mirror of
https://github.com/rclone/rclone.git
synced 2025-12-11 22:14:05 +01:00
filter: Add --hash-filter to deterministically select a subset of files
Fixes #8400
This commit is contained in:
@@ -2808,6 +2808,7 @@ For the filtering options
|
||||
* `--max-size`
|
||||
* `--min-age`
|
||||
* `--max-age`
|
||||
* `--hash-filter`
|
||||
* `--dump filters`
|
||||
* `--metadata-include`
|
||||
* `--metadata-include-from`
|
||||
|
||||
@@ -718,6 +718,98 @@ old or more.
|
||||
|
||||
See [the time option docs](/docs/#time-option) for valid formats.
|
||||
|
||||
### `--hash-filter` - Deterministically select a subset of files {#hash-filter}
|
||||
|
||||
The `--hash-filter` flag enables selecting a deterministic subset of files, useful for:
|
||||
|
||||
1. Running large sync operations across multiple machines.
|
||||
2. Checking a subset of files for bitrot.
|
||||
3. Any other operations where a sample of files is required.
|
||||
|
||||
#### Syntax
|
||||
|
||||
The flag takes two parameters expressed as a fraction:
|
||||
|
||||
```
|
||||
--hash-filter K/N
|
||||
```
|
||||
|
||||
- `N`: The total number of partitions (must be a positive integer).
|
||||
- `K`: The specific partition to select (an integer from `0` to `N`).
|
||||
|
||||
For example:
|
||||
- `--hash-filter 1/3`: Selects the first third of the files.
|
||||
- `--hash-filter 2/3` and `--hash-filter 3/3`: Select the second and third partitions, respectively.
|
||||
|
||||
Each partition is non-overlapping, ensuring all files are covered without duplication.
|
||||
|
||||
#### Random Partition Selection
|
||||
|
||||
Use `@` as `K` to randomly select a partition:
|
||||
|
||||
```
|
||||
--hash-filter @/M
|
||||
```
|
||||
|
||||
For example, `--hash-filter @/3` will randomly select a number between 0 and 2. This will stay constant across retries.
|
||||
|
||||
#### How It Works
|
||||
|
||||
- Rclone takes each file's full path, normalizes it to lowercase, and applies Unicode normalization.
|
||||
- It then hashes the normalized path into a 64 bit number.
|
||||
- The hash result is reduced modulo `N` to assign the file to a partition.
|
||||
- If the calculated partition does not match `K` the file is excluded.
|
||||
- Other filters may apply if the file is not excluded.
|
||||
|
||||
**Important:** Rclone will traverse all directories to apply the filter.
|
||||
|
||||
#### Usage Notes
|
||||
|
||||
- Safe to use with `rclone sync`; source and destination selections will match.
|
||||
- **Do not** use with `--delete-excluded`, as this could delete unselected files.
|
||||
- Ignored if `--files-from` is used.
|
||||
|
||||
#### Examples
|
||||
|
||||
##### Dividing files into 4 partitions
|
||||
|
||||
Assuming the current directory contains `file1.jpg` through `file9.jpg`:
|
||||
|
||||
```
|
||||
$ rclone lsf --hash-filter 0/4 .
|
||||
file1.jpg
|
||||
file5.jpg
|
||||
|
||||
$ rclone lsf --hash-filter 1/4 .
|
||||
file3.jpg
|
||||
file6.jpg
|
||||
file9.jpg
|
||||
|
||||
$ rclone lsf --hash-filter 2/4 .
|
||||
file2.jpg
|
||||
file4.jpg
|
||||
|
||||
$ rclone lsf --hash-filter 3/4 .
|
||||
file7.jpg
|
||||
file8.jpg
|
||||
|
||||
$ rclone lsf --hash-filter 4/4 . # the same as --hash-filter 0/4
|
||||
file1.jpg
|
||||
file5.jpg
|
||||
```
|
||||
|
||||
##### Syncing the first quarter of files
|
||||
|
||||
```
|
||||
rclone sync --hash-filter 1/4 source:path destination:path
|
||||
```
|
||||
|
||||
##### Checking a random 1% of files for integrity
|
||||
|
||||
```
|
||||
rclone check --download --hash-filter @/100 source:path destination:path
|
||||
```
|
||||
|
||||
## Other flags
|
||||
|
||||
### `--delete-excluded` - Delete files on dest excluded from sync
|
||||
|
||||
Reference in New Issue
Block a user