filter: Add --hash-filter to deterministically select a subset of files

Fixes #8400
This commit is contained in:
Nick Craig-Wood
2025-02-25 17:48:15 +00:00
parent 4e2b78f65d
commit 3fb4164d87
4 changed files with 226 additions and 2 deletions

View File

@@ -2808,6 +2808,7 @@ For the filtering options
* `--max-size`
* `--min-age`
* `--max-age`
* `--hash-filter`
* `--dump filters`
* `--metadata-include`
* `--metadata-include-from`

View File

@@ -718,6 +718,98 @@ old or more.
See [the time option docs](/docs/#time-option) for valid formats.
### `--hash-filter` - Deterministically select a subset of files {#hash-filter}
The `--hash-filter` flag enables selecting a deterministic subset of files, useful for:
1. Running large sync operations across multiple machines.
2. Checking a subset of files for bitrot.
3. Any other operations where a sample of files is required.
#### Syntax
The flag takes two parameters expressed as a fraction:
```
--hash-filter K/N
```
- `N`: The total number of partitions (must be a positive integer).
- `K`: The specific partition to select (an integer from `0` to `N`).
For example:
- `--hash-filter 1/3`: Selects the first third of the files.
- `--hash-filter 2/3` and `--hash-filter 3/3`: Select the second and third partitions, respectively.
Each partition is non-overlapping, ensuring all files are covered without duplication.
#### Random Partition Selection
Use `@` as `K` to randomly select a partition:
```
--hash-filter @/M
```
For example, `--hash-filter @/3` will randomly select a number between 0 and 2. This will stay constant across retries.
#### How It Works
- Rclone takes each file's full path, normalizes it to lowercase, and applies Unicode normalization.
- It then hashes the normalized path into a 64 bit number.
- The hash result is reduced modulo `N` to assign the file to a partition.
- If the calculated partition does not match `K` the file is excluded.
- Other filters may apply if the file is not excluded.
**Important:** Rclone will traverse all directories to apply the filter.
#### Usage Notes
- Safe to use with `rclone sync`; source and destination selections will match.
- **Do not** use with `--delete-excluded`, as this could delete unselected files.
- Ignored if `--files-from` is used.
#### Examples
##### Dividing files into 4 partitions
Assuming the current directory contains `file1.jpg` through `file9.jpg`:
```
$ rclone lsf --hash-filter 0/4 .
file1.jpg
file5.jpg
$ rclone lsf --hash-filter 1/4 .
file3.jpg
file6.jpg
file9.jpg
$ rclone lsf --hash-filter 2/4 .
file2.jpg
file4.jpg
$ rclone lsf --hash-filter 3/4 .
file7.jpg
file8.jpg
$ rclone lsf --hash-filter 4/4 . # the same as --hash-filter 0/4
file1.jpg
file5.jpg
```
##### Syncing the first quarter of files
```
rclone sync --hash-filter 1/4 source:path destination:path
```
##### Checking a random 1% of files for integrity
```
rclone check --download --hash-filter @/100 source:path destination:path
```
## Other flags
### `--delete-excluded` - Delete files on dest excluded from sync