Pietro Battiston

Questa è una vecchia versione del documento!

Indice

What is this?
Is ffdupes efficient?
- Technical parenthesis (aka TODO)
Is ddupes efficient?
Who should I blame if this sucks?
Requirements

Click here to download version 2.3 of ddupes. Click here to download the Debian package.

Ubuntu packages can be found in my PPA.

What is this?

ddupes is a python program which extends fdupes action to directories.

ffdupes (“fast fdupes”) is an enhanced version of fdupes.

fdupes/ffdupes “finds duplicate files in a given set of directories”.

ddupes instead finds duplicate files and directories in a given set of directories.

Basically, ddupes gets fdupes/ffdupes (ran recursively) output and starts comparing folders containing identical files, to see if they are identical.

When it has detected all groups of identical folders, it prints them in order of size (also see -w option).

To see detailed list of options, run ddupes or ffdupes with option “–help”.

Is ffdupes efficient?

It seems to be much more efficient than fdupes (taking around one half of the time in the very few tests made): the main reason is that ffdupes doesn't necessarily read all files it must compare: instead, it first tries to compare the heads, and reads the rest only if they match.

That said, in the worst case in which there are many files which are almost identical, except for the final part, ffdupes may perform worse than fdupes (even taking twice the time). Please let me know if you find real life situations in which this happens.

If ffdupes is used with the “–algorithm” option set to “adler32”, it will run statistically slower, but faster in the worst case (in particular, it will run faster than fdupes in all cases).

If ffdupes is used with the “–algorithm” option set to “md5”, it will behave exactly as fdupes (and hence presumably run slightly slower because Python is slower than C, and fdupes is written in C).

Notice that ffdupes (still?) doesn't support all fdupes options.

Technical parenthesis (aka TODO)

Theoretically, ffdupes could be improved to perform better than fdupes in any case, but unfortunately at the moment it is not possible because it uses filecomp to compare files, and it is not easy to checksum files while comparing them: probably the best thing to do would be using named pipes to checksum on the fly (storing in memory checksums for exponentially growing lenghts of heads) and then at the same time do the comparison by reading the named pipe.

Is ddupes efficient?

ddupes is written in Python, which is an interpreted language; however, this doesn't seem to be a key problem, since in all the tests made the step that takes by far the most time is the invocation of fdupes/ffdupes.

In general, the real ddupes step takes more time if there are many duplicate files, not just if the directory structure given as argument is bigger. Still, tests ran on a directory containing around 100 000 files/folders which produced around 20 000 lines of fdupes output, with groups containing thousands of duplicates, showed the ddupes step needs less than 20 seconds.

ddupes could scale very badly in the case in which there are very big (thousands of members) groups of duplicates, which reside in directories which are very similar but not identical. This should be a quite remote eventuality, but if you do find some patologic case, please report.

Who should I blame if this sucks?

Pietro Battiston - me@pietrobattiston.it

Last version of ddupes can always be found at http://www.pietrobattiston.it/ddupes.

Requirements

ddupes and ffdupes are written in Python, so you need python to run them.