Pietro Battiston

Indice

What is this?
Is ffdupes efficient?
- Technical parenthesis (aka TODO)
Is ddupes efficient?
Who should I blame if this sucks?
Requirements

Click here to download version 2.3 of ddupes. Click here to download the Debian package.

Ubuntu packages can be found in my PPA.

What is this?

ddupes is a python program which extends fdupes action to directories.

ffdupes (“fast fdupes”) is an enhanced version of fdupes.

Update: at the time of writing this page, I ignored the existence of many other command line tools to find duplicate files: for instance, in Debian you can find not only fdupes, but also rdfind, hardlink, finddup, duff. I totally ignore how they compare to ffdupes: it is reasonable that they outperform it. I didn't find, instead, any replacement for ddupes. Notice the different tools are not compatible as interface (arguments and output), so ddupes is not able to use their output.

fdupes/ffdupes “finds duplicate files in a given set of directories”.

ddupes instead finds duplicate files and directories in a given set of directories.

Basically, ddupes gets fdupes/ffdupes (ran recursively) output and starts comparing folders containing identical files, to see if they are identical.

When it has detected all groups of identical folders, it prints them in order of size (also see -w option).

To see detailed list of options, run ddupes or ffdupes with option “–help”.

Is ffdupes efficient?

It seems to be much more efficient than fdupes (taking around one half of the time in the very few tests made): the main reason is that ffdupes doesn't necessarily read all files it must compare: instead, it first tries to compare the heads, and reads the rest only if they match.

A test of larger size (thanks, Florian Bruhin!), ran with 2.5 TB of data, in ~727 000 files, gave the following results:

fdupes: 6 Hours 23 Minutes
ffdupes: 4 Hours 19 Minutes
ddupes: 40 Minutes

That said, in the worst case in which there are many files which are almost identical, except for the final part, ffdupes may perform worse than fdupes (even taking twice the time). Please let me know if you find real life situations in which this happens.

If ffdupes is used with the “–algorithm” option set to “adler32”, it will run slower on average, but faster in the worst case (in particular, it will run faster than fdupes in all cases).

If ffdupes is used with the “–algorithm” option set to “md5”, it will behave exactly as fdupes (and hence presumably run slightly slower because Python is slower than C, and fdupes is written in C).

Notice that ffdupes (still?) doesn't support all fdupes options.

Technical parenthesis (aka TODO)

Theoretically, ffdupes could be improved to perform better than fdupes in any case, but unfortunately at the moment it is not possible because it uses filecomp to compare files, and it is not easy to checksum files while comparing them: probably the best thing to do would be using named pipes to checksum on the fly (storing in memory checksums for exponentially growing lenghts of heads) and then at the same time do the comparison by reading the named pipe.

Is ddupes efficient?

ddupes is written in Python, which is an interpreted language; however, this doesn't seem to be a key problem, since in all the tests made the step that takes by far the most time is the invocation of fdupes/ffdupes.

In general, the real ddupes step takes more time if there are many duplicate files, not just if the directory structure given as argument is bigger. Still, tests ran on a directory containing around 100 000 files/folders which produced around 20 000 lines of fdupes output, with groups containing thousands of duplicates, showed the ddupes step needs less than 20 seconds.

ddupes could scale very badly in the case in which there are very big (thousands of members) groups of duplicates, which reside in directories which are very similar but not identical. This should be a quite remote eventuality, but if you do find some pathological case, please report.

Who should I blame if this sucks?

Pietro Battiston - me@pietrobattiston.it

Last version of ddupes can always be found at http://www.pietrobattiston.it/ddupes. The source repo can be obtained with

git clone git://pietrobattiston.it/ddupes

and browsed at http://www.pietrobattiston.it/gitweb

Requirements

ddupes and ffdupes are written in Python, so you need python to run them.