Ask HN: Identify duplicate files in my data hoard?

tony-allan · on Sept 2, 2022

If your handy with Python, a script to create the MD5 of each file, saved to a SQLite database isn’t that hard to write. It can then identify common files irrespective of the file name.

nullrouten · on Sept 2, 2022

Duplicates aren’t always bad… some files naturally exist in many places, and removing them from some of the places make that directory/app incomplete.

If you do want to save space by storing one copy of the bits/blocks, and still retain the index of all the original locations… you can store all your backups on a ZFS file system with Dedup turned on… (this uses memory and has performance implications)..

Or back everything up with restic:

https://github.com/restic/restic

…restic stores files encrypted in a tree by the hash, so it naturally stores 1 file but as many references to it as needed. It has lookup and list functions that can tell you what’s duplicated.

To simply find/report dups to be dealt with manually, you could quite easily md5/Sha1 your entire file tree, storing the output in a text file , which you can then pipe through sort , awk, and uniq to see which hashes occupy multiple lines … this is labor intensive… I just let my backup tools “compress” by saving one copy of each hash, and then it doesn’t matter as much (in my opinion).

If its pictures or some other specific file type that you want to focus on the most… I’d pick an app that’s intended for cataloging those. Example: Adobe Lightroom shows me my duplicate pics and I can deal with those easily there.

luzifer42 · on Sept 2, 2022

DupeGuru is an interesting tool to find duplicates.

It's fast and flexible.

It can even search for similar files (binary, music and pictures).

https://github.com/arsenetar/dupeguru

fxde · on Sept 2, 2022

There are two tools I regularly use for linux:

  fdupes
  rmlint

They don't always give the same results. I also encountered problems scanning a smb share but I would say it is worth giving them a try.

citrin_ru · on Sept 2, 2022

There is also jdupes - fork of fdupes which can replace duplicate files by hardlinks to save space.

tfeldmann · on Sept 2, 2022

Have a look at https://github.com/tfeldmann/organize

Liru · on Sept 2, 2022

Czkawka worked pretty well for me.

https://github.com/qarmin/czkawka

Raziarazzi · on Sept 2, 2022

If you're signed in to the OneDrive sync app on your computer, you can access your OneDrive using File Explorer. You can also access your folders from any device by using the OneDrive mobile app.

groffee · on Sept 2, 2022

Do you really want to delete duplicate files?

If one of your drives gets bricked or accidentally formatted there's a chance you'll still have the files backed up somewhere else.

abdullin · on Sept 2, 2022

I can’t recommend Borg Backup enough (OSS).

It does deduplication at the chunk level.

This handles both duplicate files and large binaries that change slowly over time.

2Gkashmiri · on Sept 2, 2022

Uh.... I need something like this on android. My photos in folders have gotten out of hand

johng · on Sept 2, 2022

rdfind is amazing for this. You can install it on Windows in a Linux shell or it already works on mac.