Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Identify duplicate files in my data hoard?
16 points by JKCalhoun on Sept 2, 2022 | hide | past | favorite | 12 comments
Backing up my documents and files to portable hard drives has been my strategy for decades now. I have so much data now, copied from one machine after another, from one drive to another that I have multiple copies of many files squirreled away in different locations on my backup drives (who knows how that happens — organization strategies changing over time?).

There has to be a tool (script?) that can scan a volume and using filenames, file sizes, come up with a list of possible duplicate files.

I don't think I would trust it to auto-cleanup (auto-delete duplicates) but at least with the duplicate paths laid out I could go through and begin the slow process of pruning until I have a final, canonical (and singular) backup of my files.

(Bonus if I can point it at the iCloud for the same purpose.)




If your handy with Python, a script to create the MD5 of each file, saved to a SQLite database isn’t that hard to write. It can then identify common files irrespective of the file name.


Duplicates aren’t always bad… some files naturally exist in many places, and removing them from some of the places make that directory/app incomplete.

If you do want to save space by storing one copy of the bits/blocks, and still retain the index of all the original locations… you can store all your backups on a ZFS file system with Dedup turned on… (this uses memory and has performance implications)..

Or back everything up with restic:

https://github.com/restic/restic

…restic stores files encrypted in a tree by the hash, so it naturally stores 1 file but as many references to it as needed. It has lookup and list functions that can tell you what’s duplicated.

To simply find/report dups to be dealt with manually, you could quite easily md5/Sha1 your entire file tree, storing the output in a text file , which you can then pipe through sort , awk, and uniq to see which hashes occupy multiple lines … this is labor intensive… I just let my backup tools “compress” by saving one copy of each hash, and then it doesn’t matter as much (in my opinion).

If its pictures or some other specific file type that you want to focus on the most… I’d pick an app that’s intended for cataloging those. Example: Adobe Lightroom shows me my duplicate pics and I can deal with those easily there.


DupeGuru is an interesting tool to find duplicates.

It's fast and flexible.

It can even search for similar files (binary, music and pictures).

https://github.com/arsenetar/dupeguru


There are two tools I regularly use for linux:

  fdupes
  rmlint
They don't always give the same results. I also encountered problems scanning a smb share but I would say it is worth giving them a try.


There is also jdupes - fork of fdupes which can replace duplicate files by hardlinks to save space.



Czkawka worked pretty well for me.

https://github.com/qarmin/czkawka


If you're signed in to the OneDrive sync app on your computer, you can access your OneDrive using File Explorer. You can also access your folders from any device by using the OneDrive mobile app.


Do you really want to delete duplicate files?

If one of your drives gets bricked or accidentally formatted there's a chance you'll still have the files backed up somewhere else.


I can’t recommend Borg Backup enough (OSS).

It does deduplication at the chunk level.

This handles both duplicate files and large binaries that change slowly over time.


Uh.... I need something like this on android. My photos in folders have gotten out of hand


rdfind is amazing for this. You can install it on Windows in a Linux shell or it already works on mac.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: