Hi there! I haven't launched yet, but you're throwing me a softball here. I am b...

Terretta · on Sept 11, 2018

I have on the order of terabytes of digital photos from QuickTake through Nikon's various Ds to the Sony A9, with various pocketables and all the generations of iPhone along the way. I have a quarter million iCloud Photos images, 30K on Flickr, etc.

So this looks fantastic! Subscribed ... very willing to be a beta tester and provide detailed feedback.

However, the problem I'm finding is a small percentage of file corruption from all the storage upgrading and copying over the years, meaning no given file can be 100% trusted to be a valid original.

I haven't found any file or photo deduplication tools with the savvy to figure out which of two identically sized and timestamped files is the least corrupt image.

In many cases, a second generation is viewable while the original is present but unusable. This most often applies to very old Aperture libraries that got copied from NAS to NAS over the years, where a "master" may be corrupt but it still has a viewable generated high res cache as a JPEG.

Implication is the "structure" of the image files themselves has to be analyzed ... is this an uncorrupted viewable image?

Note that with JPEGs and various flavors of RAW, renderers will still happily open and display the file but what humans view can evidence bit rot. Conversely, some files are detected as corrupt by file examination, but can be viewed without problem.

To offer "principle of least loss" for mass merge of diverse collections, this would have to be figured out.

mceachen · on Sept 11, 2018

> which of two identically sized and timestamped files is the least corrupt image

What I've found on my older hard drive backups was file corruption due to bitrot or file truncation.

I use `jpegtran` to validate JPEG bytestreams, `dcraw` to validate RAW images, and ``ffmpeg` to validate videos. At least for my quarter-million-file corpus, those tools detect corruption sufficient enough for me to want to skip the file. I actually had to write a bit rotter to write tests for this, and do glitch inspection.

> To offer "principle of least loss" for mass merge of diverse collections, this would have to be figured out

Every unique SHA gets copied into your library (if you have copies enabled), but any given asset will have 1 or more asset files (that are merged in the UI and DB). To minimize risk from bugs^H^H^H^H "undiscovered features," PhotoStructure never moves or deletes files excluding it's own cache and db.

Flenser · on Sept 11, 2018

> I've got 20-odd hard drives that have accumulated over the years, filled with backups and libraries from Apple Photos, Aperture, Picasa, several hundred gigs of Google Takeout tarballs, and other ancient DAM apps.

I'm in a similar boat. What I'd like to know is: where are the duplicates and what can I safely delete? Anything that can help me clear it up would be a godsend!

mceachen · on Sept 11, 2018

This was the approach I originally was considering (to do in-place duplicate deletion), but eventually gave up due to the impact of "undiscovered features" in my code.

The approach I've settled on which should work for most people is to establish a new library, with unique copies of each of your originals, skipping exact SHA matches and invalid files.

In your case, though, you'd run PhotoStructure in its "don't copy into the library" mode. Once it finishes scanning your drives, you can run a simple SQL query against your SQLite db to get a list of duplicate files. That query will be in the FAQ.

Flenser · on Sept 12, 2018

Thanks, that would be a great help!

I could manage that query but your average user wouldn't. How about a way to export it to a CVS file so it can be viewed and filtered in the users choice of spreadsheet app?

mceachen · on Sept 12, 2018

That's a good idea, thanks.

kkarakk · on Sept 13, 2018

man why do so many sites add this shitty robot checking shit for even signin up for mailing lists. your non-existent project even has it. i'm tired of selecting cars/bridges and fire hydrants for 2 minutes at a time only for the robochecker to fail. the audio option workaround doesn't even work anymore, i'm really sick of captchas