Cleaning up the mess left from early-adopting N photo apps and websites that subsequently shut down is why I took on this project. I've got 20-odd hard drives that have accumulated over the years, filled with backups and libraries from Apple Photos, Aperture, Picasa, several hundred gigs of Google Takeout tarballs, and other ancient DAM apps. I wanted a single, organized, deduped, copy of my photos and videos. Skip the thumbnails, the files that are missing original EXIF headers, or have suffered bitrot.
Finally, I've got a single folder hierarchy I can rsync to my NAS or wherever, and know I got everything. There's a simple SQLite db I use for persistence, and a web server that sits on top of it that makes browsing and searching your whole library feel serendipitous.
So yeah, it's Google Photos that lives on your bookshelf. Viva the distributed web! I'm looking into the applicability of dat and ipfs for secure sharing soon.
I've got a limited number of beta users trying it out right now. If you're willing to share your feedback, please consider signing up. The beta is free.
I have on the order of terabytes of digital photos from QuickTake through Nikon's various Ds to the Sony A9, with various pocketables and all the generations of iPhone along the way. I have a quarter million iCloud Photos images, 30K on Flickr, etc.
So this looks fantastic! Subscribed ... very willing to be a beta tester and provide detailed feedback.
However, the problem I'm finding is a small percentage of file corruption from all the storage upgrading and copying over the years, meaning no given file can be 100% trusted to be a valid original.
I haven't found any file or photo deduplication tools with the savvy to figure out which of two identically sized and timestamped files is the least corrupt image.
In many cases, a second generation is viewable while the original is present but unusable. This most often applies to very old Aperture libraries that got copied from NAS to NAS over the years, where a "master" may be corrupt but it still has a viewable generated high res cache as a JPEG.
Implication is the "structure" of the image files themselves has to be analyzed ... is this an uncorrupted viewable image?
Note that with JPEGs and various flavors of RAW, renderers will still happily open and display the file but what humans view can evidence bit rot. Conversely, some files are detected as corrupt by file examination, but can be viewed without problem.
To offer "principle of least loss" for mass merge of diverse collections, this would have to be figured out.
> which of two identically sized and timestamped files is the least corrupt image
What I've found on my older hard drive backups was file corruption due to bitrot or file truncation.
I use `jpegtran` to validate JPEG bytestreams, `dcraw` to validate RAW images, and ``ffmpeg` to validate videos. At least for my quarter-million-file corpus, those tools detect corruption sufficient enough for me to want to skip the file. I actually had to write a bit rotter to write tests for this, and do glitch inspection.
> To offer "principle of least loss" for mass merge of diverse collections, this would have to be figured out
Every unique SHA gets copied into your library (if you have copies enabled), but any given asset will have 1 or more asset files (that are merged in the UI and DB). To minimize risk from bugs^H^H^H^H "undiscovered features," PhotoStructure never moves or deletes files excluding it's own cache and db.
> I've got 20-odd hard drives that have accumulated over the years, filled with backups and libraries from Apple Photos, Aperture, Picasa, several hundred gigs of Google Takeout tarballs, and other ancient DAM apps.
I'm in a similar boat. What I'd like to know is: where are the duplicates and what can I safely delete? Anything that can help me clear it up would be a godsend!
This was the approach I originally was considering (to do in-place duplicate deletion), but eventually gave up due to the impact of "undiscovered features" in my code.
The approach I've settled on which should work for most people is to establish a new library, with unique copies of each of your originals, skipping exact SHA matches and invalid files.
In your case, though, you'd run PhotoStructure in its "don't copy into the library" mode. Once it finishes scanning your drives, you can run a simple SQL query against your SQLite db to get a list of duplicate files. That query will be in the FAQ.
I could manage that query but your average user wouldn't. How about a way to export it to a CVS file so it can be viewed and filtered in the users choice of spreadsheet app?
man why do so many sites add this shitty robot checking shit for even signin up for mailing lists. your non-existent project even has it. i'm tired of selecting cars/bridges and fire hydrants for 2 minutes at a time only for the robochecker to fail. the audio option workaround doesn't even work anymore, i'm really sick of captchas
I am building it: https://PhotoStructure.com
Cleaning up the mess left from early-adopting N photo apps and websites that subsequently shut down is why I took on this project. I've got 20-odd hard drives that have accumulated over the years, filled with backups and libraries from Apple Photos, Aperture, Picasa, several hundred gigs of Google Takeout tarballs, and other ancient DAM apps. I wanted a single, organized, deduped, copy of my photos and videos. Skip the thumbnails, the files that are missing original EXIF headers, or have suffered bitrot.
Finally, I've got a single folder hierarchy I can rsync to my NAS or wherever, and know I got everything. There's a simple SQLite db I use for persistence, and a web server that sits on top of it that makes browsing and searching your whole library feel serendipitous.
So yeah, it's Google Photos that lives on your bookshelf. Viva the distributed web! I'm looking into the applicability of dat and ipfs for secure sharing soon.
I've got a limited number of beta users trying it out right now. If you're willing to share your feedback, please consider signing up. The beta is free.