The reference to rsync in Dropbox's YCombinator application is a bit of a tipoff --- rsync uses exactly this technique to avoid recopying files that already exist at the destination.
Yep...I never dived much into that reason - but that's why I quoted it and quoted the definition of 'Diff'. I guess I never brought it home as coherently as I wanted to.
I thought this was common knowledge. It wouldn't make sense to store Pirated.Movie.DVDRiP.avi thousands of times if it's the same file for thousands of users. Files that translate to the same hash, get served from one single file on DB's servers.
It's a pretty common optimization. They had a good plan, and executed well, but let's not get carried away; other people were working on similar ideas. I'd bet that data stored in s3 are deduplicated in a similar manner.
Another plus to their setup is the hash for the file is calculated on your machine, so you pay them so you can calculate the hash on your own files and only upload them if they haven't seen them before.
Well...I am not 100% sure, but that's the only logical thing given that if you tried that example I gave you with the 241MB file, it uploaded in a minute or two.
Well..considering the alternative, that's the least. The alternative being that they store every file X number of times, where X is the number of users that upload that file.
Wasting money on bandwidth by doing multiple transfers is minimal in comparison.