It sounds to me like Mega is just using block-level hardware de-duplication and ...

daeken · on Jan 21, 2013

> After all, if each file is being encrypted with different keys, then Alice and Bob's encrypted copies of THE HOBBIT wouldn't match at all. So much as removing the file from Mega's servers and re-uploading it would seemingly change the key and thus the encrypted data entirely, right?

As far as I can tell, they're generating the keys for the files from a hash of the file, meaning the keys are not random and unique (for the files -- user keys are different). I described rough steps for secure dedupe in another comment below.

roc · on Jan 21, 2013

Then the encryption isn't worth the processing time. And as a 'legal shield' it seems like it will not only be utterly ineffective in court, but it's going to be cited as proof of Mega's ability to know what people are hosting, due their generation of a file fingerprint as a matter of course.

They're going to wind up having to maintain a hash blacklist which is going to be just annoying enough that people aren't going to bother.

It's a shame mega isn't just running an encrypted block-level store. They should have shipped client binaries that handle the encryption and key generation and simply exchanged blocks with Mega servers.

jerf · on Jan 21, 2013

No, it's the other way around; a block overlap is effectively a guarantee it's the same file with the same key.

Suppose they are using 1MB chunks. The keys are 16 bytes. The 1MB chunk may be 2 ^ (8 millionish) different things, the keys only 2^128 different things. Thus, two different 1MB chunks, to be the same file with different keys, must have two of their possible 2^128[1] encryptions overlap out of the possibility space of 2^(8 millionish); that any two different chunks would even have such an overlap available is exceedingly, exceedingly improbable, let alone that the two users would end up with the exact correct keys to have actually manifested the overlap.

There are only two plausible theories when an overlap occurs: 1. It is the same file with the same key or 2. Someone's worked out how to artificially create a collision. And those two are hardly equal in probability either....

[1]: Yeah, that's a simplification since there aren't actually 2^128 valid keys. Roll with it. Actually I've taken a number of small liberties for simplicity; none of them matter.

roc · on Jan 21, 2013

Why would you choose so large a block size for de-duplication? Your disk space savings decrease as block size increases. Choosing a more effective block size will increase the chance of collisions. And collisions are rather the desired goal of de-duplication.

jerf · on Jan 21, 2013

And all of your other costs associated with the block go up. While I do not work on the backup product my company produces, I've used it as a service. They've told me 1MB is pretty much too small nowadays, though it made sense when they started.

People have this weird fetish around deduplication, but it isn't magic. It makes it so the tenth copy of storing a backup of Windows XP doesn't hardly cost you any space. This is where the astonishing compression claims come from. It does not magically compress much of anything else, though. The claims are true, but not generally applicable. In practice, the middle bits of otherwise unrelated files don't get de-duped, excepting a couple of obvious and rare cases like 'a megabyte of zero padding in the middle of a file' which hardly amount to anything. Your World of Warcraft texture file is simply not going to overlap with your eBook copy of 50 Shades of Grey, and in general, two things sampled even from a 2^(524,288) possibility space, which corresponds to an absurdly small 64KB block size, are very unlikely to collide.

Block size has very little effect on collisions; what collides are identical files, or files that are nearly identical because they are versions of the same thing (and block size tends to matter surprisingly little there for various reasons), and what doesn't collide is everything else.

KMag · on Jan 22, 2013

On a related note, if any of you are designing a file format, please place metadata at the end, so multiple copies of the same file with different metadata can be easily deduplicated. If the metadata is variable length and at the beginning, the block boundaries get shifted. In order to deal with shifted files, one ends up using a pseudorandom function of block contents to decide where the dedup block boundaries are (rather than some multiple of the filesystem block) in order to be able to re-sync the dedup block boundaries between the two files beginning with different length metadata.

For that matter, if you're writing a library to handle an existing file format that has a flexible location for metadata, please put any metadata (especially user-generated metadata) as far back in the file as possible.

Of course, the main advantage is that you don't have to shift all of the data around if the user starts editing file metadata, but it does also help block-level deduplication.

roc · on Jan 21, 2013

I'll readily concede that de-duplication isn't going to get you much, once you start talking about uniquely-encrypted data.

1MB just sounded much larger than what I've read and seen. At that size, yeah, about the only thing you're going to be de-duplicating is entire files. Interesting that smaller block sizes don't de-duplicate enough in practice to justify it.