SHA256 pretty much ensures that you have a unique hash for every value - and tha...

nebulous1 · on May 1, 2018

I'm not sure you could make the data statistically meaningful and have too many false positives to deanonymize an id. I think you're basically suggesting randomly grouping the ids so they average X real ids per grouped ID. At least if you just did it randomly instead of by hashing then there would be no danger of a dictionary attack.

PeterisP · on May 1, 2018

The expectation is that a brute force attack would try orders of magnitude more IDs than you actually have. It means that if a random ID is 90% likely to have a unique hash and 10% likely to map to one of your real IDs, then your real data won't have that many collisions, however, if someone does a brute force check of (for example) a million email addresses, then they'll get 100 000 positive responses, the vast majority of which will be false positives.

nebulous1 · on May 1, 2018

That's a reasonable point but doesn't explain why you're using hashes instead of random groupings in the first place.