Hacker News new | past | comments | ask | show | jobs | submit login

SHA256 pretty much ensures that you have a unique hash for every value - and that's a feature you don't want for anonymization. So why not simply take the first few bytes of a SHA256, a small enough set to ensure that collisions not only might happen but will happen? I mean, that's a required feature to ensure anonymization, not just pseudonymization - if you can select a whole trail of events for ID #123 and be sure that these represent all the events for some (unknown) real user, then that by itself means that those events aren't anonymous, they're pseudonymous.

You can tweak the hash length so that whatever statistics you run out of the hashed data are meaningful (though not exact) despite the collisions, but that running a dictionary attack of plausible usernames returns an overwhelming amount of false positives.




I'm not sure you could make the data statistically meaningful and have too many false positives to deanonymize an id. I think you're basically suggesting randomly grouping the ids so they average X real ids per grouped ID. At least if you just did it randomly instead of by hashing then there would be no danger of a dictionary attack.


The expectation is that a brute force attack would try orders of magnitude more IDs than you actually have. It means that if a random ID is 90% likely to have a unique hash and 10% likely to map to one of your real IDs, then your real data won't have that many collisions, however, if someone does a brute force check of (for example) a million email addresses, then they'll get 100 000 positive responses, the vast majority of which will be false positives.


That's a reasonable point but doesn't explain why you're using hashes instead of random groupings in the first place.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: