Hacker News new | past | comments | ask | show | jobs | submit login

It's an easy-to-state largely foolproof test to see if data really is anonymized.

The thing that you're worried about with poorly-anonymized datasets is that if you have another non-anonymized dataset you can combine them to deduce the original information. "Your data set must not be able to be combined with any others that would allow them to infer the original data" is hard. How could you possibly test them all?

Well it turns out that there is one such non-anonymized dataset with the property that if you can't connect your anonymized data with it at all then you can be pretty sure that you couldn't connect them with any others -- the original data!




Let's say you're doing a study of fingerprint patterns. You anonymize a collection of fingerprints from a non-anonymized source by stripping everything but the fingerprint images. Because fingerprints are unique it seems like it'd be impossible to meet the GDPR criteria; even if the only thing that was left was the fingerprints, when compared against the source dataset they will be identified. a) is this interpretation accurate? b) if so, it seems that there's large swaths of data that can never be in compliance. What are the implications for medical research, for instance?


I think you nailed it that some data can’t really be anonymized. How could you anonymize emails, names, social security numbers, DNA samples?

You don’t have to use anonymized data all the time, it’s just that the requirements for handling and passing around such data is lower.


I don't understand the point though; if someone has the source data, what good is the anonymized data to them? What value is added by requiring more stringent safeguards on data that can't be anonymized this way?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: