Hacker News new | past | comments | ask | show | jobs | submit login

You are missing the point of differential privacy. This is an oversimplified explanation but I see it like this (This is also how my professor and his phd assistant at my university explained it to us).

Differential privacy provides some simple mathematical foundations to share any database data what so every without revealing the data about anyone specific. An example could be when you store someones name, birthday and illness. Differential privacy in a simplified way says,a name is a direct link to a person so remove it. A birthday could potentially be used to link to a user but not directly so replace it with a range so age between 20-30 for example, instead of the specific birthday. The illness is the data someone else wants so that stays. Now someone else can get information from your database without getting to any specific user or person. (there are a lot of other things that can be done such as adding random numbers to the result when you for example ask for am average age)

Where this whole thing starts to break down is when this is applied to real situations. Sure everything mathematically shows that you can not get to a specific user. But when you do already have a large amount of data or there are multiple of these databases you can quite easily combine them to find specific users or people back in the data. And these type of attacks are already happening, with people adding large data breaches together to find username, email and password combinations for example. This way they can for example find out if you have a pattern in you passwords such as a base password + a specific thing extra at the end.




As the other poster mentioned, this sounds much more like non-DP anonymization, which (as you note) is usually surprisingly vulnerable to deanonymization through various approaches.

With Differential Privacy, you instead add randomness such that you can't tell whether the answer you got includes any individual person, for whatever question you're asking.

IIUC, RAPPOR adds that randomness it to the original data; Leap Year (where I worked for a while) adds it to answers to specific queries. There are huge tradeoffs and they're suitable for very different settings. I am not sure which approach is taken here.

Edited to add:

Skimming the docs, it seems to be the latter - ask questions of the exact data, returning answers that are noisy. This requires ongoing trust of the entity holding the data (so it's most applicable to circumstances where they'd have that data regardless), but is much more flexible.


My understanding is that what you describe is closer to the state of the art before do, my I believe the thing about dp is that it allows you to measure information leakage, even iirc in the face of other data being disclosed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: