Hacker News new | past | comments | ask | show | jobs | submit login

The process of only releasing information that describes groups of people is more broadly known as k-anonymity [1]. This comment does a pretty good job of describing its appeal (it makes some intuitive sense, and it's better than doing nothing) and drawbacks (it's vulnerable to side information, releasing even two k-anonymous outputs can reveal a lot of information, and it can get pretty complicated).

Differential privacy solves these problems because it gives mathematical bounds for the amount of privacy lost. If you take part in two 0.5-differentially private computations, the probability of any event becomes at most e times more or less probable as a result of those computations. Of course, differential privacy has a lot of work left to do before it's feasible to just insert it into general data analysis, but it's slowly moving along that road.

[1] https://en.wikipedia.org/wiki/K-anonymity




One of the interesting insights in differential privacy is that to provide privacy protections that can't be reverse-engineered, the process has to be random rather than deterministic. The sort of algorithm that OP describes is really neat, but in addition to what dp_throw says, deterministic algorithms like this that choose how to anonymize things based on private data can reveal information about that private data in the very way that they format the final data. (This may be less relevant in the case at hand, but consider a setting where it would be sensitive to know if someone is in the database at all, e.g., a medical study.)


Yes, you're absolutely right. In our case, we were effectively doing one massive annual dump of this data into an online tool (vs. multiple releases of k-anonymous data), so we didn't have to worry about the more complicated cases of two separate data releases causing disclosures.

(That, and the regulations that governed our program said to suppress cells based on 3 or less individuals... so, that's what we had to do.)


Differential privacy if I understand correctly has an upper limit on how many analyses can be performed before there is a privacy risk and the data must be destroyed. For official statistics and scientific research this is often not an acceptable tradeoff.


Releasing any data or statistic based on sensitive data--even once--bears a privacy risk. The primary purpose of differential privacy is to quantify that risk, both for a single release of data and over many releases of data.

As for the number of analyses you can run, that depends on what you mean. You're right that differential privacy won't allow you to set up a database of _confidential data_ that can be arbitrarily queried infinitely many times with any meaningful privacy guarantee, but this is in no way unique to differential privacy.

What you can do with differential privacy is release noisy statistics once and let researchers use those statistics for arbitrarily many analyses. This is what the 2020 US Census is doing, for example.


Just to add to the nice responses from lomereiter and georgefox, I think the common response to

> For official statistics and scientific research this is often not an acceptable tradeoff

is that differential privacy is the best known method for rigorously accounting for privacy risks. It's possible to argue that differential privacy is too strong (and plenty of people have), but to the best of my knowledge, systems that say "you don't need DP - we'll answer lots of database queries without DP and still prevent deanonymization" usually end up getting broken. A good example of this is the (repeated) breaking of Diffix [1], a system that attempts to provide privacy without using differential privacy.

So differential privacy is, I think, a good starting point if privacy is critical to your application. It does not offer much guidance for when you should decide privacy is critical, or when the utility of an application outweighs the need for privacy.

For example, many social science researchers have criticized the US Census for using differential privacy in the 2020 census. It's consistent to say "it's way more important to have accurate counts for all of the decisions made using census data -- let's not try too hard to be private". It's also consistent to say "privacy is important, so we should use a rigorous notion like differential privacy". It's not consistent to say "private is important, but let's just use some heuristics and hope for the best", which is what the census had largely been doing until 2020.

[1] https://differentialprivacy.org/diffix-attack/


This conundrum can be resolved by generating synthetic datasets resembling true data. The definition of differential privacy doesn't distinguish between algorithm output types, which can be a single number as well as a whole dataset. The algorithms get a lot more complicated, of course, and quantifying data utility isn't simple either.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: