The process of only releasing information that describes groups of people is mor...

georgefox · on Feb 24, 2021

One of the interesting insights in differential privacy is that to provide privacy protections that can't be reverse-engineered, the process has to be random rather than deterministic. The sort of algorithm that OP describes is really neat, but in addition to what dp_throw says, deterministic algorithms like this that choose how to anonymize things based on private data can reveal information about that private data in the very way that they format the final data. (This may be less relevant in the case at hand, but consider a setting where it would be sensitive to know if someone is in the database at all, e.g., a medical study.)

dbatten · on Feb 24, 2021

Yes, you're absolutely right. In our case, we were effectively doing one massive annual dump of this data into an online tool (vs. multiple releases of k-anonymous data), so we didn't have to worry about the more complicated cases of two separate data releases causing disclosures.

(That, and the regulations that governed our program said to suppress cells based on 3 or less individuals... so, that's what we had to do.)

Tarq0n · on Feb 24, 2021

Differential privacy if I understand correctly has an upper limit on how many analyses can be performed before there is a privacy risk and the data must be destroyed. For official statistics and scientific research this is often not an acceptable tradeoff.

georgefox · on Feb 24, 2021

Releasing any data or statistic based on sensitive data--even once--bears a privacy risk. The primary purpose of differential privacy is to quantify that risk, both for a single release of data and over many releases of data.

As for the number of analyses you can run, that depends on what you mean. You're right that differential privacy won't allow you to set up a database of _confidential data_ that can be arbitrarily queried infinitely many times with any meaningful privacy guarantee, but this is in no way unique to differential privacy.

What you can do with differential privacy is release noisy statistics once and let researchers use those statistics for arbitrarily many analyses. This is what the 2020 US Census is doing, for example.

dp_throw · on Feb 24, 2021

Just to add to the nice responses from lomereiter and georgefox, I think the common response to

> For official statistics and scientific research this is often not an acceptable tradeoff

is that differential privacy is the best known method for rigorously accounting for privacy risks. It's possible to argue that differential privacy is too strong (and plenty of people have), but to the best of my knowledge, systems that say "you don't need DP - we'll answer lots of database queries without DP and still prevent deanonymization" usually end up getting broken. A good example of this is the (repeated) breaking of Diffix [1], a system that attempts to provide privacy without using differential privacy.

So differential privacy is, I think, a good starting point if privacy is critical to your application. It does not offer much guidance for when you should decide privacy is critical, or when the utility of an application outweighs the need for privacy.

For example, many social science researchers have criticized the US Census for using differential privacy in the 2020 census. It's consistent to say "it's way more important to have accurate counts for all of the decisions made using census data -- let's not try too hard to be private". It's also consistent to say "privacy is important, so we should use a rigorous notion like differential privacy". It's not consistent to say "private is important, but let's just use some heuristics and hope for the best", which is what the census had largely been doing until 2020.

[1] https://differentialprivacy.org/diffix-attack/

lomereiter · on Feb 24, 2021

This conundrum can be resolved by generating synthetic datasets resembling true data. The definition of differential privacy doesn't distinguish between algorithm output types, which can be a single number as well as a whole dataset. The algorithms get a lot more complicated, of course, and quantifying data utility isn't simple either.