They talked a bit more about differential privacy in the State of the Union. Basically, they hash the data and add noise. By collecting data from a bunch of people that noise gets averaged out. They also limit the amount of samples (over a relatively short period of time) they can get from a single person so they won't be able to identify them.
State of the Union is what the WWDC keynote used to be before it started being watched by press and the public. Much more technical detail, and information in the underlying frameworks rather than user visible features.
You can find all the videos from WWDC 2016 some time after the session is done. I usually check the next day. They have the videos for several previous WWDCs up as well.
At some point, regardless of adding noise, you're definitely losing your privacy. I'd be happy with an "opt-out" feature that I know worked (as far as I can see, only if it was open-source). I didn't watch WWDC, perhaps they mentioned this.
I agree opt-out is definitely something that should be deployed alongside differential privacy, but what makes you so sure that it doesn't work "at some point"? If the noise means an specific query to one user's information has a significant chance of being wrong, how does this not equate to privacy? You can add a lot more noise than you might imagine if you know the kind of analysis you'll be doing with the data; for example, a lot of statistical techniques are constructed to be mostly immune to Gaussian noise since it's very common with some kinds of data.
The whole point of collecting the data is to predict the actions or information needs of individual users. That in itself is a privacy issue.
If a recommender system for iTunes can predict the likelihood of me appreciating movies that contain violence against women, that information could be subpoenaed when I am falsely accused of having strangled my girlfriend.
I appreciate that Apple is trying to protect our privacy where they can. But if we want them to make predictions about or behavior, we have to be aware of the fact that we are necessarily giving up some privacy.
You're misunderstanding where this is to be used. It is specifically not for things like iTunes suggestions, where it would be useless. It's for situations where they want to get aggregated metrics without collecting identifiable information. The obfuscation can be performed by the client so that they never have a database on the server with accurate (at the specific user level) data.
I don't think I am misunderstanding (although I'm not completely sure about that). My point isn't about iTunes. My point is about the purpose of data collection. If that purpose is predicting our actions, then that in itself is a privacy issue.
I understand that the database Apple wants to build does not contain accurate information about individual users. But if that database allows them to make predictions of our behavior, then there is a privacy issue. If the purpose is not prediction, then what is it?
It could be a number of things, but one possibility is identifying broad correlations between metrics. Ssince you can't trust the accuracy of the individual metrics, you will have a limited ability to apply the correlation to individual users, but if you use the right kind of noise aggregated conditional probabilities may survive.
So Apple can (for example) predict that listing to band A means you are likely to like band C, and then send a list of correlations to your device so the predictions can be made there by examining your library locally. A more probable use is analytics for marketing purposes. Another is selling just these correlations and other aggregate statistics to other parties; this is actually how Mint makes money.
I used "you" incorrectly, my bad. They can predict that people who listen to band A are likely to like band C, but their data for whether you listen to band A still has a significant chance of being wrong.
Yes, the data has a significant chance of being wrong. But it is useful only insofar as it supports a prediction with a probability of being right that is greater than 0.5.
That's what makes the data useful and that's what makes it a privacy issue at the same time.
It doesn't have to support that prediction in specific instances, just in a general trend, where random noise tends to average itself out in a lot of cases. There are lots of different distributions with the same averages, the same conditional probablities, etc. with wildly different data. If you have some mathematical proofs that say you can not reach one of these other distributions by injecting random noise to mask individual contributions, then please write a paper on it! But to my knowledge, Cynthia Dwork's work and others still stands. There is definitely no simple, common sense reason that it doesn't work.
How does sending the same list of conditional probabilities for liking pairs of bands to everyone's device and then having the device pick out the ones actually pertinent to your library compromise your privacy?
I don't doubt the validity of Dwork's work. I think we're talking past each other.
What I'm saying is that if Apple keeps data on its servers that is sufficient to predict some of my actions or likes with any accuracy greater than 50%, then that is a privacy concern.
But if you're saying that the data in Apple's database does not have any predictive power on its own, then I agree that it is not a privacy concern.
In that case, my device would have to download some of Apple's data and combine it with data that resides only on my device in order to make a prediction locally on my device.
They even limit the number of samples they get from a specific person so they can't filter out the noise for that person and get their individual response.
But, keep in mind that Apple will have records of all your iTunes rentals and purchases at least for billing purposes. However, at least in the US there's a law about keeping that data private (because of Robert Bork).