I'm a bit confused about how this article fits in with the previously posted article about the company (http://news.ycombinator.com/item?id=1817631). There, they are harvesting PII data (e-mail) and re-selling access to related information. In this article, they explain a machine-learning-type algorithm to anonymize data to contain no PII. I don't really see how the two fit together. What's their deal?
Personally Identifying Information (PII) is defined as data which can uniquely identify an individual--name, social security number, facebook ID. Rapleaf's personalization service--which is designed to let websites personalize based on who is viewing the website--does not serve PII. Instead, it serves targeting data like "Age 40-50" and "Interests Basketball".
The idea behind the Anonymouse project is that a person should not be able to be personally identified based on the targeting information served about them. Only sets of targeting data which cannot be traced back to a unique individual is stored in the personalization cookie.
For example, it would be okay to serve a targeting cookie about a person which contained "Male, Wealthy, looking to buy a Ferrari", because thousands of people fit that description, and the ad network or website cannot identify the person with any reasonable specificity.
It would not be okay, however, to target a person as "52 year old Male, makes $256,000, lives in Sometown OH, and was born on April 12th", because in all likelihood, only one or two people fit that description, and the data would effectively serve as PII, and this would be no better than dropping a facebook ID in the cookie.
Let me know if anything's unclear, or if you want more details. From a CS perspective, it's a really cool/hard problem, and something we've spent a lot of time on. We're planning on writing an update blog post on where we are... as soon as we're done coding : )
A central concept in the article is k-anonymity, which I only learned about maybe a year or two ago. Researching, it appears though that it has been introduced in a conference article in 1998 and has been published in 2002 (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.163...), which I found rather surprising. It's a shame that the knowledge of this concept seems to be mostly restricted to CS circles. Maybe the rather steep technical barrier (research seems to be published mostly in technical context - non-scientific metric: half of the url in the resuls of googling for 'k-anonymity' has 'cs' and 'edu' in the domain name; and no literature on anonymizing datasets for people with only basic schooling in statistics and database theory seems to be available) is preventing the uptake of better database privacy protection.
Of course there's also the issue that incentives in collecting and storing sufficiently anonymous data are not aligned (and sometimes in direct competition) with the goals for which the data is collected. This company seems to be making great strides in this field with this research, I hope they keep pushing the edge and publishing their results.
Technically, this is really interesting, but what's the business justification for the engineering work?
I'm skeptical that these efforts will protect Rapleaf or any other company from public relations disasters or class-action lawsuits or harmful regulation, because the solution is too complex for reporters, lawyers, and politicians to understand. (The cynic in me thinks that even if they did understand it, that's not going to get in the way of a good story / lawsuit / feel-good cause for the public.)
While I'm sympathetic, declaring that your dataset is 16-anonymous due to cluster-based suppression isn't going to persuade anyone that's already decided that behavioral targeting is the devil.
I assume that this was just a feel-good article written in response to their name appearing in WSJ recently. The project itself has been around a few months, at least publicly. The engineering described isn't particularly novel, but it is a high-level overview of a large-scale project.
We wrote this blog post back at the end of July about the project, not in response to the WSJ article. There have been a few posts since then discussing other aspects of the project, and there will be an update coming quite soon.