While this is intended as just a proof of concept, I think this technique is mathematically flawed. You can't just use the average number of people in each zipcode, because if there is great variation in the number of people per zipcode, a randomly sampled person is more likely to be in a "large" zipcode than a small one.
Consider the case where we have 100000 people, and 10 zipcodes, each with an average of 10000 people. Now add that zipcodes 0-8 have only one person, and zipcode 9 has the remaining 99991 people. What percentage of Americans can be reliably identified by sex, zipcode, and full birthdate?
I don't know the exact answer, but I'm pretty sure it's closer to .01% than 87%. The 9 people in the 9 small districts are reliably identifiable, and 99.99% of the population in the large 10th district are essentially not identifiable. While the actual population distribution by zipcode is surely not this extreme, I think this example shows that the question cannot be answered unless a specific population distribution is known (or assumed).
Actual distribution of population can be seen here: http://proximityone.com/zip-place.htm (click on Population twice to sort by most populous). At a glance, there are quite a few zipcodes with far more than 10,000 people, making me think this is going to skew the results considerably.
I work with this type of data and I assure you that the results are quite plausible. The original hypothesis was tested against US census data. See "Experiment B" here:
I'll add that there are far fewer live births, per day, in the US than there are zip codes. I agree that some highly populated areas that are problematic, but this may be the only reason that 87.1% number isn't 100%!
It may have gotten lost in the revisions of my comment, but I was consciously not trying to argue that Sweeney's numbers were wrong, only that Cook's explanation was lacking since it doesn't discuss distribution. I hadn't yet looked at the paper.
That said, looking at the paper you linked now, I don't see how Cook's simulation (simulation, not explanation) and Sweeney's paper can both be correct. Cook got 84-85% identifiable assuming uniform age distribution and identical population per zipcode. At the bottom of Figure 14, Sweeney says 87% for the US as a whole.
Shouldn't any non-uniformity (of zip code population or age clustering) act only to reduce the percent of the population that is identifiable? That is, shouldn't Cook's simulation with flat age distribution and equal zipcode populations be an upper bound on identifiability? Since Cook's simulation code looks fine, this makes me suspect that there's something off about Sweeney's analysis.
Is the 87% perhaps an average of the state percentages, and not properly weighted by state population? Or maybe an average across age classes not weighted by population of that class? Oh, I don't know about those, but maybe I see a bigger issue now...
In Section 4.3.1, Sweeney defines the "Number of subjects uniquely identified in a subdivision of a geographical area". But this isn't a simulation like Cook did, she's just using a binary yes/no depending on whether the subpopulation in each age class exceeds a numerical threshold:
if population(zi, a) ≥ |Qa|, then ID_aZi= population(zi, a)
else ID_aZi = 0.
While it's nice that it's clearly defined, I don't think this yields a "percent identifiable" that matches up with Cook's simulation, nor with any common usage of the term. Also (while I'm being picky) isn't the definition backward? If we were to go with this arbitrary definition, wouldn't we want Id_zi to equal zero when the population is less than the threshold? I presume the direction of equality is just a typo, but if the paper is using a hard threshold rather than some more rigorous approach like Cook's simulation, this seems like a major flaw in interpretability of the results.
Now that edit window is over, I finally noticed the massive error in my wording in the second to last sentence. Instead, let's pretend I wrote "why would we want Id_zi to equal zero when the population is less than the threshold?" It would also be good to note again that I haven't read the paper closely, and very well might be misinterpreting what it is doing.
---
But since I'm still in this edit window, I'll add an update here. I downloaded the per zip population data from here: https://blog.splitwise.com/2013/09/18/the-2010-us-census-pop.... Then I wrote a quick Perl program (parallel to Cook's Python simulation) but using the actual per zipcode populations rather than a fixed average. After confirming Cook's 84% number with the fixed population, I ran it on the actual populations (but still with a flat distribution for age and sex) and got 63% uniques.
Presumably this number would drop somewhat further with actual age distributions, but I don't know how far exactly. My current belief is that Sweeney's paper does a good job of calling attention to the fact that risk of identification is high, but the methodology and exact numbers should not be trusted. The actual percentage of Americans identifiable by (zip, dob, sex) is large, but something less than 63%. It might be interesting to run the simulation with actual age bracket data, but I didn't find this in any easy to download format, so I think I'll stop here.
Another point that people forget is that real world population is not only not evenly distributed in geography, but geographical distributions are not then evenly distributed in terms of age.
The real distribution of birth days within a neighborhood is almost always far less spread than an assumption of uniformity across the entire probability distribution (because schools, dinks, retirees, students, workers, etc all trend to cluster in real life).
This tendency of demographics to cluster makes de-identification harder than the theoretical models, and in some ways even protects it aa bit, because researchers who don't test their models assume greater accuracy than they actually achieve.
Of course, this might not be as protecting as one might assume, because for a lot of activities, even if people do casually link you to someone else, odds are pretty good you have similar modelled outcomes as someone very close to you in properties anyway...
I editorialized the title a bit - the author's title is "Simulating identification by zip code, sex and birthdate." Instead I've used the salient conclusion of the article, which is pretty interesting.
This article is a footnote to another front page submission by the same author, "No funding for uncomfortable results." While that article is sobering, I find this one to be technically cooler. The author uses a simple probabilistic calculation - alongside a Python simulation - to demonstrate how 87% of Americans are uniquely identifiable, given these three points of data.
Not a novel finding, but I like the way the author demonstrates the conclusion analytically and programmatically.
I think that your alteration of the title is good, but why did you change "sex" to "gender"?! The study appears to be about sex as a matter of public record, so you've added inaccuracy, or at least ambiguity, to the title.
Oh good point. I don't know, I was writing from memory. I didn't mean any offense by it and didn't think about the implied difference between the two terms. Sorry!
In some cultures, sex is what you do, gender is what you have. So for me, when asked for sex, my answer is "yes, please". When asked for gender it's M/F.
I understand that the US is different in this regard.
It's generally understood that sex is biological and gender is social, or grammatical: it makes sense for those two to go together as in many languages, including standard English, you need to know a person's gender in order to talk about them easily.
In the UK, at least, official documents specify "sex", not "gender". We may be slowly moving towards a world in which sex is officially recorded only in medical records, while gender is an optional field in social media accounts and the like.
All the forms in my country, especially the government ones, use 'gender' and not 'sex'. For me, when I see 'sex' on a form, it's jarring to say the least.
That's interesting. I wonder what's going on there. Do the words "sex" and "gender" have slightly different meanings for an Australian, or are the Australian forms asking a slightly different question from the UK ones?
But your passport says "Sex / Sexe" just below your date of birth, right?
The design of passports is rather precisely defined by an international standard. Interestingly, and perhaps relevantly, Australia is one of a dozen countries that routinely allows its citizens to have an 'X' instead of an 'F' or 'M' in that field. There is an ongoing legal case to obtain that right in the UK (Christie Elan-Cane).
Does your driving licence indicate "sex" or "gender"? A UK driving licence shows neither, and it doesn't give a "title", either: just full name, address, date of birth, and photo. In most cases that would allow one to guess the sex and gender of the holder, of course, but not always.
Biological sex is most often male or female, but not always. People can be born intersex (with genitalia that are neither clearly male nor female, which is often “corrected” surgically at birth, and/or with genitalia that don’t match what’s expected from their chromosomes, and/or with chromosome combinations other than XX or XY). Chimerism can result in different parts of the body having different sex chromosomes. It’s not strictly monolithic or binary.
That aside, yes, “sex” broadly refers to biology in some way, though often oversimplified, while “gender” refers to identity and expression (which also don’t necessarily go hand in hand).
In medical and demographic datasets where there is a need to differentiate between biological sex and self-identified gender then yes, they are used that way.
It works very well, and - since it's for identification, not authentication - you don't have to worry about it leaking out too much, like you would with SSN.
While the principal is broadly correct, in real life you've got to worry about a host of other things.
Firstly, while you can come up with a theoretical number of persons who are uniquely identified, its harder to establish who IS uniquely identified. This might seem like splitting hairs but its quite fundamental: imagine you knew you could identify 50% of the population uniquely, unless you know which 50% of the population, have you really identified anyone? Clearly knowing someone's sex, age and location gives me analytical information about them I can use to make predictions (a 5 year old female in california is going to be fundamentally different from a 85 year old male in alabama) but is it really identification yet and do we even need identification to make useful predictions?
Secondly, what you presumably care about in 'identification' is not 'single source' probability issues per se. In the real world what most people care about is multiple-source identifiability: not the odds of one piece of data uniquely identifying someone (just by collecting gender I might eventually uniquely identify someone in some remote geography for example), but the odds of uniquely identifying someone in TWO (or more) data sources that were previously 'unlinked', because that's what's required to expand your current information set (you didn't know this person was identifiable on two data sources, and now you do, so you can bring together more information than what you already held originally).
This second point is extremely important, because in the real world, you have to worry about transcription errors, recording formats, corruption, temporal changes and time, and scope of the two data sources. Most people do not get to work with total accurate census' of the population at two points in time in a population that doesn't change.
From my own practical experience, something like the zip code, sex and birthdate combination is powerful, and yes, you'll be able to uniquely identify some people with such information (especially in smaller geographical areas), but the practical rate will be far far less than 87%. But for many modelling purposes, it doesn't need to be spectacularly accurate to be useful anyway.
I like when they go into the simulation/computational solutions before showing the analytical solution. Even much harder problems have very simple computational solutions/simulation "solutions"
especially when the analytical solution is so hard to write that it requires approximations. I always wondered why we spend so much time learning analytical analysis when, once faced with real problems, we have to shortcut it in various, empirical approximations...
No me, because that would be identifying myself as a relatively privacy conscious nerdy type. Combine that info with one or two other datapoints, and I'm sure it could be used to identify someone.
It's only one bit of data, but it's easily apparent if we get to ignore things like gender fluid or intersex. Sex can be inferred by appearance, name, clothing purchase, shopping history--so it's almost free data.
Although Google still thinks I'm the opposite gender, I guess I'm not heteronormative enough.
I think it's just about anonymity in research data. These are the most common variables across databasets.
More and more data is being released, specially government funded data in the US. Anonimity of respondents is a big deal. Would you like everyone to know your debt? Or your salary?
If people knew that they will be identified and their answers will be public, they may lie or refuse to participate.
As a researcher you want truthful data about a representative sample of everyone. If you exclude people sensitive to privacy, that is not represetative.
There is a trade off between respondent anonymity, usefulness of data and complexity. Take for instance the "Survey of Consumer Finances" released by the Federal Reserve. It provides quite many variables, which is useful but uses multiply imputed observations to provide anonimity. However, it's a pain in the ass to deal with multiply imputed data :(
Well if you work with even smaller data sets every additional attribute narrows the field quite a bit unless it is ubiquitous. Adding a species field to employee records or "doesn't have Marfan's syndrome" wouldn't narrow much but cummulative divisions of N help a lot with scaling.
Consider the case where we have 100000 people, and 10 zipcodes, each with an average of 10000 people. Now add that zipcodes 0-8 have only one person, and zipcode 9 has the remaining 99991 people. What percentage of Americans can be reliably identified by sex, zipcode, and full birthdate?
I don't know the exact answer, but I'm pretty sure it's closer to .01% than 87%. The 9 people in the 9 small districts are reliably identifiable, and 99.99% of the population in the large 10th district are essentially not identifiable. While the actual population distribution by zipcode is surely not this extreme, I think this example shows that the question cannot be answered unless a specific population distribution is known (or assumed).
Actual distribution of population can be seen here: http://proximityone.com/zip-place.htm (click on Population twice to sort by most populous). At a glance, there are quite a few zipcodes with far more than 10,000 people, making me think this is going to skew the results considerably.