About 33 bits

randomwalker · on March 1, 2012

Hey, I'm the author of this blog. Much of my previous deanonymization research has been discussed on HN; see http://www.google.com/search?q=33bits.orgsite:news.ycombinat... Also, if you find the premise of the blog interesting check out the sitemap linked from the page.

But since this post is about the About page, let me share a couple of lessons I've learned from the blog, which has been more successful in communicating my research than I'd dared to hope for when I started it 3.5 years ago.

1. Those of us working on technical areas often struggle to explain our ideas to others not as technical, in a way that avoids oversimplification and losing essential meaning. Sometimes you'll discover an analogy or metaphor or phrase that does both. Seize those chances, they're powerful.

2. Coming up with a name is more important than you might think. If a good name will make your idea or product even 5% stickier, it follows that it may be worthwhile to spend 5% of your time just coming up with the name. One way to do it is to be constantly on the lookout for a good name while you're working on the product.

3. If you're writing about something that has policy implications, and want it to be read in Washington, it's hard but not impossible. Two important requirements are to network and build up an audience — they aren't going to read your blog just because it ranks high in Google searches — and to use language that non-technical people can understand.

Happy to answer any questions!

zeratul · on March 1, 2012

You say in your blog that you sat on the "Heritage Health Prize" advisory board. Have you looked at these data sets?

https://www.i2b2.org/NLP/DataSets/Main.php

De-identification of medical charts is a bottleneck in clinical research. It's impractical to ask for thousands consent forms, however, smaller sample sizes are inconclusive, so much, that most of medicine is driven by inconclusive research findings. Moreover, full anonymization does not allow to follow patient records over time. This will kill any big patient outcome study, at least financially. What are your thoughts?

microarchitect · on March 1, 2012

Hey! I ran into your blog after I saw an announcement for (one of?) your talk(s) next week.

I submitted the about page because the two key claims that you make: (1) you only need a few bits of information to identify a person uniquely in the whole world and (2) this information is becoming easier and easier to obtain - both make a lot of sense to me. Your about page does an excellent job of communication these two points and I thought it might interesting food for thought for HN.

I'm wondering whether much as we'd all like to have privacy and anonymity, these could be goals that might be impossible to achieve in the future. I'd like to hear what your thoughts are on where, we as a society are heading in this context and whether it's unrealistic to expect that conventional expectations of privacy will continue to be fulfilled in the future. Perhaps, we should accept that the privacy battle is lost and try to other solutions to the problems that privacy was solving?

randomwalker · on March 1, 2012

That's a great question with no simple answer. I've written two essays about this that look at it from two different sides:

http://33bits.org/2011/10/18/printer-dotspervasive-tracking-...

http://33bits.org/2011/06/08/the-many-ways-in-which-the-inte...

The synopses of the two posts are:

My opinion is that it impossible to put the genie back into the bottle — the cost of tracking every person, object and activity will continue to drop exponentially. ... If we accept that we cannot stop the invention and use of tracking technologies, what are our choices? Our best hope, I believe, is a world in which the ability to conduct tracking and surveillance is symmetrically distributed, a society in which ordinary citizens can and do turn the spotlight on those in power, keeping that power in check. On the other hand, a world in which only the government, large corporations and the rich are able to utilize these technologies, but themselves hide under a veil of secrecy, would be a true dystopia.

and from the other side:

There are many, many things that digital technology allows us to do more privately today than we ever could.

[examples snipped, but I recommend taking a look at the post]

Of course, I’ve only presented one half of the story. The other half, that technology is also allowing us to expose ourselves in ways never before, has been told so many times by so many people, and so loudly, that it is drowning out meaningful conversation about privacy.

Although these two opinions might at first sight seem contradictory, they are not. Some day I will get around to putting the two sides of the argument together into a coherent narrative that explains the nuanced scenario that I think we're heading towards, but for now I will offer you the above articles.

gwern · on March 1, 2012

While I'm here, thanks for writing your blog. I would never have been able to write my little essay http://www.gwern.net/Death%20Note%20Anonymity without it, and oddly enough, it's turned out to be (on Hacker News) the most popular thing I've ever written: http://news.ycombinator.com/item?id=3634320

chrisacky · on March 1, 2012

I've been trying to think of a way of using typing cadence to capture "bits" of information. Think you would have a good solution for that?

Take this scenario (and also check out the exclaimer at the bottom!).

All users on the earth type the same paragraph, or perhaps some password (clearly the longer the most distinct fingerprint, but bear with me on this).

Based on this sequence of keypresses, I capture the timestamp that each keypress is activated, and then the duration that each key is held down for.

Based on this information, how would you recommend, or suggest that a person goes about detecting some unique fingerprint from these values.

I was thinking the best way would be to have each keypress some space point and the duration held down a vector. And then if each use is entering the same paragraph, the distance accross all of the vectors could be used to calculate some identifying fingerprint.

Exclaimer: I'm absolutely not interested in the slightest in tracking users. Every weekend I try and research something that interests me. Last weekend was user fingerprinting based on the typing speed and cadence of users.

randomwalker · on March 1, 2012

This is a well-known technology :-) See http://en.wikipedia.org/wiki/Keystroke_dynamics

In the research community it's a proven and accepted concept. There are products in the market that do two-factor authentication based on password + keystroke dynamics, but I don't know how well they work.

chrisacky · on March 1, 2012

Thanks! I was sure it must have been called something!

Hoff · on March 1, 2012

I know a user that has treated his computer keyboard as a keyboard on a musical instrument, and "played a tune" with the keys as his password.

unimpressive · on March 2, 2012

It's too bad that this is halfway to the bottom of the page.

That's easily one of the most interesting steganographic techniques I have ever heard of.

It certainly allows for unique password reminders. A tape recording next to your monitor with strange music on it would probably be passed over by...I honestly can't think of anyone who'd make the leap between music and passwords. So lets just say anyone except (potentially) people who've read this post, the user you mentioned, and anyone else who happens to be doing this.

geoffschmidt · on March 2, 2012

> If a good name will make your idea or product even 5% stickier, it follows that it may be worthwhile to spend 5% of your time just coming up with the name.

I'm a sucker for a great name, but this is misleading. If something will give you an x% better outcome, it doesn't follow that you should spend about x% of your resources on it.

It depends entirely on the opportunity costs, yeah? You should spend time on your name only when you believe that thinking about names for another hour will do more good than coding or talking to customers for another hour.

aangjie · on March 2, 2012

>1.. Those of us working on technical areas often struggle to explain our ideas to others not as technical, in a way that avoids oversimplification and losing essential meaning. Sometimes you'll discover an analogy or metaphor or phrase that does both. Seize those chances, they're powerful.

Am sorry does both of what?.. Don't mean to nitpick, but this is a problem i run into regularly and don't seem to be able to find a reliable approach. So just trying to divide it into the factors involved...

bgilroy26 · on March 2, 2012

Does {explain ideas} with both of {understandable by non-technical people}, {avoids losing essential meaning}

orthecreedence · on March 1, 2012

I understand the Log2 concept of people able to narrow down something via binary search, but I have a question: don't the "facts" about a person have to divide the remaining population in half (or into smaller chunks)?

For instance if you know "Frank" doesn't wear a Rolex, that would not rule out very many people. So statistically, it would probably be better to know if Frank has red hair, as that could rule out a lot more people.

Also, let's say you have it narrowed down to four people, but the last bit of information is common to all of them. You now have to get another bit, and possibly another, correct?

EDIT: Felt like I didn't express my main point well enough: while you can certainly narrow down people with "bits" of information, information is most of the time not just 1 or 0 and can be fuzzy (or too common) to be useful in a binary search, although with the right bits of information it can of course be fruitful.

I'm really interested by this concept and also curious as to if anyone is employing it on a mass scale.

jerf · on March 1, 2012

"Also, let's say you have it narrowed down to four people, but the last bit of information is common to all of them."

The definition of a bit is something which removes half the possibilities. If you have 4 people and acquire a "bit" of information that breaks them into two categories, one with 4 people and one with 0 people, you, by definition, in fact have 0 bits.

Fractional bits are not only possible, they are by far the common case. With a lg2 in the definition of the bit, it's pretty uncommon to have integral bits.

Critical insight: What we call a "bit" in a computer and a "bit" in information theory are related but not the same thing. You can't have a fraction of a bit stored in your computer's RAM, the words are meaningless. It is best to simply flush your idea of what you think a bit is and start over again from scratch when studying information theory, then when you are comfortable with it the connections will become obvious. Starting from the RAM side is actively harmful.

orthecreedence · on March 1, 2012

This makes sense, thanks for explaining the semantics. So basically, you're confirming that you'd need 33 "bits" of information, which doesn't necessarily mean you can use 33 pieces of information because a bit is a lot more specific than a one-off piece of information.

pohl · on March 1, 2012

33 bits of information can amount to fewer than 33 pieces of information. Quoting from the article: "...knowing your hometown gives me 16 bits of entropy about you".

Edit: a relevant anecdote: when I was in high school, a friend went as an exchange student to Costa Rica. As an experiment, I had her send me a blank postcard with nothing on it but "pohl 68320 USA". It arrived in my post office box without a hitch.

blake8086 · on March 1, 2012

You can learn part of a bit from a fact that doesn't divide the population neatly in half: http://en.wikipedia.org/wiki/Entropy_(information_theory)#Ex...

twiceaday · on March 1, 2012

I think the premise is false. You would need about 33 _unique_ bits. I doubt that you can prove the existence of a person-independent algorithm to gather these.

randomwalker · on March 1, 2012

See <strike>comment #12</strike> the comment posted at February 12, 2010 at 5:15 am in the blog post.* The term entropy refers to uniqueness.

As for the development of algorithms to gather those bits, that's what my entire Ph.D. is about and what my blog is mostly about. This is what I've been proving for the last 6 years.

*Just realized comment numbers are unstable. Bad wordpress.

pjscott · on March 1, 2012

If you want to be very precise, you need evidence which causes at least a 33 bit reduction in entropy between your prior and posterior estimates of the probability of each person in the world being the one you're looking for.

http://en.wikipedia.org/wiki/Entropy_(information_theory)

As the stuff posted on 33bits regularly demonstrates, it is surprisingly easy to get this much information for a whole lot of people.

dangero · on March 1, 2012

I agree the premise is false also, but it doesn't need to be 33 unique bits either. The combination of the bits has to be unique and that is not something easily provable unless you know the entire dataset, so the premise is just kind of irrelevant to reality I think. They do this same kind of thing on crime shows all the time. He has a blue truck and a mustache. How many people with that description live in lower queens? "13 sir."

funkah · on March 1, 2012

How fortuitous that you would post such a comment while this discussion is on the front page: http://news.ycombinator.com/item?id=3652067

I suggest you peruse it.

TamDenholm · on March 1, 2012

Just out of curoisity, how many bits would it take to include all the people that have ever lived? Also, how many to realistically cover the the future?

andreasklinger · on March 1, 2012

Stop me if I am wrong but…

According to… http://www.wolframalpha.com/input/?i=how+many+people+have+li... http://www.wolframalpha.com/input/?i=106+billion+in+binary

37 bits

chris_dcosta · on March 1, 2012

Isn't the answer always 42?

andreasklinger · on March 1, 2012

If you take the timespan of HHGTTG into account yes. ;)

peteretep · on March 1, 2012

37 for everyone who's ever lived (+30% more people)... and for the future? I dunno, how long do you want to go? 2x...

However, 33 bits is a simplification. You can express 8.5 billion different values with 33 bits, but unless those 33 values map to well-distributed discriminators, the number is meaningless...

kingatomic · on March 1, 2012

It's not a simplification. It's information theory. It's not about one-to-one encodings (eg, bit 14 is whether you live in Arizona or not -- such a scheme would be ridiculous), but rather the amount of information that a discriminator reveals.

For instance, knowing that a person lives in the US reveals a little under 5 bits of information (there are a little over 2^28 people living in the US, according to Google). The entropy is just a measure of the amount of information that gives us in narrowing down a population; not an exact encoding.

*Edit - s/science/theory/i

chrisacky · on March 1, 2012

How can anyone possibly ever answer that question?

People have been disputing how many people have ever lived on the earth for decades. Some anthropologists have thrown out numbers as high as 70billion-120billion, although other scientists have admittedly said the number is probably around 7-10B.

How many people to calculate how many people have ever lived?

That's just a log2( X ) number of bits would be needed. As the author of the blog says, 6000 billion people could be written in just 43 bits.

fooandbarify · on March 1, 2012

According to Wikipedia [1], numbers around 100 billion have been bandied about quite seriously, and much lower estimates have been "debunked as unscientific".

An interesting debate, regardless. I had no idea it could be so hard to calculate. At any rate, as you say, 43 bits would have humanity covered for a long, long time.

[1] http://en.wikipedia.org/wiki/World_population#Number_of_huma...

djbender · on March 1, 2012

A quick google search showed me that 107 Billion people have ever lived, so you would need at least 37 bits. 2^37 ≈ 137 Billion

khafra · on March 1, 2012

The Anthropic Doomsday Argument says that, since it would take 37 bits to include all the people that have lived so far, it takes about 38 bits to include all the people that will ever live. Many people find this somewhat disconcerting.

jmatt · on March 1, 2012

On anonymity and privacy... I always thought this was an interesting fact:

Birthday, Gender and Zipcode is enough to identify someone uniquely approximately 85% of the time.

And a quickly googled source but the meme is older than that: http://godplaysdice.blogspot.com/2009/12/uniquely-identifyin...

reedlaw · on March 1, 2012

Really? That would imply there's only 730 people living in a given zip code (365 days in a year * 2 genders). Unless you mean birth date.

gxs · on March 1, 2012

Yes, he meant birth date.

plq · on March 1, 2012

> There are only 6.6 billion people in the world, so you only need 33 bits (more precisely, 32.6 bits) of information about a person to determine who they are.

I think you should count the dead as well. But then, 33 bits ~= 8 billion, which should still be enough, I guess.