Hacker News new | past | comments | ask | show | jobs | submit login
33 Questions (github.com/markdunne)
283 points by splike on Nov 24, 2013 | hide | past | favorite | 161 comments



Fun to think about, but in the real world, no question neatly divides people, even the gender one. To quote Reddit's u/tailcalled[1], the exo-software/meatspace world is even less standardized than the software world:

Falsehoods programmers believe about gender: http://www.cscyphers.com/blog/2012/06/28/falsehoods-programm...

Falsehoods programmers believe about names: http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-b...

Falsehoods programmers believe about addresses: http://www.mjt.me.uk/posts/falsehoods-programmers-believe-ab...

Falsehoods programmers believe about time: http://infiniteundo.com/post/25326999628/falsehoods-programm...

More falsehoods programmers believe about time: http://infiniteundo.com/post/25509354022/more-falsehoods-pro...

Falsehoods programmers believe about geography: http://wiesmann.codiferes.net/wordpress/?p=15187&lang=en

[1] http://www.reddit.com/r/programming/comments/1fc147/falsehoo...


I dislike some of the examples of the "falsehoods programmers believe about time" because they typically aren't falsehoods, and will always hold true within the constraints of your system. Yes, time is a fickle thing that is marred by history, but in building a system, I choose a representation for the data I am storing in it. If I choose to store dates using a Gregorian calendar (because I'm following ISO-8601), then,

> Months have either 28, 29, 30, or 31 days.

Does hold true: months do have 28 to 31 days, by definition. If someone wants my Gregorian date in a different calendar system, then we must convert, but that a display issue. Otherwise, we're comparing apples and oranges, and of course all bets are off.

It's kind of like time: the hour "2 AM" never repeats itself on random politically appointed days, because I have chosen to store timestamps in UTC. If someone wants to see that timestamp in "PST" (or America/Los_Angeles), then that's a display issue that can be accommodated, but it does not suddenly violate the fact that "2 am" never repeats in UTC.

Of course, in real life, people don't know what timezone their date is stored in and things love to break randomly on DST switch-overs or leap days or the ends of leap years.


2am doesn't even repeat in the local timezone, it's just that the timezone changes. So here for example, 1:59:59 am CDT increments to 1:00:00 CST, which does not conflict with the earlier 1:00:00 CDT. I think it would be better not to go through all this hassle though, myself.


Nitpicking a bit, but I'd usually argue that the TZ doesn't change: You went from "America/Chicago" to "America/Chicago". The timezone encapsulates more than just the offset: it also includes when and how DST works. Merely, you went from "1:59:59 am America/Chicago DST=true" to "1:00:00 am America/Chicago DST=false". I'd say this is just terminology, since "CDT" and "CST" work just as well (and how I envision it in my head, mostly due to the official TZ name being "America/Chicago").


2 AM might not repeat in UTC, but midnight does if your internal representation uses unix timestamps (or similar - .Net doesn't count leap second ticks either)

http://en.wikipedia.org/wiki/Unix_time


If you store time in UTC, then you haven't solved all possible time storage problems - for example, you still need to take into account that some days have 25 hours; so any structures for exchange/storage of periodic data (hourly planning schedules; by-minute temperature readings) need to contain a variable number of data instead of fixed set of 24 hours. If your HR system is doing employee scheduling for 24/7 shifts, then those days may cause a lot of issues no matter how you represent that data.

And the 'display issues' are nontrivial - it's not just converting a timestamp to a string; it has tricky consequences for UI layout and printing if those night hours matter.


> some days have 25 hours

I'm curious to know in what circumstance you think a UTC day has 25 hours.

> it has tricky consequences for UI layout and printing if those night hours matter.

This is very true. I don't believe Google Calendar (or any calendar app that I've seen, really) handles displaying DST weirdness well. It would be very tricky to render. I'd love to see something attempt to tackle this.


You can (and likely should) choose to store your locale-specific times in UTC, but you often can't choose to define 'days' as UTC days.

If your days are meaningful for your app in any way, then you have to have clear boundaries between days/weeks/months - if your website has a report 'downloads per day', then it doesn't mean UTC days (which for many locations would mean splitting in the middle of business hours. And it has days where the difference between 'start-of-day' and 'end-of-day' is not 24 hours, but 25 hours.

Also, no matter how you handle time storage, if you're doing any analytics, and your process is minute-dependent instead of day-dependent (say, power consumption, not purchases), then your daily totals will have ~5% jumps twice a year that you might need to adjust.


It is explicitly not a display issue.


I'd suggest, "Do you self-identify as female?" There are slightly more women than men, so your bifercation is "female" and male/intersex/genderqueer. I'd suggest that currently intersex and genderqueer people are currently a small enough population that it would be close to 51/49 other to self-identified woman ratio.


The question should be about biology, not stuff like gender. Biology solves this problem (most of the time)


Not really. The stats have not been thoroughly calculated (to my knowledge) but here is an attempt: http://www.isna.org/faq/frequency

Not XX and not XY: one in 1,666 births

Klinefelter XXY: one in 1,000 births

There are other relevant cases, not necessarily measured here, like mosaicisms and chimerism.

"Do you self-identify as female?" is a more answerable question.


but doesn't 1:1666 mean the success rate is more than 99%?

Asking about the posession of an XX chromosome should separate the people almost 50:50, the rest is something else.

On the other hand, maybe there are just <1% people who don't identify them self with being male or female, so the questions would make no difference...


| On the other hand, maybe there are just <1% people who don't identify them self with being male or female.

Huzzah for logic. But note that the question was, "Do you have 1 apple" and not, "Choose a statement: 'I have 1 apple', 'I have 2 apples'"


True story, but "Do you have XX-Chromosomes?" would do the trick and cut the mass of people at around 50%.


All that matters is that the answer stays consistent for purposes of the 33 questions.


> even the gender one

Sure, but if you ask "do you consider yourself classically male" and "do you consider yourself classically female", you'll get the vast majority of people, so you can still eliminate large swaths of population with either of these.


The interesting nature of the problem is that you can't just 'eliminate swathes', your question must evenly divide the entire population.


I don't think that that's a requirement at all, unless you think that his second example question of a list of countries perfectly divides the population with not even a rounding error


If you don't evenly divide the remaining population of the earlier questions, you'll need more than 33 questions.


For the purposes of this problem (and most other purposes) if you have a question that divides answers in 47%:A 51%:B 2%:'stupid_question_doesn't_fit_me_I'll_answer_randomly', then it's still perfectly okay.


No, answering randomly is not okay. You will not be able to reproduce the bit string if there is a random answer any more than two source texts with a single bit changed between them will produce the same hash.


Yes, there's so much about these.

Of course, in practice, you usually target your system to a narrow set of users at first.

But yeah, if you're facebook, or work with an airline booking system, for example, you will most likely hit every single item on these lists


> you usually target your system to a narrow set of users at first [but eventually the exceptions show up]

Reminded me of the College Humor sketch about security questions: http://www.collegehumor.com/embed/6936880/security-questions...


"but eventually the exceptions show up"

Yes. And they will be pissed/disappointed, especially if you're preventing them of doing something (their job, for example).

Great examples on the video! What's "impossible" today is obvious tomorrow.


"not until I saw the newsweek article and I thought 'thats me!'"


That's how I feel when sites expect me to have a phone number. I've moved on to use mode modern replacements of that communication technology (hint: it uses the internet and doesn't care where in the world you are).


>... 33 'Yes' or 'No' general questions that, when answered correctly, uniquely identified everyone on the planet.

I think a lot is possible with this challenge. You could compress over 8 billion yes/no questions in a single yes/no question under these rules.

And if that doesn't neatly divide people, why not let the people divide themselves through unique thought?

A 33-bit hash would probably collide too much. Yet there seems no requirement to communicate your hash back to the creator of the questionnaire with your answers. It could be a 1024-bit hash of a short story like:

  Hi I am blauwbilgorgel. I currently define myself as male.
  My internet names are ... I lived at ... I think we are
  in Time Cube. Today is Setting Orange. I declare ...
that would create a unique hash with which to uniquely identify yourself with.


But, then again, so would "I am John Doe and live in 23 Maple Road, Kentucky, US, 12345, Apartment 1a", that's why the post office uses it.


You can't identify everyone in the world that way though. There are people with the exact same name living in the same house, and there are people who don't have a postal address.


Well gender is good enough already: "Are you female?" does not imply everyone saying no is male and is close enough to 50% that it not a big deal (2^33 leaves you with some 1.5 billion people leeway).


Not only does a "one-sided" question like this get you "close enough", the worldwide male/female gender disparity (around 60 million more males than females in 2010) is almost certainly larger than the worldwide intersex/trans/etc population.

A question like, "Are you a male or a Mexican female?" gets you even closer, though.


Interesting lists.

I was initially confused about the first one. Then I realized that the author mentioned working on HR systems and it clicked. But for most of us who aren't building HR databases, I honestly think most professional programmers don't have to think about gender in their work nearly as much as this list suggests - the biggest thing I could come up with would be for localization, where some strings may need to be tweaked for the gender of people or inanimate objects.


Mirror for the gender post: http://pastebin.com/raw.php?i=25bnhuBC


I don't think this problem is solvable in any elegant form, but it is solvable. You'll just end up with massively conjunctive questions that you can't even hold in your head at once, like "27: Are you a non-practicing Catholic with exactly three children, or an asian owner of a minivan produced between 1998 and 2004 that isn't green, or a licensed boat mechanic with astigmatism, or..." and so on for the next 6 pages.

In short, you can draw categories to include or exclude as precise a number as you like, you just have to be willing to draw really, really complicated boundaries.


Sounds like a premise for a dystopian sci-fi story: a future where every identity is exactly planned, where everyone's life is determine by ... 33 bits. Donald Sutherland can be the benevolent ruler that tells the protagonist how the unbridled greed of the 21st century brought us here (Hollywood adaptation can add an ironic anti-consumerist twist).

Perhaps this could be a retro sci-fi a la "Brazil", with each person carrying around a punch card with his 33-bits on them. A computer error means two people are issued the same bit pattern. In a defining shot, they hold up their punch cards up against the sun and see the holes line up. Maybe an Egyptian tomb opens too!


Your mistake was posting this here, and not on Reddit.


Here's my admittedly naive Sunday afternoon spitball on an elegant solution:

I like the idea of a human UUID/GUID type identifier.

I would also like to think that this is solvable using strictly biological and physical properties, sampled at birth.

Otherwise, time and culture factors would seem make it difficult to produce a static set of "apples to apples" questions and answers.

I wonder if the right maths applied to existing genetic and forensics big data sets could produce the 33 questions.


I wonder if the right maths applied to existing genetic and forensics big data sets could produce the 33 questions.

I'd imagine that genetic markers would be the best way to do to (Disclaimer: I'm no biologist and might have made completely wrong assumptions here). They're less likely to change than, say, someone's political or religious beliefs; one could get a nasty hit on their head and forget.

The thing with genetics is that they can change over time. Some gene's turn on and off. Attributes like your face and fingerprints change over time. They're not constants.

If you could identify a set of 33 lifetime constants, you'd end up with a life-long UUID. If you expanded beyond 33 bits and included genetic markers which change over time, such as gene's which flip on/off, you could end up with a point-in-time (PIT) UUID.

    UUID     = Constant throughout life.
    PIT+UUID = UUID plus markers identifying you at a particular point in time.
A constant would be something like, do you have a Y chromosome? (there is fault in this question: XYY syndrome)

Also, you'd probably need more than 33 bits. 33 will encompass all living humans today in 2013 ADE, but would have to be expanded as the living population grows, and to include the billions of deceased humans.

In the end, a "true" unique identifier, encompassing any human, would be their UUID plus a list of all PIT+UUID's they generated during their lifetime. Or in english, an entire record of their genetics from start to end:

    struct LIFETIME_UUID {
        void * uuid;
        void * pit_uuid_TIMESTAMP1; 
        void * pit_uuid_TIMESTAMP2; 
        void * pit_uuid_TIMESTAMP3;
        ...
    };
That should eliminate conflicts in edge cases like identical twins or cloning.


DNA is your UUID/GUID



Unless you're one of a set of twins/triplets/etc.


append datetime string of birth. Boom problem solved.


What if the twins are delivered via Caesarean, with the exact same datetime recorded? Including datetime (even as a string) also includes the aforementioned falsehoods about time.


unless you have a twin, or other tuplet siblings.


or a clone:)


Unless you're a twin


I don't think so. How do you come up with time invariant questions? If you used these big conjunctions, how do you uniquely identify every person in just 33 bits?


Those are bad examples because those questions don't split the population in two. Very few people are non-practicing Catholics with exactly three children. If you want to limit the number of questions to just 33 then you have to choose your questions very carefully.


I think you misunderstood. powrtoch proposed having a set of questions in which each individual question is itself very complicated. For example, being a non-practicing Catholic with exactly three children is only one small facet to a single question. By or'ing a bunch of really specific questions together you can come very close to getting exactly 50% of the population to answer yes to a single question.


That's kinda cheating though isn't it? Like chaining a dozen statements on one line with semicolons and going, "look I can write that program in one line!"


That depends what constraints you choose to define on the problem - if they need to be knowable, memorable,... then yes probably. Anyway, sz4kerto has a good comment about using a Karnaugh map which might help see how this works - it's a lot cleverer than just chaining things randomly - but does break a lot of hypothetical arbritary restrictions.


The example was one question with a bunch of OR operators that when combined, would equal exactly 50%.


> To contribute to the project, open up a pull request and add your question to the list below. All questions are open to debate and discussion.

This is a completely wrong way to approach the problem. Because the questions should all divide the population into two parts the questions should be 'matched' to each other. This approach is a bit like doing a PCA by figuring out one component, then the other, then the rest...

One way to solve this problem is to have a lot of yes/no questions (like a big Karnaugh-table), then everybody would have a long bitstring as his unique ID. Now you need to compress that bitstring -- like the minimization of the Karnaugh-table.

http://en.wikipedia.org/wiki/Karnaugh_map

-- you need to generalize this for N number of questions (which can be done), then you'd have 33 complex questions like 'is it true that (you live in NA AND you are male) OR (you live in Canada AND you are white AND ) .. and so on and on.


Assuming that the person doesn't necessarily need to know their answer (which is important for babies anyways) the answer is trivial. The first question would be "Given that we ordered all humans in order of the time of their birth, would the 1st bit of your position in the ordering be 1?", continue the other 32 questions with the remaining 32 bits.


lol too much


Very interesting thought experiment. A few random thoughts:

Reminds me of Panoptic by the EFF: https://panopticlick.eff.org/

Everyone's ID would change as time passed (if they move, if they age, if they get a sex change, etc).

The best questions for this are inherently "irrelevant", since "relevant" questions tend to be statistically linked. So, questions like "Was the second letter of your first girlfriend's middle name between A and M?" is better than "Were you younger than 20 when you had your first girlfriend?", since we can likely guess the latter based on the other statistics.

It's very unlikely every ID will be unique if only asking 33 yes/no questions. I mean, look at two twins living together -- very few questions will be able to differentiate between them.

I think it's possible to do based on a random snapshot in time, however less possible if it's meant to last a lifetime.

I also think the questions exist, but not in a manner that we'd be able to come up with on our own. As in, I believe that a program that knew every detail about every human could create 33 yes/no questions that differentiated people, however I don't believe we could do it ourselves.

I also wonder how many questions would be required to ask non-yes/no questions and get a completely unique ID for everyone. For example, questions like "weight? languages spoken? birth place?".


Only ~1/3 of population would even be able to answer a question like "Was the second letter of your first girlfriend's middle name between A and M?".

26% of global people are pre-teens; and approx half of the remainder aren't dating women.

Not even going into the fact that the social concept of 'girlfriend' isn't universal, there are multimillion cultures where the relationship stages of going from strangers to family are split differently, and none of those stages can be equated with western 'boyfriend/girlfriend' relationship; e.g. you may like someone but not be dating (so no girlfriend) and then move directly to engagement or marriage; or many other possibilities.


If you allow questions with n possible answers, then you need at least ceil(log_n(7e9)) questions, where log_n means logarithm base n.

I don't know if there is a simple formula for the case where different questions allow different number of answers.


But""Was the second letter of your first girlfriend's middle name between A and M?" is a bad question on many levels -- people who aren't attracted to females, people who are but have never had a girlfriend (including children), people who got married to their wife as the start of the relationship, and cultures where "girlfriend" would be hard to define because of that sort of thing.

I can think of almost perfectly 50/50 questions based on odd/evenness of numbers (is your current weight in grams/age in hours/number of hairs odd or even) but those are completely time dependent.


[EDIT] I should have said "significant other", not "girlfriend". Even then, it doesn't work -- but at least it's more inclusive :)


33 questions is sort of the Shannon-Hartley optimal encoding of identifying information about human beings.

That means to come up with them is identical to finding an optimal compression of identifying data.

Necessarily, as the second question already implies, for this question to correctly divide the population in half, you would have to group large amounts of small populations together, resulting in very long questions.

For example, if you'd like to make another geographical question that's independent of the second one, it would have to divide in half every population of the 6 countries you mentioned. The next question would necessarily have to divide those 12 again.

By the way, the first question you ask is already suboptimal when combined with the second question, as those countries together probably do not have a clean 50% male/female split. (if they do, you should really explain that as it's not obvious)


Interesting exercise, which I'd call impossible in the given form. Imagine someone magically came up with 32 statistically independent binary indicators. Now you need to come up with the 33th question Q such that if you pick any two persons who are similar up to the 32nd bit, that single question must allow to distinguish them. Sounds hard.


Question 33 has to split any category of people formed from the first 32 questions - very awkward, as you say.

Here is a moderately-functional question 33, though:

"Are you further North than any other person who has given the same answers as you for the first 32 questions?"


> "Are you further North than any other person who has given the same answers as you for the first 32 questions?"

What if one of the persons was not on the planet at the time?


Excellent :) Of course there's something to be said about the ability to realistically answer reliably.


Just make questions 33 be the your assigned number then. 0 or 1


Just use:

- Birthday (19~ bits)

- Rough Location (remaining bits)

And base the questions around those two, for example, where you born on a 1-15th, does the city you were born in start with the letter's a-k. This part would be an exercise in statistics, I would think.

edit: And one bit for if you were the first to be born of two identical twins =p


Lots of people don't know their birthday, or if they do, it was written down as 1st Jan (or another memorable date), because their parents/carer didn't know when they came to register them, if they ever registered them. There will be bias towards those memorable dates.


> Lots of people don't know...

We can go farther. People in a coma or with certain mental impairments will fall out of the system, if knowledge of any given fact is a hard requirement.

The only way this even makes sense as a thought experiment is if you have some magic biographical oracle that helps you answer the questions.


"Lots of people don't know their birthday" Really? I've never met a single person who didn't. I guess "lots" in the terms of thousands could be plausible, vs "lots" in the terms of "a high percentage".


That would work if people birthdays were uniformly distributed among the dates. But that's not the case.

(Most people born 100 years ago are now dead.)


As long as you split the distribution uniformly into categories with the same number of people, it doesn't matter that it's not uniform by date.


Yes, except that you cannot expect people to be able to answer this "question" anymore.

There is no doubt that you can uniquely identify the world population with 33 bits, so I guess the problem is to do it with questions people can answer themself.


What if you weren't born in a city? The USA seems to give "city status" to many places, but not all countries follow this.


Latitude, longitude.


and timestamp, getting pretty close


Triplets?


Just add time of birth. I don't see how you could split correctly with a bit for twins.


What if time is unknown?


This project assumes we can know things that are not really knowable for everyone. It starts with gender and birthplace, both tricky questions in some situations.

So maybe we get to assume we have some oracle that helps us simplify the hard questions.

At that stage, it's easy. Begin with, "Assume we build a list of people sorted by time of birth (with some arbitrary tiebreakers, like proximity of birthplace to Barbados, or darkness of hair color...)."

Question 1: Are you on the top half or bottom half of this list?

Question 2: Are you on the top quarter or bottom quarter of the half?

Question 3: ...


I had come up with something similar:

Question 1: Are you north of an east-west line that perfectly divides the population in half?

(proceed to continue to divide each section in half.)

Of course, determining the position of those lines is damn near impossible. And as soon as someone hops on a plane, the whole thing comes crashing down.


I like it. Maybe "were you north of an east-west line" when this list was made.

I guess it gets tricky when people are born.

Wait a minute, just realized this whole question breaks down if we don't recycle numbers from the dead back to the newly born... and there's no guarantee that all of the dead will sufficiently resemble all of the newborns. I want my money back. :)


It's not enough to find 33 independent questions that evenly split the world's population.

An optimal, though inelegant solution to that goal might look something like this:

"Is the {1..33}th bit of sha1(name : location : date of birth) 1?".

Clearly you'll have tons of collisions with that solution, as you would have with any solution using 33 independent questions.

To uniquely identify people, we'd either need to use more bits, or look very closely at the population and derive very specific questions.


> Clearly you'll have tons of collisions with that solution

Why?

If we assume the hash code assignment to one of the 2^32 people is uniformly random from a set of 2^160 codes, the odds of finding a collision are astronomically small (order of 2^-95 or so). Am I missing something?


You are not taking the entire hash, you are only taking the first 33 bits of the hash. Since there are only about 8.5 billion different values for the 33 bits and there are about 7 billion people, the odds are astronomically low that each of those 7 billion people will receive a different one of those 8.5 billion possibilities.

This is the birthday paradox with instead of 365 days you have 2^33 possible answer values and instead of 23 people you have 7 billion people. I leave it as an exercise for the reader to fill in these values into one of the formulas to calculate the probability of successfully giving each person a unique 33 bit answer: http://en.wikipedia.org/wiki/Birthday_problem


> You are not taking the entire hash, you are only taking the first 33 bits

Right, my bad, didn't pay attention to the problem we are trying to solve :)


It is interesting to see your comment way down the thread, even though it is by far the best.


I thought it would be more plausible and probably more interesting to do this in maybe 40 questions. To do this in 33, as several others have pointed out, would require 33 questions that each almost perfectly bisect the population and are almost perfectly independent of each other.

With 40 or 45, we could relax that a bit and use questions that are actually meaningful. Two people who are within a few bits of each other would actually be similar in ways we care about, unlike two people who are similar because their transliterated last names both appear in the last half of the alphabet.


So you want to create a data set with entropy = 1 . Think of this in terms of a hash function , You want to create a hash which only has an address space of 33 bits. Something in terms of H(Alice) = 0x12321 {H is a function which generates 0x12321 to store the data of alice)

Doesn't this sound like perfect hashing with limited memory. I don't really think that this can be done with such memory constraints. Even now we cannot produce a perfect hash function that uses 1 bit / key. The theoretical best we can do is 1.44 bit / key. And the practical best we have done till now is 2.5 bits per key. [1]

This may just be possible without the memory constraint that is , you answer N number of questions which uniquely identify you. (where N > 48 )

[1] http://en.wikipedia.org/wiki/Perfect_hash_function#Minimal_p...


No one here seems to have mentioned Hunch (http://en.wikipedia.org/wiki/Hunch_(website) ).

Picking discrete questions like this is equivalent to building a decision tree for humanity. This is actually something that could be approached as an engineering problem (and there are mechanisms for optimizing decision trees).

The problem still remains in the face of both the technological capabilities of decision trees, and practical implementations like Hunch.com, that decision trees are reductive and discrete. Reality is neither discrete nor reductive.

It may very well be the case that there is a set of questions that could uniquely identify humans, but the insight that could be drawn from those questions might be essentially pointless.

For example:

* Were you born in the northern hemisphere?

* Were you born on an even numbered year in the Gregorian calendar?

* Is the country of your birth governed through a representative system?


This reminds me of Akinator: http://en.akinator.com

It's a little spammy nowadays, but it's had enough input that it seems pretty amazingly accurate at "guessing" what / who you are thinking of in ~ 25 questions.


I thought of this too, or at least a site similar to it. It has so much input that it's good at even very obscure people/characters.


It's similar, except the questions it asks can be crafted based on previous responses.


I don't think it is possible with exactly 33 questions. It will probably require more than that. Binary numbers have the property of adding twice as many numbers +1 for every new bit. For example if you already have 7 bits and you add an 8th one, then you'll be able to represent 127 numbers with that bit off and 128 numbers with that bit on.

To properly mimic this property with yes/no questions, you will have to come up with questions that divide the whole Earth's population equally AT EVERY NEW QUESTION. Even the most obvious one, "are you (fe)male?" is slightly biased toward men (according to wikipedia). At every question that skew your 50/50, you'll have to add another question beyond 33 to catch up with this.


All questions must be independent. That's too hard. I would be surprised to find even a handful of questions that have no statistical correlation when applied to all people in the world.


> To properly mimic this property with yes/no questions, you will have to come up with questions that divide the whole Earth's population equally AT EVERY NEW QUESTION.

Not only that: The first question must divide the population in 2; the second question must divide each of the 2 subsets produced by the first question in 2; and the third has to perfectly divide the 4 subsets produced by the second... And this independently of the order in which you make the questions.


Question 1: What is the first bit in your unique 33-bit string? Question 2: What is the first bit in your unique 33-bit string? ...


I'm pretty sure you didn't mean what you wrote! But maybe I'm whooshing?


I think first you have to show a question exists that effectively separates identical twins before you spend much time working on broad questions like gender and geography.


Maybe something like "Do you have more older siblings than younger siblings?" would work. One twin has to be born before the other.

That question has some problems, but I feel like it could serve as the basis for a more specific question that would define how an only child or middle child would answer without introducing much bias. It would also have to define how half-siblings are counted and probably some other things.


So you are going to burn one entire question that is entirely worthless for the vast majority of Chinese that have been born recently?


Interesting read: "One Child Policy and Arising of Man-Made Twins". http://paa2013.princeton.edu/papers/130113


It's entirely possible that the twins would not know which one was born first.


Its entirely possible the person could not know the answer to any of the questions, I dont think that really counts


This is a fun exercise, but as others have pointed out likely impossible in its current form.

We don't have true constraints on space though; why limit to 33 bits? How could we still provide a meaningful UUID to each person?

A UUID based on time and location of birth might be more feasible than any other approach, since neither will change and it's the least likely to be ambiguous. Capturing UTC at the time of cutting or otherwise removing the umbilical cord could be one way of choosing as precise, non-debatable a timestamp as any. Adding lat/long and, say, the first byte of the UTF-8 character of the mother's name (or an aspect of the mother's UUID?) could get you the rest of the way there.

Of course, this falls over in places without access to precise timing and geolocation.


I think the limitation to 33 bits is because this is the smallest power of 2 which is higher than the world's population:

2^32 = 4.29 billions,

2^33 = 8.58 billions,

2^34 = 17 billions,


Another pitfall of this thought-experiment is that there is no room for the constant number of humans who die and are born every day: the set of 33 questions hypothetical questions of today will not be valid tomorrow.


Do the answers have to be knowable? Time independent?

For example, "are you below the median age at this exact second?" That is not a knowable answer, and changes by the second, but it does give you an exact 50/50 split.

Repeat N times for each split and we're getting very very close.


These need to be questions that are invariant over a lifetime:

- Were you born in the northern hemisphere or southern?

$2^{33}$ is sufficient for those alive now, but the human population is a dynamic function. Set a bit when the person dies?


The Northern Hemisphere has roughly 90% of the population[1]. That is far from 50% what we should be looking for.

[1]: http://en.wikipedia.org/wiki/Northern_Hemisphere


Are you born above or below the tropic of cancer then: replace with whatever latitude separates humans into equal groups.


This make other questions based on geography useless (as in the second question). Thats not to say the already given second question is better than your one


Even with questions invariant over a lifetime, edge cases like identical twins (same gender, same age, likely same upbringing, etc) are likely to mess with it.


Added pull requests to extend the address space from 33 bits to 36 bits to accommodate our revered ancestors, and a bit to indicate liveness.

TODO: don't implement zombies or ghosts at this time (YAGNI principle).


On a related note, this has a very interesting significance in the world of privacy and anonymous tracking.

http://33bits.org/about/


This is a really cool concept, but one that is totally impossible. In the actual world, few things are truly independent. Even if you could find 33 binary questions that did not correlate with each other at all, you still run the risk of having multiple people yield the same 33 answers.

Just because two things aren't statistically linked does not mean that they will never overlap.


Wouldn't the best way to do this be to ask questions related to genetic markers? You require 33 yes/no questions that independently divide the population in half, but has near-uniform distribution otherwise (each populace half has no relationship to the other questions).

Are there 33 genetic markers that each has no correlation on the presence of the others?


33 is not the constraint. If we increase the limit to 50 and these 50 questions can fingerprint an individual then that will be really interesting.

Some hard problems :- 1. Distinguish twins 2. Using characters in names as some like Chinese use non-ascii names.


You could distinguish twins by first identifying the twins, then by asking which one's name starts earlier in the alphabet, or whatever analogue exists in languages without an alphabet. Encode appropriately.


If anyone is interested in seeing such a application in a fictional setting, I suggest the anime Death Note, if nothing else for its entertainment value. For those who are familiar with the story, the questions L asked in order to narrow down Kira suspects to a limited demographics in a small region in Japan, among billions of candidates, were some good ones. A good article that analyzes the plot from a information theory perspective: [http://www.gwern.net/Death%20Note%20Anonymity](http://www.gw....


Even if the questions were perfect (each question splitting the population in two exact halves, and all questions totally independent from each other) and therefore the algorithm would give each person a perfectly random number, the birthday paradox [1] tells us that even for just square(2^33)=~ 93k people we would have 50% probability of having a collision. To work we would need more bits. (Either that or create questions that are _not_ independent, so crafted in a way to make sure each person gets a different number)

[1] http://en.wikipedia.org/wiki/Birthday_problem


No: the linked article requires that further questions split the previous groups each in half. The first question must split the group in half, the second question must split those two groups in half, and so on. When we hit the last bit, the final question is splitting groups of 2 people into the last bit, and each person has been assigned a unique number.

In this manner, 33 bits is enough, and the birthday problem is avoided. The page mentions all this.


How many questions would you need to differentiate between identical twins, particularly if they live and work together? Take identical twin sons of a subsistence farmer - they live together, work together on the same things, know the same people, have the same genetic makeup, and whichever was the first twin born may not have been recorded. You could ask their names, but that's not a yes/no question.

Or even twins who are still babies, no work required? Some cultures wouldn't even have named them yet.


Multiple births would be bad enough, and then you have people with dementia who can't remember the answer to most of the questions, and people in a coma who can't answer questions at all. The chance of succeeding with this exercise is approximately zero.


"As an example, having the questions "Are you male?" and "Are you below the median age?" will not work "

First question is "Are you male?". Made me laugh.


You can have one, but not both.


I should reword that sentence to make the point clearer. As psuter said, you can have one but not both


It was much more clear after having it pointed out to me.


Another problem not mentioned is that the questions should be about the content that does not allow for the answer to change over time. Otherwise the ID is no good.


Could you not have way more than 33 questions created, (maybe a couple hundred) but change what questions are asked based on previous answers? Use the previous answers to determine the strongest next question to ask?

If an early answer states the candidate lives in the north hemisphere, there's no point in asking them if they live on a landlocked African country... or whatever much more complicated questions could arrive.


A boring solution:

Question n. Consider the number of your birth out of all people currently alive. When you divide by 2^n and take the remainder, is it odd?


This seems like it would be a lot of work. The intensity and specificity of the questions that would need to be asked would have to be quite unique. It might be possible, but without excluding people of the world because they get lumped into a group, it seems like maybe 33 questions might not be enough to uniquely identify everyone in the world.


A useful question might revolve around language or concepts a person knows, but then this becomes a lot more difficult if the questioner doesn't know which language/concepts a person wouldn't understand (and therefore whether they could even answer the question) - and if they do know, there is a priori knowledge effectively.


The 33-question issue is a tough one for sure.

I'm instead left wondering how many extra questions (35 bits? 36 bits?) it would have to be expanded to in order to produce unique results but without having to be particularly clever in producing the questions. I bet it wouldn't take as many extra as one might be inclined to think.


Good point, especially because at 33 bits, storing everyone's ID would only take about 29 gigabytes.

7,126,462,675 people * 33 bits per person / 8 bits per byte = 29,396,658,534 bytes

World population source: http://www.census.gov/popclock/


Do the questions have to be constant over time? If not it can trivially be solved by asking: Are you born before or after time t? 33 times, where t is the median date of birth of your population. You just need to recompute t 33 times (and know the date of birth of every single person in the world).


If the goal is to have questions which can be answered only with yes or no. I don't think asking for location of the person is good thing, because there would be so many questions as there is locations.

"Do you live in China, India, The United States, Indonesia, Brazil or Pakistan?" is not good question.


"Were you born east or west of longitude X" (where X is a longitude that would split the population roughly in half).


Yes that would work and north of x..


I'm guessing that that question is meant as a yes/no question that would divide the world population in half.


Is the intent of this exercise to build a unique identifier that the individual could reproduce over the course of their life, or does it just uniquely identify them at the time they answered the questions?

I ask because questions like number of siblings, favorite movie, etc. would change over time.


Are you male? This will not split the population 50/50. One group will be slightly larger, and you then only have 32 questions to subdivide this larger group into further categories which is impossible.

This is not possible unless the categories _precisely_ bisect the group each time.


Seems impossible to me, for example what question would separate two identical twins( identical in dna and when born ).

And let's say you find such a question, there is no way that question would divide half of the population.


Well if you had conditional questions, you could ask who was born first.


That question doesn't even apply since you can only ask it to two twins, not a single person.


Going along with what someone said earlier, your last question can be "Were you born after someone who answered the previous 32 questions the same as you", which works to identify twins as well as people in general. Although I don't know how this would work for triplets (or any larger number of "twins" than 2); you'd have to make the previous 32 questions split them into groups of at most 2.


Isn't this solvable to a degree by just asking a big amount of yes/no questions to a big amout of people and then removing all those questions that didn't identify people any further?


I bet you could make a lot of progress by dividing GPS coordinates evenly by population. Simple binary search by primary residence and then leave some space for division within a household.


It's an interesting idea ... But no I don't think it is possible in any way that is not turning the list into a set of questions about their genetics or DNA.


An easy way to construct the questions is to ask for increasingly precise time and location of birth.

There will be corner cases, but then so does asking if someone is male.


The set of questions that would do this is probably a list of genetic questions.

"do you have the mumble allele?" etc.


Less than half the population will be able to "read" the questions due to not speaking English...


Fun version of that ...

http://en.akinator.com/


Do you speak english as your main language? Could be a good question too? What do u think?


Possible, You need 33 answers but more than 33 questions.


People have so much time on their hands.


First 33 bits of SHA512(your DNA)

First 33 bits of SHA512(your 3D GPS location)


As someone else mentions, this doesn't work, as the birthday problem[1] applies here.

You need to ensure collisions won't happen; SHA512 does not to that. (The probability of a collision in 512 bits is near impossible; in 33 bits, much less so.)

[1]: http://en.wikipedia.org/wiki/Birthday_problem


This is really interesting actually. Your entire "uniqueness" can be summed up in 33 yes or no questions, in theory.


I don't think that's quite the right interpretation. Let's say that the human population increased by a factor of 1000. Does that mean you have 10 more bits of uniqueness? You didn't change at all.

In actuality, you already had those 10 more bits of uniqueness, and people just need to ask more questions to find out which one of the many unique people you are.

This is about identification, not uniqueness.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: