This is perhaps less impressive than you would think. Watch some mathematical slight of hand: suppose for the sake of argument that 1% of Facebook has Dark Secret X and you have a DSXdar program which is 99% accurate.
Your program will, out of a sample of 100,000 Facebookers, identify 1,980 people as being DSX. Half of them (99% of 1,000 = 990) have DSX. Half of them (1% of 99,000 = 990) are not actually DSX.
The math is much more favorable for oracle programs which look for traits which are evenly split in the population. This is one reason why "An AI which reads your work and guesses your gender" works so amazingly well as a tech demo. (That program exists, incidentally, and is better than 80% accurate.) However, even supposing there were sufficient cues in a person's writing to tell whether they were e.g. left handed or not in the same fashion, a similar Oracle Of Lefties would be almost useless because you'd get too many misclassified righties in with your identified lefties.
On a related note, this is why a) you are not particularly recommended to get an HIV test if you're not at risk for AIDS and b) the first thing that will happen after an HIV+ result is scheduling another test.
On a totally different note, much of the worries of privacy experts with regards to Gaydar and similar programs don't require the AI to be functional to come to pass. "My sophisticated computer analysis says you're 83% likely to be a child molester." If you're already under suspicion, folks will not ordinarily say "Wait a second, lets see the study showing that this, in fact, actually works." They'll simply take the "expert"'s word at face value.
(See, for example, computer models of global warming. The big black box will spit out any answer you want from it -- want the answer that global warming is no problem at all? Turn off the "Assume that water vapor evaporation causes a vicious warming cycle" and, bam, you can crank your smokestack slider to the max and not have any problems. Yet folks don't routinely consider that prior to saying that "Global warming is a fact -- the UN reported that the scientists proved it.")
To be clear, there's no checkbox associated with water vapor causing a runaway warming cycle. The ipcc radiative forcing metric pegs the contribution at less than 5% of the total positive warming, and cloud alebedo effect, spurred on by the increased humidity, is more strongly negative. The largest anthropogenic contributors are still greenhouse gases, directly.
The IPCC features 4 feedback scenarios, and 6 original scenarios which are 'equally-likely', and the discussion is always around the worst case of them.
Seems odd that work done in 2007 is reported in the popular press in 2009 and yet the results are not available because the authors "are hoping to have it published in a journal." Google turned up nothing but more secondary references that don't mention a publication (http://tinyurl.com/nqy5zm).
If I am mistaken and the paper is available online please post the link (thank you).
Picking 1,980 people from 100,000 and get 990 out of 1,000 who have DSX is in fact quite impressive--99% recall (in information retrieval) or sensitivity (in medicine). In information retrieval, you rarely get that sort of number out of unstructured data.
If you look at the predicted positives for rare conditions (as in your DSX example), you may find that 50% are false positives. It is still considered useful for many purposes. In medicine, for example, it is used as a first-round filter to screen people for another, more expensive/cumbersome test.
They did this with a software program that looked at the gender and sexuality of a person’s friends and, using statistical analysis, made a prediction.
Gay people have gay friends? Brilliant! Or did they mean the gender and sex of a persons friends. There's a big fat whopping difference here. One is much more impressive than the other.
Gender and sex are the same 99+% of the time. The rest of the time, it's impossible to tell from Facebook alone. It's not like Facebook has an "I'm a transsexual" checkbox on your profile.
Identifying a closeted gay person from the proportion of openly gay friends they have is indeed an obvious approach. The surprise is how well it works.
>they used their private knowledge of 10 people in the network who were gay but did not declare it on their Facebook page as a simple check. They found all 10 people were predicted to be gay by the program.
Does the author care to give a false positive error rate? how many were predicted gay and were not. without that statistic this argument is bogus.
The two students had no way of checking all of their predictions, but based on their own knowledge outside the Facebook world, their computer program appeared quite accurate for men, they said.
So basically they didn't prove anything, they have anecdotal evidence.
And a healthy dose of selective memory and bias, let's not forget.
It's depressing that this is news. Also mews: my friend thinks people who drink Coors are gay and may not realize it yet. Why not, right? It's just as testable as this nonsense story is.
Hmm. No paper to be found on the web; no actual data in the article (apart from the totally bogus 'we checked 10 gay people and they were all marked as gay' analysis); no false negative or false positive data.
Absolutely no way to evaluate whether this has any value whatsoever.
I am so sick to death about these constant "privacy concerns".
If you have some deep dark secret that you don't want the world to know, then don't go around posting it on a system that bills itself as a way of providing ubiquitous access to everything posted to it!
These people who get concerned about privacy are the digital equivalent of somebody posting a billboard with their contact information, then claiming a violation of their privacy when people use that to contact them!
You posted it online. On a public forum. You cannot expect even a modicum of privacy when you do that. If you have a problem with people reading what you write, don't write it!
These are things that people aren't posting, though. This program doesn't look at a field on people's profiles which states their sexuality. It guesses from the information which they make available. You may well conceal the information you want to keep hidden, but then the information you post as public lets people make inferences.
They're ignorant and stupid but they can be educated. They can even be protected from themselves to some extent. I think it's worthwhile to keep pointing out privacy concerns because one day, maybe more people will care.
This would have been more impressive if the authors had used other profile data to infer sexual preference.
As an example, consider those Facebook profiles that are relatively complete yet leave the "interested in" field blank. People not inclined to advertise their homosexuality might feel compelled to be honest and leave the field blank as opposed to outing themselves or posting a lie. Thus, one might infer that a person with a conspicuously absent "interested in" preference is actually gay. (Of course, there's also the chance that the profile owner simply feels that the idea of the "interested in" field seems a bit juvenile, so this particular example might be misleading.)
I can attest that it is very easy to identify who is gay on Facebook, even if their profile does not say so. If you have a lot of gay friends, then Facebook tells you how many mutual friends you have with new people. If you know the mutual friends are gay, then it is likely that the unknown person is also gay. Very easy, so I'm not surprised.
Your program will, out of a sample of 100,000 Facebookers, identify 1,980 people as being DSX. Half of them (99% of 1,000 = 990) have DSX. Half of them (1% of 99,000 = 990) are not actually DSX.
The math is much more favorable for oracle programs which look for traits which are evenly split in the population. This is one reason why "An AI which reads your work and guesses your gender" works so amazingly well as a tech demo. (That program exists, incidentally, and is better than 80% accurate.) However, even supposing there were sufficient cues in a person's writing to tell whether they were e.g. left handed or not in the same fashion, a similar Oracle Of Lefties would be almost useless because you'd get too many misclassified righties in with your identified lefties.
On a related note, this is why a) you are not particularly recommended to get an HIV test if you're not at risk for AIDS and b) the first thing that will happen after an HIV+ result is scheduling another test.
On a totally different note, much of the worries of privacy experts with regards to Gaydar and similar programs don't require the AI to be functional to come to pass. "My sophisticated computer analysis says you're 83% likely to be a child molester." If you're already under suspicion, folks will not ordinarily say "Wait a second, lets see the study showing that this, in fact, actually works." They'll simply take the "expert"'s word at face value.
(See, for example, computer models of global warming. The big black box will spit out any answer you want from it -- want the answer that global warming is no problem at all? Turn off the "Assume that water vapor evaporation causes a vicious warming cycle" and, bam, you can crank your smokestack slider to the max and not have any problems. Yet folks don't routinely consider that prior to saying that "Global warming is a fact -- the UN reported that the scientists proved it.")