Lead author here. Since my serious thinking on this topic started when I responded to this Ask HN post[1] Π years ago[2], it's nice to see this posted here, to come full circle in a sense. Happy to answer any questions.
I'm impressed by the skill that goes into this but it doesn't seem like an even-handed technology -- it empowers governments, major corporations, and other large organizations more than it does private individuals.
As a specific example, people writing political blogs in China could be seriously harmed by this technique even at the levels that it's at now.
I applaud you for including the link to "manually changing your writing style will defeat these attacks" but that's a link to an academic paper. Could you please also write some good, layperson-oriented docs on "how to beat this"? For that matter, I'll do the writing grunt work if you'll provide the expertise. If you're interested, use the GMail address in my profile.
That's a good question. First, I believe that intelligence agencies are already well aware of the potential of technology like this, and at least some, like the NSA, could very well be ahead of public research. Second, research such as ours is intended to demonstrate a proof of concept, and it takes a lot of work to turn it into a reliable tool — for example, we restrict ourselves to English text. For those two reasons, I think our work does little to directly help governments and other oppressive entities. On the other hand, publicly available research is effective (we hope) in raising awareness of the threat, so on balance it does more good than harm to people writing political blogs.
As for practical tips to defeat stylometry and such, organizations like the EFF specialize in doing that, so I will leave that to them. Comparative advantage, etc. If you would like to help, you are more than welcome.
could very well be ahead of public research - does that mean you've had meetings with groups of 3 federal employees, one of whom does nothing but ensure the other 2 don't say too much? You know, like the feds that visited IBM to make DES more resistant to differential cryptanalysis (http://en.wikipedia.org/wiki/Data_Encryption_Standard#NSA.27...)?
I'd also be interested in what anonymizing techniques come from this, but the way to specifically beat the Chinese government here is to carefully guard the border between your online presence and your offline government-issued identity. At minimal this means anyone in China should be encrypting everything before it leaves their machine and goes to the network through a connection with their name on it, and if they're in the business of writing anti-China political blogs, they need to treat everything they do publicly as a threat to their identity. You can't expect to carelessly leak your identity today and try to hide it tomorrow. You either don't leak your identity at all or what you're trying to keep anonymous has to be so low-bit in information content that it's probably worthless because it would be like everything else.
I'd be interested in seeing more tools in this vein. I already know that if I want to camouflage my writing, I have to cut way down on the dashes, but it'd be neat to upload a few large documents and see a nice list of the top ten traits that could be used to identify me.
The technology must be assumed to exist as there is plenty of commercial incentive to develop and use the technology. A threat model that fails to account for future advances in technology vs. expected period of protection is unacceptable.
Making this technology available and easily accessible for widespread public use encourages development of counter-authorship-detection techniques. If tools are available that allow authors to easily assume the identity of other people (through morphing of writing styles), this technology rapidly loses value.
I attended 28c3 and from memory there was a tool called JStylo for discovering authorship and Anonymouth for defeating authorship recognition. Check the Chaos Computer Club site.
Imagine I want to post anonymously about some sensitive subject. Whenever I do it, I write the article normally, then Google translate from English to French to English, and then I clean up obvious errors in the retranslation.
Practically that would work, at least for the foreseeable future.
Theoretically it's not so easy. If you know how Google translates from English to French (and vice-versa) you can at least partially reverse the process.
Or perhaps authorship detection can be performed on the output from Google translate? The authoriship entropy contained within the original English text is likely to be carried through (at least partially) to the post-translated output.
Is your software and the dataset it uses publicly available?
If so, perhaps it would be possible to use it to engineer countermeasures. In the example you used ("since" vs "because"), it would be fairly simple to alter the ratio.
Of course I don't know what more complex indicators you may be using, but I'm having a hard time imagining what would be easy to measure, but difficult to alter.
Do you expect your system to scale easily to the "the entire internet" data set? You mention that you degrade gracefully as the problem size increases; is getting the accuracy up to a usable figure only a question of more signals and processing power and dirty ML tricks, or do you think that you need a fundamental improvement?
The algorithm achieves significantly higher accuracy if it has more text per author. Also, if you're willing to do human analysis on a few dozen candidates after algorithmically shortlisting them, that gives you a further advantage. Finally, there is much room for straightforward algorithmic improvement (e.g., ensembles of classifiers) that we didn't have time to fully investigate. In short, IMO it's just a matter of more data and slightly better ML, not fundamental improvements.
Great paper, glad to see the fruit of your efforts.
I guess you did not look at texts with multiple authors, or professionally edited texts? I am curious if a different editor, or publication house style, can be detected.
Reading this it came to mind and is perhaps worth mentioning that this is how the Unabomber (Ted Kaczynski), the Luddite who engaged in a mail bombing campaign that spanned nearly 20 years, was caught.
Before the publication of the manifesto, Theodore Kaczynski's brother, David Kaczynski, was encouraged by his wife Linda to follow up on suspicions that Ted was the Unabomber. David Kaczynski was at first dismissive, but progressively began to take the likelihood more seriously after reading the manifesto a week after it was published in September 1995. David Kaczynski browsed through old family papers and found letters dating back to the 1970s written by Ted and sent to newspapers protesting the abuses of technology and which contained phrasing similar to what was found in the Unabomber Manifesto
While impressive, I don't think these results are actually that bad for privacy. 80% precision, for example, is useless when you're matching against tens of millions. It's much the same fallacy of the medical test for a disease that occurs in 1 out of 1000 people, and which has 99% accuracy -- but that still means a 90% false positive rate.
It reminds me of the claims of being able to identify, for example, the gender of an author with ~65% accuracy -- which is really actually completely unimpressive, as it's hardly better than guessing, and certainly not something you could rely on for any serious purpose.
The author mentions that topic is one way to help correlate beyond the results of the algorithm. But if I wrote "anonymous" posts in my area of expertise, you certainly would not need stylistic analysis to guess what my identity might be! There has never been privacy in this regard, I don't think.
Where privacy is needed most, I think, is exactly where this deanonymizing tool still isn't sufficient: talking about unrelated topics. A person should be free to express themselves under multiple names for different purposes, and there is no reason why an employer needs to know about a programmer's side hobby as a fiction writer if s/he doesn't want them to.
Finally, I do wonder how well these results correlate to the case where someone is intentionally operating under a different name. Matching one post by tech blogger A against blogger A is easy, because tech blogger A is making no attempt to write any differently or in any different context. However, what if tech-writer A ghost-wrote YA fiction on the side? Could you use these techniques to detect that the fiction was written by that blogger? It can't be ruled out without trying, but generalizing these results to that seems questionable.
The difficulty of doing it cross-context is actually slightly more surprising to me than the possibility. I would've guessed that, once a suitable data set were found (a main impediment to previous studies), accuracy would be quite good, along the lines of how easy it is to guess browser fingerprints from a few dozen telltale markers. But it appears that only about 10% of authors can be guessed to a precision of 80%, which is still pretty decent odds of not being identified automatically, at least for now, even without actively trying to cover up (though the linked post is right that with a specific target, intelligently adding some ad-hoc additional features can probably help).
One thing that'd be interesting to me is whether there are certain characteristics that make it particularly easy to identify people cross-context, like a top-10-telltale-markers sort of thing. Are a disproportionate number of the 10% who can be identified with high precision using a handful of unusual grammatical or lexical features, or is it more of a diffuse sort of thing?
That's a very interesting paper (and very accessible to anyone with a stats/data mining background). I went back and read Jason Baldridge's intro, which is excellent
It seems you didn't attempt to fingerprint for misspellings, among the variables on pdf p 5. Also, curious why did you need to up the dataset to exactly 100k with the 5.7k.
Much better than that: if the research/software that identifies authors is published, and some reasonable approximation of the public training set that deanonymizers would use is available, then anyone can check their writing against the tool before publishing it.
If your writing is too identifying, just perturb the text until the tool fails to identify the author. Or even better: perturb the writing until the deanonymizer fingers someone else, in a usefully confounding way.
The deanonymizer's feature-extraction/analysis could itself help drive the perturbation routines. "Make my word choice more like Paul Graham", you could say. And even if there are limits to its automatic substitutions, it could offer coaching: "To make your writing more Graham-like, decrease your average sentence length and use fewer interjections."
Edit, resubmit, repeat until the right author is fingered.
Business idea: website that offers this tuning to help un-deanonymize or faux-deanonymize writing.
Evil business idea: this website remembers everything submitted, to allow the super-deep-pocketed to peek in and de-un-deanonymize (of re-deanonymize?) blocks of text.
It's not a privacy implication, it's an anonymity one. Different things. If you want privacy, use PGP or check the appropriate setting in Facebook to make sure only the intended recipient(s) sees your message. If you want anonymity for publicly broadcast messages of anything you say, that's a lot harder since one can infer on content as well as style and draw inferences around collections of lies just as easily as collections of truths.
I would imagine that a program to detect you by writing style in the way explained here could also be able to anonymize your writing too, with a little rewriting itself. Or even used to frame other authors.
I know that I semi-consciously engage in a few spelling anachronisms that probably serve to isolate me. Actually, since I recognized both them and their likely effect, I've become somewhat more conscious in applying them -- or in checking for them while proofreading and deciding whether to leave them in.
"Developing fully automated methods to hide traces of one’s writing style remains a challenge". How would the following 3 methods fare?
Method1: Run the text through a markov chain constructed maybe from a mixture of 0.5 your text, 0.25 Shakespeare and 0.25 Alice in wonderland. Do something like sample every third word with the other two coming as a chain. Then run that text through wordnet to do synonym based replacement.
Method 2: Do a translation to a nearby language and back again using some language translating api.
Method 3: Replace less common words with hypernyms and more common words with synonyms or possibly not + antonyms.
Might want a few heuristics to replace stuff like (, ..., ) , - , : ,[,] with each other. Also randomize space between punctuation.
Optionally Run the outputs through mechanical turk to iron out the result, leave as is or clean by self.
Of course, this is dependant on translation algorithms that are at least somewhat inaccurate.
You may want to choose one closeby language, and one further removed and find the equilibrium phrase. For example, translate english->german->italian->german->english, repeat until you get the same english phrase each time.
[1] http://news.ycombinator.com/item?id=413730
[2] No, really, it's been exactly Π years to the day :-)