Is Writing Style Sufficient to Deanonymize Material Posted Online?

randomwalker · on Feb 20, 2012

Lead author here. Since my serious thinking on this topic started when I responded to this Ask HN post[1] Π years ago[2], it's nice to see this posted here, to come full circle in a sense. Happy to answer any questions.

[1] http://news.ycombinator.com/item?id=413730

[2] No, really, it's been exactly Π years to the day :-)

dstorrs · on Feb 20, 2012

I'm impressed by the skill that goes into this but it doesn't seem like an even-handed technology -- it empowers governments, major corporations, and other large organizations more than it does private individuals.

As a specific example, people writing political blogs in China could be seriously harmed by this technique even at the levels that it's at now.

I applaud you for including the link to "manually changing your writing style will defeat these attacks" but that's a link to an academic paper. Could you please also write some good, layperson-oriented docs on "how to beat this"? For that matter, I'll do the writing grunt work if you'll provide the expertise. If you're interested, use the GMail address in my profile.

randomwalker · on Feb 20, 2012

That's a good question. First, I believe that intelligence agencies are already well aware of the potential of technology like this, and at least some, like the NSA, could very well be ahead of public research. Second, research such as ours is intended to demonstrate a proof of concept, and it takes a lot of work to turn it into a reliable tool — for example, we restrict ourselves to English text. For those two reasons, I think our work does little to directly help governments and other oppressive entities. On the other hand, publicly available research is effective (we hope) in raising awareness of the threat, so on balance it does more good than harm to people writing political blogs.

As for practical tips to defeat stylometry and such, organizations like the EFF specialize in doing that, so I will leave that to them. Comparative advantage, etc. If you would like to help, you are more than welcome.

bediger · on Feb 20, 2012

could very well be ahead of public research - does that mean you've had meetings with groups of 3 federal employees, one of whom does nothing but ensure the other 2 don't say too much? You know, like the feds that visited IBM to make DES more resistant to differential cryptanalysis (http://en.wikipedia.org/wiki/Data_Encryption_Standard#NSA.27...)?

Jach · on Feb 20, 2012

I'd also be interested in what anonymizing techniques come from this, but the way to specifically beat the Chinese government here is to carefully guard the border between your online presence and your offline government-issued identity. At minimal this means anyone in China should be encrypting everything before it leaves their machine and goes to the network through a connection with their name on it, and if they're in the business of writing anti-China political blogs, they need to treat everything they do publicly as a threat to their identity. You can't expect to carelessly leak your identity today and try to hide it tomorrow. You either don't leak your identity at all or what you're trying to keep anonymous has to be so low-bit in information content that it's probably worthless because it would be like everything else.

gtani · on Feb 21, 2012

http://news.ycombinator.com/item?id=3582490

earlyresort · on Feb 20, 2012

Researchers at Drexel have released some software to help subvert stylometry - see 'Anonymouth', linked from here: https://psal.cs.drexel.edu/index.php/Main_Page

There's an interesting presentation here: http://events.ccc.de/congress/2011/Fahrplan/attachments/2019...

I'd be interested in seeing more tools in this vein. I already know that if I want to camouflage my writing, I have to cut way down on the dashes, but it'd be neat to upload a few large documents and see a nice list of the top ten traits that could be used to identify me.

nodata · on Feb 21, 2012

Anonymouth is way too complicated for an above-average user to use.

dhx · on Feb 23, 2012

The technology must be assumed to exist as there is plenty of commercial incentive to develop and use the technology. A threat model that fails to account for future advances in technology vs. expected period of protection is unacceptable.

Making this technology available and easily accessible for widespread public use encourages development of counter-authorship-detection techniques. If tools are available that allow authors to easily assume the identity of other people (through morphing of writing styles), this technology rapidly loses value.

derrida · on Feb 21, 2012

I attended 28c3 and from memory there was a tool called JStylo for discovering authorship and Anonymouth for defeating authorship recognition. Check the Chaos Computer Club site.

robrenaud · on Feb 21, 2012

Is this really that hard to break?

Imagine I want to post anonymously about some sensitive subject. Whenever I do it, I write the article normally, then Google translate from English to French to English, and then I clean up obvious errors in the retranslation.

dhx · on Feb 23, 2012

Practically that would work, at least for the foreseeable future.

Theoretically it's not so easy. If you know how Google translates from English to French (and vice-versa) you can at least partially reverse the process.

Or perhaps authorship detection can be performed on the output from Google translate? The authoriship entropy contained within the original English text is likely to be carried through (at least partially) to the post-translated output.

DanBC · on Feb 23, 2012

Some people could be tracked by their weird punctuation. I'm not sure how to deal with that.

staunch · on Feb 20, 2012

I asked that question! I was going to post a link to that when I saw the headline here.

Ask HN's are awesome but you don't usually get a PhD thesis in reply :-)

JesseAldridge · on Feb 21, 2012

So did you every figure out who onetimetoken was?

http://news.ycombinator.com/item?id=1203836

DennisP · on Feb 21, 2012

Is your software and the dataset it uses publicly available?

If so, perhaps it would be possible to use it to engineer countermeasures. In the example you used ("since" vs "because"), it would be fairly simple to alter the ratio.

Of course I don't know what more complex indicators you may be using, but I'm having a hard time imagining what would be easy to measure, but difficult to alter.

saulrh · on Feb 20, 2012

Do you expect your system to scale easily to the "the entire internet" data set? You mention that you degrade gracefully as the problem size increases; is getting the accuracy up to a usable figure only a question of more signals and processing power and dirty ML tricks, or do you think that you need a fundamental improvement?

randomwalker · on Feb 21, 2012

The algorithm achieves significantly higher accuracy if it has more text per author. Also, if you're willing to do human analysis on a few dozen candidates after algorithmically shortlisting them, that gives you a further advantage. Finally, there is much room for straightforward algorithmic improvement (e.g., ensembles of classifiers) that we didn't have time to fully investigate. In short, IMO it's just a matter of more data and slightly better ML, not fundamental improvements.

saulrh · on Feb 21, 2012

Pretty much what I figured. I look forward to seeing your future results! Or to seeing you to get shut down by the NSA. :P

neilk · on Feb 21, 2012

Great paper, glad to see the fruit of your efforts.

I guess you did not look at texts with multiple authors, or professionally edited texts? I am curious if a different editor, or publication house style, can be detected.

loboman · on Feb 20, 2012

Well, not exactly, just 3.142 :)

http://www.wolframalpha.com/input/?i=1147+days+to+years

randomwalker · on Feb 20, 2012

Sure, Π is irrational, so it can't be exact :) What I meant is that 1147 is the closest integer to Π*365, which IMO is still an awesome coincidence!

simonsarris · on Feb 20, 2012

Reading this it came to mind and is perhaps worth mentioning that this is how the Unabomber (Ted Kaczynski), the Luddite who engaged in a mail bombing campaign that spanned nearly 20 years, was caught.

Before the publication of the manifesto, Theodore Kaczynski's brother, David Kaczynski, was encouraged by his wife Linda to follow up on suspicions that Ted was the Unabomber. David Kaczynski was at first dismissive, but progressively began to take the likelihood more seriously after reading the manifesto a week after it was published in September 1995. David Kaczynski browsed through old family papers and found letters dating back to the 1970s written by Ted and sent to newspapers protesting the abuses of technology and which contained phrasing similar to what was found in the Unabomber Manifesto

http://en.wikipedia.org/wiki/Ted_Kaczynski#Search

DarkShikari · on Feb 21, 2012

While impressive, I don't think these results are actually that bad for privacy. 80% precision, for example, is useless when you're matching against tens of millions. It's much the same fallacy of the medical test for a disease that occurs in 1 out of 1000 people, and which has 99% accuracy -- but that still means a 90% false positive rate.

It reminds me of the claims of being able to identify, for example, the gender of an author with ~65% accuracy -- which is really actually completely unimpressive, as it's hardly better than guessing, and certainly not something you could rely on for any serious purpose.

The author mentions that topic is one way to help correlate beyond the results of the algorithm. But if I wrote "anonymous" posts in my area of expertise, you certainly would not need stylistic analysis to guess what my identity might be! There has never been privacy in this regard, I don't think.

Where privacy is needed most, I think, is exactly where this deanonymizing tool still isn't sufficient: talking about unrelated topics. A person should be free to express themselves under multiple names for different purposes, and there is no reason why an employer needs to know about a programmer's side hobby as a fiction writer if s/he doesn't want them to.

Finally, I do wonder how well these results correlate to the case where someone is intentionally operating under a different name. Matching one post by tech blogger A against blogger A is easy, because tech blogger A is making no attempt to write any differently or in any different context. However, what if tech-writer A ghost-wrote YA fiction on the side? Could you use these techniques to detect that the fiction was written by that blogger? It can't be ruled out without trying, but generalizing these results to that seems questionable.

_delirium · on Feb 20, 2012

The difficulty of doing it cross-context is actually slightly more surprising to me than the possibility. I would've guessed that, once a suitable data set were found (a main impediment to previous studies), accuracy would be quite good, along the lines of how easy it is to guess browser fingerprints from a few dozen telltale markers. But it appears that only about 10% of authors can be guessed to a precision of 80%, which is still pretty decent odds of not being identified automatically, at least for now, even without actively trying to cover up (though the linked post is right that with a specific target, intelligently adding some ad-hoc additional features can probably help).

One thing that'd be interesting to me is whether there are certain characteristics that make it particularly easy to identify people cross-context, like a top-10-telltale-markers sort of thing. Are a disproportionate number of the 10% who can be identified with high precision using a handful of unusual grammatical or lexical features, or is it more of a diffuse sort of thing?

ludflu · on Feb 20, 2012

Funny, I just started reading about adversarial stylometry the other day. https://www.cs.drexel.edu/~mb553/stuff/Indiana_20110407.pdf

gtani · on Feb 21, 2012

That's a very interesting paper (and very accessible to anyone with a stats/data mining background). I went back and read Jason Baldridge's intro, which is excellent

http://ata-s12.utcompling.com/schedule/ATA-Authorship%20Attr...

It seems you didn't attempt to fingerprint for misspellings, among the variables on pdf p 5. Also, curious why did you need to up the dataset to exactly 100k with the 5.7k.

lignuist · on Feb 20, 2012

Location can (sometimes) also be detected from writing style:

http://www.cmu.edu/news/archive/2011/January/jan7_twitterdia...

tensafefrogs · on Feb 20, 2012

The privacy implications are a bit worrisome. Perhaps it's time to write utilities to anonymize your writing style.

Maybe running your text through a round-trip translator could help? (although then you'd need to fix any errors introduced).

gojomo · on Feb 20, 2012

Much better than that: if the research/software that identifies authors is published, and some reasonable approximation of the public training set that deanonymizers would use is available, then anyone can check their writing against the tool before publishing it.

If your writing is too identifying, just perturb the text until the tool fails to identify the author. Or even better: perturb the writing until the deanonymizer fingers someone else, in a usefully confounding way.

The deanonymizer's feature-extraction/analysis could itself help drive the perturbation routines. "Make my word choice more like Paul Graham", you could say. And even if there are limits to its automatic substitutions, it could offer coaching: "To make your writing more Graham-like, decrease your average sentence length and use fewer interjections."

Edit, resubmit, repeat until the right author is fingered.

Business idea: website that offers this tuning to help un-deanonymize or faux-deanonymize writing.

Evil business idea: this website remembers everything submitted, to allow the super-deep-pocketed to peek in and de-un-deanonymize (of re-deanonymize?) blocks of text.

abecedarius · on Feb 20, 2012

Typically someone wanting anonymity needs to defend against future attacks on their published works, not just current ones.

Jach · on Feb 20, 2012

It's not a privacy implication, it's an anonymity one. Different things. If you want privacy, use PGP or check the appropriate setting in Facebook to make sure only the intended recipient(s) sees your message. If you want anonymity for publicly broadcast messages of anything you say, that's a lot harder since one can infer on content as well as style and draw inferences around collections of lies just as easily as collections of truths.

pessimizer · on Feb 20, 2012

I would imagine that a program to detect you by writing style in the way explained here could also be able to anonymize your writing too, with a little rewriting itself. Or even used to frame other authors.

pasbesoin · on Feb 20, 2012

I know that I semi-consciously engage in a few spelling anachronisms that probably serve to isolate me. Actually, since I recognized both them and their likely effect, I've become somewhat more conscious in applying them -- or in checking for them while proofreading and deciding whether to leave them in.

kaarlo_n · on Feb 21, 2012

This exact issue is addressed in _The Secret Life of Pronouns_ (Pennebaker).

http://secretlifeofpronouns.com/

Dn_Ab · on Feb 21, 2012

"Developing fully automated methods to hide traces of one’s writing style remains a challenge". How would the following 3 methods fare?

Method1: Run the text through a markov chain constructed maybe from a mixture of 0.5 your text, 0.25 Shakespeare and 0.25 Alice in wonderland. Do something like sample every third word with the other two coming as a chain. Then run that text through wordnet to do synonym based replacement.

Method 2: Do a translation to a nearby language and back again using some language translating api.

Method 3: Replace less common words with hypernyms and more common words with synonyms or possibly not + antonyms.

Might want a few heuristics to replace stuff like (, ..., ) , - , : ,[,] with each other. Also randomize space between punctuation.

Optionally Run the outputs through mechanical turk to iron out the result, leave as is or clean by self.

warfangle · on Feb 21, 2012

The first I thought of was your second point.

Of course, this is dependant on translation algorithms that are at least somewhat inaccurate.

You may want to choose one closeby language, and one further removed and find the equilibrium phrase. For example, translate english->german->italian->german->english, repeat until you get the same english phrase each time.

cbo · on Feb 20, 2012

Possible privacy implications aside, this is awesome.

I am consistently astounded by how advanced AI techniques are becoming.

stretchwithme · on Feb 21, 2012

Not if you can detect such style and point it out to the writer before he posts it.

toonse · on Feb 20, 2012

I, can see that. People, always say, that they can determine my writing style, without, much problem.