The author mentions in the second paragraph that the data comes from scraping Wi...

ghaff · on April 28, 2017

Before drawing any strong conclusions, I'd probably want to do at least some validation against original sources, e.g. Project Gutenberg. You're talking about plot descriptions written by a mostly fairly narrow demographic. I'd be hesitant to use that to draw conclusions about the source material.

SerLava · on April 28, 2017

Hm. Wouldn't that have little bearing on the result? Can't really say "he poisons" when that wasn't the plot of the story.

Bartweiss · on April 28, 2017

It might have a large impact on the language, though. 'Empoison' used to be a verb, 'burgle' has largely been replaced with 'rob', and so on. I think this would tend to improve the data, though - 'empoison' and 'poison' ought to be grouped.

lmkg · on April 28, 2017

But that's the point. "She poisons" is more likely to be the plot of the story than "he poisons."

davrosthedalek · on April 28, 2017

There could be a bias in the summary writers too, in that they prefer "he murders" and "she poisons" for the same method of killing.