Hacker News new | past | comments | ask | show | jobs | submit login

The author mentions in the second paragraph that the data comes from scraping Wikipedia articles' plot descriptions. So the plots might be old, but the descriptions (and language) were all written recently.



Before drawing any strong conclusions, I'd probably want to do at least some validation against original sources, e.g. Project Gutenberg. You're talking about plot descriptions written by a mostly fairly narrow demographic. I'd be hesitant to use that to draw conclusions about the source material.


Hm. Wouldn't that have little bearing on the result? Can't really say "he poisons" when that wasn't the plot of the story.


It might have a large impact on the language, though. 'Empoison' used to be a verb, 'burgle' has largely been replaced with 'rob', and so on. I think this would tend to improve the data, though - 'empoison' and 'poison' ought to be grouped.


But that's the point. "She poisons" is more likely to be the plot of the story than "he poisons."


There could be a bias in the summary writers too, in that they prefer "he murders" and "she poisons" for the same method of killing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: