Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

YES! These are pretty much exactly the methods I used when I developed my project http://prosecraft.io

You can see the emotional story arc -- the shapes of the stories -- for more than 16,000 books.

I train a Word2Vec model on the vocabulary of all those books (almost 1.5 billion words) and then I use a clustering algorithm to score all those words on a sentiment scale of 1 to 10 (where 1 is the most negative and 10 is the most positive). Then I break the books into 50 equal-sized chunks and aggregate the positive and negative scores for each chunk.

You can click on any of the chart segments to see a word cloud of all the words that contributed to the positive and negative sentiment of that chunk. You can really see the ups and downs of the stories, as the protagonists struggle to overcome their obstacles, when you look at those charts!

Here are a few of my favorite example books to show people:

The Hobbit

http://prosecraft.io/library/j-r-r-tolkien/the-hobbit/

Harry Potter and the Deathly Hallows

http://prosecraft.io/library/j-k-rowling/harry-potter-and-th...

Animal Farm

http://prosecraft.io/library/george-orwell/animal-farm/

I first encountered this method not through Vonnegut but through the "Hedonometer" project, at the University of Vermont Computational Story Lab. They use this technique on the twitter firehose, to measure the overall emotional arc of the world, as expressed in social media.

https://hedonometer.org/timeseries/en_all/

There's an excellent episode of the podcast Lexicon Valley where they discuss the hedonometer project, with the researchers at UVM who developed it...

http://www.slate.com/articles/podcasts/lexicon_valley/2015/0...



I don't mean to be overly negative but browsing through some titles the "emotional story arc" is indistinguishable from a randomly generated line graph. Clicking on the bars reveals the way this was obtained... "bad, death, dark, danger" = less score, "good, great, love" = good. Of course such a trivial and simplistic analysis cannot ever produce any meaningful result.

The "most passive page" thing also does not seem to be working. Passive as in passive voice? If yes it's also pretty off the mark.


I respect your skepticism :)

It's easy to imagine exceptions to the idea of a simple numerical word-scoring algorithm...

Of course, a word like "bad" might be used ironically, or in some other slang-sense, with a different literal meaning on the page...

But that's totally fine. In principle, the word2vec algorithm is designed to cope with ambiguities like that.

When you analyze billions of words of prose, you can build a model of word-associativity that captures the superposition of all those different word-senses, and the contexts where they tend to appear on the page.

After a big crazy machine-learning process, each word is modeled as a vector in 300-dimensional space, with a vast network of associations and relationships between the other words in the vector-space, based on the way those words are used together in typical English grammar.

When we score the emotional valence of a particular word, we use a "word-vector" technique where those ambiguities are basically already priced into the scoring calculation. Words with a "less ambiguous" sentiment score (joy, paradise, ..., agony, depression) have their lack-of-ambiguity baked into the formula already.

Extreme scores are reserved for words with unambiguous intensity.

But the important thing is: we're not really as concerned about the numerical scores of individual words as we are with the shifting balance of those sentiment scores over the course of a long document.

It's not a perfect way of scoring sentiment of individual words, but it's REALLY reliable for estimating the basic structure of a narrative.


Wow! I've been a longtime lurker on HN, and I created an account to just tell you that prosecraft.io is beautifully designed! Would you mind sharing what visualization tools or libraries you used to render your graphs?


Awww thank you! I really appreciate it!

I'm not using any visualization libraries. It's all just hand-coded javascript... I've been meaning to learn D3 for a long time, but I haven't gotten around to it yet.


Oops, I almost forgot... The one viz component I'm using is the excellent WordCloud2 library by Timothy Guan-tin Chien...

https://timdream.org/portfolio/wordcloud/


Even more impressive! I immediately assumed D3 but wasn't entirely sure. Congrats again on this work corroborating Vonnegut's 'shapes' of stories.


Second this. Love the prose craft site! Beautiful!


Your site is very impressive!

Of the books you've analyzed, interesting - but not necessarily surprising - to see a Palahniuk book has the least "passive voice" usage (1).

1.) http://prosecraft.io/library/chuck-palahniuk/pygmy/


Nice site, it's fun to look around!

I threw a curveball at it: http://prosecraft.io/library/mark-z-danielewski/house-of-lea...

It would be interesting to see if Prosecraft would ever correlate "similar books" with Borges since Danielewski said that was an influence.


Right now the "similar books" thing is based on a "topic-model"...

So books are more likely to be similar if they're roughly in the same genre and discuss similar kinds of topics (dragons, computers, romance, spies, war, shopping, time-travel, magic, hunting, etc).

Someday I hope the "similar books" feature will be a bit more sophisticated, where other kinds of "similarity" will also be relevant, beyond just the topic-model... Other things like: story structure, narrative voice, irony, vocabulary, sense-of-humor, lyricisim, etc...


This is a beautiful site done in such an original way. I just tried it on my favorite Vonnegut book: http://prosecraft.io/library/kurt-vonnegut/mother-night/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: