swf's comments

swf · on April 28, 2017

I did something like this in a project for my former work at a research group, and it was very difficult to visualize due to the number of variables we needed to communicate in the visualization.

What we were trying to do was use word2vec to model changes in relationships between 700 proteins and one in particular, which is related to cancer/tumor growth. We created multiple word2vec models based on year windows of medical journals (so the 2003 model had input journals from 2000 - 2003).

To visualize the models we used a D3 force graph where the nodes were the 700 proteins and the edges were known discoveries of relationships between proteins that had years associated with them - as in X protein was discovered to be related to Y protein in 2007. The relationship data was curated by people, independently of the word2vec models. The size of each of the nodes was determined by the word2vec model for the particular year's similarity score between that protein and the cancer-related protein we were interested in.

To see the changes between years, we used a year slider which the force graph would respond to by animating the sizes of the nodes in line with the particular year model's similarity score for each protein. In addition, the color of the nodes represented changes in the models' similarities between the proteins - more green meant that the word2vec model's similarity score had increased compared to the previous year's model, and red meant it had decreased.

The visualization is useful, but it is a bit of a mess considering there are 700 . I'll message you the link and if anyone else want the link I can send it, but I'd rather not post it here since it's hosted on my college's CS department server and it's not equipped to deal with a lot of traffic.

Also if anyone has an idea of how else we might have done it I'd be interested to hear it.

Edit: Didn't realize HN doesn't have a PM feature - if there's another way to send it to you and you're interested let me know.

openasocket · on April 28, 2017

Thanks, that's helpful! And you can email me at <my HN username> at gmail.com.

I'm also curious if there are any ways to quantify, mathematically, the changes over time. There's the simple sum of the squares of the changes distance to get a sense of the "kinetic energy" of the system, but I'm wondering if there are some more clever analyses, especially something that can quantify localized changes versus global changes.

Edit: so are you running a separate word2vec thing for each year's dataset? If so, how to you map between them, because the orientation the word2vec mapping generates will be random, and I worry that trying to rotate the mappings to some common axis could obscure some of the data.

swf · on April 28, 2017

Sent! Yeah, we made a model for each year's dataset. In our case, we were only interested in the similarity between our target protein and the others, so we used the model's similarity measure between those in order to avoid problems with varying orientations between models.

swf · on April 28, 2017

The author mentions in the second paragraph that the data comes from scraping Wikipedia articles' plot descriptions. So the plots might be old, but the descriptions (and language) were all written recently.

ghaff · on April 28, 2017

Before drawing any strong conclusions, I'd probably want to do at least some validation against original sources, e.g. Project Gutenberg. You're talking about plot descriptions written by a mostly fairly narrow demographic. I'd be hesitant to use that to draw conclusions about the source material.

SerLava · on April 28, 2017

Hm. Wouldn't that have little bearing on the result? Can't really say "he poisons" when that wasn't the plot of the story.

Bartweiss · on April 28, 2017

It might have a large impact on the language, though. 'Empoison' used to be a verb, 'burgle' has largely been replaced with 'rob', and so on. I think this would tend to improve the data, though - 'empoison' and 'poison' ought to be grouped.

lmkg · on April 28, 2017

But that's the point. "She poisons" is more likely to be the plot of the story than "he poisons."

davrosthedalek · on April 28, 2017

There could be a bias in the summary writers too, in that they prefer "he murders" and "she poisons" for the same method of killing.

swf · on April 10, 2017

Sorry to hear that. What part of it didn't work?