I did something like this in a project for my former work at a research group, and it was very difficult to visualize due to the number of variables we needed to communicate in the visualization.
What we were trying to do was use word2vec to model changes in relationships between 700 proteins and one in particular, which is related to cancer/tumor growth. We created multiple word2vec models based on year windows of medical journals (so the 2003 model had input journals from 2000 - 2003).
To visualize the models we used a D3 force graph where the nodes were the 700 proteins and the edges were known discoveries of relationships between proteins that had years associated with them - as in X protein was discovered to be related to Y protein in 2007. The relationship data was curated by people, independently of the word2vec models. The size of each of the nodes was determined by the word2vec model for the particular year's similarity score between that protein and the cancer-related protein we were interested in.
To see the changes between years, we used a year slider which the force graph would respond to by animating the sizes of the nodes in line with the particular year model's similarity score for each protein. In addition, the color of the nodes represented changes in the models' similarities between the proteins - more green meant that the word2vec model's similarity score had increased compared to the previous year's model, and red meant it had decreased.
The visualization is useful, but it is a bit of a mess considering there are 700 . I'll message you the link and if anyone else want the link I can send it, but I'd rather not post it here since it's hosted on my college's CS department server and it's not equipped to deal with a lot of traffic.
Also if anyone has an idea of how else we might have done it I'd be interested to hear it.
Edit: Didn't realize HN doesn't have a PM feature - if there's another way to send it to you and you're interested let me know.
Thanks, that's helpful! And you can email me at <my HN username> at gmail.com.
I'm also curious if there are any ways to quantify, mathematically, the changes over time. There's the simple sum of the squares of the changes distance to get a sense of the "kinetic energy" of the system, but I'm wondering if there are some more clever analyses, especially something that can quantify localized changes versus global changes.
Edit: so are you running a separate word2vec thing for each year's dataset? If so, how to you map between them, because the orientation the word2vec mapping generates will be random, and I worry that trying to rotate the mappings to some common axis could obscure some of the data.
Sent! Yeah, we made a model for each year's dataset. In our case, we were only interested in the similarity between our target protein and the others, so we used the model's similarity measure between those in order to avoid problems with varying orientations between models.
The author mentions in the second paragraph that the data comes from scraping Wikipedia articles' plot descriptions. So the plots might be old, but the descriptions (and language) were all written recently.
Before drawing any strong conclusions, I'd probably want to do at least some validation against original sources, e.g. Project Gutenberg. You're talking about plot descriptions written by a mostly fairly narrow demographic. I'd be hesitant to use that to draw conclusions about the source material.
It might have a large impact on the language, though. 'Empoison' used to be a verb, 'burgle' has largely been replaced with 'rob', and so on. I think this would tend to improve the data, though - 'empoison' and 'poison' ought to be grouped.
What we were trying to do was use word2vec to model changes in relationships between 700 proteins and one in particular, which is related to cancer/tumor growth. We created multiple word2vec models based on year windows of medical journals (so the 2003 model had input journals from 2000 - 2003).
To visualize the models we used a D3 force graph where the nodes were the 700 proteins and the edges were known discoveries of relationships between proteins that had years associated with them - as in X protein was discovered to be related to Y protein in 2007. The relationship data was curated by people, independently of the word2vec models. The size of each of the nodes was determined by the word2vec model for the particular year's similarity score between that protein and the cancer-related protein we were interested in.
To see the changes between years, we used a year slider which the force graph would respond to by animating the sizes of the nodes in line with the particular year model's similarity score for each protein. In addition, the color of the nodes represented changes in the models' similarities between the proteins - more green meant that the word2vec model's similarity score had increased compared to the previous year's model, and red meant it had decreased.
The visualization is useful, but it is a bit of a mess considering there are 700 . I'll message you the link and if anyone else want the link I can send it, but I'd rather not post it here since it's hosted on my college's CS department server and it's not equipped to deal with a lot of traffic.
Also if anyone has an idea of how else we might have done it I'd be interested to hear it.
Edit: Didn't realize HN doesn't have a PM feature - if there's another way to send it to you and you're interested let me know.