I just gave an invited talk at KDD about deep learning in which I covered this algorithm, so it's great to see this code appear now.
For anyone interested in text analysis: PLEASE study and use this code and the referenced papers. It's importance is hard to overstate. It is far, far better than all previous approaches to word analysis. These representations are the dimensional compression that occurs in the middle of a deep neural net. The resulting vectors encode rich information about the semantics and usage pattern of each word in a very concise way.
We have barely scratched the surface of the applications of these distributed representations. This is a great time to get started in this field - previous techniques are almost totally obsoleted by this so everyone is starting from the same point.
I have previously used Explicit Semantic Analysis (ESA) algorithm for individual word similarity calculations. ESA uses as a basis the text of Wikipedia entries and its ontology as a source and worked quite OK.
Do you / does anyone know if there is an easy way to use word2vec to compare similarities of two different documents (think of TF-IDF & cosine similarity)? It is stated on the page that "The linearity of the vector operations seems to weakly hold also for the addition of several vectors, so it is possible to add several word or phrase vectors to form representation of short sentences [2]", but the referenced paper has not yet been published.
It would be super interesting if there was a simple way to compare the similarities of two documents using something like this.
Could anybody explain (or provide a pointer to an explanation of) the details of how the individual words are mapped to vectors? The source is available, but optimized such that the underlying how's are a bit opaque, and the underlying whys even more so.
You can think of this as a square matrix W. The size of the matrix is the size of the vocabulary. If we look at the 100k most frequent words in our corpus, W will be a 100k x 100k matrix.
The value of W(i,j) is the distance between words i and j, and a row of the matrix is the vector representation of that word. Research around word vectors is all about computing W(i,j) in an efficient way that is also useful in natural language processing applications.
Word vectors are often used to compute similarity between words: since words are represented as vectors, we can compute the cosine angle between a given pair of words to find out how similar the two words are.
TL;DR: The answer to your query is a person named Chaudhry Sitwell Borisovich who is definitely an entomologist-hymnist and probably is also a mineralogist-ornithologist.
A google search suggests that he was born in 1961.
I ran a few queries using the code and its default dataset, trying to use neutral words for substraction: "mosquito -small +mountaineer", "mosquito -big +mountaineer", "mosquito -loud +mountaineer", "mosquito -normal +mountaineer", "mosquito -usual +mountaineer", "mosquito -air +mountaineer", "mosquito -nothing +mountaineer".
You inadvertently stumbled onto the punchline of the joke - "You can't cross them because a mountaineer is a scalar." (scaler) - works better when spoken.
Yeah, the papers linked in the references are probably a better place to start than the readme (though I'm not sure how closely aligned this implementation is with that research, but the paper is still a good read), especially [1]
I wrote a simple library[1] in Ruby for measuring the similarity between documents using word vectors. It has none of the cleverness of this one, but is much simpler, if that helps?
Just as a hint, you'll get better results if you apply some dimensional reduction to this. LDA is an old standby, but I like what these people are doing...
If you've ever complained on HN about NSA technology ruining our society, then time to take this tool and save society. This is powerful and cool technology. It could expose corruption. It could be a bullshit detector. The same types of tools that have stripped our liberties can be the same tools to re-balance democracy.
But please write free and open source software. Otherwise no one can ever trust the software or consider it secure and safe.
This is great! It would be amazing to have a list of other tools/libraries like this.
One project I'm working on I was looking to compare the word/topic frequencies of a document/text against common usage and I'm sure there is something out there like this that does this already as opposed to me doing it (poorly) from scratch.
"... and vector('king') - vector('man') + vector('woman') is close to vector('queen') [3, 1]. You can try out a simple demo by running demo-analogy.sh..."
I downloaded the code but I dont understand how to get queen as a result. I tried submitting king - man + woman, but the bin dont understand it.
I'm working in a experimental library to do a similar thing, but using echo state networks (https://github.com/neuromancer/libmind).
It will be nice to compare both approaches with a SVM to classify words.
Many different types of models were proposed for estimating continuous representations of words, including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). In this paper, we focus on distributed representations of words learned by neural networks, as it was previously shown that they perform significantly better than LSA for preserving linear regularities among words
It is similar, but at least according to what Mikolov wrote as a response to reviewer comments regarding LDA/LSA/tf-idf [1], LDA does not preserve linguistic regularities such as king - man + woman ~ queen. I asked for additional clarification, but so far I haven't received a reply.
A good intuition as to why these kinds of regularities could even exist was given by Chris Quirk as a blog comment [2]. Essentially, imagine that each word is approximately represented by the contexts it appears in, if so, swapping in and out the contexts of other words could indeed preserve some linguistic regularities.
I wonder how well it works; my takeaway was that you need to tweak the internal thresholds and matrix sizes a lot to get the optimal results, which in turn is highly dependant on the datasets you use (which is also made very clear in every LSA paper you'll read).
For anyone interested in text analysis: PLEASE study and use this code and the referenced papers. It's importance is hard to overstate. It is far, far better than all previous approaches to word analysis. These representations are the dimensional compression that occurs in the middle of a deep neural net. The resulting vectors encode rich information about the semantics and usage pattern of each word in a very concise way.
We have barely scratched the surface of the applications of these distributed representations. This is a great time to get started in this field - previous techniques are almost totally obsoleted by this so everyone is starting from the same point.