I found that explanation to focus almost entirely on the negative sampling, not explaining too much of the actual Skip-Gram and CBOW-models.
However as I've understood it, the negative-sampling is a big part in why those models are so calculation-efficient, combined with Hierarchical Softmax to reduce the complexity further.
This current article seems to cover the various choices for constructing 'contexts' (which include skip-gram and CBOW) pretty well.
Note that negative-sampling and hierarchical-softmax are actually alternative choices to interpret the hidden-layer and to arrive at error-values to back-propagate. Each can be used completely independently.
If you enable them both, you're training two independent hidden layers, which then in an interleaved fashion update the same shared input-vectors. (Essentially, it's joint training of each example via the hierarchical-softmax codepath to nudge the vectors, then via the separate negative-sampling codepath to nudge the vectors.) So the actual combination doesn't reduce the complexity – it's additive to model state size and training time – and I think most projects with large amounts of data just use one or the other (usually just negative-sampling).
Ah, thank you for pointing that out. I guess I got confused in all the papers I've read on the topic recently. It's hard to get into.
However, I would still not agree that the comment-linked article explaining negative sampling really explains how word2vec works, well enough, or maybe I just didn't understand.
Nice paper. I especially like how he has equations, pseudo-code, and Python code snippets. He could turn this paper into a book, adding full Python examples, and I would buy a copy.
Huh, I wrote a few pages on neural networks for Natural Language Processing just a few days ago. Too bad I didn't have access to this. It seems to mention all the different kinds of networks I figured to be relevant to mention, and it has a comprehensive explanation on Recursive Neural Networks, which I didn't really find.
I glanced through the entire PDF. While it looks like an outstanding comprehensive overview to neural networks, it doesn't appear to really address NLP all that much, despite the title.
I would gladly welcome if you or someone could write a guide that has the comprehensiveness of the PDF above but with more NLP domain-specific discussion and concrete examples.
http://arxiv.org/pdf/1402.3722v1.pdf https://levyomer.wordpress.com/2014/04/25/word2vec-explained...