Re: promising results Yes, there have been. https://github.com/lmcinnes/umap_pap...

mr_crankypants · on July 16, 2019

Sense2vec isn't really what I'm waiting for.

It relies on word sense disambiguation, which tends to be one of those very language-specific things, and so I'd expect (but haven't verified) that, like other techniques that rely on language-specific bits, it wouldn't work as well on most non-English text. And the most interesting polysemy problems aren't to do with part of speech. They're things like "apple-as-in-food" vs "apple-as-in-computer", or figuring out that "The Big Apple" doesn't have anything to do with either of those. What would be really interesting is dealing well with jargon, slang, and terms of art.

As far as those notebooks, is there one in particular I should be looking at? I might have missed something, but the stuff I saw basically just demonstrated, "Hey, we can handle a lot of training data really fast." What I'd be more interested in seeing is, "Hey, plug us into your document classification pipeline and your performance (as in accuracy) metrics won't know what hit them."

edit: For a more concrete example of what I'd like to see, and going back to the analogy task: The holy grail I'm looking for isn't "king - man + woman = queen". It's more like "software engineer = programmer", and also "software engineer != software + engineer".

visarga · on July 16, 2019

You need to find collocations such as "software engineer" and "The Big Apple" and replace them with "software_engineer" and "The_Big_Apple" in the training corpus, then run regular w2v or GloVe. You will get exactly what you want, and also slightly improved vectors for the rest of the vocabulary.

mr_crankypants · on July 16, 2019

It's identifying the collocations so that you can do that replacement that remains an imperfect science.

yorwba · on July 16, 2019

I've heard good things about "Scalable Topical Phrase Mining from Text Corpora" [1], but it's been a while, so I don't know how close to the state of the art it is.

[1] https://arxiv.org/abs/1406.6312

Der_Einzige · on July 16, 2019

You'll find that if you run UMAP on a large corpus (the same size as your original word embeddings), the ones it'll generate (especially if you feed it any labels as UMAP supports semi-supervised and supervised dimensionality reduction) should outperform those generated I'd even wager by modern transformers. if they don't, than they'll be like 2% worse for a lot of speed improvement on the currently single threaded implementation of UMAP

Oh and you can use UMAP to concat tons of vector models together and all other side data for super-loaded embeddings

visarga · on July 16, 2019

So, do you run UMAP on the PMI matrix or on precomputed word embeddings? Seems like UMAP requires dense vectors as input.

hadsed · on July 16, 2019

Outperform on what tasks?