Modern methods for deriving word embeddings easily beat LSA.

jointpdf · on May 2, 2020

Hard to beat in terms of effort vs. quality of the outcome is more accurately what I meant (it’s two lines of code in scikit-learn [CountVectorizer() + TruncatedSVD()] to go from raw text to document/word embeddings, and the result is often “good enough” depending on what you’re trying to do). See the results on pg. 6 (note LSI==LSA): http://proceedings.mlr.press/v37/kusnerb15.pdf

Also, at least based on papers I’ve read recently, BERT doesn’t work that well for producing word embeddings compared to word2vec and GloVe (which can be formulated as matrix factorization methods, like LSA). See table on pg. 6: https://papers.nips.cc/paper/9031-spherical-text-embedding.p...

Point being: mastering the old models gives you a solid foundation to build from.

jmt_ · on May 2, 2020

Agree, but I bet LSA is still a good baseline due to it's simplicity.