A relatively underdiscussed quirk of the rise of superlarge language models like...

Vetch · on Dec 10, 2021

While you make sensible points, in the case of GPT-3, not everyone will be willing to route their data through OpenAI's servers.

> Just use DistilBERT uncased/cased (which is fast enough to run on consumer CPUs)

This can still be impractical, at least in my case of regularly needing to process hundreds of pages of text. Simpler systems can be much faster for an acceptable loss and you can get more robustness by working with label distributions instead of just picking argmax.

Fast simpler classifiers can also help decide where the more resource intensive models should focus attention.

Another reason for preprocessing is rule systems. Even if not glamorous to talk about, they still see heavy use in practical settings. While dependency parses are hard to make use of, shallow parses (chunking) and parts of speech data can be usefully fed into rule systems.

new_stranger · on Dec 10, 2021

I imagine it being very useful to understand what you just said

hooande · on Dec 10, 2021

lol. a rough translation is that the new super language models are good enough that you don't have to keep track of specific parts of speech in your programming. if you look at the arrays of floating point weights that underlie gpt-3 etc, you can use them to match present participle phrases with other present participle phrases and so forth

this is of course a correct and prescient observation. minimaxir is kind of an NLP final boss, so I wouldn't expect most people to be able to follow everything he says

minimaxir · on Dec 10, 2021

I don't think it's more of a final boss thing: IMO working with embeddings/word vectors is easier, even in the basest case such as word2vec/GloVe, to understand than some of the more conventional NLP techniques (e.g. bag of words/TF-IDF).

The spaCy tutorials in the submission also have a section on word vectors.

Vetch · on Dec 10, 2021

Ah, although, TF-IDF is still good to know. Semantic search hasn't eliminated the need for classical retrieval techniques. It can also be used to select a subset of words to use to create an average of word vectors for a document signature, a quick and dirty method for document embeddings.

Bag of word co-occurrences in matrix format is also a nice to know, factorizing such matrices were the original vector space model for distributional semantics and provide historical context for GloVe and the like.

mumblemumble · on Dec 11, 2021

> Bag of word co-occurrences in matrix format is also a nice to know, factorizing such matrices were the original vector space model for distributional semantics and provide historical context for GloVe and the like.

And also, IIRC, still outperforms them on some tasks.

vfulco2 · on Dec 11, 2021

Thank you for making it easier for those in the cheap seats to understand your point!

mtqwerty · on Dec 10, 2021

Readjusting expectations for pre-processing was one of the biggest differences I noticed going from NLP courses to working on NLP in production. For the amount of pre-processing learning material there is, I expected it to be much more important in practice.

I feel lucky to gotten into NLP when I did (learning in 2017/2018 and working in the beginning of 2020). Changing our system from glove to BERT was super exciting and a great way to learn about the drawbacks and benefits of each.

PeterisP · on Dec 11, 2021

IMHO it's not a difference between courses and production, but rather about the difference between preprocessing needs of different NLP ML approaches.

For some of NLP methods all the extra preprocessing steps were absolutely crucial (and took most of the time in production) and for other NLP methods they are of limited benefit and even harmful - and it's just that in older courses (and many production environments still!) the former methods are used, so the preprocessing needs to be discussed, but if you're using a BERT-like system, then BERT (or something similar) and its subword tokenization effectively becomes your preprocessing stage.