A relatively underdiscussed quirk of the rise of superlarge language models like GPT-3 for certain NLP tasks is that since those models have incorporated so much real world grammar, there's no need to do advanced preprocessing and can just YOLO and work with generated embeddings instead without going into spaCy's (excellent) parsing/NER features.
Hugging Face Transformers makes this easier (and for free) as most models can be configured to return a "last_hidden_state" which will return the aggregated embedding. Just use DistilBERT uncased/cased (which is fast enough to run on consumer CPUs) and you're probably good to go.
While you make sensible points, in the case of GPT-3, not everyone will be willing to route their data through OpenAI's servers.
> Just use DistilBERT uncased/cased (which is fast enough to run on consumer CPUs)
This can still be impractical, at least in my case of regularly needing to process hundreds of pages of text. Simpler systems can be much faster for an acceptable loss and you can get more robustness by working with label distributions instead of just picking argmax.
Fast simpler classifiers can also help decide where the more resource intensive models should focus attention.
Another reason for preprocessing is rule systems. Even if not glamorous to talk about, they still see heavy use in practical settings. While dependency parses are hard to make use of, shallow parses (chunking) and parts of speech data can be usefully fed into rule systems.
lol. a rough translation is that the new super language models are good enough that you don't have to keep track of specific parts of speech in your programming. if you look at the arrays of floating point weights that underlie gpt-3 etc, you can use them to match present participle phrases with other present participle phrases and so forth
this is of course a correct and prescient observation. minimaxir is kind of an NLP final boss, so I wouldn't expect most people to be able to follow everything he says
I don't think it's more of a final boss thing: IMO working with embeddings/word vectors is easier, even in the basest case such as word2vec/GloVe, to understand than some of the more conventional NLP techniques (e.g. bag of words/TF-IDF).
The spaCy tutorials in the submission also have a section on word vectors.
Ah, although, TF-IDF is still good to know. Semantic search hasn't eliminated the need for classical retrieval techniques. It can also be used to select a subset of words to use to create an average of word vectors for a document signature, a quick and dirty method for document embeddings.
Bag of word co-occurrences in matrix format is also a nice to know, factorizing such matrices were the original vector space model for distributional semantics and provide historical context for GloVe and the like.
> Bag of word co-occurrences in matrix format is also a nice to know, factorizing such matrices were the original vector space model for distributional semantics and provide historical context for GloVe and the like.
And also, IIRC, still outperforms them on some tasks.
Readjusting expectations for pre-processing was one of the biggest differences I noticed going from NLP courses to working on NLP in production. For the amount of pre-processing learning material there is, I expected it to be much more important in practice.
I feel lucky to gotten into NLP when I did (learning in 2017/2018 and working in the beginning of 2020). Changing our system from glove to BERT was super exciting and a great way to learn about the drawbacks and benefits of each.
IMHO it's not a difference between courses and production, but rather about the difference between preprocessing needs of different NLP ML approaches.
For some of NLP methods all the extra preprocessing steps were absolutely crucial (and took most of the time in production) and for other NLP methods they are of limited benefit and even harmful - and it's just that in older courses (and many production environments still!) the former methods are used, so the preprocessing needs to be discussed, but if you're using a BERT-like system, then BERT (or something similar) and its subword tokenization effectively becomes your preprocessing stage.
OpenAI recently released an Embeddings API for GPT-3 with good demos and explanations: https://beta.openai.com/docs/guides/embeddings
Hugging Face Transformers makes this easier (and for free) as most models can be configured to return a "last_hidden_state" which will return the aggregated embedding. Just use DistilBERT uncased/cased (which is fast enough to run on consumer CPUs) and you're probably good to go.