As a linguist and software engineer I can't imagine someone doing serious NLP without ever having studied [concrete] syntax trees and such. It is easy to impress people with some tokenization but it's n-grams that are really useful in the real world, as is understanding syntax trees and all the interconnections possible inside them so you can NLP the shit out of real world text/speech, instead of simple examples with a tagger (and a good, carefully crafted for a demo like Apple's, trained set of tagged data). This is a good summary article with very good links nonetheless.
As a linguist, I see what you mean. However, I disagree on a couple of things. First, I wouldn’t say “n-grams and understanding syntax trees and all the interconnections possible inside them” will make your NLP skills or results any better. Understanding n-grams and syntax trees should be quite easy for linguists, but that won’t do the trick. It turns out that tagging, lemmatizing, wsd, parsing, and many other fundamental tasks in NLP are not that “simple” after all. Of course, you can use libraries to do all of those tasks -and this is pretty simple, indeed,- but most libraries will end up making flagrant mistakes if texts -either spoken or written- get too complex. Second, real world texts are complex by default, so, for someone to “NLP the shit out of real world text/speech”, they will have to find creative solutions to make tools improve -and that can get pretty difficult. Once again, understanding n-grams and syntax trees is not enough. There’s one thing I totally agree with you on: tokenization, tagging, and the like might not be that impressive. Nonetheless, creative solutions to the problems underlying those tasks are, in fact, rare and impressive.
> as is understanding syntax trees and all the interconnections possible inside them so you can NLP the shit out of real world text/speech
But do modern DL approaches (e.g. SQuad, translation models) defy this approach? They train DL models on labeled data without knowing anything about syntax trees, and allow NN do all the magic..
> They train DL models on labeled data without knowing anything about syntax trees, and allow NN do all the magic.
Sure, but that doesn't mean that knowing about syntax won't improve the result. If you're training a translation model on a huge database of labeled examples, it might discover syntactic relationships from scratch. But if you don't have so much data, you're probably better off using all the auxiliary information you can get.
> But that doesn't mean that knowing about syntax won't improve the result
this is likely correct, additional high quality input will likely improve performance, but creating such input for 200 modern human languages requires significant effort, much larger than allowing NN to solve this problem, that's why researchers invest this effort into NN improvement, not syntax tree creation tooling.
All of us have to strike the right balance between stubbornly believing that NNs will solve all problems in NLP -which, imho, underestimates what linguistics can do for this field of study- and stubbornly saying that only linguistic knowledge will lead to an improvement of the results -which, imho, means underestimating the outstanding work on deep learning techniques done so far. As you said, creating high-quality input is likely to be costly -and I'm not talking about syntax tree creation tooling only here. But, isn't it worth the while? I believe it is.
You could jointly train a NN to parse the sentence and tackle the NLP task you have at the same time, or instead of just using the tokens you could use a parser to add more features too.
From reading papers on WaveNet and AlphaGo, the problem seems to be incomplete information in the papers, but I am not going to argue with the popular results.
Sorry, but for a lot of situations n-grams just don't scale. There are just way too many combinations. There are a lot of people dealing with GBs and TBs of text.
The state of the art is primarily hybrid models which use deep neural networks in combination with classical parsing algorithms. For example, the Stack LSTM method uses a form of arc-std with an LSTM and a stack of sorts.
Arc-standard [1] is the classical algorithm in question. See section 3.2 of Dyer's paper: "Our parser is based on the arc-standard transition inventory (Nivre, 2004), given in Figure 3".
Perhaps it's not still state of the art, but I don't see dependency parsing as a task in that suite of benchmarks, and NLP isn't my primary field, so I'm not up-to-date on the whether or not another method has outperformed it in the last three years. (I would be surprised if one hadn't.) At the time it was, and regardless, I don't doubt that the current state of the art similarly uses neural networks only as a subset of the complete method.
Edit: I can't seem to reply to the post below me. There isn't an entry CMU/Pittsburgh in your benchmark. The paper refers to the Stanford Dependency [SD] and Penn Chinese Treebank 5 [CTB5], and reports 90.9/85.7, respectively, for LAS scores. I'm not sure if any of the treebanks listed correspond to these, but these numbers do seem in line to at least be competitive regardless of which treebank is used.
I really love that this getting started guide is "do lots of studying and practice, here are the canonical textbooks, papers, conferences, tools, and problems" instead of "spend a few hours on this superficial toy problem." I'd love to see more guides like this.
It's easy to list a lot of books and papers (and drown newcomers in them), without pointing to actual step=by-step starting points. Sure, doing superficial problems is only the first step (and it's foolish to think that it is the last step). Yet, you can read all books in the world, but unless you are able to prove theorems, or write code, you know less than someone who wrote a small script to predict names.
Additionally, it's weird that they recommend NLTK (no, please not), SpaCy (cool and very useful, but high-level), but not Gensim, PyTorch (or at least: Keras). As a side note, PyTorch has readable implementations of classical ML techniques, such as word2vec (vide https://adoni.github.io/2017/11/08/word2vec-pytorch/).
I finished top 25 in kaggle using NLTK and sklearn. word2vec is thrown around like the gospel in NLP, but simple techniques usually do a lot better because: #1 there isnt that much data in most cases and most importantly #2 the corpus differs substantially than the one word2vec was fit on. I am really flabbergasted by how many people start with word2vec and LSTM and come up with really over-fit models and they never even tried the simple things.
using ngrams (1 and 2 on words, and 3-5 characters with truncated SVD) gets you really far.
> Yet, you can read all books in the world, but unless you are able to proof theorems, or write code
This is the implicit intent of reading all those books. If you actively follow along when learning from those books you'll be guided through plenty of those toy projects anyway.
Besides NLP course by Jurafsky, course “Introduction to Natural Language Processing” by D. Rädev is quite good - there were some topics not covered in Jurafsky course
Given several NLP algorithms are able to achieve >90% accuracy and many more achieve >80% accuracy, how do you come to the conclusion that "all ideas in NLP are garbage".
80%-90% accuracy on your arbitrary data set is meaningless.Your cute NLP algorithms suffer miserably in real life. You are severely underestimating how complex language is if you think getting a high accuracy on these narrow tasks is a meaningful measure of progress. Language is a big open problem for cognitive scientists, and linguistics alike. What makes you think that these computer scientists are magically further ahead of these scientists to create true NLP? Alexa is shit, Google Duplex is shit, Siri is shit. All these NLP applications will continue to be shit, until we actually understand what language is and how it works.
It's really easy to say experts are wrong. It's a lot harder to prove it. So prove it. If you're coming in here to run down the current experts, I expect you to have a better solution. If not, I expect you to delete this comment as it adds absolutely nothing to the conversation.
Experts aren't wrong at all. The experts that Silicon Valley likes to favour and pour billions of dollars in are completely wrong, and it's funny how these best of the best are completely ignoring advancements that have taken place in niche fields over the past 50 years. Your so called 'experts' aren't even asking the right questions. They are all doing the exact same thing over and over again, expecting a different result.
Luckily there are extremely bright experts in other fields that do know what they are doing. But because they don't have the 'Harvard', 'Stanford' or what ever elitist label, they get ignored, and remain obscure.
Traditionally, "science advances one funeral at a time".