Getting Started in Natural Language Processing

caiobegotti · on June 8, 2018

As a linguist and software engineer I can't imagine someone doing serious NLP without ever having studied [concrete] syntax trees and such. It is easy to impress people with some tokenization but it's n-grams that are really useful in the real world, as is understanding syntax trees and all the interconnections possible inside them so you can NLP the shit out of real world text/speech, instead of simple examples with a tagger (and a good, carefully crafted for a demo like Apple's, trained set of tagged data). This is a good summary article with very good links nonetheless.

hcorreasuarez · on June 8, 2018

As a linguist, I see what you mean. However, I disagree on a couple of things. First, I wouldn’t say “n-grams and understanding syntax trees and all the interconnections possible inside them” will make your NLP skills or results any better. Understanding n-grams and syntax trees should be quite easy for linguists, but that won’t do the trick. It turns out that tagging, lemmatizing, wsd, parsing, and many other fundamental tasks in NLP are not that “simple” after all. Of course, you can use libraries to do all of those tasks -and this is pretty simple, indeed,- but most libraries will end up making flagrant mistakes if texts -either spoken or written- get too complex. Second, real world texts are complex by default, so, for someone to “NLP the shit out of real world text/speech”, they will have to find creative solutions to make tools improve -and that can get pretty difficult. Once again, understanding n-grams and syntax trees is not enough. There’s one thing I totally agree with you on: tokenization, tagging, and the like might not be that impressive. Nonetheless, creative solutions to the problems underlying those tasks are, in fact, rare and impressive.

riku_iki · on June 8, 2018

> as is understanding syntax trees and all the interconnections possible inside them so you can NLP the shit out of real world text/speech

But do modern DL approaches (e.g. SQuad, translation models) defy this approach? They train DL models on labeled data without knowing anything about syntax trees, and allow NN do all the magic..

yorwba · on June 8, 2018

> They train DL models on labeled data without knowing anything about syntax trees, and allow NN do all the magic.

Sure, but that doesn't mean that knowing about syntax won't improve the result. If you're training a translation model on a huge database of labeled examples, it might discover syntactic relationships from scratch. But if you don't have so much data, you're probably better off using all the auxiliary information you can get.

riku_iki · on June 8, 2018

> But that doesn't mean that knowing about syntax won't improve the result

this is likely correct, additional high quality input will likely improve performance, but creating such input for 200 modern human languages requires significant effort, much larger than allowing NN to solve this problem, that's why researchers invest this effort into NN improvement, not syntax tree creation tooling.

hcorreasuarez · on June 8, 2018

All of us have to strike the right balance between stubbornly believing that NNs will solve all problems in NLP -which, imho, underestimates what linguistics can do for this field of study- and stubbornly saying that only linguistic knowledge will lead to an improvement of the results -which, imho, means underestimating the outstanding work on deep learning techniques done so far. As you said, creating high-quality input is likely to be costly -and I'm not talking about syntax tree creation tooling only here. But, isn't it worth the while? I believe it is.

ertand · on June 8, 2018

You could jointly train a NN to parse the sentence and tackle the NLP task you have at the same time, or instead of just using the tokens you could use a parser to add more features too.

edit: clarification

matachuan · on June 8, 2018

That's how NLP can be a dead end by putting your faith on NN.

riku_iki · on June 8, 2018

So far this dead end outperforms traditional ways though.

matachuan · on June 8, 2018

Lol for sure any NN-based approach can claim to outperform traditional ways while half of CVPR paper results cannot be reproduced.

posterboy · on June 8, 2018

From reading papers on WaveNet and AlphaGo, the problem seems to be incomplete information in the papers, but I am not going to argue with the popular results.

riku_iki · on June 8, 2018

> while half of CVPR paper results cannot be reproduced.

even if just another half of results are proven to outperform traditional approaches, is it strong enough trend?

matachuan · on June 10, 2018

You clearly did not see that was sarcastic.

amirouche · on June 8, 2018

CVPR?

mindcrime · on June 8, 2018

Conference on Computer Vision and Pattern Recognition

hiker512 · on June 8, 2018

Sorry, but for a lot of situations n-grams just don't scale. There are just way too many combinations. There are a lot of people dealing with GBs and TBs of text.

autokad · on June 8, 2018

> "There are just way too many combinations"

truncated SVD

asdsa5325 · on June 8, 2018

Nowadays, deep learning powers most of NLP.

stochastic_monk · on June 8, 2018

The state of the art is primarily hybrid models which use deep neural networks in combination with classical parsing algorithms. For example, the Stack LSTM method uses a form of arc-std with an LSTM and a stack of sorts.

riku_iki · on June 8, 2018

And on which problem this approach achieved state of art results?

stochastic_monk · on June 8, 2018

Dependency Parsing [1].

[1] https://arxiv.org/abs/1505.08075

riku_iki · on June 8, 2018

I checked this paper briefly, and couldn't find any "classical parsing algorithms" there. Maybe you could point me on what do you mean?..

Also why do you think it is state of art? Is it in top results in any recent benchmarks (e.g. http://universaldependencies.org/conll17/results.html)?

stochastic_monk · on June 8, 2018

Arc-standard [1] is the classical algorithm in question. See section 3.2 of Dyer's paper: "Our parser is based on the arc-standard transition inventory (Nivre, 2004), given in Figure 3".

Perhaps it's not still state of the art, but I don't see dependency parsing as a task in that suite of benchmarks, and NLP isn't my primary field, so I'm not up-to-date on the whether or not another method has outperformed it in the last three years. (I would be surprised if one hadn't.) At the time it was, and regardless, I don't doubt that the current state of the art similarly uses neural networks only as a subset of the complete method.

[1] https://www.aclweb.org/anthology/W04-0308

Edit: I can't seem to reply to the post below me. There isn't an entry CMU/Pittsburgh in your benchmark. The paper refers to the Stanford Dependency [SD] and Penn Chinese Treebank 5 [CTB5], and reports 90.9/85.7, respectively, for LAS scores. I'm not sure if any of the treebanks listed correspond to these, but these numbers do seem in line to at least be competitive regardless of which treebank is used.

riku_iki · on June 8, 2018

> But I don't see dependency parsing as a task in that suite of benchmarks

That site is dedicated for dependency parsing task, scorecard represents results of various teams/models.

mindcrime · on June 8, 2018

Obligatory:

Every time I fire a linguist, the performance of our speech recognition system goes up. -- Fred Jelinek

caiobegotti · on June 8, 2018

Dunno why you were downvoted, I heard this one multiple times during my classes and always had a good laugh :-)

imh · on June 8, 2018

I really love that this getting started guide is "do lots of studying and practice, here are the canonical textbooks, papers, conferences, tools, and problems" instead of "spend a few hours on this superficial toy problem." I'd love to see more guides like this.

stared · on June 8, 2018

I strongly disagree.

It's easy to list a lot of books and papers (and drown newcomers in them), without pointing to actual step=by-step starting points. Sure, doing superficial problems is only the first step (and it's foolish to think that it is the last step). Yet, you can read all books in the world, but unless you are able to prove theorems, or write code, you know less than someone who wrote a small script to predict names.

Additionally, it's weird that they recommend NLTK (no, please not), SpaCy (cool and very useful, but high-level), but not Gensim, PyTorch (or at least: Keras). As a side note, PyTorch has readable implementations of classical ML techniques, such as word2vec (vide https://adoni.github.io/2017/11/08/word2vec-pytorch/).

There are some good recommendations linked there (I really like "Speech and Language Processing" by Dan Jurafsky and James H. Martin https://web.stanford.edu/~jurafsky/slp3/, and recommended myself in http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html).

autokad · on June 8, 2018

I finished top 25 in kaggle using NLTK and sklearn. word2vec is thrown around like the gospel in NLP, but simple techniques usually do a lot better because: #1 there isnt that much data in most cases and most importantly #2 the corpus differs substantially than the one word2vec was fit on. I am really flabbergasted by how many people start with word2vec and LSTM and come up with really over-fit models and they never even tried the simple things.

using ngrams (1 and 2 on words, and 3-5 characters with truncated SVD) gets you really far.

throwawaymath · on June 8, 2018

> Yet, you can read all books in the world, but unless you are able to proof theorems, or write code

This is the implicit intent of reading all those books. If you actively follow along when learning from those books you'll be guided through plenty of those toy projects anyway.

YeGoblynQueenne · on June 8, 2018

What, no Charniak? Tut tut:

https://mitpress.mit.edu/books/statistical-language-learning

alexott · on June 8, 2018

Besides NLP course by Jurafsky, course “Introduction to Natural Language Processing” by D. Rädev is quite good - there were some topics not covered in Jurafsky course

billybolton · on June 8, 2018

[flagged]

rpedela · on June 8, 2018

Given several NLP algorithms are able to achieve >90% accuracy and many more achieve >80% accuracy, how do you come to the conclusion that "all ideas in NLP are garbage".

billybolton · on June 8, 2018

80%-90% accuracy on your arbitrary data set is meaningless.Your cute NLP algorithms suffer miserably in real life. You are severely underestimating how complex language is if you think getting a high accuracy on these narrow tasks is a meaningful measure of progress. Language is a big open problem for cognitive scientists, and linguistics alike. What makes you think that these computer scientists are magically further ahead of these scientists to create true NLP? Alexa is shit, Google Duplex is shit, Siri is shit. All these NLP applications will continue to be shit, until we actually understand what language is and how it works.

freehunter · on June 8, 2018

It's really easy to say experts are wrong. It's a lot harder to prove it. So prove it. If you're coming in here to run down the current experts, I expect you to have a better solution. If not, I expect you to delete this comment as it adds absolutely nothing to the conversation.

billybolton · on June 9, 2018

Experts aren't wrong at all. The experts that Silicon Valley likes to favour and pour billions of dollars in are completely wrong, and it's funny how these best of the best are completely ignoring advancements that have taken place in niche fields over the past 50 years. Your so called 'experts' aren't even asking the right questions. They are all doing the exact same thing over and over again, expecting a different result.

Luckily there are extremely bright experts in other fields that do know what they are doing. But because they don't have the 'Harvard', 'Stanford' or what ever elitist label, they get ignored, and remain obscure.

Traditionally, "science advances one funeral at a time".

ma2rten · on June 9, 2018

What is the most interesting recently published paper in that area?