A Guide to Natural Language Processing

bhaak · on Nov 15, 2017

> Essentially, when dealing with natural languages hacking a solution is the suggested way of doing things, since nobody can figure out how to do it properly.

That's really the TL;DR I also got from the computational linguistic courses I attended.

There's probably the Pareto principle at works. Having no solution is worse than having an 80% solution that works well enough when the 100% solution is much harder to achieve (and some of the problems not even humans would be able to solve properly).

cvs268 · on Nov 15, 2017

Couldn't agree more!

Recently I wrote a web-extension for Firefox that displays funny "Deep thought" quotes.

I wanted to analyse the quote text and fetch relevant images to animate in the background of the quote text. After reading several NLP tutorials, guess what I did as a first PoC - Pick the 3 longest words in a quote text and run an image search with those 3 words.

6 lines of plain javascript code that can be run anywhere almost instantly https://github.com/TheCodeArtist/deep-thought-tabs/blob/mast...

I get relevant images in the search results 99/100 times. The quirks of searching often result in the image adding to the funny-ness of the "Deep Thought" on display.

Its so effective that i ended up publishing the "Deep Thought Tabs" web-extension with this approach itself: https://addons.mozilla.org/en-US/android/addon/deep-thought-...

Later I tried using the nlp-compromise js library to identify "topics" of interest within a quote text - typically nouns, verbs, and adjectives. Comparing the results with my "3-longest-words" approach, I found that the longest words were anyways almost always the "topic" words that NLP identified for any given quote text.

vvanders · on Nov 15, 2017

That's pretty awesome.

Back in games we'd do all sorts of tricks in networking to make it look like things were happening(sound effects, decals, etc) in response to local events until we could have the server provide the definitive call on some game state.

Most players thought we had a much higher fidelity sim then we actually did. It's a pretty common technique across a lot of games. You can get away with quite a bit by being smart about what you "fake" and what you actually make work end-to-end.

boxy310 · on Nov 15, 2017

Neat observation. The "3-longest-words" approach probably works well because grammatical words tend to be elided down to as short of an implementation as possible, while longer words tend to be more demonstrative of the actual topic at hand, rather than grammatical structure.

kleiba · on Nov 15, 2017

You could use the same argument against pretty much any discipline that's undergoing active research. Of course no-one knows (yet) how to do it properly or else there would be no research going on. Image understanding, robotics, even non-computational disciplines such as medicine... Staying with the latter, take HIV for example: no-one knows how to heal it but I'm sure a lot of people are very grateful for the 80% solutions that prolong lives today.

So, in summary, you point is not wrong. But it's no reason for bashing computational linguistics. It is common across many disciplines to use not-yet-perfect solutions as long as you don't know how to do better.

That said, I don't fully agree with the notion that "hacking a solution" is the suggested way of doing things. Computational linguistics is a pretty wild field with a lot of sub-disciplines. In a lot of those, the state of the art consists of quite sophisticated approaches that are the result of years of research. Take speech recognition, for instance. Currently, deep learning approaches take the cake, but there is also a plethora of insights that have been gained from improving the traditional methods over decades.

I think, a more nuanced point of view is called for here.

bhaak · on Nov 16, 2017

I didn't intend to bash computational linguistics. Those were some of my favorite course I wouldn't have attended more than I needed if I didn't like the topic and gotten something out of it.

It's surprising how often you can get very far with imperfect solutions. ELIZA is the classic example. A simple program with very little code could convince people that they were talking to another human or at least machine with an understanding of their feelings.

ELIZA was coded completely by humans. Of course, nowadays we have more sophisticated ways of doing that. We can throw a few topic tagged example sentences with connected replies at a computer and it will mostly reply with the right answers to similar sentences. This is only possible because computational linguistics provided the foundation for that.

Still many solution are hacky to this day but that is because computational linguistics is more concerned about interaction with imperfect humans than most of the other disciplines in computer science.

boxy310 · on Nov 15, 2017

Eh, that really comes down to applied theory vs. pure theory. There's no one Grand Unifying Theory of Natural Language Processing, and not likely to be a strong candidate for a while yet. Until then, there can still be a lot of good problem-solving that can be used with either traditional NLP or with neural networks, or even a hacked-together hybrid approach, and both application and research will feed into each other to refine the processes.

nlperguiy · on Nov 15, 2017

Yeah, when you look at some of the SemEval contest winners or top 3, many use fairly simple methods combined into a powerful solution (except when LSTM with attention grabs the throne).

nl · on Nov 15, 2017

Ha, there's a whole section on clones of the summarizer from Classifier4J.

I wrote that in 2003 (I think?) based on @pg's "A plan for spam" essay, and then "invented" the summarization approach (I'm sure others had done similar, but I thought it up myself anyway).

Turns out it was rather well tuned. The 2003 implementation, presumably downloaded from sourceforge(!) still wins comparisons on datasets which didn't even exist when I wrote it[1].

I much prefer the Python implementation though[2], which I hadn't seen before.

Also, Textacy on top of Spacy is awesome for any kind of text work.

[1] https://dl.acm.org/citation.cfm?id=2797081

[2] https://github.com/thavelick/summarize/blob/master/summarize...

amelius · on Nov 15, 2017

There are a few applications missing:

- Answering a question by returning a search result from a large body of texts. E.g. "How do I change the background color of a page in Javascript?"

- Improving the readability of a text. The article only mentions "understanding how difficult to read is a text".

- Establishing relationships between entities in a body of text. E.g. we could build a fact-graph from sentences like "Burning coal increases CO2", and "CO2 increase induces global warming". Useful also in medical literature where there are millions of pathways.

- Answering a question, using a large body of facts. Like search, but now it gives a precise answer.

- Finding and correcting spelling/grammatical errors.

BjoernKW · on Nov 15, 2017

> - Establishing relationships between entities in a body of text. E.g. we could build a fact-graph from sentences like "Burning coal increases CO2", and "CO2 increase induces global warming". Useful also in medical literature where there are millions of pathways.

That's a simple example because with 'CO2' you at least have the same string that can serve as a keyword connecting those two facts. Usually in natural language we make frequent use of anaphora to refer to people, objects and concepts previously mentioned in the text by name.

Anaphora resolution is one of the really hard problems not only in NLP but in linguistics in general. The most simple anaphoric device in languages like English is pronouns and even with those it can be quite difficult to determine what a 'he' or 'she' refers to in context.

boxy310 · on Nov 15, 2017

>Anaphora resolution is one of the really hard problems not only in NLP but in linguistics in general.

This was one of the most frustrating parts of studying Latin rhetoric. The speakers would keep referring to "That thing I was talking about," and it's a noun from a subordinate clause 2 and a half paragraphs ago.

kuschku · on Nov 15, 2017

That’s actually very common in most languages. English is one of the few western languages that doesn’t do this, which makes it quite complicated for some people to write sentences in it, as in their native language such far backreferences, and long run-on sentences may be a lot more common.

umilegenio · on Nov 15, 2017

> - Answering a question by returning a search result from a large body of texts. E.g. "How do I change the background color of a page in Javascript?" > - Answering a question, using a large body of facts. Like search, but now it gives a precise answer.

That is essentially a Natural Language Interface. There are simple ways to implement one for bots that receives simple commands[1]. The problem is that it quickly become very hard if you are trying to do something more open ended that a bot. So, there was simply no room to include it.

> - Improving the readability of a text. The article only mentions "understanding how difficult to read is a text".

The issue is that the formulas to measure the readability of a text cannot really be used to suggest improvements. That's because the user ends up focusing on improving the score instead of improving the text. To suggest improvements you need a much more sophisticate system.

> - Establishing relationships between entities in a body of text. E.g. we could build a fact-graph from sentences like "Burning coal increases CO2", and "CO2 increase induces global warming". Useful also in medical literature where there are millions of pathways.

This is one of the things that were axed, because in some sense it is simple if you just want to link together concepts without any causality, i.e. stuff that happens together. To do that you could link named entity recogniton (to find entities) and a simple way to find a relationship between words (i.e., they happen in the same phrase therefore they have related). However a more sophisticated form of the process, like the one that results in the Knowledge Graph[2] would be quite hard to do.

> - Finding and correcting spelling/grammatical errors.

That's a great idea, we will add how to detect spelling errors.

[1] https://medium.com/swlh/a-natural-language-user-interface-is...

[2] https://en.wikipedia.org/wiki/Knowledge_Graph

bjterry · on Nov 15, 2017

The fact that those things are hard is exactly why a guide on them would be valuable.

umilegenio · on Nov 15, 2017

That's true up to a point. We wrote the article for programmers that had no previous knowledge, so we avoided stuff that is too hard. To such people stuff that is too advanced would look cool, but it would also be impractical to use.

However, we are thinking about creating a more advanced article on a later date.

1maginary · on Nov 15, 2017

Author profiling comes to mind as well

fnl · on Nov 15, 2017

- Text generation and dialogue systems

kinow · on Nov 15, 2017

A lot to review, read, learn. Thanks a lot for sharing this. Any plans to extend it or have another one including even more, like Natural Language Generation (not limited to bots, we are using it in weather forecast), and co-reference?

umilegenio · on Nov 15, 2017

Thanks. Well, there are interesting things that we had to cut because they were too advanced for an introductory article. We were thinking about making a new article for them in a few months. And Natural Language Generation would be another great topic to talk about.

However, if you already have experience in the topic we would be happy if you would like to write a guest post for us.

fnl · on Nov 15, 2017

I'm always astonished how little mention gensim gets, considering that it can basically be used for all the listed tasks, including parsing, if you combine it with your favorite deep learning library (DyNet, anyone?).

rpedela · on Nov 15, 2017

gensim is one of the best libraries for word vectors and summarization. For parsing and NER, Stanford CoreNLP works best in my experience.

fnl · on Nov 15, 2017

Well, a model you fine tune to your specific corpus/domain works even (in fact: much) better... And gensim there gives you the tools to build the best possible embeddings.

But you do need a use case and an economic reward for the substantial increase in cost than a pre-trained, vanilla, off-the-shelf parser (model) can give you. Yet, if your domain is technical enough (pharma, finance, law, ... - essentially, all but parsing news, blogs, and tweets...) it might be the only way to get a NLP system that really works.

pencilcode · on Nov 15, 2017

Regarding finding similar documents what is the state of the art nowadays, LDA, word2vec, something else? What do you normally use?

zintinio5 · on Nov 15, 2017

Like everything else, depends on your use-case. I have personally used TF-IDF vectors and token sets with Cosine and Jaccard distances in practice.

Some examples of use-cases: are you searching for "semantically similar", or "near duplicate"? You can compare documents under different metrics and different _representations_. Some representations are: LSA, PLSA, LDA, TF-IDF, and Set representations, along with metrics such as Jaccard Distance, Cosine Distance, Euclidean distance, etc.

Doc2vec is the Word2vec analog for documents.

nl · on Nov 15, 2017

Word Mover Distance on Word2Vec vectors.

There is an implementation in Textacy.

nicklovescode · on Nov 15, 2017

Have you heard of word mover’s distance? It works really well!

visarga · on Nov 15, 2017

First time I see reading time and readability score mentioned together with NLP.

bane · on Nov 15, 2017

Was hoping for some discussion about word vectors like word2vec. I keep reading about them, but don't really understand what they're useful for.

umilegenio · on Nov 15, 2017

The interesting thing about word2vec is that is an unsupervised method that build vectors to represent each word in a way that makes easy to find relationship between them.

There is a video by the creator of Gensim on word2vec and frieds: https://www.youtube.com/watch?v=wTp3P2UnTfQ

We didn't include it, simply because it relies on machine learning and we wanted to show simpler methods.

laGrenouille · on Nov 15, 2017

Yes, I agree that the applications for word vectors are not made as clearly as it should be. One direct application is as the first layer of a neural network [1], which could be part of either a 1-dimensional convolution or a recurrent neural network. Using pre-trained word vectors is a form of transfer learning and allows for much more predictive models with smaller amounts of training data.

[1] https://blog.keras.io/using-pre-trained-word-embeddings-in-a...

matt4077 · on Nov 15, 2017

Let me try:

Take the famous example of [king] and [queen] being close neighbors in vector space after generating the word vectors ("embedding"). If you then use these vectors to represent the words in your text, a sentence about kings will also add information about the concept of queens, and vice versa. To a far lesser degree, such a sentence will also add to your knowledge of [ceo], and, further down, [mechanical engineer]. But it will not change the system's knowledge of [stereo].

bane · on Nov 16, 2017

Thanks, yeah I get that, but I think I'm having a lack of imagination about what to do with that in terms of how to build something useful and user friendly out of it.

rpedela · on Nov 15, 2017

Essentially they are useful for comparing the semantic similarity of pieces of text. The text could be a word, phrase, sentence, paragraph, or document. One practical use case is semantic keyword search where the vectors can be used to automatically find a keyword's synonyms. Another is recommendation engines that recommend other documents based on semantic similarity.

amirouche · on Nov 15, 2017

are you sure it allows to guess synonyms? I was under the impression that word2vec only allowed to know how similar are words, which different from synonyms. E.g. red is like blue in word2vec sens, but not a synonym.

rpedela · on Nov 15, 2017

Technically yes. It will find words which are used in similar contexts such as synonyms, antonyms, etc. However in practice, word2vec and clustering does a good job of finding synonyms [1].

1. https://www.slideshare.net/mobile/lucidworks/implementing-co...

boxy310 · on Nov 15, 2017

Was very pleased to find this out when I first started studying word embeddings (the abstract principles of word2vec). Essentially it comes down to words having similar verbs and objects that come up most frequently together, so they end up being semantically close.

d23 · on Nov 15, 2017

My experience with your site on mobile: https://m.imgur.com/5vLrEJH

Can't get it to go away, can't read the article.

arcanus · on Nov 15, 2017

Is there an equivalent to MNIST for NLP? I've always wanted to play around in this space but I don't know a good, and simple, database to start with.

jimsmart · on Nov 15, 2017

There are a few different datasets that might be of use, depending on what you're playing with:-

- bAbI https://research.fb.com/downloads/babi/ and https://github.com/facebook/bAbI-tasks

- SQuAD https://rajpurkar.github.io/SQuAD-explorer/

- WebQuestions https://github.com/brmson/dataset-factoid-webquestions

Edit: there's also a great list of datasets on the ParlAI project page https://github.com/facebookresearch/ParlAI

jventura · on Nov 15, 2017

I worked with NLP for my research, and I used to build my corpora from wikipedia documents. Here's a tool that I've built to do it: https://github.com/joaoventura/WikiCorpusExtractor

gumby · on Nov 15, 2017

Well there's word2vec, which while it isn't quite the same (its whole point is the vector classification it already embodies), I think is actually the kind of think you were asking for.

nl · on Nov 15, 2017

Depends on what you want to try. NLTK has built in datasets. 20 Newsgroups is useful for trying lots of things.

betageek · on Nov 15, 2017

Your 'send me a PDF' popup has the background fade div above the form so it's impossible to fill in the form (without opening dev tools).

umilegenio · on Nov 15, 2017

Thanks for your comment! Now, we have fixed the issue.

paultopia · on Nov 15, 2017

FYI, still a glitch: email form for pdf doesn't work right on mobile Safari for me---the cursor shows up in strange places unrelated to the form fields, have to click in random places to go from editing the name field to the email field.

umilegenio · on Nov 15, 2017

Thanks for your comment. We are going to look into it.

raarts · on Nov 15, 2017

The 'send me the PDF' pop-up can not be closed on my iPhone. Had to close the page.

owlninja · on Nov 15, 2017

Hmm, worked fine for me.

rpedela · on Nov 15, 2017

Using Chrome on both a Chromebook and Galaxy S5, the right sidebar is screwed up. On the phone, it completely blocks the content.

Boothroid · on Nov 15, 2017

Quite an obnoxious website on my phone. Anyway I came here to point to GATE as a mature FLOSS option: https://gate.ac.uk/

alexasmyths · on Nov 16, 2017

Recommend Dan Jurafsky and Chris Manning @ Stanford online course:

https://www.youtube.com/watch?v=nfoudtpBV68