> Essentially, when dealing with natural languages hacking a solution is the suggested way of doing things, since nobody can figure out how to do it properly.
That's really the TL;DR I also got from the computational linguistic courses I attended.
There's probably the Pareto principle at works. Having no solution is worse than having an 80% solution that works well enough when the 100% solution is much harder to achieve (and some of the problems not even humans would be able to solve properly).
Recently I wrote a web-extension for Firefox that displays funny "Deep thought" quotes.
I wanted to analyse the quote text and fetch relevant images to animate in the background of the quote text. After reading several NLP tutorials, guess what I did as a first PoC - Pick the 3 longest words in a quote text and run an image search with those 3 words.
I get relevant images in the search results 99/100 times. The quirks of searching often result in the image adding to the funny-ness of the "Deep Thought" on display.
Later I tried using the nlp-compromise js library to identify "topics" of interest within a quote text - typically nouns, verbs, and adjectives. Comparing the results with my "3-longest-words" approach, I found that the longest words were anyways almost always the "topic" words that NLP identified for any given quote text.
Back in games we'd do all sorts of tricks in networking to make it look like things were happening(sound effects, decals, etc) in response to local events until we could have the server provide the definitive call on some game state.
Most players thought we had a much higher fidelity sim then we actually did. It's a pretty common technique across a lot of games. You can get away with quite a bit by being smart about what you "fake" and what you actually make work end-to-end.
Neat observation. The "3-longest-words" approach probably works well because grammatical words tend to be elided down to as short of an implementation as possible, while longer words tend to be more demonstrative of the actual topic at hand, rather than grammatical structure.
You could use the same argument against pretty much any discipline that's undergoing active research. Of course no-one knows (yet) how to do it properly or else there would be no research going on. Image understanding, robotics, even non-computational disciplines such as medicine... Staying with the latter, take HIV for example: no-one knows how to heal it but I'm sure a lot of people are very grateful for the 80% solutions that prolong lives today.
So, in summary, you point is not wrong. But it's no reason for bashing computational linguistics. It is common across many disciplines to use not-yet-perfect solutions as long as you don't know how to do better.
That said, I don't fully agree with the notion that "hacking a solution" is the suggested way of doing things. Computational linguistics is a pretty wild field with a lot of sub-disciplines. In a lot of those, the state of the art consists of quite sophisticated approaches that are the result of years of research. Take speech recognition, for instance. Currently, deep learning approaches take the cake, but there is also a plethora of insights that have been gained from improving the traditional methods over decades.
I think, a more nuanced point of view is called for here.
I didn't intend to bash computational linguistics. Those were some of my favorite course I wouldn't have attended more than I needed if I didn't like the topic and gotten something out of it.
It's surprising how often you can get very far with imperfect solutions. ELIZA is the classic example. A simple program with very little code could convince people that they were talking to another human or at least machine with an understanding of their feelings.
ELIZA was coded completely by humans. Of course, nowadays we have more sophisticated ways of doing that. We can throw a few topic tagged example sentences with connected replies at a computer and it will mostly reply with the right answers to similar sentences. This is only possible because computational linguistics provided the foundation for that.
Still many solution are hacky to this day but that is because computational linguistics is more concerned about interaction with imperfect humans than most of the other disciplines in computer science.
Eh, that really comes down to applied theory vs. pure theory. There's no one Grand Unifying Theory of Natural Language Processing, and not likely to be a strong candidate for a while yet. Until then, there can still be a lot of good problem-solving that can be used with either traditional NLP or with neural networks, or even a hacked-together hybrid approach, and both application and research will feed into each other to refine the processes.
Yeah, when you look at some of the SemEval contest winners or top 3, many use fairly simple methods combined into a powerful solution (except when LSTM with attention grabs the throne).
Ha, there's a whole section on clones of the summarizer from Classifier4J.
I wrote that in 2003 (I think?) based on @pg's "A plan for spam" essay, and then "invented" the summarization approach (I'm sure others had done similar, but I thought it up myself anyway).
Turns out it was rather well tuned. The 2003 implementation, presumably downloaded from sourceforge(!) still wins comparisons on datasets which didn't even exist when I wrote it[1].
I much prefer the Python implementation though[2], which I hadn't seen before.
Also, Textacy on top of Spacy is awesome for any kind of text work.
- Answering a question by returning a search result from a large body of texts. E.g. "How do I change the background color of a page in Javascript?"
- Improving the readability of a text. The article only mentions "understanding how difficult to read is a text".
- Establishing relationships between entities in a body of text. E.g. we could build a fact-graph from sentences like "Burning coal increases CO2", and "CO2 increase induces global warming". Useful also in medical literature where there are millions of pathways.
- Answering a question, using a large body of facts. Like search, but now it gives a precise answer.
- Finding and correcting spelling/grammatical errors.
> - Establishing relationships between entities in a body of text. E.g. we could build a fact-graph from sentences like "Burning coal increases CO2", and "CO2 increase induces global warming". Useful also in medical literature where there are millions of pathways.
That's a simple example because with 'CO2' you at least have the same string that can serve as a keyword connecting those two facts. Usually in natural language we make frequent use of anaphora to refer to people, objects and concepts previously mentioned in the text by name.
Anaphora resolution is one of the really hard problems not only in NLP but in linguistics in general. The most simple anaphoric device in languages like English is pronouns and even with those it can be quite difficult to determine what a 'he' or 'she' refers to in context.
>Anaphora resolution is one of the really hard problems not only in NLP but in linguistics in general.
This was one of the most frustrating parts of studying Latin rhetoric. The speakers would keep referring to "That thing I was talking about," and it's a noun from a subordinate clause 2 and a half paragraphs ago.
That’s actually very common in most languages. English is one of the few western languages that doesn’t do this, which makes it quite complicated for some people to write sentences in it, as in their native language such far backreferences, and long run-on sentences may be a lot more common.
> - Answering a question by returning a search result from a large body of texts. E.g. "How do I change the background color of a page in Javascript?"
> - Answering a question, using a large body of facts. Like search, but now it gives a precise answer.
That is essentially a Natural Language Interface. There are simple ways to implement one for bots that receives simple commands[1]. The problem is that it quickly become very hard if you are trying to do something more open ended that a bot. So, there was simply no room to include it.
> - Improving the readability of a text. The article only mentions "understanding how difficult to read is a text".
The issue is that the formulas to measure the readability of a text cannot really be used to suggest improvements. That's because the user ends up focusing on improving the score instead of improving the text. To suggest improvements you need a much more sophisticate system.
> - Establishing relationships between entities in a body of text. E.g. we could build a fact-graph from sentences like "Burning coal increases CO2", and "CO2 increase induces global warming". Useful also in medical literature where there are millions of pathways.
This is one of the things that were axed, because in some sense it is simple if you just want to link together concepts without any causality, i.e. stuff that happens together. To do that you could link named entity recogniton (to find entities) and a simple way to find a relationship between words (i.e., they happen in the same phrase therefore they have related). However a more sophisticated form of the process, like the one that results in the Knowledge Graph[2] would be quite hard to do.
> - Finding and correcting spelling/grammatical errors.
That's a great idea, we will add how to detect spelling errors.
That's true up to a point. We wrote the article for programmers that had no previous knowledge, so we avoided stuff that is too hard. To such people stuff that is too advanced would look cool, but it would also be impractical to use.
However, we are thinking about creating a more advanced article on a later date.
A lot to review, read, learn. Thanks a lot for sharing this. Any plans to extend it or have another one including even more, like Natural Language Generation (not limited to bots, we are using it in weather forecast), and co-reference?
Thanks. Well, there are interesting things that we had to cut because they were too advanced for an introductory article. We were thinking about making a new article for them in a few months. And Natural Language Generation would be another great topic to talk about.
However, if you already have experience in the topic we would be happy if you would like to write a guest post for us.
I'm always astonished how little mention gensim gets, considering that it can basically be used for all the listed tasks, including parsing, if you combine it with your favorite deep learning library (DyNet, anyone?).
Well, a model you fine tune to your specific corpus/domain works even (in fact: much) better... And gensim there gives you the tools to build the best possible embeddings.
But you do need a use case and an economic reward for the substantial increase in cost than a pre-trained, vanilla, off-the-shelf parser (model) can give you. Yet, if your domain is technical enough (pharma, finance, law, ... - essentially, all but parsing news, blogs, and tweets...) it might be the only way to get a NLP system that really works.
Like everything else, depends on your use-case. I have personally used TF-IDF vectors and token sets with Cosine and Jaccard distances in practice.
Some examples of use-cases: are you searching for "semantically similar", or "near duplicate"? You can compare documents under different metrics and different _representations_. Some representations are: LSA, PLSA, LDA, TF-IDF, and Set representations, along with metrics such as Jaccard Distance, Cosine Distance, Euclidean distance, etc.
The interesting thing about word2vec is that is an unsupervised method that build vectors to represent each word in a way that makes easy to find relationship between them.
Yes, I agree that the applications for word vectors are not made as clearly as it should be. One direct application is as the first layer of a neural network [1], which could be part of either a 1-dimensional convolution or a recurrent neural network. Using pre-trained word vectors is a form of transfer learning and allows for much more predictive models with smaller amounts of training data.
Take the famous example of [king] and [queen] being close neighbors in vector space after generating the word vectors ("embedding"). If you then use these vectors to represent the words in your text, a sentence about kings will also add information about the concept of queens, and vice versa. To a far lesser degree, such a sentence will also add to your knowledge of [ceo], and, further down, [mechanical engineer]. But it will not change the system's knowledge of [stereo].
Thanks, yeah I get that, but I think I'm having a lack of imagination about what to do with that in terms of how to build something useful and user friendly out of it.
Essentially they are useful for comparing the semantic similarity of pieces of text. The text could be a word, phrase, sentence, paragraph, or document. One practical use case is semantic keyword search where the vectors can be used to automatically find a keyword's synonyms. Another is recommendation engines that recommend other documents based on semantic similarity.
are you sure it allows to guess synonyms? I was under the impression that word2vec only allowed to know how similar are words, which different from synonyms. E.g. red is like blue in word2vec sens, but not a synonym.
Technically yes. It will find words which are used in similar contexts such as synonyms, antonyms, etc. However in practice, word2vec and clustering does a good job of finding synonyms [1].
Was very pleased to find this out when I first started studying word embeddings (the abstract principles of word2vec). Essentially it comes down to words having similar verbs and objects that come up most frequently together, so they end up being semantically close.
Well there's word2vec, which while it isn't quite the same (its whole point is the vector classification it already embodies), I think is actually the kind of think you were asking for.
FYI, still a glitch: email form for pdf doesn't work right on mobile Safari for me---the cursor shows up in strange places unrelated to the form fields, have to click in random places to go from editing the name field to the email field.
That's really the TL;DR I also got from the computational linguistic courses I attended.
There's probably the Pareto principle at works. Having no solution is worse than having an 80% solution that works well enough when the 100% solution is much harder to achieve (and some of the problems not even humans would be able to solve properly).