Teaching a Computer to Read: NLP Hacking in Python

drakaal · on Sept 26, 2013

The CPU cost to do use this approach is terribly high. I don't think this approach is going to give better results than a few simple rules and NLTK would.

This API will do a better job telling you what an article is about. https://www.mashape.com/stremor/stremor-noun-phrase-and-part...

That said, the approach we use for our TLDR software and search rankings doesn't rely on just frequency, the adjectives that amplify the content, the sentences with emotion attached to them, and the "charge" of words matters too much.

Consider the following:

That frakking loser Drakaal came over and hijacked my NLP thread. Just because he does NLP for a living, and thinks he knows everything doesn't mean a thing. My NLP is way cooler because it uses machine learning and that is the future of NLP, not the heuristics model he uses for his stuff.

What is the "core" of that? Clearly it is about how Drakaal sucks, but we only mention him once. NLP is important, machine learning is important, but really it is about why Drakaal sucks.

samuizo · on Sept 27, 2013

While CPU cost is a concern, often memory and correspondingly IO is the bottleneck in vector space approaches. Practitioners can leverage highly-optimized libraries for performing the matrix decompositions, so random disk seeks become more of a concern than CPU time. It's in iterative SVD that Gensim really shines in my opinion.

Turning to your example, any model based on term frequencies, vector space treatments included, would have trouble identifying 'Drakaal' as the most important term. But, this can be mitigated to some extent by preprocessing. In particular, naive coreference resolution would simply assign 'Drakaal' to every occurrence of 'he'/'his' in the sentence (since there are no other candidates). In which case, the count of 'Drakaal' jumps from 1 to 5. Just taking the comments in this thread as the corpus, that's a pretty high frequency for a single document, which might indeed get it to stand out on that basis alone.

Now whether we could get even more nuanced and determine that it's not just about 'Drakaal' but also a certain disposition toward him really depends on the task. If it's important to uncover those sorts of patterns then I would incorporate some documents that are illustrative of the distinction. In this sense, vector space approaches can be both purely exploratory as well as guided toward the divisions you aim for.

msalahi · on Sept 26, 2013

i've actually found the performance of gensim (the topic modeling python module i use here) to be pretty great. we're not at a scale where CPU performance is make or break just yet, so i haven't done any comprehensive testing of performance. but i've definitely not run into any performance issues worth complaining about. however, gensim is 100% based on lazy evaluation where it can be, so it's relatively light on the CPU. i love NLTK as well, but it did lack in the dimensionality reduction/topic modeling department which gensim did so beautifully. LDA + SVM seemed like an interesting approach to go with, and it didn't disappoint.

drakaal · on Sept 26, 2013

The issue with Genism is you have to know what you are trying to analyze before you analyze it. It doesn't do well if you use the wrong corpus or if like you mention start with a million word corpus.

If you were analyzing emails in a single organization all day you could probably sort out topics really well. Doing all of the web it breaks down because it gets less accurate the larger the variety of content.

msalahi · on Sept 27, 2013

"doing all of the web" will cause pretty much any approach to AI/machine learning/NLP to break down. i'm a big believer in it being the responsibility of the engineer employing these techniques to take stock of the problem at hand and find out what constraints you can take advantage of to achieve better performance/accuracy/prettiness of code. there's not really a silver bullet that you can just release on the internet with the task of bringing back incredibly useful information without "knowing what you're trying to analyze before you analyze it."

agibsonccc · on Sept 27, 2013

Web Developers are a neat bunch. It's amazing what kind of inference you can do with exploiting document structure to make different kind of inferences alongside more traditional approaches like word frequency analysis, LDA, or even deep learning/word distributional inference. NLP on the web especially question answering and search, can still be greatly expanded upon.

drakaal · on Sept 27, 2013

Wait an hour. We decided to push that bit of code live in Alpha. :-)

msalahi · on Sept 27, 2013

SCIENCE!

adpreese · on Sept 26, 2013

I tried to test your example with your API, but it requires a credit card even for the freemium plan. Is there any way you can make a rate limited API that never charges to avoid that? I'm not familiar with mashape so it may not be possible.

drakaal · on Sept 26, 2013

I can't easily. The free quota is pretty high on their. (5000 calls), For development I'd work off a few test cases that you save the response on, then if you get something working decide if 5000 calls a month will fill your needs.

drakaal · on Sept 27, 2013

Saw you are trying it out. Awesome! Sorry the documentation is a bit weak right now, we had people wanting it so we got it out, rather than getting all the docs complete.

adpreese · on Sept 27, 2013

I did try it out. It does a good job of pulling out different bits and categorizing them. I went ahead and ran the example you had and put it up to continue the conversation(https://gist.github.com/adpreese/6722561). If you want me to take it down, I will certainly respect that but I thought it'd be convenient for anyone else paying attention.

The noun phrases part of the response gave a concise list of things, including the word thing(hijacked, NLP thread, stuff, My NLP, thing, Drakaal, cooler, heuristics, Just). That's maybe good at picking out the nouns, but it's not really actionable yet. It might be great as the bag of words to use for trying to classify something, but by itself the best it's giving me is NLP/heuristics if I had the concepts grouped together, somehow. I think that's a reasonable takeaway from your example, but I'd be curious what your thoughts are on it.

PS. I tried the TLDR API on a copy and paste of the original article with commas, periods, and single and double quotes removed but it returned a 500 error. I'm probably doing something wrong, but I'd love to see what it spits out if you can help me out.

drakaal · on Sept 27, 2013

We have the groupings and we can use sentiment to get the importance. The API is somewhat limited compared to our full bag of tricks, mostly because we don't want to give away all of our secrets, but also because we change things pretty often and would have to let others know when we made changes if they were consuming an API.

drakaal · on Sept 27, 2013

Play with the http://TLDRStuff.com tools there is zero learning curve over there.

One of the API's takes a URL, one takes plaintext or HTML, not sure which one you hit, but the TLDRStuff.com will make it really easy for you to play.

tdj · on Sept 27, 2013

Actually, I think you could save yourself some trouble and use scikit-learn's built-in text preprocessing utils:

Word counter: http://scikit-learn.org/stable/modules/generated/sklearn.fea...

Hashing vectorizer if you want to trade off explainability for speed and scalability: http://scikit-learn.org/stable/modules/generated/sklearn.fea...

TF-IDF weighing: http://scikit-learn.org/stable/modules/generated/sklearn.fea...

Also, if you transform bag-of-words vectors into a dense form, you're gonna have a bad time (insert appropriate meme picture here). In large corpora, dimensionality grows quite substantially - if you work with news corpora or Wikipedia, you're in the 100k-1M dimensional space pretty quickly.

Great to see an approachable explanation for NLP. As they say sometimes, when you know how it's done, it stops being "Artificial Intelligence".

adpreese · on Sept 26, 2013

How do you deal with keeping your super rare words list sensible? For many forms of technical writing I could see things getting out of hand where you have lots of tiny dense clusters not really close to anything else if you didn't manage the list well.

msalahi · on Sept 26, 2013

As with most successful applications of machine learning, it's about finessing your approach based on the problem at hand. In our case, we have classes divided on the level of "Medicine," "Real Estate," etc. So, we could throw away lots of words that only occurred once or twice in the massive corpus we crawled to build the language model and still have a pretty robust representation of the subject you're trying to represent.

msalahi · on Sept 26, 2013

In fact, if your training corpus is sufficiently large, you'd be shocked how many words you can eliminate right away for a term frequency of one or two. I went from millions of words in the vocabulary to something like 60k just by ignoring words that happen once or twice in the corpus. Plus, you probably won't learn much about the relationships between words if they only occur a few times in the corpus.

jlees · on Sept 27, 2013

Yeah, but consider that some rare words are much stronger indicators of topic than more common ones. Even more so if you look at n-grams. If you use something like wordnet you can get a lot of meaning out of low-frequency words and throw away the meaningless higher-frequency ones that occur in too many categories to be useful.

adpreese · on Sept 27, 2013

Sure, there's value in rare words, but I don't think anything that occurs across the corpus fewer than 3 times is going to tell you anything useful. You need a certain amount just to have it be a real signal. What was the least frequent useful word in the data set, msalahi?

samuizo · on Sept 27, 2013

You can often strike a balance between rare words that appear in only a couple of documents and very frequent words that occur all over the place by employing both a term frequency and a document frequency weighting scheme; 'tf-idf' in the nomenclature [1].

The basic idea is that you keep track of counts both within documents and among documents. For English, word like 'the' will be frequent in each document it occurs in. It will also occur in every document. The high document frequency counteracts the high term frequency. On the other hand, 'motherboard' might be infrequent overall (but not extremely so), but its low document frequency boosts its importance.

The scheme is commonly employed and works quite well, sometimes obviating the need for careful vocabulary pruning. FWIW, scikit-learn implements it in their feature extraction library [2].

[1] http://en.wikipedia.org/wiki/Tf–idf‎ [2] http://scikit-learn.org/stable/modules/generated/sklearn.fea...

taariqlewis · on Sept 26, 2013

Interesting to see a content marketplace company using technology well ahead of its peers. I wonder how many other firms are pursuing this type of commercial research.

rbucks · on Sept 26, 2013

Great question! Probably not very many.

aas48 · on Sept 27, 2013

Genius