Pattern: A web mining & natural language processing system for Python

tswicegood · on Feb 24, 2011

This is all really cool stuff, but I can't help but thinking they've got a dozen or so separate packages here. They've also re-invented the wheel at every turn. All new wrapper for the Twitter API and every search engine? All new graphing library for JavaScript.

Don't get me wrong, this is all awesome, but a lot of it could have been re-used from better sources without having to spend time working on random API wrappers and the like. I would definitely like to know the reasoning behind creating everything from scratch.

tomdesmedt · on Feb 25, 2011

The reason is mainly 1) licensing and 2) integration. I took what I found that fitted the BSD license and wrote the rest myself. The idea was having all the things I needed in a single, concise package. I like carrying around 1 MacGyver knife instead of a heavy toolbox - even if all the separate tools in the box are more robust than the knife.

As for the JavaScript graph library: Daniel Friesen had already ported 90% of the Python code, so it only took me a day or two to finish it. The result is a single file (graph.js) with lots of things besides visualization (eigenvector centrality etc.) which seemed better suited to Pattern than integrating another, bigger project.

Best, Tom

baltcode · on Feb 24, 2011

May be they wanted it to fit together more seamlessly than if they were to use the existing pieces?

stdbrouw · on Feb 24, 2011

Looks nice, though less full-featured than the NLTK. I'd be interested to see how nice they'd play together and whether applications could exploit the strengths of both at the same time. The only thing that's better than one good NLP framework, is two good NLP frameworks, after all.

baltcode · on Feb 24, 2011

A recurring meme in terms of frameworks is people keep bringing out tools and some of them disappear, some find niche applications, and some become mainstream. Though I haven't heard of many nlp toolkits (but I'm not in that field).

I want to jump into some basic NLP, but I'd like to stick with one or two toolkits. I had heard of nltk before this, but are there any other comprehensive or sort of succesful frameworks out there one should be aware of? (Either in python or something else)

gilesc · on Feb 25, 2011

The best toolkits are probably in Java:

-Stanford's Tagger, Parser, and NLP Core

-Apache OpenNLP

-Lingpipe

Many smaller components are made to be compatible with IBM UIMA (of Watson fame), so they are able to be integrated into a pipeline somewhat easily. For examples of this in biomedical TM, see http://u-compare.org/ .

People will kill me for saying this, but truly: Python's performance isn't adequate for large-scale text mining, _especially_ if you want to do deep/full parsing. Shallow parsing as shown in this package's demo is more feasible.

I personally find NLTK convoluted, but in its favor, it does have readers for a TON of corpora, which is really nice.

devinj · on Feb 25, 2011

My friends in the natural language field tell me Python and NLTK are more common than Java. Then again, this is at a sort-of Python-centric university (Toronto).

raufrajar · on Feb 24, 2011

It seems like a very nice tool and many hackers would want to play with it. However, it will be really convenient if the project is put on GitHub.

TuxPirate · on Feb 24, 2011

The official project page: http://nodebox.net/code/index.php/Perception

soulclap · on Feb 24, 2011

I don't know a whole lot about text analysis and the mentioned algorithms, can this be used to analyze articles and determine which are dealing with the same subject? Techmeme-ish? Or what would be a good starting point for this? (Or would this be better off in an 'Ask HN' post? I am one of those horrible new people on here.)

simonb · on Feb 24, 2011

The: "tf-idf + cosine similarity + LSA metrics" bit from Pattern is what you are looking for.

thezilch · on Feb 24, 2011

In other words, the vector module: http://www.clips.ua.ac.be/pages/pattern-vector

syllogism · on Feb 25, 2011

I'm going to be involved in teaching an NLP course this semester, and we're debating what to put in it. What are some things you want to do with NLP, and what would you hope to learn (or have a future employee learn) in an honours and masters level course?

derrida · on Feb 24, 2011

Imagine spidering twitter for phrases of the type "x is a type of y" in order to form a database of real world objects in an inheritance hierarchy. Now imagine when you have these objects, finding out what these objects do by looking at verbs that occur around them. Boom. You have objects, and you have the methods you need to write. Now you just need someone to write the code! The methods writing could become a sort of captcha exercise.

derrida · on Feb 24, 2011

All of this is to upload the universe of course into some sort of Minecraft game.

izendejas · on Feb 24, 2011

You need to check out Freebase.

dailystatusrpt · on Feb 24, 2011

i.e. the semantic web, circa forever?

phreeza · on Feb 24, 2011

Project thats been in the back of my head for a while, but have no time to do:

Analyze HN comments over time with some NLP techniques, maybe sentiment analysis. Then if the next wave of "HN is turning into Reddit" posts comes, point people to the analysis, whatever the conclusions are.

Seems like this would be well suited for the task. Any takers?

waterside81 · on Feb 24, 2011

Use our API:

http://www.repustate.com/docs/

ericxtang · on Feb 25, 2011

Really fantastic piece of software. It's about time we move away from using java wrappers for NLP stuff. Anyone know a similar project in Ruby?

logjam · on Feb 24, 2011

For we who have had to roll some of the same functionality piecemeal out of tools like the Stanford NLP Core, tregex/tsurgeon, Wordnet, Beautiful Soup, and python nlptk, this looks on the surface to be pretty sweet. BSD licensing.

Here's a cool application - tagging negation and speculation clauses in some text (their demo has been trained on biomedical text):

Example sentence: When U937 cells were infected with HIV-1, no induction of NF-KB factor was detected, whereas high level of progeny virions was produced, suggesting that this factor was not required for viral replication.

Result: When U937 cells were infected with HIV-1 , [NEG0 no induction of NF-KB factor was detected NEG0] , whereas high level of progeny virions was produced , [SPEC2 suggesting that this factor was [NEG1 not required for viral replication NEG1] SPEC2] .

derrida · on Feb 24, 2011

I actually just literally drooled on the keyboard!