Hacker News new | past | comments | ask | show | jobs | submit login
Pattern: A web mining & natural language processing system for Python (ua.ac.be)
356 points by rbreve on Feb 24, 2011 | hide | past | favorite | 22 comments



This is all really cool stuff, but I can't help but thinking they've got a dozen or so separate packages here. They've also re-invented the wheel at every turn. All new wrapper for the Twitter API and every search engine? All new graphing library for JavaScript.

Don't get me wrong, this is all awesome, but a lot of it could have been re-used from better sources without having to spend time working on random API wrappers and the like. I would definitely like to know the reasoning behind creating everything from scratch.


The reason is mainly 1) licensing and 2) integration. I took what I found that fitted the BSD license and wrote the rest myself. The idea was having all the things I needed in a single, concise package. I like carrying around 1 MacGyver knife instead of a heavy toolbox - even if all the separate tools in the box are more robust than the knife.

As for the JavaScript graph library: Daniel Friesen had already ported 90% of the Python code, so it only took me a day or two to finish it. The result is a single file (graph.js) with lots of things besides visualization (eigenvector centrality etc.) which seemed better suited to Pattern than integrating another, bigger project.

Best, Tom


May be they wanted it to fit together more seamlessly than if they were to use the existing pieces?


Looks nice, though less full-featured than the NLTK. I'd be interested to see how nice they'd play together and whether applications could exploit the strengths of both at the same time. The only thing that's better than one good NLP framework, is two good NLP frameworks, after all.


A recurring meme in terms of frameworks is people keep bringing out tools and some of them disappear, some find niche applications, and some become mainstream. Though I haven't heard of many nlp toolkits (but I'm not in that field).

I want to jump into some basic NLP, but I'd like to stick with one or two toolkits. I had heard of nltk before this, but are there any other comprehensive or sort of succesful frameworks out there one should be aware of? (Either in python or something else)


The best toolkits are probably in Java:

-Stanford's Tagger, Parser, and NLP Core

-Apache OpenNLP

-Lingpipe

Many smaller components are made to be compatible with IBM UIMA (of Watson fame), so they are able to be integrated into a pipeline somewhat easily. For examples of this in biomedical TM, see http://u-compare.org/ .

People will kill me for saying this, but truly: Python's performance isn't adequate for large-scale text mining, _especially_ if you want to do deep/full parsing. Shallow parsing as shown in this package's demo is more feasible.

I personally find NLTK convoluted, but in its favor, it does have readers for a TON of corpora, which is really nice.


My friends in the natural language field tell me Python and NLTK are more common than Java. Then again, this is at a sort-of Python-centric university (Toronto).


It seems like a very nice tool and many hackers would want to play with it. However, it will be really convenient if the project is put on GitHub.



I don't know a whole lot about text analysis and the mentioned algorithms, can this be used to analyze articles and determine which are dealing with the same subject? Techmeme-ish? Or what would be a good starting point for this? (Or would this be better off in an 'Ask HN' post? I am one of those horrible new people on here.)


The: "tf-idf + cosine similarity + LSA metrics" bit from Pattern is what you are looking for.


In other words, the vector module: http://www.clips.ua.ac.be/pages/pattern-vector


I'm going to be involved in teaching an NLP course this semester, and we're debating what to put in it. What are some things you want to do with NLP, and what would you hope to learn (or have a future employee learn) in an honours and masters level course?


Imagine spidering twitter for phrases of the type "x is a type of y" in order to form a database of real world objects in an inheritance hierarchy. Now imagine when you have these objects, finding out what these objects do by looking at verbs that occur around them. Boom. You have objects, and you have the methods you need to write. Now you just need someone to write the code! The methods writing could become a sort of captcha exercise.


All of this is to upload the universe of course into some sort of Minecraft game.


You need to check out Freebase.


i.e. the semantic web, circa forever?


Project thats been in the back of my head for a while, but have no time to do:

Analyze HN comments over time with some NLP techniques, maybe sentiment analysis. Then if the next wave of "HN is turning into Reddit" posts comes, point people to the analysis, whatever the conclusions are.

Seems like this would be well suited for the task. Any takers?



Really fantastic piece of software. It's about time we move away from using java wrappers for NLP stuff. Anyone know a similar project in Ruby?


For we who have had to roll some of the same functionality piecemeal out of tools like the Stanford NLP Core, tregex/tsurgeon, Wordnet, Beautiful Soup, and python nlptk, this looks on the surface to be pretty sweet. BSD licensing.

Here's a cool application - tagging negation and speculation clauses in some text (their demo has been trained on biomedical text):

Example sentence: When U937 cells were infected with HIV-1, no induction of NF-KB factor was detected, whereas high level of progeny virions was produced, suggesting that this factor was not required for viral replication.

Result: When U937 cells were infected with HIV-1 , [NEG0 no induction of NF-KB factor was detected NEG0] , whereas high level of progeny virions was produced , [SPEC2 suggesting that this factor was [NEG1 not required for viral replication NEG1] SPEC2] .


I actually just literally drooled on the keyboard!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: