This is all really cool stuff, but I can't help but thinking they've got a dozen or so separate packages here. They've also re-invented the wheel at every turn. All new wrapper for the Twitter API and every search engine? All new graphing library for JavaScript.
Don't get me wrong, this is all awesome, but a lot of it could have been re-used from better sources without having to spend time working on random API wrappers and the like. I would definitely like to know the reasoning behind creating everything from scratch.
The reason is mainly 1) licensing and 2) integration. I took what I found that fitted the BSD license and wrote the rest myself. The idea was having all the things I needed in a single, concise package. I like carrying around 1 MacGyver knife instead of a heavy toolbox - even if all the separate tools in the box are more robust than the knife.
As for the JavaScript graph library: Daniel Friesen had already ported 90% of the Python code, so it only took me a day or two to finish it. The result is a single file (graph.js) with lots of things besides visualization (eigenvector centrality etc.) which seemed better suited to Pattern than integrating another, bigger project.
Looks nice, though less full-featured than the NLTK. I'd be interested to see how nice they'd play together and whether applications could exploit the strengths of both at the same time. The only thing that's better than one good NLP framework, is two good NLP frameworks, after all.
A recurring meme in terms of frameworks is people keep bringing out tools and some of them disappear, some find niche applications, and some become mainstream. Though I haven't heard of many nlp toolkits (but I'm not in that field).
I want to jump into some basic NLP, but I'd like to stick with one or two toolkits. I had heard of nltk before this, but are there any other comprehensive or sort of succesful frameworks out there one should be aware of? (Either in python or something else)
Many smaller components are made to be compatible with IBM UIMA (of Watson fame), so they are able to be integrated into a pipeline somewhat easily. For examples of this in biomedical TM, see http://u-compare.org/ .
People will kill me for saying this, but truly: Python's performance isn't adequate for large-scale text mining, _especially_ if you want to do deep/full parsing. Shallow parsing as shown in this package's demo is more feasible.
I personally find NLTK convoluted, but in its favor, it does have readers for a TON of corpora, which is really nice.
My friends in the natural language field tell me Python and NLTK are more common than Java. Then again, this is at a sort-of Python-centric university (Toronto).
I don't know a whole lot about text analysis and the mentioned algorithms, can this be used to analyze articles and determine which are dealing with the same subject? Techmeme-ish? Or what would be a good starting point for this? (Or would this be better off in an 'Ask HN' post? I am one of those horrible new people on here.)
I'm going to be involved in teaching an NLP course this semester, and we're debating what to put in it. What are some things you want to do with NLP, and what would you hope to learn (or have a future employee learn) in an honours and masters level course?
Imagine spidering twitter for phrases of the type "x is a type of y" in order to form a database of real world objects in an inheritance hierarchy. Now imagine when you have these objects, finding out what these objects do by looking at verbs that occur around them. Boom. You have objects, and you have the methods you need to write. Now you just need someone to write the code! The methods writing could become a sort of captcha exercise.
Project thats been in the back of my head for a while, but have no time to do:
Analyze HN comments over time with some NLP techniques, maybe sentiment analysis. Then if the next wave of "HN is turning into Reddit" posts comes, point people to the analysis, whatever the conclusions are.
Seems like this would be well suited for the task. Any takers?
For we who have had to roll some of the same functionality piecemeal out of tools like the Stanford NLP Core, tregex/tsurgeon, Wordnet, Beautiful Soup, and python nlptk, this looks on the surface to be pretty sweet. BSD licensing.
Here's a cool application - tagging negation and speculation clauses in some text (their demo has been trained on biomedical text):
Example sentence:
When U937 cells were infected with HIV-1, no induction of NF-KB factor was detected, whereas high level of progeny virions was produced, suggesting that this factor was not required for viral replication.
Result:
When U937 cells were infected with HIV-1 , [NEG0 no induction of NF-KB factor was detected NEG0] , whereas high level of progeny virions was produced , [SPEC2 suggesting that this factor was [NEG1 not required for viral replication NEG1] SPEC2] .
Don't get me wrong, this is all awesome, but a lot of it could have been re-used from better sources without having to spend time working on random API wrappers and the like. I would definitely like to know the reasoning behind creating everything from scratch.