I have a good amount of experience in natural language processing and machine learning, and I don't think offering an API that provides easy access to the algorithms is the right solution. The major algorithms in text classification aren't that complex to implement, and can be done in a few hundred lines. Moreover, all of the most widely used, widely tested, and reliable algorithms have public implementations that are readily adaptable to your needs. And that's the problem: understanding your needs.
Understanding your needs (or your company's needs) is where people with PhDs make their money. Machine learning isn't a panacea, and we won't be seeing a one-size-fits-all approach for awhile. Even though data has become more accessible, it might be noisy, incomplete, streaming, partially labeled, etc. This is why understanding exactly what you're trying to model with these algorithms is crucial and why "just applying" them is impractical at best and misleading at worst.
There are a number of text analysis SaaS offerings such as OpenCalais, AlchemyAPI, Zemanta, and OpenAmplify. They've all got impressive science under the hood, but none of them are accurate enough to be useful.
I spend most of my time these days thinking about why that is and what to do about it.
For systems to do better, they'll need to incorporate world knowledge; they'll need to test different interpretations of a text and select the ones that "make sense". This is likely to be a form of statistical inference rather than Cyc style logic.
Based on some systems I've worked with, I'd estimate that a space optimized "background" knowledge base that can estimate satisfiability in the common sense domain is on the order of 10-100 GB. It will puff out to at least an order of magnitude beyond that in the process of creating it.
Few users will have the ability to create a KB of that type, and it would be a serious thing to download and install.
Hosting the services of that kind of system in a SaaS manner makes a lot of sense.
Without looking under the hood I'd say there could be at least four reasons why they fail (based on what most of the NLP literature is lacking):
- did not remove contradicting information from the training sets (two very similar vectors having contradicting labels)
- did not try enough feature selection algorithms
- did not estimate ALL learner parameters using the training sets with internal CV
- did not include domain knowledge
The last one refers to Paul Houle comment. Just, beside using tools like OpenCyc, WordNet, UMLS, there many other ways to embed domain expertise in an automated classification process. Injecting semantically related features into a vector representation of a document is extremely difficult. Forward feature selection doesn't work well for sparse and noisy data.
the curse of dimensionality is the worst problem that affects machine learning
customers don't want to create training sets large enough to train text classifiers; often the number of documents they need to sort into a category is too small to fit in a category.
As for semantic indexing, it was hard to do in 2005. In 2011 it's easy. DBpedia and Freebase are a chromosome map for the human memome. With large amounts of instance information, it's possible to do things that a big rulebox can't.
These tools are aiming for the market segment that Cyc aimed for, but will use very different methodologies.
The other big catch is randomness, which is often not understood by neophytes. If you try to find relations in sufficiently large data set, you're bound to find some that are "caused" by randomness. Tools like p-values are of little help when you fish for many relations (and not just one in particular).
Randomness can be a curse, but can also be a blessing when introduced as in the random subspace methods. This again abstracts to understanding your business needs and whether the results encountered make sense given the features' [absence of] independence. An API giving you a wide choice of algorithms will still rely on you to run something like ICA as a pre-processing step to identify this statistically independent randomness.
I've found in my data-mining experience that the most interesting data (at least on the Web) is not particularly easy to parse, even if you write something that automates a form's POST submissions. The second difficult part is normalizing it, as much web/text data is formatted for display to humans, which is quite different than data in easily analyzable form.
So given that, it's just worth learning enough program to do loops, conditionals, and regexes to get what you want.
This is so true. I've been mining mailing lists and framework documentation for my Ph.D. and most of my effort and time was spent normalizing the data. Once that was done, classifying content and linking concepts was relatively easy...
Text mining is one of my specialties and I have had similar ideas for a business. One thing that has stopped me is the awesome (and free for about 50K API calls a day) Open Calais service that does entity extraction and identifies some relationships between entities in input text.
For document clustering there are many good open source tools that people and companies can use. The commercial Ling Pipe product does a good job at sentiment analysis.
Obtaining, scrubbing, and generally curating the data is a pain point that users of this system may still need to worry about.
I wish this new business good luck, but there are definitely some real problems to work around. Perhaps we should go into business together :-)
True, but problems are there to be solved. :)
We've done a lot of work around data normalization/scrubbing from a multitude of sources as part of a sister project, so I'm fairly confident about this aspect.
Curation and classification is another issue, but we have a few ideas.
As for business, you never know, just let me try to get off this Ramen based diet first. ;)
Text mining: most of the time is spent on gathering the data, curating the data, and working with your annotators (domain experts). After that, you try a dozen or more ways to covert documents into a matrix format. Then, you try a dozen or more feature selection algorithms. Finally, the icing on the cake: you get to try a dozen or more machine learning algorithms, each having a dozen or more parameters to be estimated.
Yep, it would be very nice to have an API that would do all that for you. But that would require a group of at least 10 ML experts + 10 NLP experts + 20 domain experts. Still, I think it's doable and one should make small efforts to make it happen.
Marginal thoughts: decision trees are very bad for large p >> n problems - random forest might work, though. If TextMinr doesn't have radial SVM with auto-tuning then it will not cope with more difficult problems.
Appreciate your comment. And you're right about decision trees, though they can be useful for simpler problems, such as classifying documents into categories, whereby you have "sub categories" and the parent categories are mutually exclusive.
Decision trees are really only useful for problems where there is mutual exclusion between the different options, so they are definitely no silver bullet.
Fully grown decision trees are notorious for their risk of overfitting your training set. If you're uncomfortable fully growing the trees, you then have to consider whether you want to grow them out completely and then prune them, stop growing after a specific depth, train the trees using a random subset of features in the feature space (and then how many do you select? Do you use the square root? Logarithm?), etc. Even then, what are you using to choose when a node splits? Information gain? Information gain ratio? Gini index? What about when you have a feature like credit card numbers, which are unique?
These are all choices that the user has to make. For something as seemingly simple as a decision tree, you can see why some knowledge is required before embarking on any machine learning mission.
I suspect that you are referring to what's called "hierarchical text classification". If so then any classifier can be used for that. And it's not a simple problem. I found it to be a good way to deal with unbalanced classes if you understand the domain knowledge that sits behind your class labels. I suggest taking look at these papers:
I'm not familiar with 80legs, but the main idea is to "democratise" access to this technology and make it easily accessible to anyone who wants to build something on it.
The initial few beta releases will probably be aimed at people aiming to build applications themselves by providing them with API's, but hopefully we'll build out the analytics side of things soon enough so it becomes accessible to non-techies as well.
I'm really happy to see more people moving into this space.
I've used a number of different systems (openCalais, AlchemyAPI, Zemanta...) in a variety of projects (Sentiment analysis, document classification...), and what I've found thus far is that while each system works extremely well within some restricted application classes, none come close to being general purpose APIs for the myriad applications developers try to throw at them.
A couple of pain points I've encountered are requiring a larger than expected corpus to generate meaningful data based on overly broad scope of the platform's analysis, or the lack of ability to apply negative signals from external sources. I find there tends to remain a large quantity of logic sitting rather redundantly on the application end to post-filter what's generated.
I don't pretend to understand the level of complexity involved or what's being worked on currently (not an NLP guy), but I do think there's a huge space to create publicly available text mining which can more effectively be applied to narrow domains.
Machine learning is concerned with developing algorithms that enable programs to use data to evolve new behavior. Calling it a meme seems to put it in the same category as advice animals and Old Spice ads. Still, I see what you're saying--- it is really "in" right now, which is probably because:
(1) There's a lot more data now for people to use it with
(2) Infrastructure for doing it is cheap and scalable
(3) Automation is a key driver of progress, and ML algos are now getting good enough to automate a lot of stuff that humans used to have to do.
I think I should look into this. I sense that I lack a general understanding of what is really possible with ML. (My current understanding is weighted toward underestimating what is possible with ML).
That's a very good list. I think we're just starting to scratch at the surface of what is possible, which is why I've devoted myself to working in this field!
The high throughput imaging that I'm familiar with is in regards to cell imaging. Cells are cultured in high-density plates (96-well or 384-well dishes), each well is given a different experimental condition (drugs, RNAi, etc) and then imaged on an automated microscope.
As you can imagine, this generates tons of data. Our lab did a highthrouput screen of genetic mutants in neurons, and then used software to quantify basic morphology such as neurite length , arborization, and cell death.
Crystallographers will use a similar system to bathe their protein in billions of compounds to find the right combination for crystallizing. Automated cameras will capture images and try to identify which ones have crystallized so the researcher doesn't have to do it by hand.
robots chugging away at petri dishes that contain hundreds of mini-dishes with chemicals/biostuff. Making sense of the millions/billions of data points is the job of machine learning.
This probably won't work, as google found out with it's prediction api (it hasn't been used much). There's already enough open source software out there that's state of the art and easy to use.
There's good business to be had in selling data though which is where these folks should probably divert their effort.
I don't know; ML has a lot of applications and the bar for most people to be able to implement it is rather high. Lowering the bar so "mere mortals" can have some serious infrastructure and data that's a mere API call away seems pretty huge.
The problem is that if you're not confident enough to get these systems working yourself, you're probably not going to be confident enough in your business to pay by the sip for someone else's api.
I honestly don't think this is feasible without regular expressions. There's very many minor details that make data mining work which have to be custom tailored to different solutions.
Understanding your needs (or your company's needs) is where people with PhDs make their money. Machine learning isn't a panacea, and we won't be seeing a one-size-fits-all approach for awhile. Even though data has become more accessible, it might be noisy, incomplete, streaming, partially labeled, etc. This is why understanding exactly what you're trying to model with these algorithms is crucial and why "just applying" them is impractical at best and misleading at worst.