Hacker News new | past | comments | ask | show | jobs | submit login

Text mining: most of the time is spent on gathering the data, curating the data, and working with your annotators (domain experts). After that, you try a dozen or more ways to covert documents into a matrix format. Then, you try a dozen or more feature selection algorithms. Finally, the icing on the cake: you get to try a dozen or more machine learning algorithms, each having a dozen or more parameters to be estimated.

Yep, it would be very nice to have an API that would do all that for you. But that would require a group of at least 10 ML experts + 10 NLP experts + 20 domain experts. Still, I think it's doable and one should make small efforts to make it happen.

Marginal thoughts: decision trees are very bad for large p >> n problems - random forest might work, though. If TextMinr doesn't have radial SVM with auto-tuning then it will not cope with more difficult problems.




Appreciate your comment. And you're right about decision trees, though they can be useful for simpler problems, such as classifying documents into categories, whereby you have "sub categories" and the parent categories are mutually exclusive.

Decision trees are really only useful for problems where there is mutual exclusion between the different options, so they are definitely no silver bullet.


Fully grown decision trees are notorious for their risk of overfitting your training set. If you're uncomfortable fully growing the trees, you then have to consider whether you want to grow them out completely and then prune them, stop growing after a specific depth, train the trees using a random subset of features in the feature space (and then how many do you select? Do you use the square root? Logarithm?), etc. Even then, what are you using to choose when a node splits? Information gain? Information gain ratio? Gini index? What about when you have a feature like credit card numbers, which are unique?

These are all choices that the user has to make. For something as seemingly simple as a decision tree, you can see why some knowledge is required before embarking on any machine learning mission.


I suspect that you are referring to what's called "hierarchical text classification". If so then any classifier can be used for that. And it's not a simple problem. I found it to be a good way to deal with unbalanced classes if you understand the domain knowledge that sits behind your class labels. I suggest taking look at these papers:

http://scholar.google.com/scholar?q=%22hierarchical+text+cla...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: