Text mining: most of the time is spent on gathering the data, curating the data,...

wfaler · on Nov 21, 2011

Appreciate your comment. And you're right about decision trees, though they can be useful for simpler problems, such as classifying documents into categories, whereby you have "sub categories" and the parent categories are mutually exclusive.

Decision trees are really only useful for problems where there is mutual exclusion between the different options, so they are definitely no silver bullet.

law · on Nov 21, 2011

Fully grown decision trees are notorious for their risk of overfitting your training set. If you're uncomfortable fully growing the trees, you then have to consider whether you want to grow them out completely and then prune them, stop growing after a specific depth, train the trees using a random subset of features in the feature space (and then how many do you select? Do you use the square root? Logarithm?), etc. Even then, what are you using to choose when a node splits? Information gain? Information gain ratio? Gini index? What about when you have a feature like credit card numbers, which are unique?

These are all choices that the user has to make. For something as seemingly simple as a decision tree, you can see why some knowledge is required before embarking on any machine learning mission.

zeratul · on Nov 21, 2011

I suspect that you are referring to what's called "hierarchical text classification". If so then any classifier can be used for that. And it's not a simple problem. I found it to be a good way to deal with unbalanced classes if you understand the domain knowledge that sits behind your class labels. I suggest taking look at these papers:

http://scholar.google.com/scholar?q=%22hierarchical+text+cla...