My thesis is related to trees, forests, and ensemble of forests.
This is pretty concise but mostly for decision tree but only half of it.
CART is the framework for decision tree for classification and regression. This article only addresses the Classification part which usually use Gini which is a class of split that split along parallel axises (there are oblique trees). The regression part uses more traditional statistical linear regression to calculate split point (SSTO).
Very light on Random Forests though, doesn't talk about out of bagging, implication of bootstrap and ordinal data, etc.. but overall I think it's a neat introduction.
> we only try a subset of the features,
You bootstrap features without replacement at every split. Instead of just bootstrapping rows/observations like in bagging.
The concept of weak learners ensemble together to become a strong learner is done by Dr. Ho work under her paper Random Subspace where she does it with decision tree and basically proposed Random Forest before Dr. Leo Breiman (both independently came to Random Forest). Her advisor have the theoretical paper for proof of weak learner, stocastic discrimination.
I think this assumes writing an article only benefits the author, when in fact I think it benefits readers far more (if the blog isn't monetized, the author only really gets satisfaction out of it). If I had expertise in some area I would be happy to contribute to someone else's blog, provided I were properly credited (which the author stated he would do).
My blog is still pretty new, so I haven't had to think about this situation yet - I'm the only one who's committed so far. If someone were to contribute I'd be more than happy to figure out a fair way to attribute them, though!
For Random Forests, I like this one: https://www.gormanalysis.com/blog/random-forest-from-top-to-..., which also has a link to a decision-tree post. That blog also has the best GBM explainer I've seen yet (Gradient Boosted Machines are the _other_ tree-ensembling method in common use, where the trees are _stacked_ instead of _bagged_)
Your goal should not be to know enough to write an RF implementation, but rather to have some intuition behind how it works, so you can better choose when to use it or not. The likelihood of it ever making sense for you to write and RF algorithm for production use is extremely unlikely; use the great code that already exists for most languages.
I wrote this regression tree tutorial a few years back that might be a good complement to the tutorial above since it covers regression instead of classification and goes on to talk about bagging vs random forest, out-of-bag samples, and tuning parameters: https://github.com/savagedata/regression-tree-tutorial I wrote it at the start of my career and haven't shared it beyond my study group, so I'm happy to hear feedback.
I like how you talk about Conditional Inference. My thesis is suppose to overcome the brute force of exhaustive search for best splits that Random Forest does (I use Dr. Loh's GUIDE trees) using statistical methods.
> Many implementations of random forest default to 1/3 of your predictor variables.
This is interesting. I hear it was sqroot(total number of predictors).
> Ensemble methods combine many individual trees to create one better, more stable model.
I think stable can be more clarify to having good training accuracy and low generalize error (unseen data error rate) compare to individual tree. This is what Dr. Ho talk about with forest.
But other than that I think it's an awesome tutorial.
I've seen what other tree and forest do for better generalization with unseen data is pruning is using CV and choosing 0.5 to 1.0 std error as a cut off point. That may be a thing to talk about if you are interested in that.
Thank you for the useful feedback! I'll have to look up GUIDE trees.
> This is interesting. I hear it was sqroot(total number of predictors).
I was probably looking at the randomForest R package documentation [1], which says:
> mtry Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)
I checked the H2O implementation of random forest [2] and they use the same defaults.
I'll add a note about the one third default being specific to regression since that seems like an important distinction.
> is there a better tutorial/course for a beginner into this field ?
I have no clue. I have a degree in Computer Science and am finishing my master in Applied Statistic. It just happen my skill set, statistical modeling, have many overlap with Machine Learning.
But for straight up modeling I would start and recommend this book: https://otexts.org/fpp2/ (update: this book doesn't go over EDA, outlier, and such... r4ds does EDA. For outlier and imputation some statistic book can cover that.)
It's for time series modeling (statistical slanted) but I think it goes over modeling aspect really well and it very intro (coursera have a step above that book for a little more depth in time series). To be fair... statistical modeling for univariate time series statistical model is number one-ish see m4 competition or the uber blog on time series.
> the end goal not being academia, but being able to think and write reasonable production code.
For the thinking part which affect writing code, from my experiences (I can be wrong), data science/ML approach to modeling is different than statistic.
The vast majority of the time Data Science/ML are given the data which is why I believe AI algorithm can be bias (see Gyfcat and Asian computer vision classification problem). Where as statistic you usually figure out your hypothesis or what you are trying to answer and then you create a design experiment or strategy on how to collect the data without being bias and hopefully controlling factors. But statistic also do given data but the vast majority of models out there is slanted toward explanatory vs forecasting/prediction. DS/ML seems to care more about forecasting/prediction.
I also think how each discipline approach modeling affect the way a person think within that field too. I have not figure out the unified thinking of how ML/DS discipline approach model but I am confidence that it's not the same as statistic. But speaking from statistic vs applied math, I can give an example. For time series data, statistician model on the assumption that all we're given is the data and we'll try to extract every bit of information out of the data and explain away the variance with models (each predictor can remove away the noise by explaining it whatever left is error/chances). It's more data focus. For applied math, they figure out how the data is generated eg their model assume this is how the data is generated so they got these stochastic processes and is uses more toward probability than statistic.
So thinking would affect code... So how statistician code stuff is probably different than ML/DS. I've seen people calling imputation witchcraft and throw temporal data in random forest >___>.
> write reasonable production code.
I do R for modeling.
If you're not creating any new fancy algorithm, you can just use packages. You train them on the data set, and just ship the trained model. You can wrap it as a REST service via https://www.rplumber.io/. I like to think Python have something similar?
Do note I know very little about deep learning or how to ship that. I've chosen to specialize in statistical modeling and R for modeling.
I've started studying ML (as someone who's hoping to transition at 47 to a different career from teaching and music), and some concepts I've found easy to get, others not so much (I'm old, so give me a break!). Often I need to have a concept explained in a different manner before I'll have a 'Oh, I see!' moment, and because I don't have a great grasp on the maths needed (which I'm working on, but it's s-l-o-w), as soon as equations I can't get appear, I tend to lose focus and think I can't do it.
This post covers things that I've seen in the past, and seems to sum up my internal understanding of the concept, which is good for me as reading through it I had a number of 'see, I do get it' moments. The only criticism I have of it is that it seems to gloss over what differences the trees within the random forest have - as I understand it, they are all slightly different, and this gives them greater accuracy?
Anyway, thanks for posting it - I'll read the other posts when I get a chance.
> what differences the trees within the random forest have - as I understand it, they are all slightly different, and this gives them greater accuracy?
I can give you an example from my own work. We have a random forest on a 400+ attribute input (ie: 400 variables). All we want at the end is a probability from 0.0 to 1.0.
Our random forest model will build around 500 trees. Each tree randomly selected a small subset of those 400+ input attributes and says "what's the best I can do using only these attributes?". Generally, it does okay. But when you average the 500 trees, the accuracy is pretty darned good.
Edit later: To be clear, each new tree is generated using the random subset of variables. The point is that each tree may glean some insight about that small combination of variables.
Since this is a blog post it may not be kosher copacetic on all the details. But if you're studying from books and papers all the notation will either be way too well-known (set theory, cartesian products, R^d vector spaces, lp/Lp norms) or explicitly explained.
If you're behind or fuzzy on the more basic stuff (what's an equivalence class? What's a Cartesian product?) I recommend the first two chapters in Munkres' book of topology. The book builds pretty far out into uncharted territory, but its recap of the basics is rigorous and superbly explained with copious illuminating prose.
> it seems to gloss over what differences the trees within the random forest have - as I understand it, they are all slightly different, and this gives them greater accuracy?
They kinda cover it in the section in 3.1 and 3.2 with Bagging and Bagging -> RandomForest, but it'd be good for them to explain Boosted Trees here as well.
As far as I understand it, random forests are an aggregation of trained trees based on randomly sampled data points from the original data set. It doesn't necessarily make them more accurate on the training dataset, but it makes them more generalised and less likely to overfit (https://en.wikipedia.org/wiki/Overfitting), because the different trees are likely to focus on different characteristics of the dataset.
Boosted trees do become more accurate, as they resample, but give more priority to data points that weren't correctly classified by the earlier models.
Just to add here that the columns and not just the rows are also sampled at each node so the trees are deliberately prevented from learning too much. This helps improve diversity and reduce overfitting. This is typically around 1/3rd of the number of columns and controlled by the mtry parameter.
If you understand how CART decision trees are trained, then you can see why random forests are powerful by thinking through their training process.
In the first step of the decision tree training we pick the best feature split, divide the data into two groups, then train each of the two data groups independently. But what about the 2nd-best feature split? In some sense, we lose the information the other splits could provide.
To see this, when testing queries, the first step is to look at that best split and pass the query to one of the two sub-trees. But those trees have only been trained with half of the training set data, and thus have weaker discriminatory power. Every split down the tree has diminishing returns in terms of how much information it provides.
Now think about what the random forest does. If the feature which contains the best split is chosen for a particular tree, the split will be the same. But if it doesn't, then if the feature for the second split is present then it will be chosen. If the top 2 features aren't present then the third best split will be chosen, and so on.
Thus, across our forest we have representatives of a range of feature-splits, each trained on more data and thus have more discriminatory power per split. The aggregation step at the end combines the information gleaned from these different models. Each one of them is weaker than the original CART decision tree, but has gotten more information out of the data for the features it was given. Thus, together, they are much better predictors than by themselves.
When you say "real world examples", what do you have in mind? A lot of "real world" uses of random forests are basically just directly calling scikit-learn or something like that. In my neural network post, I implemented a simple neural network from scratch because I felt like it'd be valuable for beginners, but I wouldn't call that "real world".
While there’s nothing wrong with random forests, they’re a bit of a red flag for me as they’re easy to implement without any real understanding of what’s going on. A lot of junior data scientists just default to saying random forest to solve any problem because it tends to have the most predictive power of the models they’re comfortable with. That’s a bad sign.
I am probably one of the junior DS you are referring to. But, I genuinely want to know the reason of using anything other than gradient boost tree to do classification on structured data.
It depends on your goal, and the nature of the problem. If you need to explain something in simplest terms, maybe use regularized logistic regression. If you need to make sensitive decisions, maybe a tree would be best because you have a clear sense of the variance in your answers at each node.
There’s nothing wrong with random forest. It’s a perfectly good model. But when it is someone’s only tool, it implies they both don’t know much much about the toolkit, and also how that one particular tool works.
I rarely use anything but linear models, trees and forests fwiw.
What's the bad sign there? If juniors are reaching for RF as a first-pass over say, (logistic) regression, that seems like a _great_ sign; no other approach has such a high average performance-effort ratio. For non-huge, naturally tabular problems, it usually takes me 10x or more the effort to beat the very first RF I train. Developer time matters!
If they start going to GBM's or neural nets first...I'd call _that_ a bad sign (and it happens).
There are also lots of good ways to peer into the inner workings of a tree ensemble model nowadays. It's not laid out plainly for you like a linear model, but it's not an impenetrable black box as people like to suggest.
That’s true, but they also tend not to know them. Or worse, they point to variable importance and don’t know how it works which leads them to the wrong conclusion.
The real differentiation will come from feature engineering. Random forests and boosted trees (the other fire-and-forget model of choice) can tell you a lot about the data set from their validation performance under tuning. It might be that you've got a straightforward situation, but often the next step is to dig into what the model is doing and see if you can better optimize the features.
I was so hoping more like "random cities" and "random islands" where the forest is a collection of tall flora to be rendered for the viewer's/gamer's pleasure.
This is pretty concise but mostly for decision tree but only half of it.
CART is the framework for decision tree for classification and regression. This article only addresses the Classification part which usually use Gini which is a class of split that split along parallel axises (there are oblique trees). The regression part uses more traditional statistical linear regression to calculate split point (SSTO).
Very light on Random Forests though, doesn't talk about out of bagging, implication of bootstrap and ordinal data, etc.. but overall I think it's a neat introduction.
> we only try a subset of the features,
You bootstrap features without replacement at every split. Instead of just bootstrapping rows/observations like in bagging.
The concept of weak learners ensemble together to become a strong learner is done by Dr. Ho work under her paper Random Subspace where she does it with decision tree and basically proposed Random Forest before Dr. Leo Breiman (both independently came to Random Forest). Her advisor have the theoretical paper for proof of weak learner, stocastic discrimination.