In my early journey into the murky depths of data science and machine learning I’ve come across the phrase Random Forest
a few times, and been completely clueless as to what it actually referred to. Today I decided to dive in and explore the concept.
So, what is a Random Forest? Here’s what the wise and mighty Wikipedia came up with:
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
For those, like myself, who find this to be a little bit oblique, let’s break it down, starting with the basic element of any forest: the tree.
Can’t see the Forest for the (Decision) Trees
A decision tree is the basic unit of a random forest, and chances are you already know what it is (just perhaps not by that name). A decision tree is a method model decisions or classifications based on different criteria at each branch. The simplest model uses binary classification (yes or no questions) to classify a group of objects or arrive at the answer to a query.
Decision Tree’s are actually fairly common – you’ve probably seen them crop op as meme’s or pseudo-serious posts – here are a few examples.
So, the Forest part of a Random Forest seems fairly intuitive; its just a bunch of these Decision Trees
, right? True, but lets get a little more nuanced and call it an ensemble, which refers to a specific method in statistics and machine learning
Ensemble Learning
Ensemble learning is a method of using multiple algorithms to model predictions, rather than one constituent model. The strength in this approach is that by using multiple models issues such as overfitting or high bias. Specifically this corrects the habit decision trees have of overfitting to their training set. In other words, a decision tree by itself can make very accurate predictions for a specific training set but fails to generalize these predictions well outside of that set.
The strength of the forest comes from the fact that it takes the mean prediction (for regression models) or the mode (most common for you non-stats types) of predictions for classification problems.
In other words, each decision tree differs from one another enough to correct for the error or overfitting that may be introduced in an individual tree.
You can think about this in ecological terms. A forest with only one type of tree and no other fauna is inherently weak; a single disease could wipe it out. A ecologically diverse forest is resilient to disease (or error) by having a wide variety of flora and fauna.
Additionally, aside from supporting different types of trees in general, Random Forests benefit from the fact that there does not need to be any correlation or relationship to the decision trees that comprise it; in fact, it funstions best when 4there is no correlation, and the composite trees have a high degree of randomness.
But how can a Random Forest ensure that it is, in fact, random?
Bagging
The fancy name for this technique is called bootstrap aggregating. However, bagging just sounds so much better in my opinion; plus in the Random Forest context this process typically carries the name bagging.
Bagging is the process selects a random subsample of the training set and fits the trees to these examples. What does this ultimately mean for the model?
The bagging process leads to better model performance because it decreases the variance of the model without increasing the bias
AKA, the “sweet spot” for machine learning.
Feature Bagging
Random Forests leverage bagging in a unique and powerful way. They implement a slightly modified learning tree approach where a random subset of features is selected for each level in a tree. This ensures that if there are features in a data set that consistently appear in the bag B
, the random selection of features ensures that they do not overly distort the outcome of the model.
Random subset selection of features ensures that there is no correlation of trees between bagging.
Random Forest Tutorial using Scikit Learn
How, you may ask, does one cultivate such a forest? The (very) cursory example below is an example of how to set up a Random Forest Classifier using the Scikit-Learn package for python.
Import the necessary libraries
from sklearn import datasets, metrics from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier
Loading the data
We’re going to use data directly from the sklearn
package – this is a very convenient trick for developing and testing the structure of a model, without loading a massive dataset.
wine = datasets.load_wine() print(f'This dataset has {len(wine.target_names)} classes') print(f'These are the features in this dataset:\n {wine.feature_names}')
This dataset has 3 classes These are the features in this dataset: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Define our X and y
X = wine.data y = wine.target
Split the data into train and test sets for cross-validation
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=12)
Create the Random Forest Model
I’m creating an instance of a RandomForestClassifier()
using only 2 trees to generate predictions. It’s more of a small woodland grove than a forest, I suppose. The model is fit the training set using fit()
.
random_forest = RandomForestClassifier(n_estimators=2) random_forest.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=2, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
Predicting on the test set for Accuracy
Use the X_test
dataset to generate predicitons; then, get the accuracy of these predictions by comparing them to the y_test
set.
y_predict = random_forest.predict(X_test) print(f'Our Random Forest Classifier is {metrics.accuracy_score(y_test, y_predict)} accurate')
Our Random Forest Classifier is 0.8888888888888888 accurate
Adding trees to the forest
Lets add a few more trees to the random forest and see how the prediction score is affected.
random_forest = RandomForestClassifier(n_estimators=42) random_forest.fit(X_train, y_train) y_predict = random_forest.predict(X_test) print(f'Our Random Forest Classifier is {metrics.accuracy_score(y_test, y_predict)} accurate')
Our Random Forest Classifier is 0.9555555555555556 accurate
Sure enough, the accuracy of the model increased with a higher number of trees in the random forest. You can find all of the above code as a Gist below.
If you found this overview useful, do something nice for another human today! And as always, comment and share your thoughts.