In this Blog Post, We are going to learn the Ultimate guide on Random Forests and What is Random Forest? exactly in Machine learning.
If you’re new to the world of machine learning, the topic of random forests may seem overwhelming.
What are they? How do they work? In this guide, we’ll give you everything you need to know to understand what exactly a random forest is and why it’s so important to machine learning enthusiasts and professionals alike.
We’ll also share some practical tips for getting started with your own experiments using random forests!
How Does a Decision Tree Work?
A decision tree works by dividing all possible outcomes into mutually exclusive groups.
Each group contains one or more outcomes that have similar probabilities of occurring, but there is no overlap between groups.
The branches of a decision tree are split based on an attribute value, like gender or income level.
For example, if you want to sell something online and ask someone’s age, you can either make it broad selling to everyone over 18 or narrow selling only to people who are 25–35 years old.
Each decision tree grows from a single root node, which contains all outcomes that you can’t divide further.
For example, if you asked for someone’s age and got 60 as an answer, then their outcome would be included in your root node.
Any outcome that cannot be categorized as belonging to any one of these groups falls into your other group.
Because your root node is such a large group by itself, it should make up a significant chunk of your overall data set.
What is Ensemble Learning Techniques?
Ensemble learning techniques are commonly used to improve machine learning algorithms by combining their predictions and weights.
Popular ensemble learning techniques include boosting, bagging, random forests, stacking, and additive trees.
Each technique can produce models that outperform individual models in some applications.
The result of one model’s output being combined with another model’s output is that two or weaker models are effectively stronger than they would be alone.
The result of data science ensembles can also be superior to that of any single member of an ensemble.
These results can often hold regardless of which particular members were chosen for an ensemble.
Choosing how many members to use (and which members) depends on your task, but more on that below.
Bagging was developed in 1993 by Leo Breiman to address weaknesses he observed in regression modeling methods and neural networks.
At training time, subsets of training data points from different randomly sampled training sets are selected and these form nodes in a decision tree.
A popular algorithm called random forests takes each subset as input into its own decision tree which produces an individual prediction for each point in that subset.
Related Article: Boosting Machine Learning Algorithm – Complete Guide
How do Ensemble Learning Techniques Work Together?
First, what exactly is an ensemble method? At its most basic, ensemble learning uses multiple weak learners (algorithms) to produce more accurate results than those produced by any of their component algorithms alone.
The result of each weak learner may be weak on its own, but taken together they make up for each other’s deficiencies.
Ensemble methods can either be bagging or boosting,
Bagging: In bagging or bootstrap aggregation, different samples are drawn with replacement from the original data set in order to construct multiple model subsets.
These subsets are then weighted by their accuracy and combined using some type of voting mechanism.
Boosting: Boosting refers to iteratively combining trained models based on some model selection procedure that gives preference to more accurate models over less accurate ones.
Then, weights are applied to these aggregated models according to how strongly they were voted upon.
When applied together, these two techniques provide even better predictions than when used separately.
A third powerful technique random forests operate within both of these paradigms to create even stronger ensembles.
Random forests: Random forests work by building many decision trees at training time through repeated random sub-sampling with replacement.
Each tree is subsequently used as part of a larger ensemble containing many individual decision trees.
Related Article: What is a Supervised Learning? – Detail Explained
What is Random Forest?
A random forest, or random decision forest, is an ensemble learning method for classification, regression, and other tasks that operates by constructing a multitude of decision trees at training time.
Like any type of ensemble method, random forests are useful when we cannot find any individual model better than its alternatives using some metrics.
The main advantage of such methods over single models (e.g., neural networks) is that they can be more robust in presence of irrelevant features.
In general, each tree in a random forest contains one predictor variable selected randomly from all candidate variables along with a constant feature (also called super-root) added to each row being classified.
There may also be interactions between predictors included in each tree.
In fact, linear regression is just one particular case of generalized linear models where link function does not depend on parameters; that’s why I did not include it as a separate section there.
We can also write down formula for mean squared error as: MSE(y_true, y_pred) = 1/m * sum_^m (y_true[i] - y_pred[i])^2 Therefore, m tries are sufficient to have MSE(y_true, y_pred) ~ 0. Furthermore: n*m > n! so m ~ O(n).
Moreover, ensemble learning can be iterative or feed-forward, It does not make sense to consider a feed-forward approach here.
So if we assume a number of iterations approach infinity we get convergence of estimator towards expected value (that means Pareto optimality in terms of Hoeffding’s inequality).
On the other side, the convergence rate is r^ which equals 0 when r=0.5 which means that while increasing the number of iterations our convergence improves but only in the half-linear way.
That would explain why random forests tend to perform better than decision trees with an equal number of iterations.
When implementing them you should take into account how many times you want each tree and test several values for m.
This could greatly improve the performance of your method at minimal cost and maintenance effort.
How Does A Random Forest Work?
A random forest uses decision trees at training time, so let’s review exactly how those work.
Decision trees start with an input variable and then make yes/no decisions (which we call nodes or splits) based on that variable and other inputs to that node.
At each split, there are two options: one will result in taking down another branch (called a child node) and another won’t.
The first option for giving split results in creating two more branches one for if our condition is true and one for if it isn’t and then repeating that process at each of these new branches.
Let’s take an example to see what I mean by all of this, Hopefully, you can begin to grasp what makes up a decision tree and why it can be useful for prediction.
But just classifying data isn’t going to get us very far, which is why we use ensembles! We randomly choose different examples from our dataset and train separate random forests with them.
Then, at testing time, we combine our individual predictions together into what’s called an ensemble.
Because of how much faster they are than traditional algorithms like Logistic Regression, Neural Networks, or Support Vector Machines they have been widely used in both academic research and industry as off-the-shelf machine learning tools; Google Search also uses random forests to sort news stories.
How to Build Random Forest Algorithm?
Building a random forest in Python can be accomplished by using 4 basic components:
Training Data,
Input Variables,
Output Variable, and
Classification Function.
To do so we’ll first create an object to represent our model and fit it on our training data. We’ll use 70% of our data to train and 30% to test.
Next, we will import sklearn which contains all of the scikit-learns estimators for classification and regression models.
The estimator that will build our tree uses DecisionTreeClassifier. After importing scikit-learn we instantiate our estimator with two parameters, n_estimators, and max_depth.
N_estimators allow us to specify how many trees are going to make up our forest while max_depth determines how complex each individual tree will be (the larger number makes more complex trees).
After setting these parameters, the estimate still needs training data as well as any other variables that you want to be included such as features or weights.
Now that our decision-tree classifier has been instantiated and filled with information, we can call its fit method.
Once it finishes learning from its training data, it will be ready to predict classifications for new cases/observations using its predict method.
This final piece of code demonstrates how to construct and train our forest, As a reminder, only 40% of our data was used for training; 60% was used for testing purposes.
At a testing time when we set out input parameter equal to X_test[:,0], we pass our entire X_test array into X because we don’t want preprocessing done during testing time.
Also, note that we aren’t calling draw here, if we wanted to visualize a sample prediction at testing time, then we would include drawing.
For now though, since we’re just interested in getting predictions, it’s fine if the draw doesn’t get called here as it does during training time since we’re not interested in visualizing anything at testing time.
Random Forest Algorithm in python
Importing the required Libraries for model Building
from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris
Loading the iris data from sklearn.datasets python package
iris = load_iris()
Creating the Samples of the data in train and test dataset called X and Y
X = iris['data']
y = iris['target']
Building the Random forest using RandomForestClassifier() function with n_estimators and random_state HyperParameters
rf = RandomForestClassifier(n_estimators=100, random_state=1)
Predicting the model based on Test Data
rf.fit(X, y)
Finding the Accuracy of the model
print(Accuracy: , rf.score(X, y)) ## 0.8 95333275557571 ## (accuracy score may vary on your environment and sample data)
Guide for Implementing New Algorithms:
first 100 trees After 10-fold cross-validation, we see that RF has 93% training accuracy and 87% test accuracy. How to interpret scores Accuracy scores are obtained from 10-fold cross-validation.
For example, if you chose to perform 5-folds of cross-validation, then run five times on randomly selected 80% of data (using 20% as the test set) and report accuracy using mean score across all five runs, it would be equal to the following: 0.8 * 5 = 4 or in other words 80%.
Please note that for classification tasks there may be less than ten classes because some classes may have zero samples.
Working of Random Forest Algorithm
A random forest classifier works by taking an input vector of features, X=(x1, x2,…., xn), and producing an output: f(X)=output.
To construct each tree in a forest, start with each possible way to split your data into two subsets of equal size: 2 nodes.
Then pick one at random (thereby choosing one training example from each subset) and perform that split on every training example in that node. Repeat until you have K splits.
Now associate some value with each split based on how well it separates examples; for example, if two classes are balanced according to your target variable, then you might prefer splits that put more examples in one class than another.
Pick your best K values and draw these as leaves. After building all trees like that, find their outputs for your test dataset.
If there’s no clear winner, sort them by the number of test-set observations they got right or wrong, pick one at random as before, and take its most common guess as your classification decision.
Random Forest Model Validation
The standard way to evaluate an ensemble classifier such as random forests is to compute its out-of-bag (OOB) error.
If not all members of your training set are present in your test set, then you should compare OOB error with leave-one-out cross-validation (LOOCV).
The latter evaluates each classifier on every data point and has a lower variance than evaluating on a randomly selected subset.
For classification problems with binary outcomes and 20 or more predictors, LOOCV is usually more accurate than OOB error.
Some practitioners use LOOCV for all supervised learning tasks; even if it isn’t required by publication guidelines.
It’s a good scientific practice to show how well new models generalize beyond their training sets.
Conclusion
Random forest (RF) classification tasks, selects a class for new instances by looking at how each of these trees votes with respect to each possible class.
The randomness in a random forest stems from how different subsets of variables are selected when building individual decision trees; no tree uses all variables but rather only samples from them (hence random).
This way we don’t have to worry about overfitting our model because every tree looks different on its own and they can learn different things!
DataScience Team is a group of Data Scientists working as IT professionals who add value to analayticslearn.com as an Author. This team is a group of good technical writers who writes on several types of data science tools and technology to build a more skillful community for learners.