Tip It’s recommended that you should read the following articles regarding “Decision Tree: Regression Trees” and ”Decision Trees Classifier” before going deep in ensemble methods.
The basic idea behind ensemble methods is to build a group or ensemble of different predictive models -weak models- (each of these models is independent of the others) and then combine their outputs in some way to get a precise final predicted value.
As using different algorithms to train different models (independent models) and combining their results is very very difficult, ensemble methods uses only one algorithm to train all of these models in a way to be completely independent.
So, ensemble methods employ a hierarchy of two algorithms.
_base learner_
. The base learner is a single machine learning algorithm that gets used for all of the models that will get aggregated into an ensemble.Many algorithms could be used as base-learner such as linear regression, SVM, but the in practical, binary decision-trees algorithm was chosen to be the default base learner of the ensemble methods.
The ensembles may consists of a contribution of hundreds or may thousands of binary decision trees, and of course inherit their properties from it.
There are two ensemble learning paradigms: bagging (abbreviation of bootstrap aggregation
) and boosting. The bagging paradigm is behind the random forest algorithm whereas boosting paradigm is behind gradient boosting algorithm.
Where: _𝑦_̂ : is the average predicted output and _𝑆 _: is number of samples.
Fig.1 Bagging Algorithm
The random forest algorithm has an advantage over bagging algorithm that, the training samples don’t incorporate all the attributes (columns) of the original data set, instead they take a random subset of them also.
Random forest is one of the most widely used ensemble learning algorithms. Why is it so effective? The reason is that by using multiple samples of the original dataset, we reduce the variance of the final model. Remember that the low variance means low overfitting. Overfitting happens when our model tries to explain small variations in the dataset because our dataset is just a small sample of the population of all possible examples of the phenomenon we try to model.
Gradient boosting develops an ensemble of tree-based models by training each of the trees in the ensemble on different labels and then combining the trees.
Where the objective is to minimize residuals (errors), each successive tree is trained on the errors left over by the collection of earlier trees. Let’s see how it works?
2. Then replace labels of each example (instance) i = 1, . . . , N in the training data set with the result of the following formula:
Where _𝑦_̂ 𝑖 , called the residual, is the new label for example 𝑥𝑖, 𝑦𝑖 is actual output and 𝑓(𝑥𝑖) is the first predicted output of boosting model (calculated mean value which is fixed through all values of X in this case).
3. Now we use the modified training set, with residuals instead of original labels, to build a new decision tree model, _𝑓_1
4. Get output of _𝑓_1 decision-tree then the boosting model is now defined as:
where 𝛼 is the learning rate (a hyper-parameter with value 0→1).
5. Repeat point-2 to get new residuals and again replace the labels in our training data set with these new residuals.
6. Build a new decision tree model, _𝑓_2 based on new data set and get results.
7. The new boosting model is now redefined as:
8. The process continues until the maximum of _𝑀 _(another hyper-parameter) trees are combined.
9. For this reason gradient boosting is called sequential model
Fig.2 Gradient Boosting Regression
So, what’s happening here? Boosting algorithm has applied the policy of reducing the residuals (errors) that results from each base-learner gradually.
#ensemble #gradient-boosting #random-forest #algorithms