Combining tree based models with a linear baseline model to improve extrapolation

This post is a short intro on combining different machine learning models for practical purposes, to find a good balance between their advantages and disadvantages. In our case we will ensemble a random forest, a very powerful non-linear, non-parametric tree-based allrounder, with a classical linear regression model, a model that is very easy to interpret and can be verified using domain knowledge.

For many problems gradient boosting or random forests are the go-to-model. They often outperform many other models as they are able to learn almost any linear or non-linear relationship. Nevertheless one of the disadvantages of tree models is that they do not handle new data very well, they often extrapolate poorly — read more on this. For practical purposes that can lead to undesired behaviour, for example when predicting time, distances or cost, which will be outlined in a second.

We can quickly verify that the sklearn implementation for random forest models can learn the identity very well for the provided range (0 to 50), but then fails miserably for values out of the training data range:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = np.arange(0, 100).reshape(-1, 1)
y = np.arange(0, 100)
## train on [0, 50]
model.fit(X[:50], y[:50]);
## predict for [0, 100]
### RandomForestRegressor()
plt.ylim(0, 100);
sns.lineplot(X.reshape(-1,), model.predict(X));
plt.show()

#sklearn #ensemble-learning #extrapolation #modeling #machine-learning

towardsdatascience.com

Combining tree based models with a linear baseline model to improve extrapolation