If you have ever combined multiple ML models to boost your score up on the leaderboard on some Kaggle competition, you know what this is about. In fact, many winning solutions of these Kaggle competitions use ensemble models instead of just a single fine-tuned model.

The intuition behind ensemble models is quite simple: combining different ML algorithms effectively can reduce the risk of an unfortunate selection of one poor model.

In this post, I will discuss Stacking, a popular ensemble method and how to implement a simple 2-layer stacking regression model in Python using the _mlxtend _library. The sample task that I have chosen is Airbnb pricing prediction.


Table of Curiosities

  1. What is stacking and how does it work?
  2. How does stacking model perform compared with the base models?
  3. What are the caveats of stacking?
  4. What can we do next?

Overview

Stacking, or Stacked Generalization, is a meta-learning algorithm that learns how to combine the predictions of each base algorithm in the best way [1].

In simple terms, here is how you build a 2-layer stacking model. The first layer, and some people refer to this as the base layer, includes the base models. In the context of classification, you can think of the base models as any classifier that you can use to make prediction as Neural Network, SVM, Decision Tree, etc. Similar logic also applies to regression. After we train these models using the training set, the second layer builds the meta-model by taking the predictions from those models and the expected outputs on the out-of-sample data. You can think of the simplest case of this meta-model is to average out the predictions from base models.

A common approach to avoid over-fitting is to perform cross-validation on these base models, and then use the out-of-fold predictions and the outputs to build the meta-model.

Check out David Wolphet’s paper for more technical details on how stacking works.

I have chosen the Airbnb pricing prediction task from Kaggle (link) to be the example here because it is a relatively simple dataset with small sample size and set of features. The task is to predict the apartment rental listing price on Airbnb based on information such as number of bedrooms, type of rental (whole apartment or private room or shared room), etc.

To see the full Python code, check out my Kaggle kernel.

Without further ado, let’s get to the details!

#regression #machine-learning #python #ensemble-learning #airbnb

Keep Calm and Stack Up — Implement Stacking Regression in Python using mlxtend
1.10 GEEK