How to making scikit-learn work (better) with pandas

In addition to providing common machine learning algorithms, scikit-learn allows users to build reusable *pipelines* that integrate data processing and model building steps into one object.

Photo by iMattSmart on Unsplash.

Each step in the pipeline object consists of a Transformer instance, which exposes the easy-to-use `fit`

/`transform`

API. Transformers may encode logic for imputing missing values, feature engineering, model inference, and so on.

Unfortunately, scikit-learn works directly with numpy arrays or scipy sparse arrays, but not pandas.DataFrame which is widespread in data science work. The metadata attached to a DataFrame, e.g. column names, is *immensely* helpful for **debugging** and **model interpretation** purposes.

How should we get around the issues discussed above? While StackOverflow is helpful as usual, in the long term I would rather use well-organized code than a snippet for which I have to Google every time. Hence, I have written my own code which is available on GitHub (notebook here) and is showcased in this article.

To be fair, there is *one* way that scikit-learn utilizes metadata in DataFrames: `ColumnTransformer`

can identify DataFrame columns by their string names, and directs your desired transformers to each column. Here is an example by Allison Honold on TDS.

Unfortunately, `ColumnTransformer`

produces numpy arrays or scipy sparse matrices. This article will extend `ColumnTransformer`

such that it produces `pandas.DataFrame`

as well.

