How to making scikit-learn work (better) with pandas

In addition to providing common machine learning algorithms, scikit-learn allows users to build reusable *pipelines* that integrate data processing and model building steps into one object.

Photo by iMattSmart on Unsplash.

Each step in the pipeline object consists of a Transformer instance, which exposes the easy-to-use `fit`

/`transform`

API. Transformers may encode logic for imputing missing values, feature engineering, model inference, and so on.

Unfortunately, scikit-learn works directly with numpy arrays or scipy sparse arrays, but not pandas.DataFrame which is widespread in data science work. The metadata attached to a DataFrame, e.g. column names, is *immensely* helpful for **debugging** and **model interpretation** purposes.

How should we get around the issues discussed above? While StackOverflow is helpful as usual, in the long term I would rather use well-organized code than a snippet for which I have to Google every time. Hence, I have written my own code which is available on GitHub (notebook here) and is showcased in this article.

To be fair, there is *one* way that scikit-learn utilizes metadata in DataFrames: `ColumnTransformer`

can identify DataFrame columns by their string names, and directs your desired transformers to each column. Here is an example by Allison Honold on TDS.

Unfortunately, `ColumnTransformer`

produces numpy arrays or scipy sparse matrices. This article will extend `ColumnTransformer`

such that it produces `pandas.DataFrame`

as well.

In Conversation With Dr Suman Sanyal, NIIT University,he shares his insights on how universities can contribute to this highly promising sector and what aspirants can do to build a successful data science career.

Getting Started with scikit-learn Pipelines for Machine Learning: Building a pipeline from the ground up. (All code in this post is also included in this GitHub repository.)

5 stages of learning Data Science and how to ace each of them

Online Data Science Training in Noida at CETPA, best institute in India for Data Science Online Course and Certification. Call now at 9911417779 to avail 50% discount.

A couple of days ago I started thinking if I had to start learning machine learning and data science all over again where would I start?