Making scikit-learn work (better) with pandas

Making scikit-learn work (better) with pandas

How to making scikit-learn work (better) with pandas

Never lose track of column names again after feature transformation

Why bother?

In addition to providing common machine learning algorithms, scikit-learn allows users to build reusable pipelines that integrate data processing and model building steps into one object.

Photo by iMattSmart on Unsplash.

Each step in the pipeline object consists of a Transformer instance, which exposes the easy-to-use fit/transform API. Transformers may encode logic for imputing missing values, feature engineering, model inference, and so on.

Unfortunately, scikit-learn works directly with numpy arrays or scipy sparse arrays, but not pandas.DataFrame which is widespread in data science work. The metadata attached to a DataFrame, e.g. column names, is immensely helpful for debugging and model interpretation purposes.

How should we get around the issues discussed above? While StackOverflow is helpful as usual, in the long term I would rather use well-organized code than a snippet for which I have to Google every time. Hence, I have written my own code which is available on GitHub (notebook here) and is showcased in this article.

What about sklearn.compose.ColumnTransformer?

To be fair, there is one way that scikit-learn utilizes metadata in DataFrames: ColumnTransformer can identify DataFrame columns by their string names, and directs your desired transformers to each column. Here is an example by Allison Honold on TDS.

Unfortunately, ColumnTransformer produces numpy arrays or scipy sparse matrices. This article will extend ColumnTransformer such that it produces pandas.DataFrame as well.

pandas imputation pipeline scikit-learn data-science

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

How To Build A Data Science Career In 2021

In Conversation With Dr Suman Sanyal, NIIT University,he shares his insights on how universities can contribute to this highly promising sector and what aspirants can do to build a successful data science career.

Getting Started with scikit-learn Pipelines for Machine Learning

Getting Started with scikit-learn Pipelines for Machine Learning: Building a pipeline from the ground up. (All code in this post is also included in this GitHub repository.)

5 stages of learning Data Science

5 stages of learning Data Science and how to ace each of them

What Are The Advantages and Disadvantages of Data Science?

Online Data Science Training in Noida at CETPA, best institute in India for Data Science Online Course and Certification. Call now at 9911417779 to avail 50% discount.

How I'd Learn Data Science If I Were To Start All Over Again

A couple of days ago I started thinking if I had to start learning machine learning and data science all over again where would I start?