How to making scikit-learn work (better) with pandas
In addition to providing common machine learning algorithms, scikit-learn allows users to build reusable pipelines that integrate data processing and model building steps into one object.
Each step in the pipeline object consists of a Transformer instance, which exposes the easy-to-use
transform API. Transformers may encode logic for imputing missing values, feature engineering, model inference, and so on.
Unfortunately, scikit-learn works directly with numpy arrays or scipy sparse arrays, but not pandas.DataFrame which is widespread in data science work. The metadata attached to a DataFrame, e.g. column names, is immensely helpful for debugging and model interpretation purposes.
How should we get around the issues discussed above? While StackOverflow is helpful as usual, in the long term I would rather use well-organized code than a snippet for which I have to Google every time. Hence, I have written my own code which is available on GitHub (notebook here) and is showcased in this article.
To be fair, there is one way that scikit-learn utilizes metadata in DataFrames:
ColumnTransformer can identify DataFrame columns by their string names, and directs your desired transformers to each column. Here is an example by Allison Honold on TDS.
ColumnTransformer produces numpy arrays or scipy sparse matrices. This article will extend
ColumnTransformer such that it produces
pandas.DataFrame as well.
In Conversation With Dr Suman Sanyal, NIIT University,he shares his insights on how universities can contribute to this highly promising sector and what aspirants can do to build a successful data science career.
Getting Started with scikit-learn Pipelines for Machine Learning: Building a pipeline from the ground up. (All code in this post is also included in this GitHub repository.)
5 stages of learning Data Science and how to ace each of them
Online Data Science Training in Noida at CETPA, best institute in India for Data Science Online Course and Certification. Call now at 9911417779 to avail 50% discount.
A couple of days ago I started thinking if I had to start learning machine learning and data science all over again where would I start?