Getting Started with scikit-learn Pipelines for Machine Learning

Getting Started with scikit-learn Pipelines for Machine Learning

Getting Started with scikit-learn Pipelines for Machine Learning: Building a pipeline from the ground up. (All code in this post is also included in this GitHub repository.)

Why Use Pipelines?

The typical overall machine learning workflow with scikit-learn looks something like this:

  1. Load all data into X and y
  2. Use X and y to perform a train-test split, creating X_trainX_testy_train, and y_test
  3. Fit preprocessors such as StandardScaler and SimpleImputer on X_train
  4. Transform X_train using the fitted preprocessors, and perform any other preprocessing steps (such as dropping columns)
  5. Create various models, tune hyperparameters, and pick a final model that is fit on the preprocessed X_train as well as y_train
  6. Transform X_test using the fitted preprocessors, and perform any other preprocessing steps (such as dropping columns)
  7. Evaluate the final model on the preprocessed X_test as well as y_test

Here is an example code snippet that follows these steps, using an antelope dataset (“antelope.csv”) from a statistics textbook. The goal is to predict the number of spring fawns based on the adult antelope population, annual precipitation, and winter severity. This is a very tiny dataset and should only be used for example purposes! This example skips any hyperparameter tuning, and simply fits a vanilla linear regression model on the preprocessed training data before evaluating it on the preprocessed testing data.

# Step 0: import relevant packages
    import pandas as pd

    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.linear_model import LinearRegression

    # Step 1: load all data into X and y
    antelope_df = pd.read_csv("antelope.csv")
    X = antelope_df.drop("spring_fawn_count", axis=1)
    y = antelope_df["spring_fawn_count"]

    # Step 2: train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, random_state=42, test_size=3)

    # Step 3: fit preprocessor
    ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")
    ohe.fit(X_train[["winter_severity_index"]])

    # Step 4: transform X_train with fitted preprocessor(s), and perform
    # custom preprocessing step(s)

    train_winter_array = ohe.transform(X_train[["winter_severity_index"]])
    train_winter_df = pd.DataFrame(train_winter_array, index=X_train.index)
    X_train = pd.concat([train_winter_df, X_train], axis=1)
    X_train.drop("winter_severity_index", axis=1, inplace=True)

    # for the sake of example, this "feature engineering" encodes a numeric column
    # as a binary column also ("low" meaning "less than 12" here)
    X_train["low_precipitation"] = [int(x < 12) for x in X_train["annual_precipitation"]]

    # Step 5: create a model (skipping cross-validation and hyperparameter tuning
    # for the moment) and fit on preprocessed training data
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Step 6: transform X_test with fitted preprocessor(s), and perform
    # custom preprocessing step(s)

    test_winter_array = ohe.transform(X_test[["winter_severity_index"]])
    test_winter_df = pd.DataFrame(test_winter_array, index=X_test.index)
    X_test = pd.concat([test_winter_df, X_test], axis=1)
    X_test.drop("winter_severity_index", axis=1, inplace=True)

    X_test["low_precipitation"] = [int(x < 12) for x in X_test["annual_precipitation"]]

    # Step 7: evaluate model on preprocessed testing data
    print("Final model score:", model.score(X_test, y_test))
view raw
ml_example_without_pipelines.py hosted with ❤ by GitHub

An example without pipelines

The train-test split is one of the most important components of a machine learning workflow. It helps a data scientist understand model performance, particularly in terms of overfitting. A proper train-test split means that we have to perform the preprocessing steps on the training data and testing data separately, so there is no “leakage” of information from the testing set into the training set.

scikit-learn data-preparation machine-learning data-pipeline data-science data analysis

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Data Preparation Techniques and Its Importance in Machine Learning

Data Preparation Techniques and Its Importance in Machine Learning. “Data are just summaries of thousands of stories, tell a few of those stories to help make the data meaningful.” 

Machine Learning Powered Data Pipeline

In the course of the last years the interest in Data Science and Machine Learning has continuously increased. Thanks to libraries like Scikit-Learn.

15 Machine Learning and Data Science Project Ideas with Datasets

Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.

Pipelines in Machine Learning | Data Science | Machine Learning | Python

Machine Learning Pipelines performs a complete workflow with an ordered sequence of the process involved in a Machine Learning task. The Pipelines can also

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant