Getting Started with scikit-learn Pipelines for Machine Learning: Building a pipeline from the ground up. (All code in this post is also included in this GitHub repository.)
The typical overall machine learning workflow with scikit-learn looks something like this:
yto perform a train-test split, creating
X_trainusing the fitted preprocessors, and perform any other preprocessing steps (such as dropping columns)
X_trainas well as
X_testusing the fitted preprocessors, and perform any other preprocessing steps (such as dropping columns)
X_testas well as
Here is an example code snippet that follows these steps, using an antelope dataset (“antelope.csv”) from a statistics textbook. The goal is to predict the number of spring fawns based on the adult antelope population, annual precipitation, and winter severity. This is a very tiny dataset and should only be used for example purposes! This example skips any hyperparameter tuning, and simply fits a vanilla linear regression model on the preprocessed training data before evaluating it on the preprocessed testing data.
# Step 0: import relevant packages import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder from sklearn.linear_model import LinearRegression # Step 1: load all data into X and y antelope_df = pd.read_csv("antelope.csv") X = antelope_df.drop("spring_fawn_count", axis=1) y = antelope_df["spring_fawn_count"] # Step 2: train-test split X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=42, test_size=3) # Step 3: fit preprocessor ohe = OneHotEncoder(sparse=False, handle_unknown="ignore") ohe.fit(X_train[["winter_severity_index"]]) # Step 4: transform X_train with fitted preprocessor(s), and perform # custom preprocessing step(s) train_winter_array = ohe.transform(X_train[["winter_severity_index"]]) train_winter_df = pd.DataFrame(train_winter_array, index=X_train.index) X_train = pd.concat([train_winter_df, X_train], axis=1) X_train.drop("winter_severity_index", axis=1, inplace=True) # for the sake of example, this "feature engineering" encodes a numeric column # as a binary column also ("low" meaning "less than 12" here) X_train["low_precipitation"] = [int(x < 12) for x in X_train["annual_precipitation"]] # Step 5: create a model (skipping cross-validation and hyperparameter tuning # for the moment) and fit on preprocessed training data model = LinearRegression() model.fit(X_train, y_train) # Step 6: transform X_test with fitted preprocessor(s), and perform # custom preprocessing step(s) test_winter_array = ohe.transform(X_test[["winter_severity_index"]]) test_winter_df = pd.DataFrame(test_winter_array, index=X_test.index) X_test = pd.concat([test_winter_df, X_test], axis=1) X_test.drop("winter_severity_index", axis=1, inplace=True) X_test["low_precipitation"] = [int(x < 12) for x in X_test["annual_precipitation"]] # Step 7: evaluate model on preprocessed testing data print("Final model score:", model.score(X_test, y_test)) view raw ml_example_without_pipelines.py hosted with ❤ by GitHub
An example without pipelines
The train-test split is one of the most important components of a machine learning workflow. It helps a data scientist understand model performance, particularly in terms of overfitting. A proper train-test split means that we have to perform the preprocessing steps on the training data and testing data separately, so there is no “leakage” of information from the testing set into the training set.
Data Preparation Techniques and Its Importance in Machine Learning. “Data are just summaries of thousands of stories, tell a few of those stories to help make the data meaningful.”
In the course of the last years the interest in Data Science and Machine Learning has continuously increased. Thanks to libraries like Scikit-Learn.
Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.
Machine Learning Pipelines performs a complete workflow with an ordered sequence of the process involved in a Machine Learning task. The Pipelines can also
Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant