Matt  Towne

Matt Towne


Pipelines & Custom Transformers in Scikit-learn

Machine Learning academic curriculums tend to focus almost exclusively on the models. One may argue that the model is what performs the magic. The statement may hold some truth, but this magic only works if the data is in the right form. Besides, to make things more complicated, the ‘right form’ depends on the type of model.

Getting the data in the right form is what the industry calls preprocessing. It takes a large chunk of the machine learning practitioner time. For the engineer, preprocessing and fitting or preprocessing and predicting are two distinct processes, but in a production environment, when we serve the model, no distinction is made. It is only data in, prediction out. Pipelines are here to do that. They integrate the preprocessing steps and the fitting or predicting into a single operation. Apartfrom helping to make the model production-ready, they add a great deal of reproducibility to the experimental phase.

Lerning Objectives

  • What is a pipeline
  • What is a transformer
  • What is a custom transformer


Scikit Learn. Dataset transformations

From the Scikit Learn documentation we have:

Dataset transformation …Like other estimators, these are represented by classes with a fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modeling and transforming the training data simultaneously.

We will focus on two of the transformer types, namely:

Custom transformer

Although Scikit learn comes loaded with a set of standard transformers, we will begin with a custom one to understand what they do and how they work. The first thing to remember is that a custom transformer is an estimator and a transformer, so we will create a class that inherits from both BaseEstimator and TransformerMixin. It is a good practice to initialize it with super().init(). By inheriting, we get a standard method such as get_params and set_params for free. In the init, we also want to create the model parameter or parameters we want to learn.

class CustomScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.means_ = None
        self.std_ = None

    def fit(self, X, y=None):
        X = X.to_numpy()
        self.means_ = X.mean(axis=0, keepdims=True)
        self.std_ = X.std(axis=0, keepdims=True)

        return self

    def transform(self, X, y=None):
        X[:] = (X.to_numpy() - self.means_) / self.std_

        return X

The fit method is where “learning” takes place. Here we perform the operation based upon the training data that yields the model parameters.

In the transform method, we apply the parameters learned in fit to unseen data. Bear in mind that the preprocessing is going to make part of the whole model, so during training, fit, and transform are apply to the same dataset. But later, when you use the trained model, you only apply the transform method with the parameter learned with fit based on the training dataset but on unseen data.

It is key that the learned parameters, and hence the transformer operation, are the same regardless of the data to be applied to.

Standard Transformers

Scikit learn comes with a variety of standard transformers out of the box. Given they almost unavoidable use, you should be familiar with Standardization, or mean removal and variance scaling and SimpleImputer for numerical data and with Encoding categorical features for categorical, specially one-of-K, also known as one-hot encoding.

The pipeline

Chaining estimators

Remember that the transformers are an estimator but so is your model (logistic regression, random forest, etc.). Think of it as steps vertical stacking. Here order matters. So you want to put the preprocessing before the model. The key is that a step output is the next step input.

FeatureUnion: composite feature spaces

Often you want to apply a different transformation to some of your features. The required transformations for numerical and categorical data are different. It is as if you have two parallel ways, or as if they were horizontally stacked.

The input to the parallel ways is the same. So the transform method has to begin by choosing the features relevant to the transformation (for example, numerical features or categorical features).

#machine-learning #scikit-learn #python #developer

What is GEEK

Buddha Community

Pipelines & Custom Transformers in Scikit-learn

Pipelines and Custom Transformers in scikit-learn

This article will cover:
Why another tutorial on Pipelines?
Creating a Custom Transformer from scratch, to include in the Pipeline.
Modifying and parameterizing Transformers.
Custom target transformation via TransformedTargetRegressor.
Chaining everything together in a single Pipeline.
Link to download the complete code from GitHub.
There’s a video walkthrough of the code at the end for those who prefer the format. I personally like written tutorials, but I’ve had requests for video versions too in the past, so there it is.

#machine-learning #transformers #pipeline #scikit-learn #python

Michael  Hamill

Michael Hamill


Scikit-Learn Is Still Rocking, Been Introduced To French President

Amilestone for open source projects — French President Emmanuel Macron has recently been introduced to Scikit-learn. In fact, in a recent tweet, Scikit-learn creator and Inria tenured research director, Gael Varoquaux announced the presentation of Scikit-Learn, with applications of machine learning in digital health, to the president of France.

He stated the advancement of this free software machine learning library — “started from the grassroots, built by a community, we are powering digital revolutions, adding transparency and independence.”

#news #application of scikit learn for machine learning #applications of scikit learn for digital health #scikit learn #scikit learn introduced to french president

Vaughn  Sauer

Vaughn Sauer


Top Free Resources To Learn Scikit-Learn

Scikit-Learn is one of the popular software machine learning libraries. The library is built on top of NumPy, SciPy, and Matplotlib and supports supervised and unsupervised learning as well as provides various tools for model fitting, data preprocessing, model selection and evaluation.

Scikit-Learn Tutorials

About: From the developers of Scikit-Learn, this tutorial provides an introduction to machine learning with Scikit-Learn. It includes topics such as problem setting, loading an example dataset, learning and predicting. The tutorial is suitable for both beginners and advanced students.

Perform Sentiment Analysis with Scikit-Learn

**About: **In this project-based course, you will learn the fundamentals of sentiment analysis, and build a logistic regression model to classify movie reviews as either positive or negative. You will learn how to develop and employ a logistic regression classifier using Scikit-Learn, perform feature extraction with The Natural Language Toolkit (NLTK), tune model hyperparameters and evaluate model accuracy etc.

Python Machine Learning: Scikit-Learn Tutorial

**About: **Python Machine Learning: Scikit-Learn tutorial will help you learn the basics of Python machine learning. You will learn how to use Python and its libraries to explore your data with the help of Matplotlib and Principal Component Analysis (PCA). You will also learn how to work with the KMeans algorithm to construct an unsupervised model, fit this model to your data, predict values, and validate the model.

Scikit Learn Tutorial | Machine Learning with Python

**About: **Edureka’s video tutorial introduces machine learning in Python. It will take you through regression and clustering techniques along with a demo of SVM classification on the famous iris dataset. This video helps you to learn the introduction to Scikit-learn and how to install it, understand how machine learning works, among other things.

Regression using Scikit-Learn

About: In this Coursera offering, you will learn about Linear Regression, Regression using Random Forest Algorithm, Regression using Support Vector Machine Algorithm. Scikit-Learn provides a comprehensive array of tools for building regression models.

Machine Learning with Scikit-Learn Tutorial

About: In this course, you will learn about machine learning, algorithms, and how Scikit-Learn makes it all so easy. You will get to know the machine learning approach, jargons to understand a dataset, features of supervised and unsupervised learning models, algorithms such as regression, classification, clustering, and dimensionality reduction.

Predict Sales Revenue with Scikit-Learn

About: In this two-hour long project-based course, you will build and evaluate a simple linear regression model using Python. You will employ the Scikit-Learn module for calculating the linear regression while using pandas for data management and seaborn for plotting. By the end of this course, you will be able to build a simple linear regression model in Python with Scikit-Learn, employ Exploratory Data Analysis (EDA) to small data sets with seaborn and pandas.

SciPy 2016 Scikit-learn Tutorial

**About: **This tutorial is available on GitHub. It includes an introduction to machine learning with sample applications, data formats, preparation and representation, supervised learning: training and test data, the Scikit-Learn estimator interface and more.

Build NLP pipelines using Scikit-Learn

About: This is a two-hour long project-based course, where you will understand the business problem and the dataset and learn how to generate a hypothesis to create new features based on existing data. You will learn to perform text pre-processing and create custom transformers to generate new features. You will also learn to implement an NLP pipeline, create custom transformers and build a text classification model.

#developers corner #learn scikit-learn #machine learning library #scikit learn

Shawn  Durgan

Shawn Durgan


How Digital Transformation Is Redefining Customer Experience

Digital transformation gives a personalized look into the customer’s purchasing habits along with their likes and dislikes. Making it easy for brands to provide a tailor-made premium customer experience based on personal preference & unspoken need.

Ever since smartphones became a part and parcel of human life, people have been a part of a digital network that connects them to friends, businesses, colleagues, and peers.

People don’t just buy products now, they connect with brands, register on their web portals, use their application, give email addresses, and phone numbers at cash counters.

They expect brands to understand their individual needs & answer back when they complain. This has encouraged brands to embrace digital transformation and reinvent customer success.

#digital-transformation #customer-experience #digital-strategy #digital-transformation-guide #customer-satisfaction #customer-engagement #customer-satisfaction-rates #net-promoter-score

Writing custom scikit-learn transformers

The scikit-learn’s transformers API is a great tool for data cleaning, preprocessing, feature engineering, and extraction. Sometimes, however, none of the wide range of available transformers matches the specific problem at hand. On these occasions, it is handy to be able to write one oneself. Luckily, it’s straightforward to leverage scikit-learn’s classes to build a transformer that follows the package’s conventions and can be included in the scikit-learn pipelines.

Image for post

Problem setting

To make it practical, let’s look at an example. We have a data set called TAO which stands for Tropical Atmosphere Ocean. It contains some weather measurements such as temperature, humidity, or wind speed. A subsample of these data comes with the R library VIM. Here, we are working with a slightly preprocessed version.

Image for post

A quick look at the data frame tells us there is a substantial number of missing values in the air_temp variable, which we will need to impute before modeling.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 733 entries, 0 to 732
Data columns (total 8 columns):
year                733 non-null int64
latitude            733 non-null int64
longitude           733 non-null int64
sea_surface_temp    733 non-null float64
air_temp            655 non-null float64
humidity            642 non-null float64
uwind               733 non-null float64
vwind               733 non-null float64
dtypes: float64(5), int64(3)
memory usage: 45.9 KB

Scikit-learn offers imputing transformers such as SimpleImputer which fills-in the variable’s missing values by its mean, median, or some other quantity. However, such imputation is known to destroy relations in the data.

But look, there is another variable called sea_surface_temp with no missing values! We could expect the water temperature to be highly correlated with air temperature! Let’s plot these two variables against each other.

Image for post

As we expected, there is a clear linear relationship. Also, we can see why mean or median imputation makes no sense: setting air temperature to its median value of 24.5 degrees for observations where the water temperature is 22 or 29 completely destroys the relation between these two variables.

It seems that a good strategy for imputing air_temp would be to use linear regression with sea_surface_temp as a predictor. As of scikit-learn version 0.21, we can use the IterativeImputer and set LinearRegression as the imputing engine. However, this will use all the variables in the data as predictors, while we only want the water temperature. Let’s write our own transformer to achieve this.

Image for post

A custom transformer

A scikit-learn transformer should be a class implementing three methods:

  • fit(), which simply returns self,
  • transform(), which takes the data X as input and performs the desired transformations,
  • fit_transform(), which is added automatically if you include TransformerMixin as a base class.

On top of these, we have the __init__() to capture the parameters - in our example the indices of air and water temperature columns. We can also include BaseEstimator as a base class, which will allow us to retrieve the parameters from the transformer object.

#software-development #python #machine-learning #scikit-learn #data-science #deep learning