1591799400

This article will cover:

Why another tutorial on Pipelines?

Creating a Custom Transformer from scratch, to include in the Pipeline.

Modifying and parameterizing Transformers.

Custom target transformation via TransformedTargetRegressor.

Chaining everything together in a single Pipeline.

Link to download the complete code from GitHub.

There’s a video walkthrough of the code at the end for those who prefer the format. I personally like written tutorials, but I’ve had requests for video versions too in the past, so there it is.

1591799400

1618278600

Amilestone for open source projects — French President Emmanuel Macron has recently been introduced to Scikit-learn. In fact, in a recent tweet, Scikit-learn creator and Inria tenured research director, Gael Varoquaux announced the presentation of Scikit-Learn, with applications of machine learning in digital health, to the president of France.

He stated the advancement of this free software machine learning library — “started from the grassroots, built by a community, we are powering digital revolutions, adding transparency and independence.”

1622792520

Scikit-Learn is one of the popular software machine learning libraries. The library is built on top of NumPy, SciPy, and Matplotlib and supports supervised and unsupervised learning as well as provides various tools for model fitting, data preprocessing, model selection and evaluation.

**About:** From the developers of Scikit-Learn, this tutorial provides an introduction to machine learning with Scikit-Learn. It includes topics such as problem setting, loading an example dataset, learning and predicting. The tutorial is suitable for both beginners and advanced students.

**About: **In this project-based course, you will learn the fundamentals of sentiment analysis, and build a logistic regression model to classify movie reviews as either positive or negative. You will learn how to develop and employ a logistic regression classifier using Scikit-Learn, perform feature extraction with The Natural Language Toolkit (NLTK), tune model hyperparameters and evaluate model accuracy etc.

**About: **Python Machine Learning: Scikit-Learn tutorial will help you learn the basics of Python machine learning. You will learn how to use Python and its libraries to explore your data with the help of Matplotlib and Principal Component Analysis (PCA). You will also learn how to work with the KMeans algorithm to construct an unsupervised model, fit this model to your data, predict values, and validate the model.

**About: **Edureka’s video tutorial introduces machine learning in Python. It will take you through regression and clustering techniques along with a demo of SVM classification on the famous iris dataset. This video helps you to learn the introduction to Scikit-learn and how to install it, understand how machine learning works, among other things.

**About:** In this Coursera offering, you will learn about Linear Regression, Regression using Random Forest Algorithm, Regression using Support Vector Machine Algorithm. Scikit-Learn provides a comprehensive array of tools for building regression models.

**About:** In this course, you will learn about machine learning, algorithms, and how Scikit-Learn makes it all so easy. You will get to know the machine learning approach, jargons to understand a dataset, features of supervised and unsupervised learning models, algorithms such as regression, classification, clustering, and dimensionality reduction.

**About:** In this two-hour long project-based course, you will build and evaluate a simple linear regression model using Python. You will employ the Scikit-Learn module for calculating the linear regression while using pandas for data management and seaborn for plotting. By the end of this course, you will be able to build a simple linear regression model in Python with Scikit-Learn, employ Exploratory Data Analysis (EDA) to small data sets with seaborn and pandas.

**About: **This tutorial is available on GitHub. It includes an introduction to machine learning with sample applications, data formats, preparation and representation, supervised learning: training and test data, the Scikit-Learn estimator interface and more.

**About:** This is a two-hour long project-based course, where you will understand the business problem and the dataset and learn how to generate a hypothesis to create new features based on existing data. You will learn to perform text pre-processing and create custom transformers to generate new features. You will also learn to implement an NLP pipeline, create custom transformers and build a text classification model.

1603825260

Digital transformation gives a personalized look into the customer’s purchasing habits along with their likes and dislikes. Making it easy for brands to provide a tailor-made premium customer experience based on personal preference & unspoken need.

Ever since smartphones became a part and parcel of human life, people have been a part of a digital network that connects them to friends, businesses, colleagues, and peers.

People don’t just buy products now, they connect with brands, register on their web portals, use their application, give email addresses, and phone numbers at cash counters.

They expect brands to understand their individual needs & answer back when they complain. This has encouraged brands to embrace digital transformation and reinvent customer success.

1595443380

The scikit-learn’s transformers API is a great tool for data cleaning, preprocessing, feature engineering, and extraction. Sometimes, however, none of the wide range of available transformers matches the specific problem at hand. On these occasions, it is handy to be able to write one oneself. Luckily, it’s straightforward to leverage scikit-learn’s classes to build a transformer that follows the package’s conventions and can be included in the scikit-learn pipelines.

To make it practical, let’s look at an example. We have a data set called `TAO `

which stands for Tropical Atmosphere Ocean. It contains some weather measurements such as temperature, humidity, or wind speed. A subsample of these data comes with the R library `VIM`

. Here, we are working with a slightly preprocessed version.

A quick look at the data frame tells us there is a substantial number of missing values in the `air_temp`

variable, which we will need to impute before modeling.

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 733 entries, 0 to 732
Data columns (total 8 columns):
year 733 non-null int64
latitude 733 non-null int64
longitude 733 non-null int64
sea_surface_temp 733 non-null float64
air_temp 655 non-null float64
humidity 642 non-null float64
uwind 733 non-null float64
vwind 733 non-null float64
dtypes: float64(5), int64(3)
memory usage: 45.9 KB
```

Scikit-learn offers imputing transformers such as `SimpleImputer`

which fills-in the variable’s missing values by its mean, median, or some other quantity. However, such imputation is known to destroy relations in the data.

But look, there is another variable called `sea_surface_temp`

with no missing values! We could expect the water temperature to be highly correlated with air temperature! Let’s plot these two variables against each other.

As we expected, there is a clear linear relationship. Also, we can see why mean or median imputation makes no sense: setting air temperature to its median value of 24.5 degrees for observations where the water temperature is 22 or 29 completely destroys the relation between these two variables.

It seems that a good strategy for imputing `air_temp`

would be to use linear regression with `sea_surface_temp`

as a predictor. As of scikit-learn version 0.21, we can use the `IterativeImputer`

and set `LinearRegression`

as the imputing engine. However, this will use all the variables in the data as predictors, while we only want the water temperature. Let’s write our own transformer to achieve this.

A scikit-learn transformer should be a class implementing three methods:

`fit()`

, which simply returns`self`

,`transform()`

, which takes the data`X`

as input and performs the desired transformations,`fit_transform()`

, which is added automatically if you include`TransformerMixin`

as a base class.

On top of these, we have the `__init__()`

to capture the parameters - in our example the indices of air and water temperature columns. We can also include `BaseEstimator`

as a base class, which will allow us to retrieve the parameters from the transformer object.

