Pipelines & Custom Transformers in Scikit-learn. Machine Learning academic curriculums tend to focus almost exclusively on the models. Getting the data in the right form is what the industry calls preprocessing. Pipelines integrate the preprocessing steps and the fitting or predicting into a single operation. Apartfrom helping to make the model production-ready, they add a great deal of reproducibility to the experimental phase.

Machine Learning academic curriculums tend to focus almost exclusively on the models. One may argue that the model is what performs the magic. The statement may hold some truth, but this magic only works if the data is in the right form. Besides, to make things more complicated, the ‘right form’ depends on the type of model.

Getting the data in the right form is what the industry calls preprocessing. It takes a large chunk of the machine learning practitioner time. For the engineer, preprocessing and fitting or preprocessing and predicting are two distinct processes, but in a production environment, when we serve the model, no distinction is made. It is only data in, prediction out. Pipelines are here to do that. They integrate the preprocessing steps and the fitting or predicting into a single operation. Apartfrom helping to make the model production-ready, they add a great deal of reproducibility to the experimental phase.

- What is a pipeline
- What is a transformer
- What is a custom transformer

Scikit Learn. Dataset transformations

From the Scikit Learn documentation we have:

Dataset transformation …Like other estimators, these are represented by classes with a fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modeling and transforming the training data simultaneously.

We will focus on two of the transformer types, namely:

Although Scikit learn comes loaded with a set of standard transformers, we will begin with a custom one to understand what they do and how they work. The first thing to remember is that a custom transformer is an estimator and a transformer, so we will create a class that inherits from both BaseEstimator and TransformerMixin. It is a good practice to initialize it with super().**init**(). By inheriting, we get a standard method such as get_params and set_params for free. In the init, we also want to create the model parameter or parameters we want to learn.

```
class CustomScaler(BaseEstimator, TransformerMixin):
def __init__(self):
super().__init__()
self.means_ = None
self.std_ = None
def fit(self, X, y=None):
X = X.to_numpy()
self.means_ = X.mean(axis=0, keepdims=True)
self.std_ = X.std(axis=0, keepdims=True)
return self
def transform(self, X, y=None):
X[:] = (X.to_numpy() - self.means_) / self.std_
return X
```

The fit method is where “learning” takes place. Here we perform the operation based upon the training data that yields the model parameters.

In the transform method, we apply the parameters learned in fit to unseen data. Bear in mind that the preprocessing is going to make part of the whole model, so during training, fit, and transform are apply to the same dataset. But later, when you use the trained model, you only apply the transform method with the parameter learned with fit based on the training dataset but on unseen data.

It is key that the learned parameters, and hence the transformer operation, are the same regardless of the data to be applied to.

Scikit learn comes with a variety of standard transformers out of the box. Given they almost unavoidable use, you should be familiar with Standardization, or mean removal and variance scaling and SimpleImputer for numerical data and with Encoding categorical features for categorical, specially one-of-K, also known as one-hot encoding.

Remember that the transformers are an estimator but so is your model (logistic regression, random forest, etc.). Think of it as steps vertical stacking. Here order matters. So you want to put the preprocessing before the model. The key is that a step output is the next step input.

Often you want to apply a different transformation to some of your features. The required transformations for numerical and categorical data are different. It is as if you have two parallel ways, or as if they were horizontally stacked.

The input to the parallel ways is the same. So the transform method has to begin by choosing the features relevant to the transformation (for example, numerical features or categorical features).

We supply you with world class machine learning experts / ML Developers with years of domain experience who can add more value to your business.

Are you looking for experienced, reliable, and qualified Python developers? If yes, you have reached the right place. At **[HourlyDeveloper.io](https://hourlydeveloper.io/ "HourlyDeveloper.io")**, our full-stack Python development services...

Looking to build robust, scalable, and dynamic responsive websites and applications in Python? At **[HourlyDeveloper.io](https://hourlydeveloper.io/ "HourlyDeveloper.io")**, we constantly endeavor to give you exactly what you need. If you need to...