In general, Feature Scaling is a necessary or even essential step in updating the characteristics of every Machine Learning model. *Why is it that important?* Well, simply because each algorithm we use for prediction or clustering hides some mathematical formulas behind it. These mathematical formulas hardly appreciate the variations in the scale of values between each feature, and that’s clearly visible when it comes to gradient descent!

In fact, unscaled data lead to difficulties in visualizations and, more importantly, they can degrade the predictive performance of many machine learning algorithms. This type of data can also slow down the convergence of many gradient-based estimators or maybe prevent it at all.

Indeed many estimators are designed with the assumption that all features vary on comparable scales. In particular, gradient-based estimators often assume that the training data is already standardized (centered features with unit variances). A notable exception are decision tree-based estimators that are robust to arbitrary scaling of the data.

Let’s take an example: Imagine that you are working on house price prediction, you will have features of the type: price, surface, number of rooms, etc. Of course, the scales of values of this dataframe are totally different according to the features. However, you will have to process them using the same algorithm. This is where Feature Scaling is necessary! your algorithm will indeed have to mix prices of [0… 100,000] $, areas of [10… 500] m2, numbers of rooms ranging from [1 … 10] rooms. Scaling, therefore, consists in putting this data at the same level.

If you don’t apply Feature Scaling wisely you will observe slow learning and reduced performance.

Fortunately, **Scikit-Learn** will help us do the job once again, but before using any technique we have to understand how each one works.

Basically, **Scikit-Learn** (sklearn.preprocessing) provides several scaling techniques, we will review 4:

- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- RobustScaler

First of all, we are going to create random datasets as well as some graph functions which will help us to better understand the effects of the different techniques mentioned above.

#coronavirus #health #data-visualization #covid19 #data #data analysis

1.05 GEEK