Data Scaling for Machine Learning — The Essential Guide

What is Standardization and why is it so darn important?

It’s possible that you will come across datasets with lots of numerical noise built-in, such as variance or differently-scaled data, so a good preprocessing is a must before even thinking about machine learning. A good preprocessing solution for this type of problem is often referred to as standardization.

Image for post

Photo by Fidel Fernando on Unsplash

Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. In scikit-learn this is often a necessary step because many models assume that the data you are training on is normally distributed, and if it isn’t, your risk biasing your model.

You can standardize your data in different ways, and in this article, we’re going to talk about the popular data scaling method — _data scaling. _Or standard scaling to be more precise.

It’s also important to note that standardization is a preprocessing method applied to continuous, numerical data, and there are a few different scenarios in which you want to use it:

When working with any kind of model that uses a linear distance metric or operates on a linear space — KNN, linear regression, K-means
When a feature or features in your dataset have high variance — this could bias a model that assumes the data is normally distributed, if a feature in has a variance that’s an order of magnitude or greater than other features

Let’s now proceed with the data scaling.

#python #towards-data-science #machine-learning #artificial-intelligence #data-science #data analytic

What is Standardization and why is it so darn important?

towardsdatascience.com

Data Scaling for Machine Learning — The Essential Guide