Data Scaling for Machine Learning

It’s possible that you will came across datasets with lots of numerical noise built in, such as lots of variance or differently-scaled data ,the preprocessing solution for that is standardization.

Standardization is a preprocessing method used to transform continuous data to make it look normally distributed, in scikit-learn this is often a necessary step, because many models assume that the data you are training on is normally distributed, and if it isn’t, your risk biasing your model, you can standardize your data in different ways, in this article, we’re going to talk about Two popular data scaling methods are normalization and standardization.

It’s also important to note that standardization is a preprocessing method applied to continuous, numerical data, there are a few different scenarios in which you want to standardize your data:

-first, if you are working with any kind of model that uses a linear distance metric or operates on a linear space like K-nearest neighbors, linear regression, or k-means clustering , the model is assuming that the data and features you’re giving it are related in a linear fashion, or can be measured with a linear distance metric.

-second, the case when a feature or features in your dataset have high variance is related to this, this could bias a model that assumes the data is normally distributed, if a feature in your dataset has a variance that’s an order of magnitude or more greater than other features, this could impact the model’s ability to learn from other features in the dataset.

#data-science #data-analysis #machine-learning #deep-learning #data-visualization

medium.com

Data Scaling for Machine Learning