Effectively Choose Input Variables Based on Distributions

We often come across a situation dealing with a variety of numerical variables consisting of different ranges, units, and magnitudes while building an ML model. As a common practice, we will apply Standardization or Normalization techniques for all the features before building a model. However, it is crucial to study the distributions of the data before making a decision on which technique to apply for feature scaling.

In this article, we will go through the difference between Standardization and Normalization along with understanding the distributions of the data. In the end, we will see how to select the strategies based on Gaussian and Non-Gaussian distribution of the features to improve the performance of the Logistic Regression model.

Standardization Vs Normalization

Both these techniques are sometimes used interchangeably but they refer to different approaches.

Standardization: This technique transforms the data to have a mean of zero and a standard deviation to 1.

Normalization: This technique transforms the values in variables between 0 and 1.

We are using the Pima Indian Diabetes dataset and you can find the same [here]

import pandas as pd
import numpy as np
data = pd.read_csv(“Pima Indian Diabetes.csv”)
data.head()

#statistics #programming #machine-learning #normal-distribution #data-science #deep learning

Standardization Vs Normalization

towardsdatascience.com

Effectively Choose Input Variables Based on Distributions