As a reminder, this end-to-end project aims to solve a classification problem in Data Science, particularly in finance industry and is divided into 3 parts:
If you have missed the 1st part, feel free to check it out here before going through the 2nd part that follows here for a better context understanding.
What is feature scaling and why do we need it prior to modelling?
According to Wikipedia,
Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
If you recall from the 1st part, we have completed engineering all of our features on both datasets (A & B) as below:
Dataset A (encoded without target)
Dataset B (encoded with target)
As seen above, data range and distribution among all features are relatively different from one another, not to mention some variables bearing with outliers. That being said, it is highly recommended that we apply feature scaling to the entire dataset consistently for the purpose of making it more digestible to machine learning algorithms.
In fact, there are a number of different methods in the market, but I will only focus on the three which I believe are relatively distinctive: StandardScaler, MinMaxScaler and RobustScaler. In brief,
Let’s see how the three scalers differ in our dataset:
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
#StandardScaler
x_a_train_ss = pd.DataFrame(StandardScaler().fit_transform(x_a_train), columns=x_a_train.columns)
#MinMaxScaler
x_a_train_mm = pd.DataFrame(MinMaxScaler().fit_transform(x_a_train), columns=x_a_train.columns)
#RobustScaler
x_a_train_rs = pd.DataFrame(RobustScaler().fit_transform(x_a_train), columns=x_a_train.columns)
#machine-learning #imbalanced-data #feature-selection #data-science #data analysis