This is the second part of a two-part series. You should read the first part first.
We’re talking about a way to compare the bag of words across the categories without building machine learning models and to do feature engineering.
So, far we have looked at:
After looking at the results we definitely got some confidence in the approach but still, it is not complete. By this, I mean to use this technique to reduce the features we have to gauge the impact in the model’s performance with and without them being part of the training dataset. This is something we are going to look into, in this part.
## split the data into test and train
X = train_df_w_ft.iloc[:, 1:]
Y = train_df_w_ft.iloc[:, 0]
print("Shape of X: ", X.shape)
print("Shape of Y: ", Y.shape)
#machine-learning #feature-selection #data-science #naturallanguageprocessing