This is the second part of a two-part series. You should read the first part first.

We’re talking about a way to compare the bag of words across the categories without building machine learning models and to do feature engineering.

So, far we have looked at:

  1. To treat the frequency of words across the categories as separate distributions
  2. Apply the Mann-Whitney U Test — a non-parametric test on the distributions of each word, to test for their significance
  3. Analyze the results i.e. compare the frequencies and the p-values of the significant words

After looking at the results we definitely got some confidence in the approach but still, it is not complete. By this, I mean to use this technique to reduce the features we have to gauge the impact in the model’s performance with and without them being part of the training dataset. This is something we are going to look into, in this part.

Test1 — Train a model with all the features

  • First, I will be building a model with all the features we got from CountVectorizer and check the coefficients of the words which has an insignificant difference in their frequency distributions.
  • The aim of doing so is to check whether we get very low coefficient values (closer to 0) or not and this will prove our hypothesis if these features/words can indeed be excluded from the model or not.
  • Also, I will be using the Lasso as it gives 0 weight to the features not important at all
## split the data into test and train

X = train_df_w_ft.iloc[:, 1:]
Y = train_df_w_ft.iloc[:, 0]

print("Shape of X: ", X.shape)
print("Shape of Y: ", Y.shape)

#machine-learning #feature-selection #data-science #naturallanguageprocessing

A New Way to BOW Analysis & Feature Engineering — Part2
1.40 GEEK