A New Way to BOW Analysis & Feature Engineering — Part1. Compare the frequency distributions across labels without building an ML model.

One of my friends asked me a problem — “how can we compare the BOW across different categories or labels? Where categories or labels could be sentiment or state or some customer segment.

My intuitive response was — to create a Bar graph of the frequency of words for each category. This really is a simple to implement solution, but have various drawbacks, some of which are:

- The data scientist/analyst working on this will be required to compare words and their frequencies across all the categories, which in some cases like countries or could easily get over 100
- Comparing the frequencies may not give any insights. For example, we have a word — “data” and its frequency across label1 and label2 be 150 & 100 respectively. Yes, indeed there is a difference of 50 but is this difference even relevant?
- What if -I want to know what are the top words which differentiate these categories without building a model/classifier, is there even a way to tell that?

Then the next idea was to create a Word Cloud, but even it does not solve the above-mentioned problem.

After thinking for a while I knew that solution lies in comparing the frequencies across the categories, but could not find the answer in One-Hot Encoding or Count Vectorizing or TF-IDF as there is a common issue in using any of these.

The issue is these create features and their values, for each of the documents, and then how can we roll these up at the label/category level? Even let’s take the case of Count Vectorizer — we will get the frequencies of words present in each document — but to do any analysis at the label/category level we will be required to sum up the counts and roll up to category level. Once we roll up, we can definitely make the difference in frequencies of words across labels but again — it won’t solve the third issue mentioned above. That is if the difference of frequency of word ‘data’ is 50, what does it tell us? Is this even significant?

Now, you can guess where are we headed — we have differences in frequencies and we want to know if the differences are significant or not. With this comes the **STATISTICS **to our rescue.

I am assuming you are aware of the various tests like z-test, t-test, ANOVA, etc. We use these tests to compare multiple means or distributions, which aligns with the problem we are trying to solve here.

The below table contains the frequency of a word across the labels — Target-0 and Target-1., which is nothing but the distributions. We can easily use the tests like z-test, t-test, etc. and compare these distributions and if the difference turns out to be significant (given the significance level) we can say that such words have different distributions across the labels and hence, can be the distinguishing factors in the model.

feature-engineering machine-learning data-science statistical-analysis

Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.

Statistics for Data Science and Machine Learning Engineer. I’ll try to teach you just enough to be dangerous, and pique your interest just enough that you’ll go off and learn more.

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

In this article, I clarify the various roles of the data scientist, and how data science compares and overlaps with related fields such as machine learning, deep learning, AI, statistics, IoT, operations research, and applied mathematics.

You will discover Exploratory Data Analysis (EDA), the techniques and tactics that you can use, and why you should be performing EDA on your next problem.