Understanding Text Vectorizations II: TF-IDF Models

**L**et’s continue our sentiment analysis journey.

Remember last time we talked about using the bag of words model to detect sentiments of review texts, which we already have a relatively good performance. Today we will build upon what we have already accomplished and upgrade the bag of words model with a smart weighting scheme. In this post, we will utilize the custom pipeline StreamlinedModel object from part 1, and witness the amazing improvements gained from applying this TF-IDF transformer from simple models like logistic regression.

Term Frequency — Inverse Document Frequency (TF-IDF)

We already know that counting word frequency could help us gauge the sentiment of a review text, but this approach ignores the importance of words across the document. We talked about using a smart weighting scheme to solve this problem, but how exactly do we execute it?

General Idea

The answer is using a weighting scheme that is reverse proportionate with the frequency that this word appears in all documents. If the word appears in almost every document (such as “I”, “you” or “they”), it is likely to be treated as very important if we equally weight them. Therefore, we would like to up weight those words who appear not very frequently. For example, using the same 3 reviews

I love dogs, I think they have adorable personalities.
I don't like cats
My favorite kind of pet is bird

We can see that “adorable” appears only in 1 of the 3 documents, the document frequency is 3/(1+1). We add 1 to the denominator to avoid it to be 0. We will then take the natural log value of the document frequency, which will give us the following

#nlp #tf-idf #sklearn-pipeline #machine-learning #deep learning

Term Frequency — Inverse Document Frequency (TF-IDF)

General Idea

towardsdatascience.com

Understanding Text Vectorizations II: TF-IDF Models