The process includes tokenization, removing stopwords, and lemmatization. In this article, I will discuss the process of transforming the “cleaned” text data into a sparse matrix. Specifically, I will discuss the use of different vectorizers with simple examples.

Before we get more technical, I want to introduce two terminologies that are widely used in text analysis. For a collection of text data we want to analyze, we call it corpus. A corpus contains several observations, like news articles, customer reviews, etc. Each of these observations is called a document. I will use these two terms from now on.

The transformation step works as building a bridge that connects the information carried in the text data and the machine learning models. For sentiment analysis, to make sentiment predictions on each document, the machine learning model needs to learn the sentiment score of each unique word in the document, and how many times each word appears there. For example, if we want to conduct sentiment analysis for customer reviews of a product, after training the model, the machine learning models are more than likely to pick up the words like “bad”, “unsatisfied” from negative reviews, while getting words like “awesome”, “great” from positive reviews.

Facing a supervised machine learning problem, to train the model, we need to specify features and target values. Sentiment analysis is solving a classification problem, and in most cases, it is a binary classification problem, with target values defined as positive and negative. The features used to the model are the transformed text data from a vectorizer. The features are constructed differently with different vectorizer. In Scikit Learn, there are three vectorizers, CountVectorizer, TFIDFVectorizer, and HashingVectorizer. Let’s discuss the CountVectorizer first.

#count-vectorizer #sentiment-analysis #data-science #machine-learning #tfidf-vectorizer

A Step-by-Step Tutorial for Conducting Sentiment Analysis
1.20 GEEK