Years ago, it was impossible for machines to make text translation, text summarization, speech recognition, etc. An application of question answering system or chatbot would be like magic and hard to implement before the rise of what we call machine learning and especially natural language processing (NLP) which considered as a subfield of machine learning that deals with language and aims to push machines to understand and interpret languages in a human level of understanding. One of the hottest applications of NLP is sentiment analysis that allows us to classify a text, tweet or comment either positive, neutral or negative. For example, to evaluate people’s satisfaction about a specific product, we apply sentiment analysis on reviews and calculate the percent of positive and negative reviews.

Image for post

In this tutorial we’d do something like that building a sentiment classifier from scratch based on logistic regression, and we’ll train it on a corpus of tweets, thus we’ll cover :

Text processing

Features extraction

Sentiment classifier

Training & evaluating the sentiment classifier

Text processing

First, we’ll use Natural Language Toolkit (NLTK), it’s an open source python library, it has a bunch of functions to process textual data, it contains also a Twitter dataset that we’ll work on :

import nltk
from nltk.corpus import twitter_samples
positive_tweets =twitter_samples.strings('positive_tweets.json')
negative_tweets =twitter_samples.strings('negative_tweets.json')
test_pos = positive_tweets[4000:]
train_pos = positive_tweets[:4000]
test_neg = negative_tweets[4000:]
train_neg = negative_tweets[:4000]
train_x = train_pos + train_neg 
test_x = test_pos + test_neg
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

The python code above allow us to get a list of positive tweets and a list of negative tweets. We’ve divided our dataset into train_x, test_x, train_y, test_y, 20% for test and 80% for training. Those tweets contain a lot of irrelevant information like hashtags, mentions, stop words, etc. Data cleaning or data preprocessing is a key step in the process of data science in order to prepare data for training a classification algorithm. In the context of NLP, text processing includes :

Tokenization : is the operation of splitting a sentence into a list of words.

Removing stop words : stop words refer to the frequent words occurring in a text without adding a semantic value to the text.

Removing punctuation : it refers to the marks like (!”#$%&’()*+,-./:;<=>?@[]^_`{|}~).

Stemming : is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words.

We’ll try to implement those operations in one python function to process all the tweets before feeding them into our classifier :

import re                                  
import string
from nltk.corpus import stopwords          
from nltk.stem import PorterStemmer        
from nltk.tokenize import TweetTokenizer
def text_process(tweet):
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    tweet = re.sub(r'#', '', tweet)
    tokenizer = TweetTokenizer()
    tweet_tokenized = tokenizer.tokenize(tweet)
    stopwords_english = stopwords.words('english') 
    tweet_processsed=[word for word in tweet_tokenized 
    if word not  in stopwords_english and word not in       
    stemmer = PorterStemmer() 
    for word in tweet_processsed:
    return tweet_after_stem

Features extraction

After text processing, it’s time for feature extraction. Actually, computers don’t deal with texts, computers only understand the language of numbers, that’s why we should work on transforming tweets into vectors that can be fed into our logistic regression function. It exists a lot of methods to represent texts into vectors, each technique depends on the context of the problem we are trying to solve. In our case, we are working on binary classification which means classifying a tweet either positive or negative. So basically, we’d find some words more occurring in the list of positive tweets like happy, good. In the same way, we’d find some words more frequent than the others in the list of negative tweets.

#deep-learning #deep learning

Sentiment Analysis From Scratch With Logistic Regression
15.40 GEEK