Data Collection

We are going to use Kaggle.com to find the dataset. Use the link below to go to the dataset on Kaggle.

Twitter Sentiment Analysis

Detecting hatred tweets, provided by Analytics Vidhya

www.kaggle.com

1. Understanding the dataset

Let’s read the context of the dataset to understand the problem statement.

In the training data, tweets are labeled ‘1’ if they are associated with the racist or sexist sentiment. Otherwise, tweets are labeled ‘0’.

2. Downloading the dataset

Now that you have an understanding of the dataset, go ahead and download two csv files — the training and the test data. Simply click “Download (5MB).”

After you downloaded the dataset, make sure to unzip the file.

Let’s move on to Google Colab now!

Data Exploration (Exploratory Data Analysis)

Let’s check what the training and the test data look like.

Checking the train and test data

Notice how there exist special characters like @, #, !, and etc. We will remove these characters later in the data cleaning step.

Check if there are any missing values. There were no missing values for both training and test data.

Checking missing values for training data

Checking missing values for test data

Data Cleaning

We will clean the data using the tweet-preprocessor library. Here’s the link: https://pypi.org/project/tweet-preprocessor/

This library removes URLs, Hashtags, Mentions, Reserved words (RT, FAV), Emojis, and Smileys.

We will also use the regular expression library to remove other special cases that the tweet-preprocessor library didn’t have.

#sentiment-analysis #nlp #data-science #classification #data analysis

Twitter Sentiment Analysis | NLP | Text Analytics
4.25 GEEK