Data Collection

We are going to use Kaggle.com to find the dataset. Use the link below to go to the dataset on Kaggle.

1. Understanding the dataset

Let’s read the context of the dataset to understand the problem statement.

In the training data, tweets are labeled ‘1’ if they are associated with the racist or sexist sentiment. Otherwise, tweets are labeled ‘0’.

2. Downloading the dataset

Now that you have an understanding of the dataset, go ahead and download two csv files — the training and the test data. Simply click “Download (5MB).”

After you downloaded the dataset, make sure to unzip the file.

Let’s move on to Google Colab now!

Data Exploration (Exploratory Data Analysis)

Let’s check what the training and the test data look like.

Checking the train and test data

Notice how there exist special characters like @, #, !, and etc. We will remove these characters later in the data cleaning step.

Check if there are any missing values. There were no missing values for both training and test data.

Checking missing values for training data

Checking missing values for test data

Data Cleaning

We will clean the data using the tweet-preprocessor library. Here’s the link: https://pypi.org/project/tweet-preprocessor/

This library removes URLs, Hashtags, Mentions, Reserved words (RT, FAV), Emojis, and Smileys.

We will also use the regular expression library to remove other special cases that the tweet-preprocessor library didn’t have.

#sentiment-analysis #nlp #data-science #classification #data analysis

Data Collection

Twitter Sentiment Analysis

1. Understanding the dataset

2. Downloading the dataset

Data Exploration (Exploratory Data Analysis)

Data Cleaning

towardsdatascience.com

Twitter Sentiment Analysis | NLP | Text Analytics