We are going to use Kaggle.com to find the dataset. Use the link below to go to the dataset on Kaggle.
Detecting hatred tweets, provided by Analytics Vidhya
Let’s read the context of the dataset to understand the problem statement.
In the training data, tweets are labeled ‘1’ if they are associated with the racist or sexist sentiment. Otherwise, tweets are labeled ‘0’.
Now that you have an understanding of the dataset, go ahead and download two csv files — the training and the test data. Simply click “Download (5MB).”
After you downloaded the dataset, make sure to unzip the file.
Let’s move on to Google Colab now!
Let’s check what the training and the test data look like.
Checking the train and test data
Notice how there exist special characters like @, #, !, and etc. We will remove these characters later in the data cleaning step.
Check if there are any missing values. There were no missing values for both training and test data.
Checking missing values for training data
Checking missing values for test data
We will clean the data using the tweet-preprocessor library. Here’s the link: https://pypi.org/project/tweet-preprocessor/
This library removes URLs, Hashtags, Mentions, Reserved words (RT, FAV), Emojis, and Smileys.
We will also use the regular expression library to remove other special cases that the tweet-preprocessor library didn’t have.
#sentiment-analysis #nlp #data-science #classification #data analysis