Predicting Fake News using NLP and Machine Learning | Scikit-Learn | GloVe | Keras

The fake news dataset is one of the classic text analytics datasets available on Kaggle. It consists of genuine and fake articles’ titles and text from different authors. In this article, I have walked through the entire text classification process using traditional machine learning approaches as well as deep learning.

Getting Started

I started with downloading the dataset from Kaggle on Google Colab.

Next, I read the DataFrame and checked the null values in it. There are 7 null values in the text articles, 122 in title and 503 in author out of a total of 20800 rows, I decided to drop the rows. For the test data, I filled them up with a blank

Additionally, I also check the distribution of ‘Fake’ and ‘Genuine’ news in the dataset. Usually, I set the rcParams for all plots on the notebook while importing matplotlib.

0 is Genuine News while 1 is Fake News

The ratio is disturbed from being 1:1 to 4:5 for genuine to fake news.

Next, I decided to look at the article length like below —

It is seen that the median length is lower for fake articles but it also has loads of outliers. Both have zero length.

It is seen that they start from 0 which is concerning. It actually starts from 1 when I used .describe() to see the numbers. So I took a look at these texts and found that they are blank. The obvious answer to this is strip and drop length zero. I checked the total number of zero-length texts is 74.

**I decided to start over again. **So, I would fill all nans with a blank and strip them next, then, remove the zero-length texts and that should be good to start the preprocessing. Following is the new code that handles missing values essentially. The final shape of the data is (20684, 6), that is, it contains 20684 rows, only 116 less than 20800.

The shape of the target variable’s distribution is equal which is good for model training.

It so appeared after that there are more texts that have single-digit lengths or as low as 10. They seemed more like comments than proper texts. I will keep them for the time being as it is and move on to the next step.

#deep-learning #text-analytics #keras #machine-learning #python

Getting Started

towardsdatascience.com

Predicting Fake News using NLP and Machine Learning | Scikit-Learn | GloVe | Keras | LSTM