Introduction

Coronavirus (COVID-19) is an infectious disease that has resulted in an ongoing pandemic. The disease was first identified in Wuhan, China, and the first case was identified in December 2019. As of 21st August 2020, more than 22 million cases have been reported across 180 countries and territories. The sheer scale of this pandemic has led to myriad problems for the current generation. One of the acute problems that I have come across is the circulation of bogus news articles and in today’s world, spurious news articles can cause panic and mass hysteria. I realized the gravity of this problem and decided to base my next machine learning project on resolving this issue.

Problem Statement

To develop a fake news classifier that appropriately classifies a news article on COVID-19 into real news or fake news.

Work Flow

Before starting with this project, I had to search for datasets that had a list of news articles related to COVID-19. This was a challenge since there are not many datasets out there that record COVID-19 news articles. After scouring the internet for days, I finally found a data set that had news articles related to COVID-19. The only task required now was to clean the data, fit the appropriate machine learning model on it, and assess the model’s performance.

Data Exploration and Data Engineering

Step 1: Checking for missing values.

I started the project by exploring the data and looking out for missing values in it. Each column in the data set had some missing values in it but most importantly, the “Label” column had 5 missing values. Fortunately, the source from where I downloaded the data set had values for the missing labels and that helped me to eliminate missing values from the “Label” column. As for the other columns i.e. “Title”, “Source” and “Text”, the missing values were replaced with an empty string.

Step 2: Looking for inconsistencies in the “Label” column.

After handling the missing data, I thought of checking the target labels to look for any inconsistencies that may be present. After exploring the “Label” column, I discovered two different Fake labels, the same can be seen in the image below. After discovering this anomaly, I decided to change the label for fake news. Final labels can be seen in the second image given below.

#data-science #towards-data-science #nlp #machine #covid19

Fake News Classifier to Tackle COVID-19 Disinformation
33.10 GEEK