Is this headline clickbait?

Is this headline clickbait?

Using Machine Learning to detect clickbait. The term “clickbait” refers to an article headline written with the sole purpose of using sensationalist language to lure in a viewer to click through to a certain webpage.

The term “clickbait” refers to an article headline written with the sole purpose of using sensationalist language to lure in a viewer to click through to a certain webpage. The webpage then generates ad revenue on the user’s clicks or monetizes the user’s activity data. The article itself is not written with journalistic integrity, research or really striving for any deeper meaning — it is simply a vehicle to monetize user clicks and data

With the explosion of social media, smartphones and the state of an increasingly digital world , there is no shortage of content vying for our attention. The ease of sharing and reposting on social media has allowed clutter like clickbait to run absolutely wild.

As clickbait has become increasingly prevalent across the web (remember when you could scroll through your Twitter feed and only see genuine content?) — I wanted to see if a headline could be classified using machine learning and what that process looks like. My goal with this project is to provide evidence for implementation on a larger scale on social media or various publisher websites as a ‘clickbait blocker’ (think ‘ad blocker’), where clickbait could be flagged or filtered out as such before a viewer ever lays eyes on it!


The Data

For this project, my data consisted of 52,000 headlines from a variety of clickbait and non-clickbait sources from roughly 2007–2020. My final dataset was compiled from a dataset found on Kaggle, as well as my own scraping and API calls of Twitter and various online publications. The data was labeled as clickbait or non-clickbait depending on the source and my final dataset was generally balanced (see distribution below).

Clickbait sources: Buzzfeed, Upworthy, ViralNova, BoredPanda, Thatscoop, Viralstories, PoliticalInsider, Examiner, TheOdyssey

Non-clickbait sources: NY Times, The Washington Post, The Guardian, Bloomberg, The Hindu, WikiNews, Reuters

Image for post

Label distribution, plotted with Seaborn.

Data Processing & Feature Engineering

As I was initially just working with the text data of each headline, the cleaning and feature engineering steps taken are described below.

python machine-learning nlp data-science social-media

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Data Science With Python Training | Python Data Science Course | Intellipaat

🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...

Applied Data Analysis in Python Machine Learning and Data Science | Scikit-Learn

Applied Data Analysis in Python Machine learning and Data science, we will investigate the use of scikit-learn for machine learning to discover things about whatever data may come across your desk.

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.