Building a Prediction Model to Classify Texts that are Spam

Building a Prediction Model to Classify Texts that are Spam

In this article, my colleagues and I used a dataset of text messages to build a prediction model to classify which texts are spam.

The Dataset

The dataset holds 5,574 messages which are tagged spam or not spam. The dataset is considered the gold standard as the legitimate texts were collected for research at the Department of Computer Science at the National University of Singapore while the spam messages were extracted from a UK forum in which cell phone users make public claims about SMS Spam messages.

The Steps

The steps took to approach the challenge consisted of: Data Exploration, Data Pre-processing (tokenization, stemming, lemmatization, whitespaces, stopwords, etc.), identification of top spam words through word cloud, create training and test set, build a classification model on training set, test the model, and lastly, evaluate the model.

Initial Issues

An initial issue faced was that a large majority of the dataset consisted of a higher proportion of legitimate text message. A common challenge when it comes to modeling fraud/spam detection as a classification problem is that in real world data, the majority is not fraudulent leaving us with imbalanced data. We had to ensure our training dataset was not biased toward legitimate messages. There are multiple ways of dealing with imbalanced data like SMOTE, RandomUnderSampler, ENN, etc. The team and I brought in stratified sampling. We wanted to avoid the situation of our model predicting most messages as legitimate and the team accepting the model as fit because of a high accuracy despite the skew.

The Process

The approach to developing a solution for SMS classification as spam or not spam included:

Preliminary text analysis

  • To check how many messages are spam or legitimate with a pie chart
  • Creating a word cloud of words that are spam and not spam
  • Identifying top 10 words that are spam and top 10 words that are legitimate
  • Analyzing the length of the spam and legitimate text messages and plotting two graphs respectively to check out the distribution of their length.

Text transformation

  • Data Cleaning by removing stopwords, performing tokenization, stemming, lemmatization, whitespaces, etc.
  • We used ‘SnowballStemmer’ from NLTK library to remove morphological affixes from words, leaving only the word stem
  • We used ‘TfidfVectorizer’ to perform TF-IDF transformation from a provided matrix of counts
  • We encoded categories in our dataset: ‘spam’ to 1 and ‘legitimate’ to 0
  • We split train and test data in 80:20 ratio

data data-visualization text-mining classification python data science

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Data Science With Python Training | Python Data Science Course | Intellipaat

🔵 Intellipaat Data Science with Python course: this Data Science With Python Training video, you...

Data Science with Python Certification Training in Chennai

Enroll in our Data Science with Python training in Chennai. Best Data Science with Python Training courses in Chennai for 100% Job Placements Support.

Data Visualization With Python | Data Visualization | Python For Data Science

🔥To access the slide deck used in this session for Free, click here: 🔥 Great Learning brings you this live session on 'Data Vis...

Python for Data Science | Data Science With Python | Python Data Science Tutorial

🔥Intellipaat Python for Data Science Course: this python for data science video you will learn e...

Applied Data Science with Python Certification Training Course -IgmGuru

Master Applied Data Science with Python and get noticed by the top Hiring Companies with IgmGuru's Data Science with Python Certification Program. Enroll Now