In this article, my colleagues and I used a dataset of text messages to build a prediction model to classify which texts are spam.
The dataset holds 5,574 messages which are tagged spam or not spam. The dataset is considered the gold standard as the legitimate texts were collected for research at the Department of Computer Science at the National University of Singapore while the spam messages were extracted from a UK forum in which cell phone users make public claims about SMS Spam messages.
The steps took to approach the challenge consisted of: Data Exploration, Data Pre-processing (tokenization, stemming, lemmatization, whitespaces, stopwords, etc.), identification of top spam words through word cloud, create training and test set, build a classification model on training set, test the model, and lastly, evaluate the model.
An initial issue faced was that a large majority of the dataset consisted of a higher proportion of legitimate text message. A common challenge when it comes to modeling fraud/spam detection as a classification problem is that in real world data, the majority is not fraudulent leaving us with imbalanced data. We had to ensure our training dataset was not biased toward legitimate messages. There are multiple ways of dealing with imbalanced data like SMOTE, RandomUnderSampler, ENN, etc. The team and I brought in stratified sampling. We wanted to avoid the situation of our model predicting most messages as legitimate and the team accepting the model as fit because of a high accuracy despite the skew.
The approach to developing a solution for SMS classification as spam or not spam included:
Preliminary text analysis
🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...
Enroll in our Data Science with Python training in Chennai. Best Data Science with Python Training courses in Chennai for 100% Job Placements Support.
🔥To access the slide deck used in this session for Free, click here: https://bit.ly/GetPDF_DataV_P 🔥 Great Learning brings you this live session on 'Data Vis...
🔥Intellipaat Python for Data Science Course: https://intellipaat.com/python-for-data-science-training/In this python for data science video you will learn e...
Master Applied Data Science with Python and get noticed by the top Hiring Companies with IgmGuru's Data Science with Python Certification Program. Enroll Now