The dataset holds 5,574 messages which are tagged spam or not spam. The dataset is considered the gold standard as the legitimate texts were collected for research at the Department of Computer Science at the National University of Singapore while the spam messages were extracted from a UK forum in which cell phone users make public claims about SMS Spam messages.
The steps took to approach the challenge consisted of: Data Exploration, Data Pre-processing (tokenization, stemming, lemmatization, whitespaces, stopwords, etc.), identification of top spam words through word cloud, create training and test set, build a classification model on training set, test the model, and lastly, evaluate the model.
An initial issue faced was that a large majority of the dataset consisted of a higher proportion of legitimate text message. A common challenge when it comes to modeling fraud/spam detection as a classification problem is that in real world data, the majority is not fraudulent leaving us with imbalanced data. We had to ensure our training dataset was not biased toward legitimate messages. There are multiple ways of dealing with imbalanced data like SMOTE, RandomUnderSampler, ENN, etc. The team and I brought in stratified sampling. We wanted to avoid the situation of our model predicting most messages as legitimate and the team accepting the model as fit because of a high accuracy despite the skew.
The approach to developing a solution for SMS classification as spam or not spam included:
Preliminary text analysis
Text transformation
#data #data-visualization #text-mining #classification #python #data science