The Dataset

The dataset holds 5,574 messages which are tagged spam or not spam. The dataset is considered the gold standard as the legitimate texts were collected for research at the Department of Computer Science at the National University of Singapore while the spam messages were extracted from a UK forum in which cell phone users make public claims about SMS Spam messages.

The Steps

The steps took to approach the challenge consisted of: Data Exploration, Data Pre-processing (tokenization, stemming, lemmatization, whitespaces, stopwords, etc.), identification of top spam words through word cloud, create training and test set, build a classification model on training set, test the model, and lastly, evaluate the model.

Initial Issues

An initial issue faced was that a large majority of the dataset consisted of a higher proportion of legitimate text message. A common challenge when it comes to modeling fraud/spam detection as a classification problem is that in real world data, the majority is not fraudulent leaving us with imbalanced data. We had to ensure our training dataset was not biased toward legitimate messages. There are multiple ways of dealing with imbalanced data like SMOTE, RandomUnderSampler, ENN, etc. The team and I brought in stratified sampling. We wanted to avoid the situation of our model predicting most messages as legitimate and the team accepting the model as fit because of a high accuracy despite the skew.

The Process

The approach to developing a solution for SMS classification as spam or not spam included:

Preliminary text analysis

  • To check how many messages are spam or legitimate with a pie chart
  • Creating a word cloud of words that are spam and not spam
  • Identifying top 10 words that are spam and top 10 words that are legitimate
  • Analyzing the length of the spam and legitimate text messages and plotting two graphs respectively to check out the distribution of their length.

Text transformation

  • Data Cleaning by removing stopwords, performing tokenization, stemming, lemmatization, whitespaces, etc.
  • We used ‘SnowballStemmer’ from NLTK library to remove morphological affixes from words, leaving only the word stem
  • We used ‘TfidfVectorizer’ to perform TF-IDF transformation from a provided matrix of counts
  • We encoded categories in our dataset: ‘spam’ to 1 and ‘legitimate’ to 0
  • We split train and test data in 80:20 ratio

#data #data-visualization #text-mining #classification #python #data science

Building a Prediction Model to Classify Texts that are Spam
1.15 GEEK