In this blog post, we’re going to build a spam filter using Python and the multinomial Naive Bayes algorithm. Our goal is to code a spam filter from scratch that classifies messages with an accuracy greater than 80%.

To build our spam filter, we’ll use a dataset of 5,572 SMS messages. Tiago A. Almeida and José María Gómez Hidalgo put together the dataset, you can download it from the UCI Machine Learning Repository.

We’re going to focus on the Python implementation throughout the post, so we’ll assume that you are already familiar with multinomial Naive Bayes and conditional proability.

If you need to fill in any gaps before moving forward, Dataquest has a course that covers both conditional probability and multinomial Naive Bayes, as well as a broad variety of other course you could use to fill in gaps in your knowledge and earn a data science certificate.

Exploring the Dataset

Let’s start by opening the SMSSpamCollection file with the read_csv() function from the pandas package. We’re going to use:

  • sep='\t' because the data points are tab separated
  • header=None because the dataset doesn’t have a header row
  • names=['Label', 'SMS'] to name the columns

#classification #naive bayes #python #text classification

How to Build a Spam Filter using Python and Naive Bayes
38.45 GEEK