How Naive Bayes Classifiers Work – with Python Code Examples. Naive Bayes Classifiers (NBC) are simple yet powerful Machine Learning algorithms. In this post, I explain "the trick" behind NBC and I'll give you an example that we can use to solve a classification problem.

Naive Bayes Classifiers (NBC) are simple yet powerful Machine Learning algorithms. They are based on conditional probability and Bayes's Theorem.

In this post, I explain "the trick" behind NBC and I'll give you an example that we can use to solve a classification problem.

In the next sections, I'll be talking about the math behind NBC. Feel free to skip those sections and go to the implementation part if you are not interested in the math.

In the implementation section, I'll show you a simple NBC algorithm. Then we'll use it to solve a classification problem. The task will be to determine whether a certain passenger on the Titanic survived the accident or not.

Before talking about the algorithm itself, let's talk about the simple math behind it. We need to understand what conditional probability is and how can we use Bayes's Theorem to calculate it.

Think about a fair die with six sides. What's the probability of getting a six when rolling the die? That's easy, it's 1/6. We have six possible and equally likely outcomes but we are interested in just one of them. So, 1/6 it is.

But what happens if I tell you that I have rolled the die already and the outcome is an even number? What's the probability that we have got a six now?

This time, the possible outcomes are just three because there are only three even numbers on the die. We are still interested in just one of those outcomes, so now the probability is greater: 1/3. What's the difference between both cases?

In the first case, we had no **prior** information about the outcome. Thus, we needed to consider every single possible result.

In the second case, we were told that the outcome was an even number, so we could reduce the space of possible outcomes to just the three even numbers that appear in a regular six-sided die.

In general, when calculating the probability of an event A, given the occurrence of another event B, we say we are calculating the **conditional probability** of A given B, or just the probability of A given B. We denote it `P(A|B)`

.

For example, the probability of getting a six given that the number we have got is even: `P(Six|Even) = 1/3`

. Here we, denoted with **Six** the event of getting a six and with **Even** the event of getting an even number.

But, how do we calculate conditional probabilities? Is there a formula?

Now, I'll give you a couple of formulas to calculate conditional probs. I promise they won't be hard, and they are important if you want to understand the insights of the Machine Learning algorithms we'll be talking about later.

The probability of an event A given the occurrence of another event B can be calculated as follows:

`P(A|B) = P(A,B)/P(B)`

Where `P(A,B)`

denotes the probability of both A and B occurring at the same time, and `P(B)`

denotes the probability of B.

Notice that we need `P(B) > 0`

because it makes no sense to talk about the probability of A given B if the occurrence of B is not possible.

We can also calculate the probability of an event A, given the occurrence of multiple events B1, B2,..., Bn:

`P(A|B1,B2,...,Bn) = P(A,B1,B2,...,Bn)/P(B1,B2,...,Bn)`

There's another way of calculating conditional probs. This way is the so-called Bayes's Theorem.

```
P(A|B) = P(B|A)P(A)/P(B)
P(A|B1,B2,...,Bn) = P(B1,B2,...,Bn|A)P(A)/P(B1,B2,...,Bn)
```

Notice that we are calculating the probability of event A given the event B, by *inverting* the order of occurence of the events.

Now we suppose the event A has occurred and we want to calculate the prob of event B (or events B1,B2,...,Bn in the second and more general example).

An important fact that can be derived from this Theorem is the formula to calculate `P(B1,B2,...,Bn,A)`

. That's called the chain rule for probabilities.

```
P(B1,B2,...,Bn,A) = P(B1 | B2, B3, ..., Bn, A)P(B2,B3,...,Bn,A)
= P(B1 | B2, B3, ..., Bn, A)P(B2 | B3, B4, ..., Bn, A)P(B3, B4, ..., Bn, A)
= P(B1 | B2, B3, ..., Bn, A)P(B2 | B3, B4, ..., Bn, A)...P(Bn | A)P(A)
```

That's an ugly formula, isn't it? But under some conditions we can make a workaround and avoid it.

Let's talk about the last concept we need to know to understand the algorithms.

The last concept we are going to talk about is independence. We say events A and B are independent if

`P(A|B) = P(A)`

That means that the prob of event A is not affected by the occurrence of event B. A direct consequence is that `P(A,B) = P(A)P(B)`

.

In plain English, this means that the prob of the occurrence of both A and B at the same time is equal to the product of the probs of events A and B occurring separately.

If A and B are independent, it also holds that:

`P(A,B|C) = P(A|C)P(B|C)`

Now we are ready to talk about Naive Bayes Classifiers!

Suppose we have a vector **X** of *n* features and we want to determine the class of that vector from a set of *k* classes *y1, y2,...,yk*. For example, if we want to determine whether it'll rain today or not.

We have two possible classes (*k = 2*): *rain*, *not rain*, and the length of the vector of features might be 3 (*n = 3*).

The first feature might be whether it is cloudy or sunny, the second feature could be whether humidity is high or low, and the third feature would be whether the temperature is high, medium, or low.

So, these could be possible feature vectors.

```
<Cloudy, H_High, T_Low>
<Sunny, H_Low, T_Medium>
<Cloudy, H_Low, T_High>
```

Our task is to determine whether it'll rain or not, given the weather features.

After learning about conditional probabilities, it seems natural to approach the problem by trying to calculate the prob of raining given the features:

```
R = P(Rain | Cloudy, H_High, T_Low)
NR = P(NotRain | Cloudy, H_High, T_Low)
```

If `R > NR`

we answer that it'll rain, otherwise we say it won't.

In general, if we have *k* classes *y1, y2, ..., yk*, and a vector of *n* features **X = <X1, X2, ..., Xn>**, we want to find the class *yi* that maximizes

`P(yi | X1, X2, ..., Xn) = P(X1, X2,..., Xn, yi)/P(X1, X2, ..., Xn)`

Notice that the denominator is constant and it does not depend on the class *yi*. So, we can ignore it and just focus on the numerator.

In a previous section, we saw how to calculate `P(X1, X2,..., Xn, yi)`

by decomposing it in a product of conditional probabilities (the ugly formula):

`P(X1, X2,..., Xn, yi) = P(X1 | X2,..., Xn, yi)P(X2 | X3,..., Xn, yi)...P(Xn | yi)P(yi)`

Assuming all the features **Xi** are independent and using Bayes's Theorem, we can calculate the conditional probability as follows:

```
P(yi | X1, X2,..., Xn) = P(X1, X2,..., Xn | yi)P(yi)/P(X1, X2, ..., Xn)
= P(X1 | yi)P(X2 | yi)...P(Xn | yi)P(yi)/P(X1, X2, ..., Xn)
```

And we just need to focus on the numerator.

By finding the class *yi* that maximizes the previous expression, we are classifying the input vector. But, how can we get all those probabilities?

We supply you with world class machine learning experts / ML Developers with years of domain experience who can add more value to your business.

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.