Most of the time I write longer articles on data science topics but recently I’ve been thinking about writing small, bite-sized pieces around specific concepts, algorithms and applications. This is my first attempt in that direction, hoping people will like these pieces.

In today’s “small-bite” I’m writing about Z-score in the context of **anomaly detection**.

Anomaly detection is a process for identifying unexpected data, event or behavior that require some examination. It is a well-established field within data science and there is a large number of algorithms to detect anomalies in a dataset depending on data type and business context. Z-score is probably the simplest algorithm that can rapidly screen candidates for further examination to determine whether they are suspicious or not.

**What is Z-score**

Simply speaking, Z-score is a statistical measure that tells you how far is a data point from the rest of the dataset. In a more technical term, Z-score tells how many standard deviations away a given observation is from the mean.

For example, a Z score of 2.5 means that the data point is 2.5 standard deviation far from the mean. And since it is far from the center, it’s flagged as an outlier/anomaly.

**How it works?**

Z-score is a parametric measure and it takes two parameters — mean and standard deviation.

Once you calculate these two parameters, finding the Z-score of a data point is easy.

Note that mean and standard deviation are calculated for the whole dataset, whereas *x* represents every single data point. That means, every data point will have its own z-score, whereas mean/standard deviation remains the same everywhere.

**Example**

Below is a python implementation of Z-score with a few sample data points. I’m adding notes in each line of code to explain what’s going on.

```
## import numpy
import numpy as np
## random data points to calculate z-score
data = [5, 5, 5, -99, 5, 5, 5, 5, 5, 5, 88, 5, 5, 5]
## calculate mean
mean = np.mean(data)
## calculate standard deviation
sd = np.std(data)
## determine a threhold
threshold = 2
## create empty list to store outliers
outliers = []
## detect outlier
for i in data:
z = (i-mean)/sd ## calculate z-score
if abs(z) > threshold: ## identify outliers
outliers.append(i) ## add to the empty list
## print outliers
print("The detected outliers are: ", outliers)
```

**Caution and conclusion**

If you play with these data you will notice a few things:

- There are 14 data points and Z-score correctly detected 2 outliers [-99 and 88]. However, if you remove five data points from the list it detects only 1 outlier [-99]. That means you need to have a certain number of data size for Z-score to work.
- In large production datasets, Z-score works best if data are normally distributed (aka. Gaussian distribution).
- I used an arbitrary threshold of 2, beyond which all data points are flagged as outliers. The rule of thumb is to use 2, 2.5, 3 or 3.5 as threshold.
- Finally, Z-score is sensitive to extreme values, because the mean itself is sensitive to extreme values.

Hope this was useful, feel free to get in touch via Twitter.

#machine-learning #anomaly-detection #outlier-detection #statistics #data-science

5.40 GEEK