Most of the time I write longer articles on data science topics but recently I’ve been thinking about writing small, bite-sized pieces around specific concepts, algorithms and applications. This is my first attempt in that direction, hoping people will like these pieces.

In today’s “small-bite” I’m writing about Z-score in the context of anomaly detection.

Anomaly detection is a process for identifying unexpected data, event or behavior that require some examination. It is a well-established field within data science and there is a large number of algorithms to detect anomalies in a dataset depending on data type and business context. Z-score is probably the simplest algorithm that can rapidly screen candidates for further examination to determine whether they are suspicious or not.

What is Z-score

Simply speaking, Z-score is a statistical measure that tells you how far is a data point from the rest of the dataset. In a more technical term, Z-score tells how many standard deviations away a given observation is from the mean.

For example, a Z score of 2.5 means that the data point is 2.5 standard deviation far from the mean. And since it is far from the center, it’s flagged as an outlier/anomaly.

How it works?

Z-score is a parametric measure and it takes two parameters — mean and standard deviation.

Once you calculate these two parameters, finding the Z-score of a data point is easy.

Note that mean and standard deviation are calculated for the whole dataset, whereas x represents every single data point. That means, every data point will have its own z-score, whereas mean/standard deviation remains the same everywhere.

Example

Below is a python implementation of Z-score with a few sample data points. I’m adding notes in each line of code to explain what’s going on.

## import numpy
import numpy as np

## random data points to calculate z-score
data = [5, 5, 5, -99, 5, 5, 5, 5, 5, 5, 88, 5, 5, 5]
## calculate mean
mean = np.mean(data) 
## calculate standard deviation
sd = np.std(data)
## determine a threhold
threshold = 2
## create empty list to store outliers
outliers = []
## detect outlier
for i in data: 
    z = (i-mean)/sd ## calculate z-score
    if abs(z) > threshold:  ## identify outliers
        outliers.append(i) ## add to the empty list
## print outliers    
print("The detected outliers are: ", outliers)

Image for post

Caution and conclusion

If you play with these data you will notice a few things:

  • There are 14 data points and Z-score correctly detected 2 outliers [-99 and 88]. However, if you remove five data points from the list it detects only 1 outlier [-99]. That means you need to have a certain number of data size for Z-score to work.
  • In large production datasets, Z-score works best if data are normally distributed (aka. Gaussian distribution).
  • I used an arbitrary threshold of 2, beyond which all data points are flagged as outliers. The rule of thumb is to use 2, 2.5, 3 or 3.5 as threshold.
  • Finally, Z-score is sensitive to extreme values, because the mean itself is sensitive to extreme values.

Hope this was useful, feel free to get in touch via Twitter.

#machine-learning #anomaly-detection #outlier-detection #statistics #data-science

Z-score for anomaly detection
5.40 GEEK