Most of the time I write longer articles on data science topics but recently I’ve been thinking about writing small, bite-sized pieces around specific concepts, algorithms and applications. This is my first attempt in that direction, hoping people will like these pieces.
In today’s “small-bite” I’m writing about Z-score in the context of anomaly detection.
Anomaly detection is a process for identifying unexpected data, event or behavior that require some examination. It is a well-established field within data science and there is a large number of algorithms to detect anomalies in a dataset depending on data type and business context. Z-score is probably the simplest algorithm that can rapidly screen candidates for further examination to determine whether they are suspicious or not.
What is Z-score
Simply speaking, Z-score is a statistical measure that tells you how far is a data point from the rest of the dataset. In a more technical term, Z-score tells how many standard deviations away a given observation is from the mean.
For example, a Z score of 2.5 means that the data point is 2.5 standard deviation far from the mean. And since it is far from the center, it’s flagged as an outlier/anomaly.
How it works?
Z-score is a parametric measure and it takes two parameters — mean and standard deviation.
Once you calculate these two parameters, finding the Z-score of a data point is easy.
Note that mean and standard deviation are calculated for the whole dataset, whereas x represents every single data point. That means, every data point will have its own z-score, whereas mean/standard deviation remains the same everywhere.
Example
Below is a python implementation of Z-score with a few sample data points. I’m adding notes in each line of code to explain what’s going on.
## import numpy
import numpy as np
## random data points to calculate z-score
data = [5, 5, 5, -99, 5, 5, 5, 5, 5, 5, 88, 5, 5, 5]
## calculate mean
mean = np.mean(data)
## calculate standard deviation
sd = np.std(data)
## determine a threhold
threshold = 2
## create empty list to store outliers
outliers = []
## detect outlier
for i in data:
z = (i-mean)/sd ## calculate z-score
if abs(z) > threshold: ## identify outliers
outliers.append(i) ## add to the empty list
## print outliers
print("The detected outliers are: ", outliers)
Caution and conclusion
If you play with these data you will notice a few things:
Hope this was useful, feel free to get in touch via Twitter.
#machine-learning #anomaly-detection #outlier-detection #statistics #data-science