In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

Image for post

In this new article we will explain what outliers are and why they are so important, we will see a practical step by step example in Python, 1, 2 and 3 dimensional displays and the use of a general purpose library.

##What are the Outliers?

It is interesting to see the translations of “outlier” — according to its context — in English:

  • Atypical
  • Featured
  • Exceptional
  • Abnormal
  • Extreme Value, Abnormal Value, Aberrant Value!

That gives us an idea, doesn’t it?

That is, the outliers in our dataset will be the values that “escape the range where most samples are concentrated”. According to Wikipedia they are the samples that are distant from other observations.

Box plot of data from the Michelson–Morley experiment displaying four outliers in the middle column, as well as one outlier in the first column.

Detection of Outliers

And why are we interested in detecting these Outliers? Because they can considerably affect the results that a Machine Learning model can obtain… For bad… or for good! That’s why we have to detect them, and take them into account. For example in Linear Regression or Assembly algorithms can have a negative impact on your predictions.

Good Outliers vs. Bad Outliers

Outliers can mean a number of things:

  • ERROR: If we have a “people age” group and we have a 160 year old person, it’s probably a data loading error. In this case, detecting outliers helps us to detect errors.
  • LIMITS: In other cases, we can have values that escape from the “middle group”, but we want to keep the data modified, so that it does not harm the learning of the ML model.
  • Point of Interest: maybe the “anomalous” cases are the ones we want to detect and they are our target (and not our enemy!)

Many times it is easy to identify the outliers in graphs. Let’s see examples of outliers in 1, 2 and 3 dimensions.

#detection #outliers #anomaly #python

Finding Outliers
1.30 GEEK