1594645440

# Robust Location and Scale Estimator in Outlier Detection

One of the main goals of statistical analysis is to find the location and scale parameters for a statistical distribution. The location parameter specifies the typical value, i.e., the central value of the distribution while the scale parameter is used to measure the dispersion or variation of the distribution.

For location parameter, three common definitions can be used:

1. **_Mean value: _**this is the arithmetic mean of data samples which is usually referred to as average of data samples. The mean value is affected by the extreme values in tails easily.

2. **_Median value: _**this value represents the middle data point that half of data is smaller and half of data larger than this point. In contrast to the mean value, the median value can be the exact data point. For the case that, the number of observation sample is odd, the median value is the data point resides in middle if the whole observations is sorted in ascending order. In the case that, the number of observation points are even, the median is obtained by average of two data points in the middle. One advantage of median is that, it is less affected by the extreme values than mean value and hence best candidate for robust location estimation for the case that there are many outliers.

3. **_Mode value: _**this value represents the value that occurs with the highest probability. This value is usually obtained by deriving the histogram of the observation samples.

Depending on the shape of the distribution, the mean, median and mode can be used as a representative of the location parameter. For example, if the underlying distribution is symmetric around central value without heavy tail such as normal distribution, the mean can be used as a good candidate for the location. In normal distribution, the median, mean and the mode are almost the same. However, in the case of skewed distribution such as exponential or log-normal distribution, the mean is different than the median. In the skewed distribution, it is not always obvious to see which location describes the distribution the best and it is better to mention all these parameters. **In other class of distributions which are symmetric but heavy tailed the median is the better location estimator than the mean value. **An example of heavy tailed is Cauchy distribution. In Cauchy distribution the mean is not defined. In other words, the mean value is not converging to a single value as the sample size increases. This is due to the fact that the mean value is heavily affected by the extreme values in tail of distribution. In this case, median is a good location estimator as it is a rank-based estimator. In robust statistics context, various alternatives are proposed to combat against the non-normality of the data since mean is a good representative when the underlying distribution is normal. Two common approaches in achieving robust location estimator for mean value are

_Mid-Mean: _computes the mean using the data between 25 and 75 percentiles.

_Trimmed mean: _computes the mean using the data between 5 and 95 percentiles.

In measuring the scale parameter of distribution two key components should be taken into account:

1. The dispersion around the centre value, i.e., location parameter
2. How dispersed are the tails?

#outliers #data-science #statistics #data analysis

1594645440

## Robust Location and Scale Estimator in Outlier Detection

One of the main goals of statistical analysis is to find the location and scale parameters for a statistical distribution. The location parameter specifies the typical value, i.e., the central value of the distribution while the scale parameter is used to measure the dispersion or variation of the distribution.

For location parameter, three common definitions can be used:

1. **_Mean value: _**this is the arithmetic mean of data samples which is usually referred to as average of data samples. The mean value is affected by the extreme values in tails easily.

2. **_Median value: _**this value represents the middle data point that half of data is smaller and half of data larger than this point. In contrast to the mean value, the median value can be the exact data point. For the case that, the number of observation sample is odd, the median value is the data point resides in middle if the whole observations is sorted in ascending order. In the case that, the number of observation points are even, the median is obtained by average of two data points in the middle. One advantage of median is that, it is less affected by the extreme values than mean value and hence best candidate for robust location estimation for the case that there are many outliers.

3. **_Mode value: _**this value represents the value that occurs with the highest probability. This value is usually obtained by deriving the histogram of the observation samples.

Depending on the shape of the distribution, the mean, median and mode can be used as a representative of the location parameter. For example, if the underlying distribution is symmetric around central value without heavy tail such as normal distribution, the mean can be used as a good candidate for the location. In normal distribution, the median, mean and the mode are almost the same. However, in the case of skewed distribution such as exponential or log-normal distribution, the mean is different than the median. In the skewed distribution, it is not always obvious to see which location describes the distribution the best and it is better to mention all these parameters. **In other class of distributions which are symmetric but heavy tailed the median is the better location estimator than the mean value. **An example of heavy tailed is Cauchy distribution. In Cauchy distribution the mean is not defined. In other words, the mean value is not converging to a single value as the sample size increases. This is due to the fact that the mean value is heavily affected by the extreme values in tail of distribution. In this case, median is a good location estimator as it is a rank-based estimator. In robust statistics context, various alternatives are proposed to combat against the non-normality of the data since mean is a good representative when the underlying distribution is normal. Two common approaches in achieving robust location estimator for mean value are

_Mid-Mean: _computes the mean using the data between 25 and 75 percentiles.

_Trimmed mean: _computes the mean using the data between 5 and 95 percentiles.

In measuring the scale parameter of distribution two key components should be taken into account:

1. The dispersion around the centre value, i.e., location parameter
2. How dispersed are the tails?

#outliers #data-science #statistics #data analysis

1603011600

## Outlier Detection with Multivariate Normal Distribution in Python

_All the code files will be available at : _https://github.com/ashwinhprasad/Outliers-Detection/blob/master/Outliers.ipynb

## What is an Outlier ?

Anything that is unusual and deviates from the standard “normal” is called an Anomaly or an Outlier.

Detecting these anomalies in the given data is called as anomaly detection.

For more theoretical information about outlier or anomaly detection, Check out :** How Anomaly Detection Works ?**

## Why do we need to remove outliers or detect them ?

**Case 1 : **Consider a situation where a big manufacturing company is manufacturing an airplane. An airplane has different parts and we don’t want any parts to behave in an unusual way. these unusual behaviours might be because of various reasons. we want to detect these parts before it is fixed in an airplane else the lives of the passengers might be in danger.

**Case 2: **As you can see in the Above Image, how outliers can affect the equation of the line of best fit. So, before performing it is important to remove outliers in order to get the most accurate predictions.

In this post, I will be using Multivariate Normal Distribution

#outlier-detection #anomaly-detection #machine-learning #python #outliers

1601528520

## What is an Outlier? Algorithms that are affected by outliers.

In statistics, an outlier is an observation point that is distant from other observations.

These extreme values need not necessarily impact the model performance or accuracy, but when they do they are called “Influential” points.

Note: _An _outlier_ is a data point that diverges from an overall pattern in a sample. An influential point is any point that has a large effect on the slope of a regression line._

Now the question arises that how we can detect these outliers and how to handle them?

Well before jumping straight into the solution lets explore that how the outliers being added to our dataset. What is the root cause of it.

#outliers #anomaly-detection #algorithms #outlier-detection #machine-learning

1601334000

## Anomaly detection with Local Outlier Factor (LOF)

Today’s article is my 5th in a series of “bite-size” article I am writing on different techniques used for anomaly detection. If you are interested, the following are the previous four articles:

Today I am going beyond statistical techniques and stepping into machine learning algorithms for anomaly detection.

#outlier-detection #fraud-detection #data-science #machine-learning #anomaly-detection

1601280960

## Statistical techniques for anomaly detection

Anomaly and fraud detection is a multi-billion-dollar industry. According to a Nilson Report, the amount of global credit card fraud alone was USD 7.6 billion in 2010. In the UK fraudulent credit card transaction losses were estimated at more than USD 1 billion in 2018. To counter these kinds of financial losses a huge amount of resources are employed to identify frauds and anomalies in every single industry.

In data science, “Outlier”, “Anomaly” and “Fraud” are often synonymously used, but there are subtle differences. An “outliers’ generally refers to a data point that somehow stands out from the rest of the crowd. However, when this outlier is completely unexpected and unexplained, it becomes an anomaly. That is to say, all anomalies are outliers but not necessarily all outliers are anomalies. In this article, however, I am using these terms interchangeably.

There are numerous reasons why understanding and detecting outliers are important. As a data scientist when we make data preparation we take great care in understanding if there is any data point unexplained, which may have entered erroneously. Sometimes we filter completely legitimate outlier data points and remove them to ensure greater model performance.

There is also a huge industrial application of anomaly detection. Credit card fraud detection is the most cited one but in numerous other cases anomaly detection is an essential part of doing business such as detecting network intrusion, identifying instrument failure, detecting tumor cells etc.

A range of tools and techniques are used to detect outliers and anomalies, from simple statistical techniques to complex machine learning algorithms, depending on the complexity of data and sophistication needed. The purpose of this article is to summarise some simple yet powerful statistical techniques that can be readily used for initial screening of outliers. While complex algorithms can be inevitable to use, sometimes simple techniques are more than enough to serve the purpose.

Below is a primer on five statistical techniques.

#anomaly-detection #machine-learning #outlier-detection #data-science #fraud-detection