Missing values or their replacement values can lead to huge errors in your analysis output wheter it is a machine learning model, KPIs or a report.

Often analysts deal with missing values just like there is only one type of them. It is not the case, there is three types of missing values and there is ways of dealing with0 each one of them.

Type of null values

Missing at random (MAR) : The presence of a null value in a variable is not random but rather dependent of a known or unknown characteristic of the record. So why is it called missing at random you might ask yourself? Because the null value is independent of it actual value. Depending on your dataset it can or cannot be tested. To find out you should compare the others variable distribution for records with missing and non missing values.

**_Ex: _**A dataset on education that contains a lot of missing values for IQ score of young children just because it is less common to have a four years old to pass the test compared to a twelve years old. The null values aren’t correlated to the IQ actual value but to the age.

Missing completely at random (MCAR) : The presence of the null value is independent of any know or unknown characteristics of the record. Here again, depending on your dataset, it can or cannot be tested. Just like for MAR, the test would consists in comparing the distribution of the others variable for records with missing values vs ones with no null values.

Ex: Missing data for survey respondents for which their questionnaires results was lost in the mail. Totally independent from the concerned variable and the characteristics of the respondents (ie records).

Missing not at random (MNAR): The presence of the null value is dependent to it actual value. This one cannot be tested, unless you know the actual value which is a bit paradoxical.

Ex: Missing values for the IQ variable only for individuals which had low score.


You might have guessed it, in the second case only it is safe to drop the null values.

For the two others cases, dropping values would result in ignoring a group of the overall population.

In the last case the fact that the record has a null value carries some information about the actual value.

Dealing with missing values

Drop

Dropping row : (Only for MCAR) This can be the perfect solution if you have only a small proportion of missing values relatively to your dataset size. However, it quickly become unviable as the proportion grows.

**Dropping col : **This one is often not considered because it results in an important loss of information. As a rule of thumb you can start considering it when the proportion of null values is higher than 60%.

Imputation

Last or next value : (Only for time series with MCAR) It is ok to use the last or the next value to fill a missing value as long as you are working on a time series problem.

Mean value : (Only for MCAR) Using the mean value is often a bad solution as it is sensible to outliers.

Median value : (Only for MCAR) Similar to mean value but more robust to outliers.

Mode value : (Only for MCAR) By choosing the most common value you make sure that you are correctly filling the null most of the times. Beware of multi-mode distribution for which it will no longer be a viable solution.

Replace with constant : (Only for MNAR) As we have seen before, missing value in case of MNAR actually hold some information about the actual value. So, it does make sense to fill them using a constant (different from others values).

Linear interpolation : (Only for time series with MCAR) In time series problem with a trend and little to no seasonality a missing value can be approximated by doing a linear interpolation using the value before it and the value after it. Here is the formula :

Linear interpolation (1st order)

Spline interpolation : (Only for time series with MCAR) This is similar to linear interpolation but it used high order polynomial features to have a smoother interpolation. Again, it is not suitable for seasonal data.

Linear/Spline interpolation with seasonal adjustment : (Only for time series with MCAR) it follows the same principle as linear and spline interpolation but with adjustments to the seasonality. It consists in deseasonalizing the data, applying linear/spline interpolation and applying back the seasonality to the time series. Here is a detail explanation of STL a method for deseasonalizing the data.

#data-science #missing-values #data-cleaning #data-imputation #machine-learning #data analysis

Don’t let missing values ruin your analysis output, Deal with them!
1.15 GEEK