Missing Value Imputation – A Review

Missing Value Imputation – A Review

Detecting and handling missing values in the correct way is important, as they can impact the results of the analysis, and there are algorithms that can’t handle them. So what is the correct way?

Missing values occur in all kinds of datasets from industry to academia. They can be represented differently - sometimes by a question mark, or -999, sometimes by “n/a”, or by some other dedicated number or character. Detecting and handling missing values in the correct way is important, as they can impact the results of the analysis, and there are algorithms that can’t handle them. So what is the correct way?

How to choose the correct strategy

Two common approaches to imputing missing values is to replace all missing values with either a fixed value, for example zero, or with the mean of all available values. Which approach is better?

Let’s see the effects on two different case studies:

  • Case Study 1: threshold-based anomaly detection on sensor data
  • Case Study 2: a report of customer aggregated data

Case Study 1: Imputation for threshold-based anomaly detection

In a classic threshold-based solution for anomaly detection, a threshold, calculated from the mean and variance of the original data, is applied to the sensor data to generate an alarm. If the missing values are imputed with a fixed value, e.g. zero, this will affect the calculation of the mean and variance used for the threshold definition. This would likely lead to a wrong estimate of the alarm threshold and to some expensive downtime.

Here imputing the missing values with the mean of the available values is the right way to go.

Case Study 2: Imputation for aggregated customer data

In a classic reporting exercise on customer data, the number of customers and the total revenue for each geographical area of the business needs to be aggregated and visualized, for example via bar charts. The customer dataset has missing values for those areas where the business has not started or has not picked up and no customers and no business have been recorded yet. In this case, using the mean value of the available numbers to impute the missing values would make up customers and revenues where neither customers nor revenues are present.

*The right way to go here is to impute the missing values with a fixed value of zero. *

In both cases, it is our knowledge of the process that suggests to us the right way to proceed in imputing missing values. In the case of sensor data, missing values are due to a malfunctioning of the measuring machine and therefore real numerical values are just not recorded. In the case of the customer dataset, missing values appear where there is nothing to measure yet.

You see already from these two examples, that there is no panacea for all missing value imputation problems and clearly we can’t provide an answer to the classic question: “which strategy is correct for missing value imputation for my dataset?” The answer is too dependent on the domain and the business knowledge.

We can however provide a review of the most commonly used techniques to:

  • Detect whether the dataset contains missing values and of which type, 
  • Impute the missing values.

overviews data preprocessing knime machine learning missing values

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

7 Ways to Handle Missing Values in Machine Learning

Popular strategies to handle missing values in the dataset. The real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. The handling of missing data is very important during the preprocessing of the dataset as many machine learning algorithms do not support missing values. Missing values can be handled by deleting the rows or columns having null values. If columns have more than half of rows as null then the entire column can be dropped.

What is Supervised Machine Learning

What is neuron analysis of a machine? Learn machine learning by designing Robotics algorithm. Click here for best machine learning course models with AI

Pros and Cons of Machine Learning Language

AI, Machine learning, as its title defines, is involved as a process to make the machine operate a task automatically to know more join CETPA

Data Preparation Techniques and Its Importance in Machine Learning

Data Preparation Techniques and Its Importance in Machine Learning. “Data are just summaries of thousands of stories, tell a few of those stories to help make the data meaningful.” 

Don’t let missing values ruin your analysis output, Deal with them!

Overview of methods for dealing with null values. Missing values or their replacement values can lead to huge errors in your analysis output wheter it is a machine learning model, KPIs or a report.