Detecting and handling missing values in the correct way is important, as they can impact the results of the analysis, and there are algorithms that can’t handle them. So what is the correct way?
Missing values occur in all kinds of datasets from industry to academia. They can be represented differently - sometimes by a question mark, or -999, sometimes by “n/a”, or by some other dedicated number or character. Detecting and handling missing values in the correct way is important, as they can impact the results of the analysis, and there are algorithms that can’t handle them. So what is the correct way?
Two common approaches to imputing missing values is to replace all missing values with either a fixed value, for example zero, or with the mean of all available values. Which approach is better?
Let’s see the effects on two different case studies:
Case Study 1: Imputation for threshold-based anomaly detection
In a classic threshold-based solution for anomaly detection, a threshold, calculated from the mean and variance of the original data, is applied to the sensor data to generate an alarm. If the missing values are imputed with a fixed value, e.g. zero, this will affect the calculation of the mean and variance used for the threshold definition. This would likely lead to a wrong estimate of the alarm threshold and to some expensive downtime.
Here imputing the missing values with the mean of the available values is the right way to go.
Case Study 2: Imputation for aggregated customer data
In a classic reporting exercise on customer data, the number of customers and the total revenue for each geographical area of the business needs to be aggregated and visualized, for example via bar charts. The customer dataset has missing values for those areas where the business has not started or has not picked up and no customers and no business have been recorded yet. In this case, using the mean value of the available numbers to impute the missing values would make up customers and revenues where neither customers nor revenues are present.
*The right way to go here is to impute the missing values with a fixed value of zero. *
In both cases, it is our knowledge of the process that suggests to us the right way to proceed in imputing missing values. In the case of sensor data, missing values are due to a malfunctioning of the measuring machine and therefore real numerical values are just not recorded. In the case of the customer dataset, missing values appear where there is nothing to measure yet.
You see already from these two examples, that there is no panacea for all missing value imputation problems and clearly we can’t provide an answer to the classic question: “which strategy is correct for missing value imputation for my dataset?” The answer is too dependent on the domain and the business knowledge.
We can however provide a review of the most commonly used techniques to:
Popular strategies to handle missing values in the dataset. The real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. The handling of missing data is very important during the preprocessing of the dataset as many machine learning algorithms do not support missing values. Missing values can be handled by deleting the rows or columns having null values. If columns have more than half of rows as null then the entire column can be dropped.
What is neuron analysis of a machine? Learn machine learning by designing Robotics algorithm. Click here for best machine learning course models with AI
AI, Machine learning, as its title defines, is involved as a process to make the machine operate a task automatically to know more join CETPA
Data Preparation Techniques and Its Importance in Machine Learning. “Data are just summaries of thousands of stories, tell a few of those stories to help make the data meaningful.”
Overview of methods for dealing with null values. Missing values or their replacement values can lead to huge errors in your analysis output wheter it is a machine learning model, KPIs or a report.