Missing value imputation is an ever-old question in data science and machine learning. Techniques go from the simple mean/median imputation to more sophisticated methods based on machine learning. How much of an impact approach selection has on the final results? As it turns out, a lot.
Let’s get a couple of things straight — missing value imputation is domain-specific more often than not. For example, a dataset might contain missing values because a customer isn’t using some service, so imputation would be the wrong thing to do.
Further, simple techniques like mean/median/mode imputation often don’t work well. And it’s easy to reason why. Extremes can influence average values in the dataset, the mean in particular. Also, filling 10% or more of the data with the same value doesn’t sound too peachy, at least for the continuous variables.
The article is structured as follows:
#data-science #machine-learning #python