How to Use Python and MissForest Algorithm to Impute Missing Data

Missing value imputation is an ever-old question in data science and machine learning. Techniques go from the simple mean/median imputation to more sophisticated methods based on machine learning. How much of an impact approach selection has on the final results? As it turns out, a lot.

Let’s get a couple of things straight — missing value imputation is domain-specific more often than not. For example, a dataset might contain missing values because a customer isn’t using some service, so imputation would be the wrong thing to do.

Further, simple techniques like mean/median/mode imputation often don’t work well. And it’s easy to reason why. Extremes can influence average values in the dataset, the mean in particular. Also, filling 10% or more of the data with the same value doesn’t sound too peachy, at least for the continuous variables.

The article is structured as follows:

Problems with KNN imputation
What is MissForest?
MissForest in practice
MissForest evaluation
Conclusion

#data-science #machine-learning #python

towardsdatascience.com

How to Use Python and MissForest Algorithm to Impute Missing Data