Summary

Some commonly used correlation filtering methods have a tendency to drop more features than required. This problem is amplified as datasets become larger and with more pairwise correlations above a specified threshold. If we drop more variables than necessary, less information will be available potentially leading to suboptimal model performance. In this article, I will be demonstrating the shortcomings of current methods and proposing a possible solution.

Example

Let’s look at an example of how current methods drop features that should have remained in the dataset. We will use the Boston Housing revised dataset and show examples in both R and Python.

#feature-engineering #analytics #data-science #machine-learning #correlation

Are you dropping too many correlated features?
1.10 GEEK