4 Common Mistakes Everybody Makes With Regressions

Linear regressions are among the most common and most powerful tools for data analysis. While other, more advanced forms of statistics have been developed over the years, linear regressions remain incredibly popular, because they’re easy to understand, interpret, and perform.

You can find regression implementations in nearly any programming language, analytical software, and even the standard TI-84 calculator. Its ubiquity allows math teachers to introduce it as early as middle school, meaning most people are at least familiar with it.

With the linear regression’s success, however, comes its misuse. As people may not completely understand its underlying assumptions, they’re more likely to use a make basic mistakes when applying it.

Luckily, some of those mistakes are easy to fix.

Fitting to Non-Linear Data

A Line of Best Fit on non-linear data. Figure produced by author.

Despite “linear” being in the name, one of the most common mistakes in linear regressions is fitting to non-linear data. The illustration above shows why this is a bad idea.

The straight line, the linear regression, doesn’t follow the curve of the data that it’s designed to mimic. As a result, the model behaves poorly and makes terrible predictions.

Nearly everybody does this at least once because they don’t take the time to do proper data exploration. Fitting each of the independent variables to check for a linear relationship, calculating correlation coefficients, or performing a principal component analysis can help prevent this mistake in the first place.

The best solution, however, is to check what type of relationship X has with Y and perform a transformation on X to fit to Y. For example, if the data forms a parabolic relationship, like in the example above, use X² as the independent variable instead of X.

#machine-learning #data-science #linear-regression #data-analysis #regression

Fitting to Non-Linear Data

medium.com

4 Common Mistakes Everybody Makes With Regressions