The Consequences of Violating Linear Regression Assumptions. How violating assumptions can impact prediction and inference

Recently, a friend learning linear regression asked me what happens when assumptions like multicollinearity are violated. Despite being a former statistics student, I could only give him general answers like “you won’t be able to trust the estimates of your model.” Unsatisfied with my response, I decided to create a real-world example, via simulation, to show what can happen to prediction and inference when certain assumptions are violated.

Suppose researchers are interested in understanding what drives the price of a house. Let’s pretend that housing prices are determined by just two variables: the size and age of the house. While age holds a negative, linear relationship with price, the size of the house has a positive, quadratic (non-linear) relationship with price. Mathematically, we can model this relationship like so:

*Priceᵢ = β₀ + β₁ sqftᵢ + β₂sqftᵢ² − β₃*age_yearsᵢ + eᵢ*

where *Price* is the price of a house in thousands of dollars, *sqft* is the square footage of a house in thousands, and *age_years* the age of the house in years. The residuals _e _are normally distributed with mean 0 and variance *σ*ₑ_². _Let’s call this the true model since it accounts for everything that drives housing prices (excluding residuals). Since researchers don’t have a crystal ball telling them what the true model is, they test out a few linear regression models. Here’s what they came up with, in no particular order:

(1)_ Priceᵢ = β₀ + β₁*sqftᵢ + β₂*sqftᵢ² − β₃*age_yearsᵢ + eᵢ_

(2)_ Priceᵢ = β₀ + β₁*sqftᵢ + β₂*sqftᵢ² − β₃*age_yearsᵢ − β₄*age_monthsᵢ + eᵢ_

(3)_ Priceᵢ = β₀ + β₁*sqftᵢ − β₂*age_yearsᵢ + eᵢ_

(4)_ Priceᵢ = β₀ − β₁*age_yearsᵢ + eᵢ_

The researchers were smart and nailed the true model (Model 1), but the other models (Models 2, 3, and 4) violate certain OLS assumptions. Lastly, let’s say that there were 10K researchers who conducted the same study. Each took 50 independent observations from the population of houses and fit the above models to the data. By examining the results of these 10K models, we can see how these different models behave.

A data scientist/analyst in the making needs to format and clean data before being able to perform any kind of exploratory data analysis.

Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments. Our latest survey report suggests that as the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, data scientists and AI practitioners should be aware of the skills and tools that the broader community is working on. A good grip in these skills will further help data science enthusiasts to get the best jobs that various industries in their data science functions are offering.

Statistics for Data Science and Machine Learning Engineer. I’ll try to teach you just enough to be dangerous, and pique your interest just enough that you’ll go off and learn more.

Become a data analysis expert using the R programming language in this [data science](https://360digitmg.com/usa/data-science-using-python-and-r-programming-in-dallas "data science") certification training in Dallas, TX. You will master data...

The agenda of the talk included an introduction to 3D data, its applications and case studies, 3D data alignment and more.