Introduction

No matter your exposure to data science & the world of statistics, it’s likely that at some point, you’ve at the very least heard of regression. As a precursor to this quick lesson on multiple regression, you should have some familiarity with simple linear regression. If you aren’t, you can start here! Otherwise, let’s dive in with multiple linear regression.

The distinction we draw between simple linear regression and multiple linear regression is simply the number of explanatory variables that help us understand our dependent variable.

Multiple linear regression is an incredibly popular statistical technique for data scientists and is foundational to a lot of the more complex methodologies used by data scientists.

Multiple Linear Regression

In my post on simple linear regression, I gave the example of predicting home prices using a single numeric variable — square footage.

Let’s continue to build on some of what we’ve already done there. We’ll build that same model, only this time, we’ll include an additional variable.

fit <- lm(price ~  sqft_living + waterfront, 
   data = housing)
summary(fit)

Similar to what you would’ve seen before, we are predicting price by square feed living space, only now we’re also including a waterfront variable, take note of the data type of our new variable.

Parallel Slopes Model

We’ve just created what is known as a parallel slopes model. A parallel slopes model is the result of a multiple linear regression model that has both one numeric explanatory variable and one categorical explanatory variable.

The formula derived from linear regression is the equation of a line.

y = mx + b

  • y is our dependent variable
  • m is the coefficient assigned to our explanatory variable
  • x is the value of the explanatory variable
  • b is the y intercept

Having in mind the resemblance to the equation of a line; when we’re trying to model home prices according to the number of bedrooms alone, we derive a coefficient related to x and a y intercept that best approximates price by minimizing error.

The question we’re left with is… when we introduce a categorical variable in addition to the current numeric predictor into our regression formula, how is it handled or reflected in the model’s output?

If you’ve ever built a simple linear regression model using only a categorical explanatory variable, you may be familiar with the idea of group means across the different levels of a categorical informing the coefficients assigned. You can read a greater detailed explanation of that here.

In a parallel slopes model, the inclusion of a categorical variable is now reflected in changes to the value of the y-intercept.

You may have asked yourself why these multiple regression models are called parallel slopes models.

#data-analysis #data-science #machine-learning #statistics #towards-data-science #data analysis

Learn Multiple Regression: Primer on Parallel Slopes Models
1.30 GEEK