A Beginner’s Guide to Using Instrumental Variables

As an undergraduate I studied economics, which meant I studied a lot of regressions. It was basically 90% of the curriculum (when we’re not discussing supply and demand curves, of course). The effect of corruption on sumo wrestling?_ Regression. _Effect of minimum wage changes on a Wendy’s in NJ? _Regression. _Or maybe The Zombie Lawyer Apocalypse is more your speed (O.K., not a regression, but the title was cool).

Either way, my undergrad taught me three things: 1) supply-and-demand, 2) regressions are_ life_, and 3) economists think they are gosh darn hilarious.

**But what if your regression fails you? **What if it isn’t predicting the thing it’s supposed to predict, because your X is all tied up with things you don’t have data for?

Well that, my friends, is when you might want to contemplate using an IV.

_An instrumental variable is a third variable, Z, used in regression analysis when you have endogenous variables — variables that are influenced by other variables in the model. In other words, you use it to account for unexpected behavior between variables. Using an instrumental variable to identify the hidden (unobserved) correlation allows you to see the true correlation between the explanatory variable and response variable, Y. — _Statistics How To

Let’s break down some of this into pieces we can understand.

Part 1: The Linear Regression Equation

Let’s say you have two variables that you think are correlated, education and wages (X and Y). You would like to investigate if education _leads _to higher wages, i.e. X → Y. It makes sense enough. You write** _y = α + βx + _ε**, and, content with yourself, spend the rest of the night binging Game of Thrones.

Wait. Slow down. First let’s clarify some things.

**_α = _the “starting point”. **Not all regressions are going to start at zero; for example, if X is education and Y is wages, you’re not going to start at _zero _education, because most people nowadays don’t drop out after the 2nd, 3rd, 4th, or even 9th grade (yay, progress!). You’re probably going to be looking at education from high school diploma onward, one year at a time, and the intercept accounts for this.
β = the weight in which X effects Y. For example, if 1 year of education was predicted to make $100 in additional salary, then the coefficient _β _would be 100. _β _specifies just how much an additional year of education gets you.
**ε = the error term.**Economists love using fancy jargon and Greek almost as much as they love snappy titles, so ε is just a fancy “e” for error. This absorbs anything that X couldn’t perfectly map to Y; very rarely are you going to get a perfectly straight curve that maps one-to-one.

Now that we’ve translated **_y = x _**to **_y = α + βx + _ε. **The problem now has to do with the theory of if it’s X truly leading to Y. Education leads to wages and that makes sense; but what if people who strive for higher education will also earn higher wages because they are a more energetic, ambitious, and driven subset of the population?

#programming #data-science #regression #econometrics #rstudio #data analysis

Part 1: The Linear Regression Equation

towardsdatascience.com

A Beginner’s Guide to Using Instrumental Variables