“Correlation does not (necessarily) imply causation,” you must have heard this famous sentence in case you took an introductory inferential statistics/data science class. At the same time, as you carefully explore academic/non-academic research, you realize that perhaps in all cases, the first step in inferring a causal relationship between a predictor and an outcome is to find the existence of correlation/association between them.

For example, based on the evidence of a lower number of COVID-19 cases in warmer places than in colder places, some researchers suggest that coronavirus transmission becomes slower with the increase in temperature. Essentially, researchers found a negative correlation between the increase in temperature and in COVID-19 cases, and postulated (with caution) a causal relationship. So, correlation can imply causation, huh!

In case you build regression models for your academic/non-academic research, you must also know that the coefficient of a predictor in a regression model expresses the correlation/association between the predictor and an outcome. In this article, we will try to understand under what circumstances the coefficient of a predictor can also (potentially) indicate the causal effect of the predictor on an outcome. Using an example with three scenarios, we are going to shed light on the correlation-causation conundrum.

Example

A restaurant business analyst is trying to investigate the causal relationship between sending discount cards to home addresses and the monthly count of customer visits to a restaurant.

Scenario 1: A Simple Linear Regression Model Using Non-Experimental Data

The analyst finds a dataset that includes data on whether restaurants sent discount cards and on the count of customer visits to restaurants during the month of X for all the restaurants in the city of Y.

Next, based on the data, the analyst builds the following regression model in which Discount Card Sent is a binary predictor (1=Yes, 0=No) and Monthly Count of Customer Visits is a continuous outcome:

Monthly Count of Customer Visits = 1951 + 674 * Discount Card Sent

Based on the above model, if a restaurant sends discount cards, the expected Monthly Count of Customer Visits = 1951+674*1 = 2625

And, if a restaurant does not send discount cards, the expected Monthly Count of Customer Visits = 1951+674*0 = 1951

The analyst concludes: compared against the_ restaurants that do not send discount cards_, the _restaurants that send discount _cards, on average, get 2625–1951 = 674 more customer visits per month.

Apparently, sending discount cards to home addresses can induce more customer visits to a restaurant as discount cards make food items more affordable. But, based only on the results of the bivariate analysis (linear regression model with only one predictor), can the analyst suggest that in the next month, restaurants that will send discount cards will, on average, get more customer visits than restaurants that will not send discount cards?

#machine-learning #causality #data-science #data analysis

What do the Coefficients of a Regression Model Indicate?
1.10 GEEK