Introduction

The purpose of this post is to show why causal inference is hard, how it fails us, and why DAGs don’t help.

Machine learning practitioners are concerned with prediction, rarely with explanation. This is a luxury. We work even on problems for which no data generating process can possibly be written down. Nobody believes that an LSTM model trained on Shakespeare’s plays works because it approximates Shakespeare. Yet it works.

In recent years, new tools for causal inference have become available for a broader audience. These tools promise to help non-experts not only predict, but explain patterns in data. Directed acyclic graphs and do-calculus are among the most influential ones.

People love shiny tools and there is danger in it. New tools come with the rush to adopt them, with a feeling of being in the vanguard, with new opportunities. That often makes them unreliable: they are misused, substituted for a better theory, design, or data.

In this post, I focus on directed acyclic graphs and a Python library DoWhy because DAGs are really popular in the machine learning community. Points I make apply equally to the Potential Outcome framework, or any other formal language to express causality.

Why causal inference is hard, in theory

Causal inference relies on causal assumptions. Assumptions are beliefs that allow movement from statistical associations to causation.

Randomized experiments are the gold standard for causal inference because the treatment assignment is random and physically manipulated: one group gets the treatment, one does not. The assumptions here are straightforward, securable by design, and can be conveniently defended.

When there is no control over treatment assignment, say with observational data, researchers try to model it. Modeling here is equivalent to saying “we assume that after adjusting for age, gender, social status, and smoking, runners and non-runners are so similar to each other as if they were randomly assigned to running.” Then one can regress life expectancy on running, declare that ‘running increases life expectancy by n %”, and call it a day.

The logic behind this approach is clunky. It implicitly assumes we know exactly why people start running or live long and the only piece missing is what we are trying to estimate. A story that is not very believable and a bit circular. Also, by a happy coincidence, all parts of our model have available empirical proxies measured without error. Finally, since there is no principle way to check how well the model of choice approximates the real assignment mechanism all its assumptions can be debated for eternity.

This brings us to a situation best summarized by Jaas Sekhon [1]:

“Without an experiment, natural experiment, a discontinuity, or some other strong design, no amount of econometric or statistical modeling can make the move from correlation to causation persuasive”

Why causal inference is hard, in practice

The concerns raised above are better demonstrated with practical examples. Although there are plenty, I stick to three, with one from economics, epidemiology, and political science.

In 1986 Robert Lalonde showed how econometric procedures do not replicate experimental results. He utilized an experiment where individuals were randomly selected into job programs. Randomization allowed him to estimate a programs’ unbiased effect on earning. He then asked: would we have gotten the same estimate without randomization? To mimic observational data Lalonde constructed several non-experimental control groups. After comparing estimates he concluded that econometric procedures fail to replicate experimental results [2].

Epidemiology has the same problems. Consider the story of HDL cholesterol and heart disease. It was believed that ‘good cholesterol’ protects against coronary heart disease. Researchers even declared observational studies to be robust to covariate adjustments. Yet several years later randomized experiments demonstrated that HDL doesn’t protect your heart. For epidemiology this situation is not unique and many epidemiological findings are later overturned by randomized control studies [3].

Democracy-growth studies were a hot topic in political science back in the day. Researchers put GDP per capita or the like on the left side of their equations, democracy on the right and to avoid being unsophisticated threw a bunch of controls there: life expectancy, educational attainment, population size, foreign direct investment, among others. From 470 estimates presented in 81 papers published prior to 2006, 16% of them showed a statistically significant and negative effect of democracy on growth, 20% negative but insignificant, 38% positive but still insignificant, and 38% of the estimates, and you will be really surprised here, were positive and statistically significant [4].

The pattern is clear: no matter how confident researchers are in their observational research, it does not necessarily bring them closer to the truth.

#machine-learning #causality #causal-inference #data-science #data analysis

Causal inference for data scientists: a skeptical view
1.25 GEEK