An introduction to Generalized Estimating Equations

A key assumption underpinning generalized linear models (which linear regression is a type of) is the independence of observations. In longitudinal data this will simply not hold. Observations within an individual (between time points) are likely to be more similar than those between individuals.

So, how do you deal with this? One option is to fit a generalized linear mixed model in which there are random intercept and slope terms for each individual. This will tell you for a specific individual (i.e. conditional on the random intercept and slope) what is the effect of a variable on an outcome. However, this isn’t very useful if you are concerned with the marginal effect, i.e. what is the effect of a variable on an outcome on average in the population.

If you want to answer these population questions you need to fit a generalized linear model using _generalized estimating equations _(GEE). This is an approach that obtains the population average effect accounting for the fact that observations within individuals are likely to be more similar than those between individuals.

An example

Suppose we have our outcome — all-cause mortality. Now suppose we record this every month for 10 months for every person. Now suppose our exposure, which is just time. We can now define a logistic regression model, with the sole independent variable being time (in months) and the dependent variable being death at that time. “Okay, great” I hear you say “but these observations are obviously not independent!”. Spot on, but we’ll come to that.

#math #statistics #data-science #epidemiology #data analysis

An example

towardsdatascience.com

An introduction to Generalized Estimating Equations