This is the seventeenth article from my column Mathematical Statistics and Machine Learning for Life Sciences where I try to explain some mysterious analytical techniques used in Bioinformatics and Computational Biology in a simple way. Linear Mixed Model (LMM) also known as Linear Mixed Effects Model is one of key techniques in traditional Frequentist statistics. Here I will attempt to derive LMM solution from scratch from the Maximum Likelihood principal by optimizing mean and variance parameters of Fixed and Random Effects. However, before diving into derivations, I will start slowly in this post with an introduction of when and how to technically run LMM. I will cover examples of linear modeling from both Frequentist and Bayesian frameworks.

Problem of Non-Independence in Data

Traditional Mathematical Statistics is based to a large extent on assumptions of the Maximum Likelihood principal and Normal distribution. In case of e.g. multiple linear regression these assumptions might be violated if there is non-independence in the data. Provided that data is expressed as a p by n matrix, where p is the number of variables and n is the number of observations, there can be two types of non-independence in the data:

non-independent variables / features (multicollinearity)
non-independent statistical observations (grouping of samples)

In both cases, the inverse data matrix needed for the solution of Linear Model is singular, because its determinant is close to zero due to correlated variables or observations. This problem is particularly manifested when working with a high-dimensional data (p >> n) where variables can become redundant and correlated, this is known as the Curse of Dimensionality.

Image for post

The Curse of Dimensionality: solution of linear model diverges in high-dimensional space, p >> n limit

To overcome the problem of non-independent variables, one can for example select most informative variables with LASSO, Ridge or Elastic Net regression, while the non-independence among statistical observations can be taking into account via Random Effects modelling within the Linear Mixed Model.

Image for post

Ways to overcome non-independence in the data: LASSO and Random Effects modelling

I covered a few variable selection methods including LASSO in my post Select Features for OMICs Integration. In the next section, we will see an example of longitudinal data where grouping of data points should be addressed through the Random Effects modelling.

**LMM and Random Effects **modeling are widely used in various types of data analysis in Life Sciences. One example is the GCTA tool that contributed a lot to the research of long-standing problem of Missing Heritability. The idea of GCTA is to fit genetic variants with small effects all together as Random Effect withing LMM framework. Thanks to the GCTA model the problem of Missing Heritability seems to be solved at least for Human Height.

Image for post

Another popular example from computational biology is the Differential Gene Expression analysis with DESeq / **DESeq2 **R package that does not really run LMM but performs a variance stabilization/shrinkage that is one of essential points of LMM. The advantage of this approach is that lowly expressed genes can borrow some information from the highly expressed genes that allows for their more stable and robust testing.

Finally, LMM is one of the most popular analytical techniques in Evolutionary Science and Ecology where they use the state-of-the-art MCMCglmm package for estimating e.g. trait heritability.

#data-science #towards-data-science #machine-learning #stats-ml-life-sciences #statistics #data analysis

Problem of Non-Independence in Data

towardsdatascience.com

How Linear Mixed Model Works