Gaussian Mixture Models: implemented from scratch

From the rising of the Machine Learning and Artificial Intelligence fields Probability Theory was a powerful tool, that allowed us to handle uncertainty in a lot of applications, from classification to forecasting tasks. Today I would like to talk with you more about the use of Probability and Gaussian distribution in clustering problems, implementing on the way the GMM model. So let’s get started

What is GMM?

GMM (or Gaussian Mixture Models) is an algorithm that using the estimation of the density of the dataset to split the dataset in a preliminary defined number of clusters. For a better understandability, I will explain in parallel the theory and will show the code for implementing it.

For this implementation, I will use the EM (Expectation-Maximization) algorithm.

The theory and the code is the best combination.

Firstly let’s import all needed libraries:

import numpy as np
import pandas as pd

I highly recommend following the standards of sci-kit learn library when implementing a model on your own. That’s why we will implement GMM as a class. Let’s also the __init_function.

class GMM:
    def __init__(self, n_components, max_iter = 100, comp_names=None):
        self.n_componets = n_components
        self.max_iter = max_iter
        if comp_names == None:
            self.comp_names = [f"comp{index}" for index in range(self.n_componets)]
        else:
            self.comp_names = comp_names
        ## pi list contains the fraction of the dataset for every cluster
        self.pi = [1/self.n_componets for comp in range(self.n_componets)]

Shortly saying, n_components is the number of cluster in which whe want to split our data. max_iter represents the number of interations taken by the algorithm and comp_names is a list of string with n_components number of elements, that are interpreted as names of clusters.

The fit function.

So before we get to the EM-algorithm we must split our dataset. after that, we must initiate 2 lists. One list containing the mean vectors (each element of the vector is the mean of columns) for every subset. The second list is containing the covariance matrix of each subset.

def fit(self, X):
        ## Spliting the data in n_componets sub-sets
        new_X = np.array_split(X, self.n_componets)
        ## Initial computation of the mean-vector and covarience matrix
        self.mean_vector = [np.mean(x, axis=0) for x in new_X]
        self.covariance_matrixes = [np.cov(x.T) for x in new_X]
        ## Deleting the new_X matrix because we will not need it anymore
        del new_X

Now we can get to EM-algorithm.

EM-algorithm.

As the name says the EM-algorithm is divided in 2 steps — E and M.

#algorithms #artificial-intelligence #gaussian-mixture-model #sigmoid #machine-learning

What is GMM?

The theory and the code is the best combination.

The fit function.

EM-algorithm.

towardsdatascience.com

Gaussian Mixture Models: implemented from scratch