Monty  Boehm

Monty Boehm

1656582240

GaussianMixtures.jl: Large scale Gaussian Mixture Models

GaussianMixtures

A Julia package for Gaussian Mixture Models (GMMs).

This package contains support for Gaussian Mixture Models. Basic training, likelihood calculation, model adaptation, and i/o are implemented.

This Julia type is more specific than Dahua Lin's MixtureModels, in that it deals only with normal (multivariate) distributions (a.k.a Gaussians), but it does so more efficiently, hopefully. We have support for switching between GMM and MixtureModels types.

At this moment, we have implemented both diagonal covariance and full covariance GMMs, and full covariance variational Bayes GMMs.

In training the parameters of a GMM using the Expectation Maximization (EM) algorithm, the inner loop (computing the Baum-Welch statistics) can be executed efficiently using Julia's standard parallelization infrastructure, e.g., by using SGE. We further support very large data (larger than will fit in the combined memory of the computing cluster) though BigData, which has now been incorporated in this package.

Install

Pkg.add("GaussianMixtures")

Vector dimensions

Some remarks on the dimension. There are three main indexing variables:

  • The Gaussian index
  • The data point
  • The feature dimension (for full covariance this adds to two dimensions)

Often data is stored in 2D slices, and computations can be done efficiently as matrix multiplications. For this it is nice to have the data in standard row,column order. However, we can't have these consistently over all three indexes.

My approach is to have:

  • The data index (i) always be a the first (row) index
  • The feature dimension index (k) always to be a the second (column) index
  • The Gaussian index (j) to be mixed, depending on how it is combined with either dimension above.

The consequence is that "data points run down" in a matrix, just like records do in a DataFrame. Hence, statistics per feature dimension occur consecutive in memory which may be advantageous for caching efficiency. On the other hand, features belonging to the same data point are separated in memory, which probably is not according to the way they are generated, and does not extend to streamlined implementation. The choice in which direction the data must run is an almost philosophical problem that I haven't come to a final conclusion about.

Please note that the choice for "data points run down" is the opposite of the convention used in Distributions.jl, so if you convert GMMs to MixtureModels in order to benefit from the methods provided for these distributions, you need to transpose the data.

Type

A simplified version of the type definition for the Gaussian Mixture Model is

type GMM
    n::Int                         # number of Gaussians
    d::Int                         # dimension of Gaussian
    w::Vector                      # weights: n
    μ::Array                       # means: n x d
    Σ::Union(Array, Vector{Array}) # diagonal covariances n x d, or Vector n of d x d full covariances
    hist::Array{History}           # history of this GMM
end

Currently, the variable Σ is heterogeneous both in form and interpretation, depending on whether it represents full covariance or diagonal covariance matrices.

  • full covariance: Σ is represented by a Vector of chol(inv(Σ), :U)
  • diagonal covariance: Σ is formed by vertically stacking row-vectors of diag(Σ)

Constructors

GMM(weights::Vector, μ::Array,  Σ::Array, hist::Vector)

This is the basic outer constructor for GMM. Here we have

  • weights a Vector of length n with the weights of the mixtures
  • μ a matrix of n by d means of the Gaussians
  • Σ either a matrix of n by d variances of diagonal Gaussians, or a vector of n Triangular matrices of d by d, representing the Cholesky decomposition of the full covariance
  • hist a vector of History objects describing how the GMM was obtained. (The type History simply contains a time t and a comment string s)
GMM(x::Matrix; kind=:diag)
GMM(x::Vector)

Create a GMM with 1 mixture, i.e., a multivariate Gaussian, and initialize with mean an variance of the data in x. The data in x must be a nx x d Matrix, where nx is the number of data points, or a Vector of length nx.

GMM(n:Int, x::Matrix; method=:kmeans, kind=:diag, nInit=50, nIter=10, nFinal=nIter)

Create a GMM with n mixtures, given the training data x and using the Expectation Maximization algorithm. There are two ways of arriving at n Gaussians: method=:kmeans uses K-means clustering from the Clustering package to initialize with n centers. nInit is the number of iterations for the K-means algorithm, nIter the number of iterations in EM. The method :split works by initializing a single Gaussian with the data x and subsequently splitting the Gaussians followed by retraining using the EM algorithm until n Gaussians are obtained. n must be a power of 2 for method=:split. nIter is the number of iterations in the EM algorithm, and nFinal the number of iterations in the final step.

GMM(n::Int, d::Int; kind=:diag)

Initialize a GMM with n multivariate Gaussians of dimension d. The means are all set to 0 (the origin) and the variances to I, which is silly by itself. If kind=:full is specified, the covariances are full rather than diagonal. One should replace the values of the weights, means and covariances afterwards.

Training functions

em!(gmm::GMM, x::Matrix; nIter::Int = 10, varfloor=1e-3)

Update the parameters of the GMM using the Expectation Maximization (EM) algorithm nIter times, optimizing the log-likelihood given the data x. The function em!() returns a vector of average log likelihoods for each of the intermediate iterations of the GMM given the training data.

llpg(gmm::GMM, x::Matrix)

Returns ll_ij = log p(x_i | gauss_j), the Log Likelihood Per Gaussian j given data point x_i.

avll(gmm::GMM, x::Matrix)

Computes the average log likelihood of the GMM given all data points, further normalized by the feature dimension d = size(x,2). A 1-mixture GMM then has an avll of -(log(2π) +0.5 +log(σ)) if the data x is distributed as a multivariate diagonal covariance Gaussian with Σ = σI. With σ=1 we then have avll≈-1.42.

gmmposterior(gmm::GMM, x::Matrix)

Returns a tuple containing p_ij = p(j | gmm, x_i), the posterior probability that data point x_i 'belongs' to Gaussian j, and the Log Likelihood Per Gaussian j given data point x_i as in llpg.

history(gmm::GMM)

Shows the history of the GMM, i.e., how it was initialized, split, how the parameters were trained, etc. A history item contains a time of completion and an event string.

You can examine a minimal example using GMM for clustering.

Other functions

  • split(gmm; minweight=1e-5, covfactor=0.2): Doubles the number of Gaussians by splitting each Gaussian into two Gaussians. minweight is used for pruning Gaussians with too little weight, these are replaced by an extra split of the Gaussian with the highest weight. covfactor controls how far apart the means of the split Gaussian are positioned.
  • kind(gmm): returns :diag or :full, depending on the type of covariance matrix
  • eltype(gmm): returns the datatype of w, μ and Σ in the GMM
  • weights(gmm): returns the weights vector w
  • means(gmm): returns the means μ as an n by d matrix
  • covars(gmm): returns the covariances Σ
  • copy(gmm): returns a deep copy of the GMM
  • full(gmm): returns a full covariance GMM based on gmm
  • diag(gmm): returns a diagonal GMM, by ignoring off-diagonal elementsin gmm

Converting the GMMs

The element type in the GMM can be changed like you would expect. We also have import and export functions for MixtureModel, which currently only has Float64 element type.

  • convert(::Type{GMM{datatype}}, gmm): convert the data type of the GMM
  • float16(gmm), float32(gmm), float64(gmm): convenience functions for convert()
  • MixtureModel(gmm): construct an instance of type MixtureModel from the GMM. Please note that for functions like pdf(m::MixtureModel, x::Matrix) the data x run "sideways" rather than "down" as in this package.
  • GMM(m::MixtureModel{Multivariate,Continuous,MvNormal}): construct a GMM from the right kind of MixtureModel.

Paralellization

Training a large GMM with huge quantities of data can take a significant amount of time. We have built-in support for the parallelization infrastructure in Julia.

The method stats(), which is at the heart of EM algorithm, can detect multiple processors available (through nprocs()). If there is more than 1 processor available, the data is split into chunks, each chunk is mapped to a separate processor, and afterwards all the statistics from the sub-processes are aggregated. In an SGE environment you can obtain more cores (in the example below 20) by issuing

using ClusterManagers
ClusterManagers.addprocs_sge(20)                                        
@everywhere using GaussianMixtures

The size of the GMM and the data determine whether or not it is advantageous to do this.

Memory

The stats() method (see below) needs to be very efficient because for many algorithms it is at the inner loop of the calculation. We have a highly optimized BLAS friendly and parallizable implementation, but this requires a fair bit of memory. Therefore the input data is processed in blocks in such a way that only a limited amount of memory is used. By default this is set at 2GB, but it can be specified though a gobal setting:

setmem(gig) 

Set the memory approximately used in stats(), in Gigabytes.

Baum-Welch statistics

At the heart of EM training, and to many other operations with GMMs, lies the computation of the Baum-Welch statistics of the data when aligning them to the GMM. We have optimized implementations of the basic calculation of the statistics:

  • stats(gmm::GMM, x::Matrix; order=2, parallel=true) Computes the Baum-Welch statistics up to order order for the alignment of the data x to the Universal Background GMM gmm. The 1st and 2nd order statistics are retuned as an n x d matrix, so for obtaining statistics in supervector format, flattening needs to be carried out in the right direction. Theses statistics are uncentered.

Random GMMs

Sometimes is it insteresting to generate random GMMs, and use these to genrate random points.

g = rand(GMM, n, d; kind=:full, sep=2.0)

This generates a GMM with normally distributed means according to N(x|μ=0,Σ=sep*I). The covariance matrices are also chosen random.

rand(g::GMM, n)

Generate n datapoints sampled from the GMM, resulting in a n times g.d array.

Speaker recognition methods

The following methods are used in speaker- and language recognition, they may eventually move to another module.

csstats(gmm::GMM, x::Matrix, order=2)

Computes centered and scaled statistics. These are similar as above, but centered w.r.t the UBM mean and scaled by the covariance.

CSstats(x::GMM, x::Matrix)

This constructor return a CSstats object for centered stats of order 1. The type is currently defined as:

type CSstats
    n::Vector           # zero-order stats, ng
    f::Matrix          # first-order stats, ng * d
end

The CSstats type can be used for MAP adaptation and a simple but elegant dotscoring speaker recognition system.

dotscore(x::CSstats, y::CSstats, r::Float64=1.) 

Computes the dot-scoring approximation to the GMM/UBM log likelihood ratio for a GMM MAP adapted from the UBM (means only) using the data from x and a relevance factor of r, and test data from y.

maxapost(gmm::GMM, x::Matrix, r=16.; means::Bool=true, weights::Bool=false, covars::Bool=false)

Perform Maximum A Posterior (MAP) adaptation of the UBM gmm to the data from x using relevance r. means, weights and covars indicate which parts of the UBM need to be updated.

Saving / loading a GMM

Using package JLD, two methods allow saving a GMM or an array of GMMs to disk:

using JLD
save(filename::String, name::String, gmm::GMM)
save(filename::String, name::String, gmms::Array{GMM})

This saves a GMM of an array of GMMs under the name name in a file filename. The data can be loaded back into a julia session using plain JLD's

gmm = load(filename, name)

In case of using kind=:full covariance matrices make sure you have also loaded LinearAlgebra module into the current session. Without it JLD is unable to reconstruct LinearAlgebra.UpperTriangular type and gmm object won't be created

using JLD
using GaussianMixtures
using LinearAlgebra
gmm = load(filename, name)

Support for large amounts of training data

In many of the functions defined above, a Data type is accepted in the place where the data matrix x is indicated. An object of type Data is basically a list of either matrices of filenames, see BigData.

If kind(x::Data)==:file, then the matrix x is represented by vertically stacking the matrices that can be obtained by loading the files listed in d from disc. The functions in GaussianMixtures try to run computations in parallel by processing the files in d simultaneously on multiple cores/machines, and they further try to limit the number of times the data needs to be loaded form disc. In parallel execution on a computer cluster, an attempt is made to ensure the same data is always processed by the same worker, so that local file caching could work to your advantage.

Variational Bayes training

We have started support for Variational Bayes GMMs. Here, the parameters of the GMM are not point estimates but are rather represented by a distribution. Training of the parameters that govern these distributions can be carried out by an EM-like algorithm.

In our implementation, we follow the approach from section 10.2 of Christopher Bishop's book. In the current version of GaussianMixtures we have not attempted to optimize the code for efficiency.

The variational Bayes training uses two new types, a prior and a variational GMM:

type GMMprior
    α0                       # effective prior number of observations
    β0
    m0::Vector               # prior on μ
    ν0                       # scale precision
    W0::Matrix               # prior precision
end

type VGMM
    n::Int                   # number of Gaussians
    d::Int                   # dimension of Gaussian
    π::GMMprior              # The prior used in this VGMM
    α::Vector                # Dirichlet, n
    β::Vector                # scale of precision, n
    m::Matrix                # means of means, n * d
    ν::Vector                # no. degrees of freedom, n
    W::Vector{Matrix}        # scale matrix for precision? n * d * d
    hist::Vector{History}    # history
end

Please note that we currently only have full covariance VGMMs.
Training using observed data x needs some initial GMM and a prior.

g = GMM(8, x, kind=:full, nIter=0) ## only do k-means initialization of GMM g
p = GMMprior(g.d, 0.1, 1.0)  ## set α0=0.1 and β0=1, and other values to a default
v = VGMM(g, p) ## initialize variational GMM v with g

Training can then proceed with

em!(v, x)

This EM training checks if the occupancy of the Gaussians still is nonzero after each M-step. In case it isn't, the Gaussian is removed. The effect is that the total number of Gaussians can reduce in this procedure.

Working with Distributions.jl

A GMM model can used to build a MixtureModel in the Distributions.jl package. For example:

using GaussianMixtures
using Distributions
g = rand(GMM, 3, 4)
m = MixtureModel(g)

This can be conveniently use for sampling from the GMM, e.g.

sample= rand(m)

Furthermore, a Gaussian mixture model constructed using MixtureModel can be converted to GMM via a constructor call

gg = GMM(m)
`

Author: Davidavdav
Source Code: https://github.com/davidavdav/GaussianMixtures.jl 
License: View license

#julia #models #statistics 

What is GEEK

Buddha Community

GaussianMixtures.jl: Large scale Gaussian Mixture Models
Monty  Boehm

Monty Boehm

1656582240

GaussianMixtures.jl: Large scale Gaussian Mixture Models

GaussianMixtures

A Julia package for Gaussian Mixture Models (GMMs).

This package contains support for Gaussian Mixture Models. Basic training, likelihood calculation, model adaptation, and i/o are implemented.

This Julia type is more specific than Dahua Lin's MixtureModels, in that it deals only with normal (multivariate) distributions (a.k.a Gaussians), but it does so more efficiently, hopefully. We have support for switching between GMM and MixtureModels types.

At this moment, we have implemented both diagonal covariance and full covariance GMMs, and full covariance variational Bayes GMMs.

In training the parameters of a GMM using the Expectation Maximization (EM) algorithm, the inner loop (computing the Baum-Welch statistics) can be executed efficiently using Julia's standard parallelization infrastructure, e.g., by using SGE. We further support very large data (larger than will fit in the combined memory of the computing cluster) though BigData, which has now been incorporated in this package.

Install

Pkg.add("GaussianMixtures")

Vector dimensions

Some remarks on the dimension. There are three main indexing variables:

  • The Gaussian index
  • The data point
  • The feature dimension (for full covariance this adds to two dimensions)

Often data is stored in 2D slices, and computations can be done efficiently as matrix multiplications. For this it is nice to have the data in standard row,column order. However, we can't have these consistently over all three indexes.

My approach is to have:

  • The data index (i) always be a the first (row) index
  • The feature dimension index (k) always to be a the second (column) index
  • The Gaussian index (j) to be mixed, depending on how it is combined with either dimension above.

The consequence is that "data points run down" in a matrix, just like records do in a DataFrame. Hence, statistics per feature dimension occur consecutive in memory which may be advantageous for caching efficiency. On the other hand, features belonging to the same data point are separated in memory, which probably is not according to the way they are generated, and does not extend to streamlined implementation. The choice in which direction the data must run is an almost philosophical problem that I haven't come to a final conclusion about.

Please note that the choice for "data points run down" is the opposite of the convention used in Distributions.jl, so if you convert GMMs to MixtureModels in order to benefit from the methods provided for these distributions, you need to transpose the data.

Type

A simplified version of the type definition for the Gaussian Mixture Model is

type GMM
    n::Int                         # number of Gaussians
    d::Int                         # dimension of Gaussian
    w::Vector                      # weights: n
    μ::Array                       # means: n x d
    Σ::Union(Array, Vector{Array}) # diagonal covariances n x d, or Vector n of d x d full covariances
    hist::Array{History}           # history of this GMM
end

Currently, the variable Σ is heterogeneous both in form and interpretation, depending on whether it represents full covariance or diagonal covariance matrices.

  • full covariance: Σ is represented by a Vector of chol(inv(Σ), :U)
  • diagonal covariance: Σ is formed by vertically stacking row-vectors of diag(Σ)

Constructors

GMM(weights::Vector, μ::Array,  Σ::Array, hist::Vector)

This is the basic outer constructor for GMM. Here we have

  • weights a Vector of length n with the weights of the mixtures
  • μ a matrix of n by d means of the Gaussians
  • Σ either a matrix of n by d variances of diagonal Gaussians, or a vector of n Triangular matrices of d by d, representing the Cholesky decomposition of the full covariance
  • hist a vector of History objects describing how the GMM was obtained. (The type History simply contains a time t and a comment string s)
GMM(x::Matrix; kind=:diag)
GMM(x::Vector)

Create a GMM with 1 mixture, i.e., a multivariate Gaussian, and initialize with mean an variance of the data in x. The data in x must be a nx x d Matrix, where nx is the number of data points, or a Vector of length nx.

GMM(n:Int, x::Matrix; method=:kmeans, kind=:diag, nInit=50, nIter=10, nFinal=nIter)

Create a GMM with n mixtures, given the training data x and using the Expectation Maximization algorithm. There are two ways of arriving at n Gaussians: method=:kmeans uses K-means clustering from the Clustering package to initialize with n centers. nInit is the number of iterations for the K-means algorithm, nIter the number of iterations in EM. The method :split works by initializing a single Gaussian with the data x and subsequently splitting the Gaussians followed by retraining using the EM algorithm until n Gaussians are obtained. n must be a power of 2 for method=:split. nIter is the number of iterations in the EM algorithm, and nFinal the number of iterations in the final step.

GMM(n::Int, d::Int; kind=:diag)

Initialize a GMM with n multivariate Gaussians of dimension d. The means are all set to 0 (the origin) and the variances to I, which is silly by itself. If kind=:full is specified, the covariances are full rather than diagonal. One should replace the values of the weights, means and covariances afterwards.

Training functions

em!(gmm::GMM, x::Matrix; nIter::Int = 10, varfloor=1e-3)

Update the parameters of the GMM using the Expectation Maximization (EM) algorithm nIter times, optimizing the log-likelihood given the data x. The function em!() returns a vector of average log likelihoods for each of the intermediate iterations of the GMM given the training data.

llpg(gmm::GMM, x::Matrix)

Returns ll_ij = log p(x_i | gauss_j), the Log Likelihood Per Gaussian j given data point x_i.

avll(gmm::GMM, x::Matrix)

Computes the average log likelihood of the GMM given all data points, further normalized by the feature dimension d = size(x,2). A 1-mixture GMM then has an avll of -(log(2π) +0.5 +log(σ)) if the data x is distributed as a multivariate diagonal covariance Gaussian with Σ = σI. With σ=1 we then have avll≈-1.42.

gmmposterior(gmm::GMM, x::Matrix)

Returns a tuple containing p_ij = p(j | gmm, x_i), the posterior probability that data point x_i 'belongs' to Gaussian j, and the Log Likelihood Per Gaussian j given data point x_i as in llpg.

history(gmm::GMM)

Shows the history of the GMM, i.e., how it was initialized, split, how the parameters were trained, etc. A history item contains a time of completion and an event string.

You can examine a minimal example using GMM for clustering.

Other functions

  • split(gmm; minweight=1e-5, covfactor=0.2): Doubles the number of Gaussians by splitting each Gaussian into two Gaussians. minweight is used for pruning Gaussians with too little weight, these are replaced by an extra split of the Gaussian with the highest weight. covfactor controls how far apart the means of the split Gaussian are positioned.
  • kind(gmm): returns :diag or :full, depending on the type of covariance matrix
  • eltype(gmm): returns the datatype of w, μ and Σ in the GMM
  • weights(gmm): returns the weights vector w
  • means(gmm): returns the means μ as an n by d matrix
  • covars(gmm): returns the covariances Σ
  • copy(gmm): returns a deep copy of the GMM
  • full(gmm): returns a full covariance GMM based on gmm
  • diag(gmm): returns a diagonal GMM, by ignoring off-diagonal elementsin gmm

Converting the GMMs

The element type in the GMM can be changed like you would expect. We also have import and export functions for MixtureModel, which currently only has Float64 element type.

  • convert(::Type{GMM{datatype}}, gmm): convert the data type of the GMM
  • float16(gmm), float32(gmm), float64(gmm): convenience functions for convert()
  • MixtureModel(gmm): construct an instance of type MixtureModel from the GMM. Please note that for functions like pdf(m::MixtureModel, x::Matrix) the data x run "sideways" rather than "down" as in this package.
  • GMM(m::MixtureModel{Multivariate,Continuous,MvNormal}): construct a GMM from the right kind of MixtureModel.

Paralellization

Training a large GMM with huge quantities of data can take a significant amount of time. We have built-in support for the parallelization infrastructure in Julia.

The method stats(), which is at the heart of EM algorithm, can detect multiple processors available (through nprocs()). If there is more than 1 processor available, the data is split into chunks, each chunk is mapped to a separate processor, and afterwards all the statistics from the sub-processes are aggregated. In an SGE environment you can obtain more cores (in the example below 20) by issuing

using ClusterManagers
ClusterManagers.addprocs_sge(20)                                        
@everywhere using GaussianMixtures

The size of the GMM and the data determine whether or not it is advantageous to do this.

Memory

The stats() method (see below) needs to be very efficient because for many algorithms it is at the inner loop of the calculation. We have a highly optimized BLAS friendly and parallizable implementation, but this requires a fair bit of memory. Therefore the input data is processed in blocks in such a way that only a limited amount of memory is used. By default this is set at 2GB, but it can be specified though a gobal setting:

setmem(gig) 

Set the memory approximately used in stats(), in Gigabytes.

Baum-Welch statistics

At the heart of EM training, and to many other operations with GMMs, lies the computation of the Baum-Welch statistics of the data when aligning them to the GMM. We have optimized implementations of the basic calculation of the statistics:

  • stats(gmm::GMM, x::Matrix; order=2, parallel=true) Computes the Baum-Welch statistics up to order order for the alignment of the data x to the Universal Background GMM gmm. The 1st and 2nd order statistics are retuned as an n x d matrix, so for obtaining statistics in supervector format, flattening needs to be carried out in the right direction. Theses statistics are uncentered.

Random GMMs

Sometimes is it insteresting to generate random GMMs, and use these to genrate random points.

g = rand(GMM, n, d; kind=:full, sep=2.0)

This generates a GMM with normally distributed means according to N(x|μ=0,Σ=sep*I). The covariance matrices are also chosen random.

rand(g::GMM, n)

Generate n datapoints sampled from the GMM, resulting in a n times g.d array.

Speaker recognition methods

The following methods are used in speaker- and language recognition, they may eventually move to another module.

csstats(gmm::GMM, x::Matrix, order=2)

Computes centered and scaled statistics. These are similar as above, but centered w.r.t the UBM mean and scaled by the covariance.

CSstats(x::GMM, x::Matrix)

This constructor return a CSstats object for centered stats of order 1. The type is currently defined as:

type CSstats
    n::Vector           # zero-order stats, ng
    f::Matrix          # first-order stats, ng * d
end

The CSstats type can be used for MAP adaptation and a simple but elegant dotscoring speaker recognition system.

dotscore(x::CSstats, y::CSstats, r::Float64=1.) 

Computes the dot-scoring approximation to the GMM/UBM log likelihood ratio for a GMM MAP adapted from the UBM (means only) using the data from x and a relevance factor of r, and test data from y.

maxapost(gmm::GMM, x::Matrix, r=16.; means::Bool=true, weights::Bool=false, covars::Bool=false)

Perform Maximum A Posterior (MAP) adaptation of the UBM gmm to the data from x using relevance r. means, weights and covars indicate which parts of the UBM need to be updated.

Saving / loading a GMM

Using package JLD, two methods allow saving a GMM or an array of GMMs to disk:

using JLD
save(filename::String, name::String, gmm::GMM)
save(filename::String, name::String, gmms::Array{GMM})

This saves a GMM of an array of GMMs under the name name in a file filename. The data can be loaded back into a julia session using plain JLD's

gmm = load(filename, name)

In case of using kind=:full covariance matrices make sure you have also loaded LinearAlgebra module into the current session. Without it JLD is unable to reconstruct LinearAlgebra.UpperTriangular type and gmm object won't be created

using JLD
using GaussianMixtures
using LinearAlgebra
gmm = load(filename, name)

Support for large amounts of training data

In many of the functions defined above, a Data type is accepted in the place where the data matrix x is indicated. An object of type Data is basically a list of either matrices of filenames, see BigData.

If kind(x::Data)==:file, then the matrix x is represented by vertically stacking the matrices that can be obtained by loading the files listed in d from disc. The functions in GaussianMixtures try to run computations in parallel by processing the files in d simultaneously on multiple cores/machines, and they further try to limit the number of times the data needs to be loaded form disc. In parallel execution on a computer cluster, an attempt is made to ensure the same data is always processed by the same worker, so that local file caching could work to your advantage.

Variational Bayes training

We have started support for Variational Bayes GMMs. Here, the parameters of the GMM are not point estimates but are rather represented by a distribution. Training of the parameters that govern these distributions can be carried out by an EM-like algorithm.

In our implementation, we follow the approach from section 10.2 of Christopher Bishop's book. In the current version of GaussianMixtures we have not attempted to optimize the code for efficiency.

The variational Bayes training uses two new types, a prior and a variational GMM:

type GMMprior
    α0                       # effective prior number of observations
    β0
    m0::Vector               # prior on μ
    ν0                       # scale precision
    W0::Matrix               # prior precision
end

type VGMM
    n::Int                   # number of Gaussians
    d::Int                   # dimension of Gaussian
    π::GMMprior              # The prior used in this VGMM
    α::Vector                # Dirichlet, n
    β::Vector                # scale of precision, n
    m::Matrix                # means of means, n * d
    ν::Vector                # no. degrees of freedom, n
    W::Vector{Matrix}        # scale matrix for precision? n * d * d
    hist::Vector{History}    # history
end

Please note that we currently only have full covariance VGMMs.
Training using observed data x needs some initial GMM and a prior.

g = GMM(8, x, kind=:full, nIter=0) ## only do k-means initialization of GMM g
p = GMMprior(g.d, 0.1, 1.0)  ## set α0=0.1 and β0=1, and other values to a default
v = VGMM(g, p) ## initialize variational GMM v with g

Training can then proceed with

em!(v, x)

This EM training checks if the occupancy of the Gaussians still is nonzero after each M-step. In case it isn't, the Gaussian is removed. The effect is that the total number of Gaussians can reduce in this procedure.

Working with Distributions.jl

A GMM model can used to build a MixtureModel in the Distributions.jl package. For example:

using GaussianMixtures
using Distributions
g = rand(GMM, 3, 4)
m = MixtureModel(g)

This can be conveniently use for sampling from the GMM, e.g.

sample= rand(m)

Furthermore, a Gaussian mixture model constructed using MixtureModel can be converted to GMM via a constructor call

gg = GMM(m)
`

Author: Davidavdav
Source Code: https://github.com/davidavdav/GaussianMixtures.jl 
License: View license

#julia #models #statistics 

Obie  Rowe

Obie Rowe

1599059340

Gaussian Mixture Models: implemented from scratch

From the rising of the Machine Learning and Artificial Intelligence fields Probability Theory was a powerful tool, that allowed us to handle uncertainty in a lot of applications, from classification to forecasting tasks. Today I would like to talk with you more about the use of Probability and Gaussian distribution in clustering problems, implementing on the way the GMM model. So let’s get started

What is GMM?

GMM (or Gaussian Mixture Models) is an algorithm that using the estimation of the density of the dataset to split the dataset in a preliminary defined number of clusters. For a better understandability, I will explain in parallel the theory and will show the code for implementing it.

For this implementation, I will use the EM (Expectation-Maximization) algorithm.

The theory and the code is the best combination.

Firstly let’s import all needed libraries:

import numpy as np
import pandas as pd

I highly recommend following the standards of sci-kit learn library when implementing a model on your own. That’s why we will implement GMM as a class. Let’s also the __init_function.

class GMM:
    def __init__(self, n_components, max_iter = 100, comp_names=None):
        self.n_componets = n_components
        self.max_iter = max_iter
        if comp_names == None:
            self.comp_names = [f"comp{index}" for index in range(self.n_componets)]
        else:
            self.comp_names = comp_names
        ## pi list contains the fraction of the dataset for every cluster
        self.pi = [1/self.n_componets for comp in range(self.n_componets)]

Shortly saying, n_components is the number of cluster in which whe want to split our data. max_iter represents the number of interations taken by the algorithm and comp_names is a list of string with n_components number of elements, that are interpreted as names of clusters.

The fit function.

So before we get to the EM-algorithm we must split our dataset. after that, we must initiate 2 lists. One list containing the mean vectors (each element of the vector is the mean of columns) for every subset. The second list is containing the covariance matrix of each subset.

def fit(self, X):
        ## Spliting the data in n_componets sub-sets
        new_X = np.array_split(X, self.n_componets)
        ## Initial computation of the mean-vector and covarience matrix
        self.mean_vector = [np.mean(x, axis=0) for x in new_X]
        self.covariance_matrixes = [np.cov(x.T) for x in new_X]
        ## Deleting the new_X matrix because we will not need it anymore
        del new_X

Now we can get to EM-algorithm.

EM-algorithm.

As the name says the EM-algorithm is divided in 2 steps — E and M.

#algorithms #artificial-intelligence #gaussian-mixture-model #sigmoid #machine-learning

Gaussian mixture model (GMM) Best practices

Disadvantages of k-means clustering

The essence of the K-means clustering algorithm is that it

  • Takes the center of each cluster as the center of the circle
  • Draws a circle with the maximum Euclidean distance from the center point of the cluster to the center point of the cluster as the radius.

This roundness truncates the training set. Moreover, k-means requires that the shape of these clusters must be circular. Therefore, the clusters (circles) fitted by the k-means model are very different from the actual data distribution (maybe ellipses)

As result multiple circular clusters are mixed together and overlap each other

In general, k-means has two shortcomings, which makes it unsatisfactory for many data sets (especially low-dimensional data sets):

  1. The shape of the class is not flexible enough, the fitting result is quite different from the actual one, and the accuracy is limited.
  2. The probability that the sample belongs to each cluster is qualitative. Only if it is true or not, the probability value cannot be output. The application lacks robustness.

Image for post

Well, if you are completely unaware of Clustering, it’s a need, types, and applications then I recommend you go through this article first.

Clustering Clearly Explained

A definition of clustering could be “the process of organizing objects into groups whose members are similar in some…

towardsdatascience.com

Gaussian or Normal Distribution

Let’s try to understand what is the “Gaussian” in Gaussian Mixture Model

Image for post

The gaussian distribution also known as the Normal **distribution **is a very important probability distribution in the various fields and has a significant influence on many aspects of statistics.

If the random variable X follows a Gaussian distribution with mathematical expectation μ and standard deviation σ2, it is written as

Image for post

Then its probability density function is

Image for post

  • The expected value of the normal distribution μ determines its position
  • Its standard deviation σ determines the amplitude of the distribution.
  • Because its curve is bell-shaped, people often call it a bell-shaped curve. What we usually call the standard normal distribution is the normal distribution with μ = 0 and σ = 1.

Everything that goes up, comes down, according to Gauss.

Single Gaussian Model

When the sample data X is one-dimensional data (Univariate), the Gaussian distribution follows the following probability density function (PDF)

Image for post

Wherein **μ **the data mean (desired), **σ **a data standard deviation.

When the sample data X is Multivariate, the Gaussian distribution follows the probability density function below:

Image for post

Among them, **μ **is the data mean (expected), **σ **is the Covariance, D is the data dimension.

Gaussian Mixture Model

The Gaussian mixture model can be regarded as a model composed of K single Gaussian models. These K submodels are the hidden variables of the hybrid model

The Gaussian mixture model (GMM) can be regarded as an optimization of the k-means model. It is not only a commonly used in industry but also a generative model.

The Gaussian mixture model attempts to find a mixed representation of the probability distribution of the multidimensional Gaussian model, thereby fitting a data distribution of arbitrary shape.

#gaussian-mixture-model #data-science #clustering #data analysis

Shawn  Durgan

Shawn Durgan

1603645320

Bayesian Gaussian mixture models (without the math) using Infer.NET

This post provides a brief introduction to Bayesian Gaussian mixture models and share my experience of building these types of models in Microsoft’s Infer.NET probabilistic graphical model framework. Being comfortable and familiar with k-means clustering and Python, I found it challenging to learn c#, Infer.NET and some of the underlying Bayesian principles used in probabilistic inference. That being said, there is a way for Python programmers to integrate .NET components and services using pythonnet, which I will cover in a follow-up post. My hope is that the content of this post will save you time, remove any intimidation that the theory may bring and demonstrate some of the advantages of what is known as the Model-based machine learning (MBML) approach. Please follow the guidelines provided in the Infer.NET documentationto get set up with the Infer.NET framework.

Bayesian Gaussian mixture models constitutes a form of unsupervised learning and can be useful in fitting multi-modal data for tasks such as clustering, data compression, outlier detection, or generative classifiers. Each Gaussian component is usually a multivariate Gaussian with a mean vector and covariance matrix, but for the sake of demonstration we will consider a less complicated univariate case.

We begin by sampling data from a univariate Gaussian distribution and store the data in a .csv file using Python code:

## generate random samples from Gaussian distributions #
	import pandas as pd
	import numpy as np
	import scipy.stats as stats
	import matplotlib.pyplot as plt

	## number of data points
	N = 100

	true_mean_0 = 1
	true_mean_1 = 3
	true_mean_2 = 5

	true_precision_0 = 10
	true_precision_1 = 10
	true_precision_2 = 10

	def sample(component):
	    if component == 0:
	        return np.random.normal(true_mean_0, np.sqrt(1 / true_precision_0), 1)[0]
	    if component == 1:
	        return np.random.normal(true_mean_1, np.sqrt(1 / true_precision_1), 1)[0]
	    else:
	        return np.random.normal(true_mean_2, np.sqrt(1 / true_precision_2), 1)[0]

	## weight distribution p can be changed to create more Gaussian component data
	mask = np.random.choice([0, 1, 2], N, p=[0, 0, 1])
	data = [sample(i) for i in mask]

	df = pd.DataFrame(data)
	df.to_csv('data.csv', sep=',', header=False, index=False)

#probabilistic-programming #gaussian-mixture-model #bayesian-inference

ShapeModels.jl: Statistical Shape Models / Point Distribution Models

ShapeModels

This package is meant to assemble methods for handling 2D and 3D statistical shape models, which are often used in medical computer vision.

Currently, PCA based shape models are implemented, as introduced by Cootes et al1.

Given a set of shapes of the form ndim x nlandmarks x nshapes, a PCA shape model is constructed using:

using ShapeModels
landmarks = ShapeModels.examplelandmarks(:hands2d)

model = PCAShapeModel(landmarks)

shapes = modeshapes(model, 1)  # examples for first eigenmode
[plotshape(shapes[:,:,i], "b.") for i = 1:10]
plotshape(meanshape(model), "r.")

Example computed with outlines of metacarpal bones:

Functions

  • model = PCAShapeModel(shapes) compute a shape model
  • nmodes(model) get number of modes of the model, including rotation, scaling and translation
  • modesstd(model) get standard deviations of modes
  • shape(model, coeffs) compute a shape given a vector coeffs of length(nmodes(a))
  • meanshape(model) get the shape which represents the mean of all shapes
  • modeshapes(model, mode) get 10 shapes from -3std to 3std of mode number mode

Helper functions for plotting. They require the PyPlot package to be installed.

  • axisij() set the origin to top-left
  • plotshape(shape) plot a single shape
  • plotshapes(shapes) plot several shaped in individual subfigures

1 T.F. Cootes, D. Cooper, C.J. Taylor and J. Graham, "Active Shape Models - Their Training and Application." Computer Vision and Image Understanding. Vol. 61, No. 1, Jan. 1995, pp. 38-59.

Download Details:

Author: Rened
Source Code: https://github.com/rened/ShapeModels.jl 
License: View license

#julia #models #models