Gaussian Mixture Model, K-means algorithm, and how to solve them.

Introduction

Mixture model is a weighted combinations of probability distributions. Mixture model is a powerful and well-understood tool for various problems in artificial intelligence, computer vision, and statistics. In this post, we will examine Gaussian mixture models, and algorithms to solve them.

Let’s first introduce Gaussian Mixture Model. Let (\mu_j, \sigma_j) be a collection of Gaussian distributions, with some associated weights \pi_j \in [0,1]. Let (x_i)_{i=1}^n be set of observations that are generated by the above distributions.

P(x_i|\sigma_j) \sim \pi_i G(\mu_j, \sigma_j)

Figure 1. Mixture of two Gaussian distributions, with \pi = (0.25, 0.75), \mu = (0, 5), \sigma=(1,3).

We call this model Mixture of Gaussians. The parameter \pi is called the mixing portion of the Gaussians.

By solving Gaussian Mixture Model, we are given a set of observed points X=(x_i)_{i=1}^n and we want to find the model parameters (\pi, \mu, \sigma) that maximizes the posterior probability P(X | \pi,\mu,\sigma).

Figure 2. Histogram of 10,000 points generated by the above Gaussian mixture model. We want to find the model parameters which maximizes the posterior probability of generating those points.

K-means clustering

K-means clustering solves a different problem, but K-means problem gives an insight over how we can solve the mixture of Gaussian problem. In K-means clustering, a set of observed points (x_i)_{i=1}^{n}, a positive integer k are given. We want to cluster the points into k clusters (S_j)_{j=1}^{k} so that it minimizes the within cluster sum of squares (WCSS).

\underset{S_x}{argmin}\sum_{j=1}^{k} \sum_{x \in S_j} || x - \mu_j ||^2

\mu_j = \frac{\sum_{x \in S_j} x}{\#S_j} is the mean of the cluster S_j.

Figure 3. Example of a K-means clustering in 2 dimensions, K = 2. The observed points are clustered into the two clusters. The X mark denotes the centre of the cluster.

Figure 4. Another example of a K-means clustering in 2 dimensions, K=4.

Expectation Maximization (EM)

Expectation Maximization is an iterative algorithm for solving maximum likelihood problems. Given the set of observed data X, generated by the probability model with parameter 𝜃, we want to find the parameter 𝜃 that maximizes the posterior probability P(X|\theta). In K-means clustering,

  • X – observed data (x_i)_{i=1}^{n}
  • \theta – the clustering of the observed data points (S_j)_{j=1}^{k}, and their associated centres \mu_j.

EM Algorithm starts with an initial hypothesis for the parameter 𝜃 and iteratively calculate the posterior probability P(X|\theta), and re-calculates the hypothesis 𝜃 each step. The details are going to be different between each EM algorithms, but the following are the approximate steps.

  1. Start with an initial hypothesis 𝜃.
  2. (Expectation step) Calculate the posterior probability P(X | \theta).
  3. (Maximization step) Categorize the observed data according to the probability, and update the hypothesis accordingly.
  4. Repeat this process until the hypothesis converges.

Algorithm – Hard K-means

We assume that the P(x_i|S_j), the likelihood of point x_i belonging to the cluster centered at the point \mu_j is 1 iff \mu_j is the closest cluster.

P(x_i|S_j) = 1\ if\ j = \underset{k}{argmin} || x_i - \mu_k ||, 0\ otherwise

During the E-step, the distances || x_i - \mu_j || are calculated to determine P(x_i|S_j). During the M-stepS_j are determined by clustering each x_i to the cluster closest to it. Finally, each \mu_j are re-calculated by taking the mean of the points in S_j.

\mu_j = \displaystyle\sum\limits_{x \in S_j}\frac{x}{\#S_j}

Figure 5. Example of a K-means algorithm, for K=2. Initial cluster centres are randomly chosen from the observed points. Each iteration, the observed points are categorized to the nearest cluster centres, and the cluster centres are re-calculated. This process is repeated until the cluster centres converge.

Algorithm – Soft K-means

In Hard-K means. the observed points can only belong in a single cluster. However, it may be useful to consider the probability P(x_i|S_j) during the computation of \mu. This is especially true for the points near a boundary between two clusters.

We want to calculate \mu_j via the weighted average of x_i with P(x_i|S_j).

We assume that the likelihood P(x_i|S_j) has exponential distribution with the stiffness factor \beta > 0.

P(x_i|S_j) \sim e^{-\beta || x_i - \mu_j ||}

This way, P(x_i|S_j) is still monotonically decreasing with respect to the distance away from the centres, but the probability is panellized for the higher distance.

Similar to the Hard K-means algorithm, P(x_i|S_j) is obtained by normalizing this value.

P(x_i|S_j) = \frac{e^{-\beta||x_i-\mu_j||}}{\sum_l e^{-\beta || x_l - \mu_j ||}}

\mu_j is obtained by the weighted averages of all x_i .

\mu_j = \frac{\sum_i P(x_i|S_j) x_i}{\sum_i P(x_i|S_j)}

One thing to note is that Hard K-mean algorithm is equivalent to a Soft K-mean algorithm as \beta \rightarrow \infty.

Algorithm – Gaussian K-means

K-means algorithms are great, but the algorithm only reveals information about the cluster membership. However, we can modify the EM algorithm to calculate

We come back to the original assumption, that the likelihood P(x_i|S_j) follows a Gaussian distribution.

P(x_i|S_j) \sim \pi_j G(||x_i - \mu_j||, \sigma_j)

Where G(\bullet, \bullet) is the Gaussian probability density distribution. The actual probably is calculated as following.

P(x_i|S_j) = \frac{\pi_j G(||x_i-\mu_j||, \sigma_j)}{\sum_k \pi_k G(||x_i-\mu_k||, \sigma_k)}

We have to re-calculate three parameters, \mu, \sigma, \pi. Similar to the Soft K-mean algorithm, μ is the weighted averages of all x_i. \pi is the normalized ratio fo the sums \sum_i P(x_i|S_j). 𝜎 is the weighted standard deviation.

\mu_j = \frac{\sum_i P(x_i|S_j) x_i}{R_k}.

\sigma_j^2 = \frac{\sum_i P(x_i|S_j)|| x_i - \mu_j || }{I R_k}

\pi_j = \frac{R_j}{\sum_k R_k}

R_j = \sum_i P(x_i|S_j)

Where I is the dimensionality of x.

Figure 6. Points sampled from a mixture of two Gaussians. Top left plot is the data points generated by the model parameters \pi=(\frac{1}{2},\frac{1}{2}), \sigma_x = \sigma_y = (0.3, 1) \mu_x = (-2, 0), \mu_y = (0, 0). The other plots show the results of Hard K-means, Soft K-means, and Gaussian K-means.

Figure 7.  Another example of Gaussian K-means algorithm, with 4 clusters.

References

Advertisements