library(ggplot2)
library(parallel)
library(viridis)
library(dplyr)
library(knitr)

1. The Palmer penguin dataset

Adelie Penguins

As a first illustrative example, let us consider the Palmer penguin dataset Horst, Hill, and Gorman (2020). The dataset contains several physical measurements collected on \(342\) penguins from three islands in the Palmer Archipelago, Antarctica..

library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

We focus on the measurement of bill length for \(342\) penguins.

library(palmerpenguins)

gg <- ggplot(penguins,aes(x=bill_length_mm,y = after_stat(density))) 
gg <- gg + geom_histogram(color = "white",fill="cyan4")
gg

2. EM algorithm for 2 univariate gaussian distributions

2.a Functions

For params a list of

  • params$p: \(p\), the proportion of cluster 1

  • params$mu \((\mu_1,\mu_2)\), the 2 means

  • params$var : \((\sigma^2_1, \sigma^2_2)\), the two variances

we write the following functions

  • logLikelihood = function(params,y)
  • EM_2Mixture = function(params0,y,tol=10^{-5},estim_only_mu=FALSE)
source('Functions_EM.R')
y <- penguins$bill_length_mm
y <- y[!is.na(y)]

2.b Run the EM

We now run the EM from various starting points.

KM <- kmeans(y, 2)

params0 = list(list(mu=c(40,50),var=c(5,5),p=0.5),
               list(mu=c(20,50),var=c(5,5),p=0.5),
               list(mu=c(35,70),var=c(5,5),p=0.6),
               list(mu=c(50,40),var=c(10,10),p=0.4))
params0[[5]] = list(mu=c(40,50),var=c(1,1),p=0.5)
params0[[6]] = list(mu=KM$centers,var=c(3,3),p=0.5)
res_all_run_all <-do.call(rbind,lapply(1:length(params0),function(run){
  res_EM_run  <- EM_2Mixture(params0[[run]],y,tol=10^{-7},estim_only_mu = FALSE)
  niter <- nrow(res_EM_run)
  res_EM_run <- cbind(res_EM_run,rep(run,niter))
  res_EM_run <- cbind(res_EM_run, c(1:nrow(res_EM_run)))
  names(res_EM_run)[7]='Init'
  names(res_EM_run)[8]='numIter'
  
  return(res_EM_run)}))

The following table provides 5 initial points \(\theta^{(0)}\) and the likelihood reached by the EM starting from theses points. The initial points 1,2,4,5, and 6 lead to the same value of the likelihood (\(p_{\widehat{\theta}}(y)=-1043.56\)) and to the same value of parameter \(\widehat{\theta}\) (not shown here). However, the third initial point does not allow the EM algorithm to reach the global maximum.

mu_1 mu_2 sigma_1_2 sigma_2_2 p ll
40.00000 50.0000 5 5 0.5 -1043.558
20.00000 50.0000 5 5 0.5 -1043.558
35.00000 70.0000 5 5 0.6 -1053.445
50.00000 40.0000 10 10 0.4 -1043.558
40.00000 50.0000 1 1 0.5 -1043.558
48.49432 39.0741 3 3 0.5 -1043.558

We plot the trajectories of the log-likelihood along the iterations of the algorithm starting from the initial points \(\theta^{(0)}\) reported in the table.

3. Problems of convergence

3.a Simulation of data

In order to illustrate a bit more the behaviour of the EM algorithm, we consider the special case where the parameters \((p,\sigma^2_{1}, \sigma^2_2)\) are known and fixed to their true value and we only want to estimate \((\mu_1,\mu_2)\). For simplicity, for this particular experiment, we simulated data \(y\) with parameters \[p=0.36 \quad \sigma^2_2 = \sigma^2_1 = 9 \quad (\mu_1 , \mu_2) = (35,45).\]

We fix the variance to their true values and only estimate \(\mu_1,\mu_2\).

theta_true = list(mu = c(35, 45), var = c(9,9), p = 0.36)
Zsim = sample(c(1,2),length(y),prob=c(theta_true$p,1-theta_true$p),replace=TRUE)
ysim = theta_true$mu[Zsim] +sqrt(theta_true$var[Zsim])*rnorm(length(y))

3.b Heatmap of the likelihood

When only estimating \(\mu_1\) and \(\mu_2\), we can represent the 2 variables-function \((\mu_1, \mu_2) \mapsto \log p_\theta(y)\) with a heat map as represented in the following figure.

N = 600
res_like = data.frame(mu1=double(),
                      mu2 = double(), 
                      ll = double())
grid_mu1 = seq(20,60,len=N)
grid_mu2 = seq(20,60,len=N)
grid_params <- as.data.frame(matrix(0,N*N,2))
names(grid_params) <- c('mu1', 'mu2') 
grid_params[,1]<-  rep(grid_mu1,N)
grid_params[,2] <- rep(grid_mu2,each=N)
res_loglik <- mclapply(1:N^2,function(i){
  params.i = theta_true
  params.i$mu <- c(grid_params[i,1],grid_params[i,2])
  logLikelihood(params.i,y=ysim)},mc.cores = 4)
res_like = grid_params
res_like[,3]  = unlist(res_loglik)
names(res_like)[3] <- 'loglik'
save(res_like, file = "loglik_heatmap.Rdata")

Parameter maximizing the likelihood

grid_params <- res_like[,c(1,2)]
wmax = which.max(res_like[,3])
grid_params[wmax,]
##            mu1      mu2
## 225228 35.1586 45.04174

3.b EM starting from several points

theta0 = theta_true
params0 <- lapply(1:5,function(i){theta0})
params0[[1]]$mu = c(25,25)
params0[[2]]$mu = c(30,25)
params0[[3]]$mu = c(55,40)
params0[[4]]$mu = c(25,55)
params0[[5]]$mu = c(25,45)

res_all_run <-do.call(rbind,lapply(1:length(params0),function(run){
  res_EM_run  <- EM_2Mixture(params0[[run]],ysim,tol=10^{-5},estim_only_mu = TRUE)
  niter <- nrow(res_EM_run)
  res_EM_run <- cbind(res_EM_run,rep(run,niter))
  res_EM_run <- cbind(res_EM_run, 1:niter)
  names(res_EM_run)[7]='Init'
  names(res_EM_run)[8]='numIter'
  
  return(res_EM_run)}))

Final plot:

The true parameter \((\mu_1^{\star},\mu^{\star}_{2})\) is represented by a black point at coordinates \((35,45)\). It is visibly close to the global maximum of the likelihood. In this figure we also observe a local maximum at coordinates about \((48,35)\) and a saddle point at around \((40,40)\). On the same plot, we plotted the trajectories of various EM algorihtm starting respectively at the values provided in the legend. The initial values \(4\) and \(5\) lead the EM to the global maximum (\(-1197.48\)). Starting from the initial points \(2\) and \(3\), the EM converges to a local maximum, while starting at the first initial value, the algorithm reaches a saddle point.

Conclusion

In practice the EM algorithm must be initialized on several points and/or well chosen points, trying to explore various regions of the parameter space.

References

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.