Palmer Penguins dataset

library(ggplot2)
library(parallel)
library(viridis)
library(dplyr)
library(knitr)

1. The Palmer penguin dataset

Adelie Penguins

As a first illustrative example, let us consider the Palmer penguin dataset Horst, Hill, and Gorman (2020). The dataset contains several physical measurements collected on $342$ penguins from three islands in the Palmer Archipelago, Antarctica..

library(palmerpenguins)
head(penguins)

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

We focus on the measurement of bill length for $342$ penguins.

library(palmerpenguins)

gg <- ggplot(penguins,aes(x=bill_length_mm,y = after_stat(density))) 
gg <- gg + geom_histogram(color = "white",fill="cyan4")
gg

2. EM algorithm for 2 univariate gaussian distributions

2.a Functions

For params a list of

params$p: $p$, the proportion of cluster 1
params$mu $(\mu_1,\mu_2)$, the 2 means
params$var : $(\sigma^2_1, \sigma^2_2)$, the two variances

we write the following functions

logLikelihood = function(params,y)
EM_2Mixture = function(params0,y,tol=10^{-5},estim_only_mu=FALSE)

source('Functions_EM.R')
y <- penguins$bill_length_mm
y <- y[!is.na(y)]

2.b Run the EM

We now run the EM from various starting points.

KM <- kmeans(y, 2)

params0 = list(list(mu=c(40,50),var=c(5,5),p=0.5),
               list(mu=c(20,50),var=c(5,5),p=0.5),
               list(mu=c(35,70),var=c(5,5),p=0.6),
               list(mu=c(50,40),var=c(10,10),p=0.4))
params0[[5]] = list(mu=c(40,50),var=c(1,1),p=0.5)
params0[[6]] = list(mu=KM$centers,var=c(3,3),p=0.5)
res_all_run_all <-do.call(rbind,lapply(1:length(params0),function(run){
  res_EM_run  <- EM_2Mixture(params0[[run]],y,tol=10^{-7},estim_only_mu = FALSE)
  niter <- nrow(res_EM_run)
  res_EM_run <- cbind(res_EM_run,rep(run,niter))
  res_EM_run <- cbind(res_EM_run, c(1:nrow(res_EM_run)))
  names(res_EM_run)[7]='Init'
  names(res_EM_run)[8]='numIter'
  
  return(res_EM_run)}))

The following table provides 5 initial points $\theta^{(0)}$ and the likelihood reached by the EM starting from theses points. The initial points 1,2,4,5, and 6 lead to the same value of the likelihood ($p_{\widehat{\theta}}(y)=-1043.56$) and to the same value of parameter $\widehat{\theta}$ (not shown here). However, the third initial point does not allow the EM algorithm to reach the global maximum.

mu_1	mu_2	sigma_1_2	sigma_2_2	p	ll
40.00000	50.0000	5	5	0.5	-1043.558
20.00000	50.0000	5	5	0.5	-1043.558
35.00000	70.0000	5	5	0.6	-1053.445
50.00000	40.0000	10	10	0.4	-1043.558
40.00000	50.0000	1	1	0.5	-1043.558
48.49432	39.0741	3	3	0.5	-1043.558

We plot the trajectories of the log-likelihood along the iterations of the algorithm starting from the initial points $\theta^{(0)}$ reported in the table.

3. Problems of convergence

3.a Simulation of data

In order to illustrate a bit more the behaviour of the EM algorithm, we consider the special case where the parameters $(p,\sigma^2_{1}, \sigma^2_2)$ are known and fixed to their true value and we only want to estimate $(\mu_1,\mu_2)$. For simplicity, for this particular experiment, we simulated data $y$ with parameters \[p=0.36 \quad \sigma^2_2 = \sigma^2_1 = 9 \quad (\mu_1 , \mu_2) = (35,45).\]

We fix the variance to their true values and only estimate $\mu_1,\mu_2$.

theta_true = list(mu = c(35, 45), var = c(9,9), p = 0.36)

Zsim = sample(c(1,2),length(y),prob=c(theta_true$p,1-theta_true$p),replace=TRUE)
ysim = theta_true$mu[Zsim] +sqrt(theta_true$var[Zsim])*rnorm(length(y))

3.b Heatmap of the likelihood

When only estimating $\mu_1$ and $\mu_2$, we can represent the 2 variables-function $(\mu_1, \mu_2) \mapsto \log p_\theta(y)$ with a heat map as represented in the following figure.

N = 600
res_like = data.frame(mu1=double(),
                      mu2 = double(), 
                      ll = double())
grid_mu1 = seq(20,60,len=N)
grid_mu2 = seq(20,60,len=N)
grid_params <- as.data.frame(matrix(0,N*N,2))
names(grid_params) <- c('mu1', 'mu2') 
grid_params[,1]<-  rep(grid_mu1,N)
grid_params[,2] <- rep(grid_mu2,each=N)
res_loglik <- mclapply(1:N^2,function(i){
  params.i = theta_true
  params.i$mu <- c(grid_params[i,1],grid_params[i,2])
  logLikelihood(params.i,y=ysim)},mc.cores = 4)
res_like = grid_params
res_like[,3]  = unlist(res_loglik)
names(res_like)[3] <- 'loglik'
save(res_like, file = "loglik_heatmap.Rdata")

Parameter maximizing the likelihood

grid_params <- res_like[,c(1,2)]
wmax = which.max(res_like[,3])
grid_params[wmax,]

##            mu1      mu2
## 225228 35.1586 45.04174

3.b EM starting from several points

theta0 = theta_true
params0 <- lapply(1:5,function(i){theta0})
params0[[1]]$mu = c(25,25)
params0[[2]]$mu = c(30,25)
params0[[3]]$mu = c(55,40)
params0[[4]]$mu = c(25,55)
params0[[5]]$mu = c(25,45)

res_all_run <-do.call(rbind,lapply(1:length(params0),function(run){
  res_EM_run  <- EM_2Mixture(params0[[run]],ysim,tol=10^{-5},estim_only_mu = TRUE)
  niter <- nrow(res_EM_run)
  res_EM_run <- cbind(res_EM_run,rep(run,niter))
  res_EM_run <- cbind(res_EM_run, 1:niter)
  names(res_EM_run)[7]='Init'
  names(res_EM_run)[8]='numIter'
  
  return(res_EM_run)}))

Final plot:

The true parameter $(\mu_1^{\star},\mu^{\star}_{2})$ is represented by a black point at coordinates $(35,45)$. It is visibly close to the global maximum of the likelihood. In this figure we also observe a local maximum at coordinates about $(48,35)$ and a saddle point at around $(40,40)$. On the same plot, we plotted the trajectories of various EM algorihtm starting respectively at the values provided in the legend. The initial values $4$ and $5$ lead the EM to the global maximum ($-1197.48$). Starting from the initial points $2$ and $3$, the EM converges to a local maximum, while starting at the first initial value, the algorithm reaches a saddle point.

Conclusion

In practice the EM algorithm must be initialized on several points and/or well chosen points, trying to explore various regions of the parameter space.

References

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.