library(ggplot2)
library(parallel)
library(viridis)
library(dplyr)
library(knitr)
As a first illustrative example, let us consider the Palmer penguin dataset Horst, Hill, and Gorman (2020). The dataset contains several physical measurements collected on \(342\) penguins from three islands in the Palmer Archipelago, Antarctica..
library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
We focus on the measurement of bill length for \(342\) penguins.
library(palmerpenguins)
gg <- ggplot(penguins,aes(x=bill_length_mm,y = after_stat(density)))
gg <- gg + geom_histogram(color = "white",fill="cyan4")
gg
For params
a list of
params$p
: \(p\),
the proportion of cluster 1
params$mu
\((\mu_1,\mu_2)\), the 2 means
params$var
: \((\sigma^2_1, \sigma^2_2)\), the two
variances
we write the following functions
logLikelihood = function(params,y)
EM_2Mixture = function(params0,y,tol=10^{-5},estim_only_mu=FALSE)
source('Functions_EM.R')
y <- penguins$bill_length_mm
y <- y[!is.na(y)]
We now run the EM from various starting points.
KM <- kmeans(y, 2)
params0 = list(list(mu=c(40,50),var=c(5,5),p=0.5),
list(mu=c(20,50),var=c(5,5),p=0.5),
list(mu=c(35,70),var=c(5,5),p=0.6),
list(mu=c(50,40),var=c(10,10),p=0.4))
params0[[5]] = list(mu=c(40,50),var=c(1,1),p=0.5)
params0[[6]] = list(mu=KM$centers,var=c(3,3),p=0.5)
res_all_run_all <-do.call(rbind,lapply(1:length(params0),function(run){
res_EM_run <- EM_2Mixture(params0[[run]],y,tol=10^{-7},estim_only_mu = FALSE)
niter <- nrow(res_EM_run)
res_EM_run <- cbind(res_EM_run,rep(run,niter))
res_EM_run <- cbind(res_EM_run, c(1:nrow(res_EM_run)))
names(res_EM_run)[7]='Init'
names(res_EM_run)[8]='numIter'
return(res_EM_run)}))
The following table provides 5 initial points \(\theta^{(0)}\) and the likelihood reached by the EM starting from theses points. The initial points 1,2,4,5, and 6 lead to the same value of the likelihood (\(p_{\widehat{\theta}}(y)=-1043.56\)) and to the same value of parameter \(\widehat{\theta}\) (not shown here). However, the third initial point does not allow the EM algorithm to reach the global maximum.
mu_1 | mu_2 | sigma_1_2 | sigma_2_2 | p | ll |
---|---|---|---|---|---|
40.00000 | 50.0000 | 5 | 5 | 0.5 | -1043.558 |
20.00000 | 50.0000 | 5 | 5 | 0.5 | -1043.558 |
35.00000 | 70.0000 | 5 | 5 | 0.6 | -1053.445 |
50.00000 | 40.0000 | 10 | 10 | 0.4 | -1043.558 |
40.00000 | 50.0000 | 1 | 1 | 0.5 | -1043.558 |
48.49432 | 39.0741 | 3 | 3 | 0.5 | -1043.558 |
We plot the trajectories of the log-likelihood along the iterations of the algorithm starting from the initial points \(\theta^{(0)}\) reported in the table.
In order to illustrate a bit more the behaviour of the EM algorithm, we consider the special case where the parameters \((p,\sigma^2_{1}, \sigma^2_2)\) are known and fixed to their true value and we only want to estimate \((\mu_1,\mu_2)\). For simplicity, for this particular experiment, we simulated data \(y\) with parameters \[p=0.36 \quad \sigma^2_2 = \sigma^2_1 = 9 \quad (\mu_1 , \mu_2) = (35,45).\]
We fix the variance to their true values and only estimate \(\mu_1,\mu_2\).
theta_true = list(mu = c(35, 45), var = c(9,9), p = 0.36)
Zsim = sample(c(1,2),length(y),prob=c(theta_true$p,1-theta_true$p),replace=TRUE)
ysim = theta_true$mu[Zsim] +sqrt(theta_true$var[Zsim])*rnorm(length(y))
When only estimating \(\mu_1\) and \(\mu_2\), we can represent the 2 variables-function \((\mu_1, \mu_2) \mapsto \log p_\theta(y)\) with a heat map as represented in the following figure.
N = 600
res_like = data.frame(mu1=double(),
mu2 = double(),
ll = double())
grid_mu1 = seq(20,60,len=N)
grid_mu2 = seq(20,60,len=N)
grid_params <- as.data.frame(matrix(0,N*N,2))
names(grid_params) <- c('mu1', 'mu2')
grid_params[,1]<- rep(grid_mu1,N)
grid_params[,2] <- rep(grid_mu2,each=N)
res_loglik <- mclapply(1:N^2,function(i){
params.i = theta_true
params.i$mu <- c(grid_params[i,1],grid_params[i,2])
logLikelihood(params.i,y=ysim)},mc.cores = 4)
res_like = grid_params
res_like[,3] = unlist(res_loglik)
names(res_like)[3] <- 'loglik'
save(res_like, file = "loglik_heatmap.Rdata")
Parameter maximizing the likelihood
grid_params <- res_like[,c(1,2)]
wmax = which.max(res_like[,3])
grid_params[wmax,]
## mu1 mu2
## 225228 35.1586 45.04174
theta0 = theta_true
params0 <- lapply(1:5,function(i){theta0})
params0[[1]]$mu = c(25,25)
params0[[2]]$mu = c(30,25)
params0[[3]]$mu = c(55,40)
params0[[4]]$mu = c(25,55)
params0[[5]]$mu = c(25,45)
res_all_run <-do.call(rbind,lapply(1:length(params0),function(run){
res_EM_run <- EM_2Mixture(params0[[run]],ysim,tol=10^{-5},estim_only_mu = TRUE)
niter <- nrow(res_EM_run)
res_EM_run <- cbind(res_EM_run,rep(run,niter))
res_EM_run <- cbind(res_EM_run, 1:niter)
names(res_EM_run)[7]='Init'
names(res_EM_run)[8]='numIter'
return(res_EM_run)}))
Final plot:
The true parameter \((\mu_1^{\star},\mu^{\star}_{2})\) is represented by a black point at coordinates \((35,45)\). It is visibly close to the global maximum of the likelihood. In this figure we also observe a local maximum at coordinates about \((48,35)\) and a saddle point at around \((40,40)\). On the same plot, we plotted the trajectories of various EM algorihtm starting respectively at the values provided in the legend. The initial values \(4\) and \(5\) lead the EM to the global maximum (\(-1197.48\)). Starting from the initial points \(2\) and \(3\), the EM converges to a local maximum, while starting at the first initial value, the algorithm reaches a saddle point.
In practice the EM algorithm must be initialized on several points and/or well chosen points, trying to explore various regions of the parameter space.