We consider the Data data(ohrazeni)
from the R package
traitor
. This package is issued from the article “Community
trait response to environment: disentangling species turnover vs
intraspecific trait variability effects” by Lepš
et al. (2011).
They study the vegetation composition in meadows of Bohemia, Czech Republic. Four specific traits, namely the specific leaf area (SLA), the leaf dry matter content (LDMC), the reproductive plant height and the seed weight are measured over 58 species . Our objective is the identify groups of species that share similar traits.
To do so, we will use the multivariate Gaussian mixture model.
An extract of the table corresponding to this dataset is provided here after.
In this dataset, we have removed the non numerical variables and the
outliers (huge seed weight). The resulting dataset is in the file
Bohemia_vegetation.csv.
donnees <- read.table("Bohemia_vegetation.csv",
header = TRUE, sep = ",") %>%
rename("Seed weight" = "Seed.weight")
kable(head(donnees))
Species | SLA | LDMC | Height | Seed weight |
---|---|---|---|---|
Sp1 | 28.3 | 275.9 | 60.5 | 0.0 |
Sp2 | 27.7 | 294.3 | 39.5 | 0.1 |
Sp3 | 23.8 | 227.4 | 105.0 | 0.2 |
Sp4 | 30.4 | 166.2 | 45.0 | 1.3 |
Sp5 | 28.4 | 187.4 | 75.0 | 0.7 |
Sp6 | 23.3 | 322.0 | 27.5 | 0.5 |
We now plot de data.
# Descriptive figures -----------------------------------------------------
## Pair plot ---------------------------------------------------------------
pair_plot <- ggpairs(donnees %>% select_if(is.numeric), upper = list(continuous = wrap("points",
size = 1/.pt)), lower = "blank") +
theme_bw() +
theme(text = element_text(size = 10),
axis.text = element_blank(),
axis.ticks = element_blank())
pair_plot
The useful functions are in the file
utils_gaussian_mixture_functions.R
.
source('utils_gaussian_mixture_functions.R')
ys <- read.table("Bohemia_vegetation.csv",
header = TRUE, row.names = 1, sep = ",") %>% as.matrix()
We now run the EM for several numbers of clusters and several intialisation points.
The code is the following one.
get_best_estimate <- function(ks, ys, n_trys){
purrr::map(ks, function(k){
if(k == 1){
all_res_k <- get_EM_estimation_mixture(ys = ys,
theta0 = list(pi = 1,
mu = rep(0, ncol(ys)),
Sigma = array(diag(1, ncol(ys)),
dim = c(ncol(ys), ncol(ys), 1))),
n_iter_max = 2000)
return(c(all_res_k, k = k))
}
else{
all_res_k <- mclapply(1:n_trys, function(i){
set.seed(i)
initial_theta <- get_theta0(k = k, ys = ys)
all_res <- try(get_EM_estimation_mixture(ys = ys,
theta0 = initial_theta,
n_iter_max = 2000))
if(inherits(all_res, "try-error")){
return(NULL)
}
else{
return(c(all_res, seed = i))
}
}, mc.cores = detectCores() - 2)
lls <- map_dbl(all_res_k, function(x){
if(is.null(x)){
return(-Inf)
}
else{
return(x$final_criterions$log_lik)
}})
return(c(all_res_k[[which.max(lls)]], k = k))
}
})
}
We obtain the estimate with the following code.
gmm_est_multiple_ks <- get_best_estimate(1:6, ys, 200)
gmm_est_k3 <- mclapply(1:200, function(i){
set.seed(i)
initial_theta <- get_theta0(k = 3, ys = ys)
all_res <- try(get_EM_estimation_mixture(ys = ys,
theta0 = initial_theta,
n_iter_max = 2000))
if(inherits(all_res, "try-error")){
return(NULL)
}
else{
return(c(all_res, seed = i))
}
}, mc.cores = detectCores() - 2)
gmm_est_best <- map_dbl(gmm_est_k3, function(x){
if(is.null(x))
return(-Inf)
else
return(x$final_criterions$log_lik)
}) %>%
which.max() %>%
gmm_est_k3[[.]]
save(list = ls(pattern = "gmm_est"),
file = "gmm_est.RData")
First, we illustrate an important feature of the EM algorithm which is the influence of the starting parameter \(\theta^{(0)}\). For \(K = 3\), we chose randomly \(200\) different starting points and run the subsequent algorithms. The log-likelihood is monitored through the algorithm, and the algorithm stops when the increase in the log-likelihood becomes lower than \(10^{-8}\). The following figure shows the differences in the final log-likelihood, therefore illustrating the numbers of local maxima.
likelihoods_plot <- map(gmm_est_k3, "log_likelihoods") %>%
map_dfr(function(x){
if(is.null(x)){
return(NULL)
}
else{
n_iter <- length(x) - 1
return(data.frame(iteration = 0:n_iter,
log_lik = 2 * x * nrow(donnees)))
}
}, .id = "Replicate") %>%
ggplot() +
aes(x = iteration, y = log_lik, group = Replicate) +
geom_line(alpha = .5) +
theme_bw() +
lims(y = c(-2000, -1500)) +
labs(x = "EM iteration", y = "Log likelihood value")
likelihoods_plot
## Warning: Removed 80 rows containing missing values or values outside the scale range
## (`geom_line()`).
It is worth noting that this is not a problem , as one can run (in parallel) the algorithm from multiple starting points, and choose the best with respect to the likelihood. . This however illustrates the difficulty to find a global maxima of the likelihood in complex settings.
The previous procedure was performed for \(K\in \lbrace 1,\dots, 6\rbrace\), and
choosing the bet final point among the 200 trials.
For the 6 models, we compute the 3 model selection criterion discussed
in the course (AIC, BIC, ICL), together with the negative
log-likelihood.
k | log_lik | AIC | BIC | ICL |
---|---|---|---|---|
-1 | -1656.045 | -1684.045 | -1712.891 | -1712.891 |
-2 | -1598.695 | -1656.695 | -1716.448 | -1730.783 |
-3 | -1531.577 | -1619.577 | -1710.236 | -1715.665 |
-4 | -1471.263 | -1589.263 | -1710.829 | -1713.910 |
-5 | -1442.812 | -1590.812 | -1743.285 | -1748.982 |
-6 | -1422.497 | -1600.497 | -1783.877 | -1788.049 |
We show the results on Figure \(\ref{fig:gmm:criterions}\).
criterions_plot <- map_dfr(gmm_est_multiple_ks, "final_criterions") %>%
mutate(log_lik = - 2 * log_lik) %>%
pivot_longer(-k) %>%
ggplot() +
aes(x = k, y = -value, color = name) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 1:6) +
theme_bw() +
labs(x = "Number of components",
y = "Criterion value",
color = "")
criterions_plot
We can notice that the AIC and ICL find a 4 components model to be the best, while BIC leads to \(K = 3\), which is kept as the final model in the following.
For \(K = 3\), the best parameter is chosen. We are then able to cluster the observations.
estimated_clusters <- gmm_est_best$posterior_weights %>%
apply(1, which.max) %>%
factor()
ClusterTable <- as.data.frame(table(estimated_clusters))
kable(ClusterTable)
estimated_clusters | Freq |
---|---|
1 | 15 |
2 | 28 |
3 | 15 |
The three clusters red green blue gather respectively 15, 28, 15 observations We plot the distributions of the variables inside each cluster.
cluster_pair_plot <- ggpairs(donnees %>%
select_if(is.numeric) %>%
mutate(Cluster = estimated_clusters),
mapping = aes(color = Cluster, alpha = 1),
upper = list(continuous = wrap("points", alpha = 1,
size = 1/.pt)),
lower = "blank") +
theme_bw() +
theme(text = element_text(size = 10),
axis.text = element_blank(),
axis.ticks = element_blank())
cluster_pair_plot
It can be interesting to compare the clusterings obtained with \(K=3\) and \(K=4\)
gmm_est_k4 <- gmm_est_multiple_ks[[4]]
estimated_clusters_k4 <- gmm_est_k4$posterior_weights %>%
apply(1, which.max) %>%
factor()
Table_compare_cluster <- table(estimated_clusters_k4,estimated_clusters)
rownames(Table_compare_cluster)=c(1:4)
kable(Table_compare_cluster)
1 | 2 | 3 |
---|---|---|
0 | 0 | 15 |
1 | 28 | 0 |
6 | 0 | 0 |
8 | 0 | 0 |