library(extraDistr)
library(ggplot2)
library(knitr)

1. The Barents Fish dataset

Species distribution models (SDM) aim at understanding how environmental conditions affect the abundance of a given species in a given site. The data are typically collected in the following way: \(n\) sites are visited and in each site \(i\) (\(1 \leq i \leq n\)) a \(d\)-dimensional vector \(x_i\) of environmental descriptors is recorded, as well as the number \(y_i\) of individuals of the species observed in the site.

Map of the Barents sea, where data were collected

Meadow in Bohemia

Fossheim, Nilssen, and Aschan (2006) measured the abundance of cod ({}) measured in \(n=89\) stations of the Barents sea. In each station, fishes were captured according to the same protocole, the latitude and longitude of each site were measured together with two environmental covariates: depth and temperature of the water. The data are available from the R package Chiquet, Mariadassou, and Robin (2021).

We first give the first few lines of the dataset and the histogram of the observed abundances, which display a large variance and a high number of observations equal to \(0\): the species is actually not observed (i.e. \(y_i = 0\)) in \(n_0 = 61\) stations.

dataCodBarents <- read.table('BarentsFish.csv', sep=';', header=TRUE)
Covariates <- as.matrix(dataCodBarents[, (2:5)])
Counts <- dataCodBarents[, 6:ncol(dataCodBarents)]
j <- 21 # We focus on Species Ga_mo
Abundance <- Counts[, j]; Presence <- (Abundance>0)
Cod<- as.data.frame(Covariates)
Cod$Abundance <-Abundance 
Cod$Presence <- Presence
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Latitude Longitude Depth Temperature Abundance Presence
1 71.10 22.43 349 3.95 309 TRUE
2 71.32 23.68 382 3.75 1041 TRUE
3 71.60 24.90 294 3.45 218 TRUE
4 71.27 25.88 304 3.65 77 TRUE
5 71.52 28.12 384 3.35 13 TRUE
10 71.30 32.15 256 3.35 0 FALSE
11 71.22 33.15 254 2.55 0 FALSE
12 71.58 32.37 297 2.65 0 FALSE

We renormalize the covariates.

p <- 1+ncol(Covariates)
Covariates <- as.matrix(scale(Covariates))
Barents <- as.data.frame(cbind(Covariates, Abundance))
names(Barents)[5] <- 'Abundance'
Covariates <- cbind(rep(1, n), Covariates)
colnames(Covariates)[1] <- 'Intercept'

2. Classical Poisson or logistic regression approaches.

The Poisson regression model (which is special instance of generalized linear models, see Appendix of the book) provides a natural and well established framework for such count data. This model states that the sites are all independent and that the mean number of observed individuals in site \(i\) depends linearly on the covariates, through the log link function:

The model can be adapted to account for heterogeneous sampling efforts (e.g. different observation times) by adding an known site-specific offset term \(o_i\) to the regression model. The unknown parameter \(\theta\) is only the vector of regression coefficients \(\beta\), its estimation (by maximizing the likelihood) and interpretation are straightforward.

# Poisson regression model for the Abundance
regP <- glm(Abundance ~ Latitude + Longitude + Depth + Temperature, data=Barents, family='poisson')
summary(regP)
## 
## Call:
## glm(formula = Abundance ~ Latitude + Longitude + Depth + Temperature, 
##     family = "poisson", data = Barents)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.009781   0.085862   0.114    0.909    
## Latitude    -0.721225   0.117634  -6.131 8.73e-10 ***
## Longitude   -0.042936   0.032051  -1.340    0.180    
## Depth        0.916611   0.025273  36.268  < 2e-16 ***
## Temperature  2.478534   0.114704  21.608  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12865.9  on 88  degrees of freedom
## Residual deviance:  2148.5  on 84  degrees of freedom
## AIC: 2295.7
## 
## Number of Fisher Scoring iterations: 7
regP$beta <- regP$coefficients

Alternatively, one may aim at understanding the drivers of the simple presence of the species in each site. One way is to consider the binary variable which equal to \(1\) is the count is strictly positive, \(0\) otherwise and to use a logistic regression model:

Barents$Presence = 1*(Barents$Abundance>0)
regL <- glm(Presence ~  Latitude + Longitude + Depth + Temperature, data=Barents, family='binomial')
regL$alpha <- regL$coefficients
regL$linPred <- as.vector(Covariates%*%regL$alpha)
logLik(regL)
## 'log Lik.' -29.29516 (df=5)
summary(regL)
## 
## Call:
## glm(formula = Presence ~ Latitude + Longitude + Depth + Temperature, 
##     family = "binomial", data = Barents)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2753     0.3670  -3.475 0.000511 ***
## Latitude     -0.2506     0.6646  -0.377 0.706134    
## Longitude     0.3009     0.4030   0.747 0.455265    
## Depth        -0.3869     0.3837  -1.008 0.313313    
## Temperature   1.9941     0.6823   2.923 0.003470 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 110.85  on 88  degrees of freedom
## Residual deviance:  58.59  on 84  degrees of freedom
## AIC: 68.59
## 
## Number of Fisher Scoring iterations: 5

This model also suffers limitation, because the presence of the species is not directly observed. Indeed, whenever the species is not observed (\(Y_i = 0\)), it is not possible to decide whether it is actually absent from the site, or simply unobserved (the two cases are sometimes referred to as ‘true zero’ versus ‘false zero’).

3. ZIP model

We consider the ZIP model presented in the Class.

2. EM algorithm for multivariate gaussian distributions

2.a Functions

The useful functions are in the file FunctionsZIPreg.R.

source('FunctionsZIPreg.R')

Have a look at the codes

2.b Run the EM

thetaInit <- InitZIP(X=Covariates, Y=Abundance)
em <- EMZIPreg(X=Covariates, Y=Abundance, thetaInit=thetaInit)
## iter 2 
## -0.9208967 -0.139851 0.2837439 -0.5660354 1.784963 1.494935 -0.3852219 -0.256608 0.8664746 1.876195 1.485154 -892.3938 
## iter 3 
## -0.9704102 -0.2033428 0.3579273 -0.5653342 1.689021 1.528935 -0.3523572 -0.2594829 0.8664457 1.883845 0.09594166 -892.1953 
## iter 4 
## -0.9663127 -0.247716 0.3714958 -0.5647264 1.643797 1.543748 -0.372098 -0.2646722 0.864083 1.856969 0.04522449 -892.1628 
## iter 5 
## -0.960954 -0.2668028 0.3728521 -0.5689256 1.621961 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.02183546 -892.1604 
## iter 6 
## -0.9579928 -0.2758148 0.3733262 -0.5715473 1.609573 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.01238802 -892.1596 
## iter 7 
## -0.9551884 -0.2807287 0.3735456 -0.5734176 1.602532 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.007041541 -892.1594 
## iter 8 
## -0.9537642 -0.2835233 0.3738369 -0.5747695 1.598074 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.004458231 -892.1593 
## iter 9 
## -0.9523285 -0.2849615 0.3737477 -0.5757986 1.595363 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.002710613 -892.1592 
## iter 10 
## -0.9516835 -0.2857548 0.3737287 -0.5762628 1.593824 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.001538903 -892.1592 
## iter 11 
## -0.9512547 -0.2861404 0.3738916 -0.5766751 1.59319 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.0006343814 -892.1592 
## iter 12 
## -0.9510709 -0.28605 0.3738501 -0.5771132 1.592726 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.0004642512 -892.1592 
## iter 13 
## -0.9507453 -0.2864516 0.3742083 -0.5770331 1.592106 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.00061946 -892.1592 
## iter 14 
## -0.950547 -0.2867738 0.37427 -0.5773241 1.591798 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.0003222188 -892.1592 
## iter 15 
## -0.9504101 -0.2866875 0.3742244 -0.5773307 1.591723 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.0001368686 -892.1592 
## iter 16 
## -0.9504101 -0.2866875 0.3742244 -0.5773307 1.591723 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0 -892.1592

We also consider alternative methods : a direct optimization and we use the R package pscl We will compare the performances of the methods by comparing the final Log Likelihoods.

# Direct optimization
opt <- optim(unlist(thetaInit), f=LogLikZIP, Y=Abundance, X=Covariates, control=list(fnscale=-1))
opt$alpha <- opt$par[1:ncol(Covariates)]; opt$beta <- opt$par[-(1:ncol(Covariates))]
opt$theta <- list(alpha=opt$alpha, beta=opt$beta)
# Package
library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
fit <- zeroinfl(Abundance ~ -1 + Covariates, dist="poisson")
fit$theta <- list(alpha=fit$coefficients$zero, beta=fit$coefficients$count)

LL <- c(LogLikZIP(unlist(thetaInit), X=Covariates, Y=Abundance), 
  LogLikZIP(unlist(opt$theta), X=Covariates, Y=Abundance), 
  LogLikZIP(unlist(fit$theta), X=Covariates, Y=Abundance),
  LogLikZIP(unlist(em$theta), X=Covariates, Y=Abundance))
names(LL) = c('Init','Optim','pscl',"EM")
print(LL)
##      Init     Optim      pscl        EM 
## -988.3747 -894.6347 -975.5075 -892.1592

The EM leads to a higher likelihood value. We keep the required quantities.

regZIP <- em
regZIP$linPredPresence <- Covariates%*%regZIP$theta$alpha
regZIP$linPredAbundance <- Covariates%*%regZIP$theta$beta
regZIP$tau <- regZIP$tau[, 2]

3. Results

3.a Parameter estimates

We now compare the estimates with the 3 models (ZIP, Logistic and Poisson)

regCoef <- matrix(NA,3,p*2) 
rownames(regCoef) <- c('Logistic', 'Poisson', 'ZIP')
colnames(regCoef) <- c(names(regP$coefficients), names(regL$coefficients))
regCoef[1,1:p] <- round(regL$alpha,3)
regCoef[2,p+(1:p)] <- round(regP$beta,3)
regCoef[3,] <- round(as.vector(c(regZIP$theta$alpha, regZIP$theta$beta)),3)
kable(regCoef)
(Intercept) Latitude Longitude Depth Temperature (Intercept) Latitude Longitude Depth Temperature
Logistic -1.275 -0.251 0.301 -0.387 1.994 NA NA NA NA NA
Poisson NA NA NA NA NA 0.010 -0.721 -0.043 0.917 2.479
ZIP -0.950 -0.287 0.374 -0.577 1.592 1.544 -0.372 -0.265 0.864 1.857

This table gives the MLE of the regression coefficients for the Poisson regression , the logistic regression and the ZIP regression models. We observe that the regression coefficients for both the presence probability (\(\alpha\)) and the abundance (\(\beta\)) are different when dealing with both aspect separately (i.e logistic regression or Poisson regression) or jointly (ZIP regression).

As the covariates have been centered, one may focus on the intercepts, which control the presence probability and the abundance, respectively, in a ‘mean’ site. The ZIP regression yields a higher mean presence probability than the logistic, because it accounts for the fact that the species can be present, when it is actually not observed.

As for the Poisson part (which deals with the mean abundance), the Poisson regression yields a smaller mean abundance, as it needs to accommodate for the numerous zeros in the data set, whereas the abundance part of the ZIP regression only deals with case where the species is actually present.

3.b Presence probability.

We first plot the estimated probability \(\widehat{\pi}_i^{ZIP}\) of presence in each station, as a function of the linear predictor \(x_i^\top \widehat{\alpha}^{ZIP}\). The blue crosses indicate the probability of presence according to the logistic regression \(\pi_i^{logistic}\): we see that the two models yield similar probabilities. Still, the binary part of the ZIP model (encoded in \(\pi_i^{ZIP}\)) does not contain all information, regarding the prediction of the actual presence of the species in a given site: the abundance part must also be accounted for.

xGrid <- seq(-4, 4, by=.001)
orderRegL <- order(regL$linPred)
plot(regL$linPred, Presence, xlab='Logistic regression', ylab='presence probability',
     pch=20, cex=1.5, cex.axis=1.5, cex.lab=1.5, col=2-Presence); 
lines(xGrid, plogis(xGrid), col=4, lwd=2); 

# ZIP
orderZIP <- order(regZIP$linPredPresence)
orderZIP0 <- order(regZIP$linPredPresence[which(!Presence)])
plot(regZIP$linPredPresence, Presence, xlab='ZIP regression', ylab='presence probability',
     pch=20, cex=1.5, cex.axis=1.5, cex.lab=1.5, col=2-Presence); 
lines(xGrid, plogis(xGrid), lwd=2); 
points(regZIP$linPredPresence[orderZIP],regL$fitted.values[orderZIP], pch='+', col=4, cex=1.5) 

# comparison 
plot(regZIP$tau, regL$fitted.values, 
     xlab='ZIP regression',  ylab='Logistic regression', 
     pch=20, col=1+(Abundance==0), cex=1.5, cex.axis=1.5, cex.lab=1.5)
abline(v=.5, h=.5, col=4, lty=2, lwd=2)

References

Chiquet, J., M. Mariadassou, and S. Robin. 2021. “The Poisson-Lognormal Model as a Versatile Framework for the Joint Analysis of Species Abundances.” Frontiers in Ecology and Evolution 9: 188. https://doi.org/10.3389/fevo.2021.588292.
Fossheim, M., E. M Nilssen, and M. Aschan. 2006. “Fish Assemblages in the Barents Sea.” Marine Biology Research 2 (4): 260–69.