library(extraDistr)
library(ggplot2)
library(knitr)
Species distribution models (SDM) aim at understanding how environmental conditions affect the abundance of a given species in a given site. The data are typically collected in the following way: \(n\) sites are visited and in each site \(i\) (\(1 \leq i \leq n\)) a \(d\)-dimensional vector \(x_i\) of environmental descriptors is recorded, as well as the number \(y_i\) of individuals of the species observed in the site.
Map of the Barents sea, where data were collected
Fossheim, Nilssen, and Aschan (2006) measured the abundance of cod ({}) measured in \(n=89\) stations of the Barents sea. In each station, fishes were captured according to the same protocole, the latitude and longitude of each site were measured together with two environmental covariates: depth and temperature of the water. The data are available from the R package Chiquet, Mariadassou, and Robin (2021).
We first give the first few lines of the dataset and the histogram of the observed abundances, which display a large variance and a high number of observations equal to \(0\): the species is actually not observed (i.e. \(y_i = 0\)) in \(n_0 = 61\) stations.
dataCodBarents <- read.table('BarentsFish.csv', sep=';', header=TRUE)
Covariates <- as.matrix(dataCodBarents[, (2:5)])
Counts <- dataCodBarents[, 6:ncol(dataCodBarents)]
j <- 21 # We focus on Species Ga_mo
Abundance <- Counts[, j]; Presence <- (Abundance>0)
Cod<- as.data.frame(Covariates)
Cod$Abundance <-Abundance
Cod$Presence <- Presence
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Latitude | Longitude | Depth | Temperature | Abundance | Presence | |
---|---|---|---|---|---|---|
1 | 71.10 | 22.43 | 349 | 3.95 | 309 | TRUE |
2 | 71.32 | 23.68 | 382 | 3.75 | 1041 | TRUE |
3 | 71.60 | 24.90 | 294 | 3.45 | 218 | TRUE |
4 | 71.27 | 25.88 | 304 | 3.65 | 77 | TRUE |
5 | 71.52 | 28.12 | 384 | 3.35 | 13 | TRUE |
10 | 71.30 | 32.15 | 256 | 3.35 | 0 | FALSE |
11 | 71.22 | 33.15 | 254 | 2.55 | 0 | FALSE |
12 | 71.58 | 32.37 | 297 | 2.65 | 0 | FALSE |
We renormalize the covariates.
p <- 1+ncol(Covariates)
Covariates <- as.matrix(scale(Covariates))
Barents <- as.data.frame(cbind(Covariates, Abundance))
names(Barents)[5] <- 'Abundance'
Covariates <- cbind(rep(1, n), Covariates)
colnames(Covariates)[1] <- 'Intercept'
The Poisson regression model (which is special instance of generalized linear models, see Appendix of the book) provides a natural and well established framework for such count data. This model states that the sites are all independent and that the mean number of observed individuals in site \(i\) depends linearly on the covariates, through the log link function:
The model can be adapted to account for heterogeneous sampling efforts (e.g. different observation times) by adding an known site-specific offset term \(o_i\) to the regression model. The unknown parameter \(\theta\) is only the vector of regression coefficients \(\beta\), its estimation (by maximizing the likelihood) and interpretation are straightforward.
# Poisson regression model for the Abundance
regP <- glm(Abundance ~ Latitude + Longitude + Depth + Temperature, data=Barents, family='poisson')
summary(regP)
##
## Call:
## glm(formula = Abundance ~ Latitude + Longitude + Depth + Temperature,
## family = "poisson", data = Barents)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.009781 0.085862 0.114 0.909
## Latitude -0.721225 0.117634 -6.131 8.73e-10 ***
## Longitude -0.042936 0.032051 -1.340 0.180
## Depth 0.916611 0.025273 36.268 < 2e-16 ***
## Temperature 2.478534 0.114704 21.608 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12865.9 on 88 degrees of freedom
## Residual deviance: 2148.5 on 84 degrees of freedom
## AIC: 2295.7
##
## Number of Fisher Scoring iterations: 7
regP$beta <- regP$coefficients
Alternatively, one may aim at understanding the drivers of the simple presence of the species in each site. One way is to consider the binary variable which equal to \(1\) is the count is strictly positive, \(0\) otherwise and to use a logistic regression model:
Barents$Presence = 1*(Barents$Abundance>0)
regL <- glm(Presence ~ Latitude + Longitude + Depth + Temperature, data=Barents, family='binomial')
regL$alpha <- regL$coefficients
regL$linPred <- as.vector(Covariates%*%regL$alpha)
logLik(regL)
## 'log Lik.' -29.29516 (df=5)
summary(regL)
##
## Call:
## glm(formula = Presence ~ Latitude + Longitude + Depth + Temperature,
## family = "binomial", data = Barents)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2753 0.3670 -3.475 0.000511 ***
## Latitude -0.2506 0.6646 -0.377 0.706134
## Longitude 0.3009 0.4030 0.747 0.455265
## Depth -0.3869 0.3837 -1.008 0.313313
## Temperature 1.9941 0.6823 2.923 0.003470 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 110.85 on 88 degrees of freedom
## Residual deviance: 58.59 on 84 degrees of freedom
## AIC: 68.59
##
## Number of Fisher Scoring iterations: 5
This model also suffers limitation, because the presence of the species is not directly observed. Indeed, whenever the species is not observed (\(Y_i = 0\)), it is not possible to decide whether it is actually absent from the site, or simply unobserved (the two cases are sometimes referred to as ‘true zero’ versus ‘false zero’).
We consider the ZIP model presented in the Class.
The useful functions are in the file
FunctionsZIPreg.R
.
source('FunctionsZIPreg.R')
Have a look at the codes
thetaInit <- InitZIP(X=Covariates, Y=Abundance)
em <- EMZIPreg(X=Covariates, Y=Abundance, thetaInit=thetaInit)
## iter 2
## -0.9208967 -0.139851 0.2837439 -0.5660354 1.784963 1.494935 -0.3852219 -0.256608 0.8664746 1.876195 1.485154 -892.3938
## iter 3
## -0.9704102 -0.2033428 0.3579273 -0.5653342 1.689021 1.528935 -0.3523572 -0.2594829 0.8664457 1.883845 0.09594166 -892.1953
## iter 4
## -0.9663127 -0.247716 0.3714958 -0.5647264 1.643797 1.543748 -0.372098 -0.2646722 0.864083 1.856969 0.04522449 -892.1628
## iter 5
## -0.960954 -0.2668028 0.3728521 -0.5689256 1.621961 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.02183546 -892.1604
## iter 6
## -0.9579928 -0.2758148 0.3733262 -0.5715473 1.609573 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.01238802 -892.1596
## iter 7
## -0.9551884 -0.2807287 0.3735456 -0.5734176 1.602532 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.007041541 -892.1594
## iter 8
## -0.9537642 -0.2835233 0.3738369 -0.5747695 1.598074 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.004458231 -892.1593
## iter 9
## -0.9523285 -0.2849615 0.3737477 -0.5757986 1.595363 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.002710613 -892.1592
## iter 10
## -0.9516835 -0.2857548 0.3737287 -0.5762628 1.593824 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.001538903 -892.1592
## iter 11
## -0.9512547 -0.2861404 0.3738916 -0.5766751 1.59319 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.0006343814 -892.1592
## iter 12
## -0.9510709 -0.28605 0.3738501 -0.5771132 1.592726 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.0004642512 -892.1592
## iter 13
## -0.9507453 -0.2864516 0.3742083 -0.5770331 1.592106 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.00061946 -892.1592
## iter 14
## -0.950547 -0.2867738 0.37427 -0.5773241 1.591798 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.0003222188 -892.1592
## iter 15
## -0.9504101 -0.2866875 0.3742244 -0.5773307 1.591723 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0.0001368686 -892.1592
## iter 16
## -0.9504101 -0.2866875 0.3742244 -0.5773307 1.591723 1.544026 -0.372099 -0.264938 0.8640869 1.856701 0 -892.1592
We also consider alternative methods : a direct optimization and we
use the R package pscl
We will compare the performances of
the methods by comparing the final Log Likelihoods.
# Direct optimization
opt <- optim(unlist(thetaInit), f=LogLikZIP, Y=Abundance, X=Covariates, control=list(fnscale=-1))
opt$alpha <- opt$par[1:ncol(Covariates)]; opt$beta <- opt$par[-(1:ncol(Covariates))]
opt$theta <- list(alpha=opt$alpha, beta=opt$beta)
# Package
library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
fit <- zeroinfl(Abundance ~ -1 + Covariates, dist="poisson")
fit$theta <- list(alpha=fit$coefficients$zero, beta=fit$coefficients$count)
LL <- c(LogLikZIP(unlist(thetaInit), X=Covariates, Y=Abundance),
LogLikZIP(unlist(opt$theta), X=Covariates, Y=Abundance),
LogLikZIP(unlist(fit$theta), X=Covariates, Y=Abundance),
LogLikZIP(unlist(em$theta), X=Covariates, Y=Abundance))
names(LL) = c('Init','Optim','pscl',"EM")
print(LL)
## Init Optim pscl EM
## -988.3747 -894.6347 -975.5075 -892.1592
The EM leads to a higher likelihood value. We keep the required quantities.
regZIP <- em
regZIP$linPredPresence <- Covariates%*%regZIP$theta$alpha
regZIP$linPredAbundance <- Covariates%*%regZIP$theta$beta
regZIP$tau <- regZIP$tau[, 2]
We now compare the estimates with the 3 models (ZIP, Logistic and Poisson)
regCoef <- matrix(NA,3,p*2)
rownames(regCoef) <- c('Logistic', 'Poisson', 'ZIP')
colnames(regCoef) <- c(names(regP$coefficients), names(regL$coefficients))
regCoef[1,1:p] <- round(regL$alpha,3)
regCoef[2,p+(1:p)] <- round(regP$beta,3)
regCoef[3,] <- round(as.vector(c(regZIP$theta$alpha, regZIP$theta$beta)),3)
kable(regCoef)
(Intercept) | Latitude | Longitude | Depth | Temperature | (Intercept) | Latitude | Longitude | Depth | Temperature | |
---|---|---|---|---|---|---|---|---|---|---|
Logistic | -1.275 | -0.251 | 0.301 | -0.387 | 1.994 | NA | NA | NA | NA | NA |
Poisson | NA | NA | NA | NA | NA | 0.010 | -0.721 | -0.043 | 0.917 | 2.479 |
ZIP | -0.950 | -0.287 | 0.374 | -0.577 | 1.592 | 1.544 | -0.372 | -0.265 | 0.864 | 1.857 |
This table gives the MLE of the regression coefficients for the Poisson regression , the logistic regression and the ZIP regression models. We observe that the regression coefficients for both the presence probability (\(\alpha\)) and the abundance (\(\beta\)) are different when dealing with both aspect separately (i.e logistic regression or Poisson regression) or jointly (ZIP regression).
As the covariates have been centered, one may focus on the intercepts, which control the presence probability and the abundance, respectively, in a ‘mean’ site. The ZIP regression yields a higher mean presence probability than the logistic, because it accounts for the fact that the species can be present, when it is actually not observed.
As for the Poisson part (which deals with the mean abundance), the Poisson regression yields a smaller mean abundance, as it needs to accommodate for the numerous zeros in the data set, whereas the abundance part of the ZIP regression only deals with case where the species is actually present.
We first plot the estimated probability \(\widehat{\pi}_i^{ZIP}\) of presence in each station, as a function of the linear predictor \(x_i^\top \widehat{\alpha}^{ZIP}\). The blue crosses indicate the probability of presence according to the logistic regression \(\pi_i^{logistic}\): we see that the two models yield similar probabilities. Still, the binary part of the ZIP model (encoded in \(\pi_i^{ZIP}\)) does not contain all information, regarding the prediction of the actual presence of the species in a given site: the abundance part must also be accounted for.
xGrid <- seq(-4, 4, by=.001)
orderRegL <- order(regL$linPred)
plot(regL$linPred, Presence, xlab='Logistic regression', ylab='presence probability',
pch=20, cex=1.5, cex.axis=1.5, cex.lab=1.5, col=2-Presence);
lines(xGrid, plogis(xGrid), col=4, lwd=2);
# ZIP
orderZIP <- order(regZIP$linPredPresence)
orderZIP0 <- order(regZIP$linPredPresence[which(!Presence)])
plot(regZIP$linPredPresence, Presence, xlab='ZIP regression', ylab='presence probability',
pch=20, cex=1.5, cex.axis=1.5, cex.lab=1.5, col=2-Presence);
lines(xGrid, plogis(xGrid), lwd=2);
points(regZIP$linPredPresence[orderZIP],regL$fitted.values[orderZIP], pch='+', col=4, cex=1.5)
# comparison
plot(regZIP$tau, regL$fitted.values,
xlab='ZIP regression', ylab='Logistic regression',
pch=20, col=1+(Abundance==0), cex=1.5, cex.axis=1.5, cex.lab=1.5)
abline(v=.5, h=.5, col=4, lty=2, lwd=2)