Disease mapping with log-Gaussian Cox processes

In this document we will perfom disease mapping with log-Gaussian Cox processes. A great bottleneck on fitting such models is that in most cases we have not point data available. Thus for this particular example I simulated data from the population of the canton of Zurich mimicking the childhood Type 1 Diabetes (T1D). The data file available are the population density of children 0-15 years old in the canton of Zurich and 1503 simulated childhood T1D cases. I also provide you with the shapefile of the canton of Zurich.

Install and load packages

This practical requires the following packages to be installed and attached: sf, spdep, tmap,ggplot2, patchwork, kableExtra and INLA.

  • To install the entire suite of packages, we can use:
install.packages(c("sf", "spdep", "ggplot2", "patchwork"), dependencies = TRUE, 
                 repos = "http://cran.r-project.org")
  • For INLA, you can use:
install.packages("INLA",repos=c(getOption("repos"),
                                INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
  • Then, load the needed packages:
library(sf)
library(raster)
library(tmap)
library(kableExtra) 
library(INLA)
library(maptools)
library(tidyverse)

The truth

Please notice that for this example we are using simulated data. The advantage of this, is that we know the truth and we can assess how well this truth is retrieved by the LGCP.

  • Load the datafiles
load("caseproces.RData")
load("poppointproc.RData")
load("ZHshapefile.RData")

prjstr <- CRS("+proj=somerc +lat_0=46.95240555555556 +lon_0=7.439583333333333 +k_0=1 +x_0=600000 +y_0=200000 +ellps=bessel +towgs84=674.374,15.056,405.346,0,0,0,0 +units=m +no_defs")

For this particular example I assumed a 3-fold risk increase inside some arbitrary circles of 5km radii with centroids in the municipality of Zurich, Wintertur and Gossau.

centroid.x <- c(681689, 696839, 700017)
centroid.y <- c(247620, 261850, 240222)

centroids <- data.frame(centroid.x = centroid.x, centroid.y = centroid.y)

radius <- 5000

n = 1000
pts = seq(0, 2 * pi, length.out = n)
ZH = cbind(centroid.x[1] + radius * sin(pts), centroid.y[1] + radius * cos(pts))
WH <- cbind(centroid.x[2] + radius * sin(pts), centroid.y[2] + radius * cos(pts))
GL <- cbind(centroid.x[3] + radius * sin(pts), centroid.y[3] + radius * cos(pts))
sl = SpatialPolygons(list(Polygons(list(Polygon(ZH)), 1), Polygons(list(Polygon(WH)), 2), 
                          Polygons(list(Polygon(GL)), 3))) 

crs(sl) <- prjstr

mun2017 <- st_as_sf(mun2017)
sl <- st_as_sf(sl)

ggplot() + geom_sf(data = mun2017, fill = NA, col = "grey55") + 
  geom_sf(data = sl, fill = NA, col = "red", size = 1) + 
  geom_point(data = caseproces, aes(x = x, y = y), size = .7) +
  theme_bw() + theme(
    axis.text = element_blank(),
    axis.title = element_blank()
  )

Prepare data

  • We will first create a mesh.
cantonZH <- st_union(mun2017)

mesh <- inla.mesh.create.helper(boundary = as(cantonZH, "Spatial"), 
                                cutoff=500, max.edge=c(1400, 10000), offset=c(.1,10000))

par(mar= c(0,0,.7,0))
plot(mesh)

You can play with the cutoff, max.edge and offset. What do you see? For good introduction to mesh please refer to https://folk.ntnu.no/fuglstad/Lund2016/Session6/spde-tutorial.pdf chapter 2.

Population density

Fir the population density, I will use the Voronoi tesselation linked with the mesh I created and calculate the number of points per polygon divided by the corresponding area.

To do this I will use a function (inla.mesh.dual()) by Elias that creates the shapefile from a mesh object in INLA.

# load the function
source("VoronoiFun.R")
VoronoiTes <- inla.mesh.dual(mesh)
crs(VoronoiTes) <- prjstr

par(mar= c(0,0,0,0))
plot(VoronoiTes)

and now I will aggregate the population per Voronoi to create the population density:

temp.spp <- SpatialPoints(coordinates(VoronoiTes), prjstr)
ov_1 <- over(temp.spp, as(cantonZH, "Spatial"))

newdata <- data.frame(ID = 1:length(VoronoiTes@polygons))
row.names(newdata) <- newdata$ID
VoronoiTes <- SpatialPolygonsDataFrame(VoronoiTes, newdata)

VoronoiTes$InOut <- "In"
VoronoiTes$InOut[is.na(ov_1)] <- "Out"

# Now count population per polygon
 
spat.points.temp <- SpatialPoints(poppointproc, proj4string = prjstr)
over.tes <- over(spat.points.temp, VoronoiTes)
over.tes$x <-  spat.points.temp@coords[,1]
over.tes$y <-  spat.points.temp@coords[,2]

# add 1 as counts to be aggregated
  
over.tes$count <- 1
temp.ag <- aggregate(over.tes$count, by = list(ID = over.tes$ID), sum)
  
temp.merge <- merge(VoronoiTes, temp.ag, by.x= "ID", by.y = "ID")
# everything flagged as out stays NA, everything in becomes 0
temp.merge$x[temp.merge$InOut %in% "In" & is.na(temp.merge$x)] <- 0

# and divide with the area to calculate the population density
temp.merge$dens <- temp.merge$x/unlist(lapply(temp.merge@polygons, function(X) slot(X, "area")))

# and plot

temp.merge <- st_as_sf(temp.merge)
ggplot() + geom_sf(data = temp.merge, aes(fill = dens), col = NA) + scale_fill_viridis_c() + theme_bw()

Fitting an LGCP

In this section, we will fit LGCPs with R-INLA. First we need to define priors for the latent field. The latent field depends on a correlation function, which is a function of distance between the points. For the SPDE approach the correlation function is the Matern which is a very flexible class of correlation functions (Lindgren, Rue, and Lindström 2011). The Matern correlation function takes as parameters a smoothness parameter (\(\nu\)) which is fixed by the investigators, a range parameter which controls the correlation (\(\phi\)) and a variance parameter which defines the variance around 0 for the latent field (\(\tau\)). \[ \Xi|\lambda(s) \sim\text{Poisson}(\int_{\mathcal{W}}\lambda(s)ds) \\[5pt] \log(\lambda(s) ) = log(\lambda_0(s)) + \beta_0 + u(s) \\[5pt] u(s) \sim \text{GF}(0, \mathbf{\Sigma}(h, \tau, \phi))\\[5pt] \kappa(h) = \tau^2\rho_{\nu}(h/\phi), \rho_{\nu}(\cdot) \text{ Matern} \]

  • To define spde we use the R-INLA function inla.spde2.pcmatern(). For the prior selection we used the PC-priors (Simpson et al. 2017).
spde <- inla.spde2.pcmatern(mesh=mesh, alpha=2, 
                            prior.range=c(60000,0.5), # Pr(phi < 60000) = 0.5
                            prior.sigma=c(1,0.01) # Pr(sigma > 1) = 0.01
)
  • Some preparatory steps

  • The case vector. It should be NA on the mesh nodes outside the domain, 0 on the mesh nodes inside the domain, and 1 for the case locations

temp.y <- rep(NA, times = (mesh$n))
temp.y[(VoronoiTes$InOut %in% "In")] <- 0
y.pp <- c(temp.y, rep(1, times = nrow(caseproces)))
  • The offset. This is basically the number of people per Voronoi.
e.pp <- c(diag(spde$param.inla$M0), rep(0,nrow(caseproces)))
ForOffset <- c(temp.merge$dens, rep(0, times = nrow(caseproces)))
ForOffset <- e.pp*ForOffset
  • The projector matrix, required when the locations are not vertices.
lmat <- inla.spde.make.A(mesh = mesh, loc = as.matrix(caseproces))
imat <- Diagonal(mesh$n, rep(1, mesh$n))

A.pp <- rBind(imat, lmat)
  • The stack.
stk.pp <- inla.stack(data=list(y=y.pp, e=ForOffset), A=list(A.pp, 1), 
                     tag='pp', effects=c(list(i=1:mesh$n), list(intercept=rep(1, length(y.pp)))))
  • Run the inla call
lgcp_mod <- inla(y ~ -1 + intercept + f(i, model = spde), 
               family='poisson', data = inla.stack.data(stk.pp),
               control.predictor = list(A = inla.stack.A(stk.pp), link = 1),
               E = inla.stack.data(stk.pp)$e, verbose=F, 
               control.compute=list(dic=TRUE,cpo=TRUE, config = TRUE), 
               control.inla=list(strategy="simplified.laplace", 
                                 int.strategy="eb"))

Let’s check the results. Haakon has written a nice function to visualize the results:

Extract the results

To extract the field, I will grid a grid over the domain and project the results of the field on the grid specification.Please note that the estimates of the model at this point is for the mesh nodes, however using the SPDE theory we can project them in any resolution and thus in theory we have a continuous surface. To do so we need the functions inla.mesh.projector() and inla.mesh.project(). The first one defines where to project the results, whereas in the second one you give the object to be projected. Let’s say that due to data confidentiality considerations we are allowed to present results at an \(500\times500\) m resolution.

grdpts <- makegrid(as(cantonZH, "Spatial"), cellsize = c(500,500))
spgrd <- SpatialPoints(grdpts, proj4string = prjstr)
spgrdWithin <- SpatialPixels(spgrd[as(cantonZH, "Spatial"),])

# and project on the centroids
pgrid0 <- inla.mesh.projector(mesh, loc = coordinates(spgrdWithin))
# the median random effect
prd0.m <- inla.mesh.project(pgrid0, lgcp_mod$summary.random$i$`0.5quant`)

# covert to polygon
pol <- as(spgrdWithin, "SpatialPolygons")
pol_df <- SpatialPolygonsDataFrame(pol, data.frame(median.ranef = prd0.m, row.names = names(pol)))

pol_df <- st_as_sf(pol_df)

ggplot() + geom_sf(data = pol_df, aes(fill = median.ranef), col = NA) + theme_bw() + 
  scale_fill_viridis_c()

Of course this means that you can project it in lower resolutions if needed.

  • We can also calculate exceedance probabilities. Recall you can calculate any exceedance probability that is on interest of you. For here I select \(\Pr(\text{risk} > 0.0075)\), where \(1503/199979 = 0.0075\) the overall risk.
marfitted <- lgcp_mod$marginals.fitted.values[1:mesh$n]
threshold <- 0.0075
exceed.prob <- lapply(X= marfitted, FUN = function(x) inla.pmarginal(marginal = x, threshold))
exceed.prob <- 1 - unlist(exceed.prob)

# and we can also project this on the 1km grid as before

exceed.prob <- inla.mesh.project(pgrid0, exceed.prob)
pol_df$exceed.prob <- exceed.prob

ggplot() + geom_sf(data = pol_df, aes(fill = exceed.prob), col = NA) + theme_bw() + 
  scale_fill_viridis_c()

and we could, as in the BYM case, create a rule of which areas might be considered as of higher risk. Let’s say that the grid cells that have an \(\Pr(\text{risk} > 0.0075)>0.95\) are high-risk areas. Then we have (I will change the colours again to make it visually clearer):

temp.ex <-  pol_df[pol_df$exceed.prob >=0.95,]
temp.ex <- st_union(temp.ex)

ggplot() + geom_sf(data = pol_df, aes(fill = exceed.prob), col = NA) + theme_bw() + 
  scale_fill_viridis_c() +  geom_sf(data = sl, fill = NA, col = "red", size = 1) + 
  geom_sf(data = temp.ex, fill = NA, col = "blue", size = 1)

where the red areas is the true areas of higher risk and the blue ones are the ones that the model suggested as high-risk.

As the last part of this tutorial we will assess the hyperparameters:

par(mfrow = c(1,2), mai = c(.8,.8,.2,.1))
prior.median.range <- 60000
tmp = inla.tmarginal(function(x) 1/x, lgcp_mod$marginals.hyperpar$`Range for i`) 
lambda = -log(.5)/(1/prior.median.range)
plot(tmp, type = "l", xlab = "inverse range nominal", ylab = "Density", xlim = c(0,0.0002), 
     ylim = c(0, lambda))
xvals = seq(0, 0.0004, length.out=1000)
lines(xvals, lambda*exp(-lambda*xvals), col = "blue")
grid()

prior.median.sd <- 1
tmp = inla.tmarginal(function(x) x, lgcp_mod$marginals.hyperpar$`Stdev for i`) 
plot(tmp, type = "l", xlab = expression(sigma[u]), ylab = "Density", xlim = c(0, 1))
xvals = seq(0, 1, length.out=1000)
lambda = -log(.01)/prior.median.sd; lines(xvals, lambda*exp(-lambda*xvals), col = "blue")
grid()

References

Lindgren, Finn, Håvard Rue, and Johan Lindström. 2011. “An Explicit Link Between Gaussian Fields and Gaussian Markov Random Fields: The Stochastic Partial Differential Equation Approach.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (4): 423–98.

Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G Martins, Sigrunn H Sørbye, and others. 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science 32 (1): 1–28.