In this document we will perfom disease mapping with log-Gaussian Cox processes. A great bottleneck on fitting such models is that in most cases we have not point data available. Thus for this particular example I simulated data from the population of the canton of Zurich mimicking the childhood Type 1 Diabetes (T1D). The data file available are the population density of children 0-15 years old in the canton of Zurich and 1503 simulated childhood T1D cases. I also provide you with the shapefile of the canton of Zurich.
This practical requires the following packages to be installed and attached: sf
, spdep
, tmap
,ggplot2
, patchwork
, kableExtra
and INLA
.
install.packages(c("sf", "spdep", "ggplot2", "patchwork"), dependencies = TRUE,
repos = "http://cran.r-project.org")
INLA
, you can use:install.packages("INLA",repos=c(getOption("repos"),
INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
library(sf)
library(raster)
library(tmap)
library(kableExtra)
library(INLA)
library(maptools)
library(tidyverse)
Please notice that for this example we are using simulated data. The advantage of this, is that we know the truth and we can assess how well this truth is retrieved by the LGCP.
load("caseproces.RData")
load("poppointproc.RData")
load("ZHshapefile.RData")
prjstr <- CRS("+proj=somerc +lat_0=46.95240555555556 +lon_0=7.439583333333333 +k_0=1 +x_0=600000 +y_0=200000 +ellps=bessel +towgs84=674.374,15.056,405.346,0,0,0,0 +units=m +no_defs")
For this particular example I assumed a 3-fold risk increase inside some arbitrary circles of 5km radii with centroids in the municipality of Zurich, Wintertur and Gossau.
centroid.x <- c(681689, 696839, 700017)
centroid.y <- c(247620, 261850, 240222)
centroids <- data.frame(centroid.x = centroid.x, centroid.y = centroid.y)
radius <- 5000
n = 1000
pts = seq(0, 2 * pi, length.out = n)
ZH = cbind(centroid.x[1] + radius * sin(pts), centroid.y[1] + radius * cos(pts))
WH <- cbind(centroid.x[2] + radius * sin(pts), centroid.y[2] + radius * cos(pts))
GL <- cbind(centroid.x[3] + radius * sin(pts), centroid.y[3] + radius * cos(pts))
sl = SpatialPolygons(list(Polygons(list(Polygon(ZH)), 1), Polygons(list(Polygon(WH)), 2),
Polygons(list(Polygon(GL)), 3)))
crs(sl) <- prjstr
mun2017 <- st_as_sf(mun2017)
sl <- st_as_sf(sl)
ggplot() + geom_sf(data = mun2017, fill = NA, col = "grey55") +
geom_sf(data = sl, fill = NA, col = "red", size = 1) +
geom_point(data = caseproces, aes(x = x, y = y), size = .7) +
theme_bw() + theme(
axis.text = element_blank(),
axis.title = element_blank()
)
cantonZH <- st_union(mun2017)
mesh <- inla.mesh.create.helper(boundary = as(cantonZH, "Spatial"),
cutoff=500, max.edge=c(1400, 10000), offset=c(.1,10000))
par(mar= c(0,0,.7,0))
plot(mesh)
You can play with the cutoff, max.edge and offset. What do you see? For good introduction to mesh please refer to https://folk.ntnu.no/fuglstad/Lund2016/Session6/spde-tutorial.pdf chapter 2.
Fir the population density, I will use the Voronoi tesselation linked with the mesh I created and calculate the number of points per polygon divided by the corresponding area.
To do this I will use a function (inla.mesh.dual()
) by Elias that creates the shapefile from a mesh object in INLA.
# load the function
source("VoronoiFun.R")
VoronoiTes <- inla.mesh.dual(mesh)
crs(VoronoiTes) <- prjstr
par(mar= c(0,0,0,0))
plot(VoronoiTes)
and now I will aggregate the population per Voronoi to create the population density:
temp.spp <- SpatialPoints(coordinates(VoronoiTes), prjstr)
ov_1 <- over(temp.spp, as(cantonZH, "Spatial"))
newdata <- data.frame(ID = 1:length(VoronoiTes@polygons))
row.names(newdata) <- newdata$ID
VoronoiTes <- SpatialPolygonsDataFrame(VoronoiTes, newdata)
VoronoiTes$InOut <- "In"
VoronoiTes$InOut[is.na(ov_1)] <- "Out"
# Now count population per polygon
spat.points.temp <- SpatialPoints(poppointproc, proj4string = prjstr)
over.tes <- over(spat.points.temp, VoronoiTes)
over.tes$x <- spat.points.temp@coords[,1]
over.tes$y <- spat.points.temp@coords[,2]
# add 1 as counts to be aggregated
over.tes$count <- 1
temp.ag <- aggregate(over.tes$count, by = list(ID = over.tes$ID), sum)
temp.merge <- merge(VoronoiTes, temp.ag, by.x= "ID", by.y = "ID")
# everything flagged as out stays NA, everything in becomes 0
temp.merge$x[temp.merge$InOut %in% "In" & is.na(temp.merge$x)] <- 0
# and divide with the area to calculate the population density
temp.merge$dens <- temp.merge$x/unlist(lapply(temp.merge@polygons, function(X) slot(X, "area")))
# and plot
temp.merge <- st_as_sf(temp.merge)
ggplot() + geom_sf(data = temp.merge, aes(fill = dens), col = NA) + scale_fill_viridis_c() + theme_bw()
In this section, we will fit LGCPs with R-INLA. First we need to define priors for the latent field. The latent field depends on a correlation function, which is a function of distance between the points. For the SPDE approach the correlation function is the Matern which is a very flexible class of correlation functions (Lindgren, Rue, and Lindström 2011). The Matern correlation function takes as parameters a smoothness parameter (\(\nu\)) which is fixed by the investigators, a range parameter which controls the correlation (\(\phi\)) and a variance parameter which defines the variance around 0 for the latent field (\(\tau\)). \[ \Xi|\lambda(s) \sim\text{Poisson}(\int_{\mathcal{W}}\lambda(s)ds) \\[5pt] \log(\lambda(s) ) = log(\lambda_0(s)) + \beta_0 + u(s) \\[5pt] u(s) \sim \text{GF}(0, \mathbf{\Sigma}(h, \tau, \phi))\\[5pt] \kappa(h) = \tau^2\rho_{\nu}(h/\phi), \rho_{\nu}(\cdot) \text{ Matern} \]
inla.spde2.pcmatern()
. For the prior selection we used the PC-priors (Simpson et al. 2017).spde <- inla.spde2.pcmatern(mesh=mesh, alpha=2,
prior.range=c(60000,0.5), # Pr(phi < 60000) = 0.5
prior.sigma=c(1,0.01) # Pr(sigma > 1) = 0.01
)
Some preparatory steps
The case vector. It should be NA on the mesh nodes outside the domain, 0 on the mesh nodes inside the domain, and 1 for the case locations
temp.y <- rep(NA, times = (mesh$n))
temp.y[(VoronoiTes$InOut %in% "In")] <- 0
y.pp <- c(temp.y, rep(1, times = nrow(caseproces)))
e.pp <- c(diag(spde$param.inla$M0), rep(0,nrow(caseproces)))
ForOffset <- c(temp.merge$dens, rep(0, times = nrow(caseproces)))
ForOffset <- e.pp*ForOffset
lmat <- inla.spde.make.A(mesh = mesh, loc = as.matrix(caseproces))
imat <- Diagonal(mesh$n, rep(1, mesh$n))
A.pp <- rBind(imat, lmat)
stk.pp <- inla.stack(data=list(y=y.pp, e=ForOffset), A=list(A.pp, 1),
tag='pp', effects=c(list(i=1:mesh$n), list(intercept=rep(1, length(y.pp)))))
lgcp_mod <- inla(y ~ -1 + intercept + f(i, model = spde),
family='poisson', data = inla.stack.data(stk.pp),
control.predictor = list(A = inla.stack.A(stk.pp), link = 1),
E = inla.stack.data(stk.pp)$e, verbose=F,
control.compute=list(dic=TRUE,cpo=TRUE, config = TRUE),
control.inla=list(strategy="simplified.laplace",
int.strategy="eb"))
Let’s check the results. Haakon has written a nice function to visualize the results:
To extract the field, I will grid a grid over the domain and project the results of the field on the grid specification.Please note that the estimates of the model at this point is for the mesh nodes, however using the SPDE theory we can project them in any resolution and thus in theory we have a continuous surface. To do so we need the functions inla.mesh.projector()
and inla.mesh.project()
. The first one defines where to project the results, whereas in the second one you give the object to be projected. Let’s say that due to data confidentiality considerations we are allowed to present results at an \(500\times500\) m resolution.
grdpts <- makegrid(as(cantonZH, "Spatial"), cellsize = c(500,500))
spgrd <- SpatialPoints(grdpts, proj4string = prjstr)
spgrdWithin <- SpatialPixels(spgrd[as(cantonZH, "Spatial"),])
# and project on the centroids
pgrid0 <- inla.mesh.projector(mesh, loc = coordinates(spgrdWithin))
# the median random effect
prd0.m <- inla.mesh.project(pgrid0, lgcp_mod$summary.random$i$`0.5quant`)
# covert to polygon
pol <- as(spgrdWithin, "SpatialPolygons")
pol_df <- SpatialPolygonsDataFrame(pol, data.frame(median.ranef = prd0.m, row.names = names(pol)))
pol_df <- st_as_sf(pol_df)
ggplot() + geom_sf(data = pol_df, aes(fill = median.ranef), col = NA) + theme_bw() +
scale_fill_viridis_c()
Of course this means that you can project it in lower resolutions if needed.
marfitted <- lgcp_mod$marginals.fitted.values[1:mesh$n]
threshold <- 0.0075
exceed.prob <- lapply(X= marfitted, FUN = function(x) inla.pmarginal(marginal = x, threshold))
exceed.prob <- 1 - unlist(exceed.prob)
# and we can also project this on the 1km grid as before
exceed.prob <- inla.mesh.project(pgrid0, exceed.prob)
pol_df$exceed.prob <- exceed.prob
ggplot() + geom_sf(data = pol_df, aes(fill = exceed.prob), col = NA) + theme_bw() +
scale_fill_viridis_c()
and we could, as in the BYM case, create a rule of which areas might be considered as of higher risk. Let’s say that the grid cells that have an \(\Pr(\text{risk} > 0.0075)>0.95\) are high-risk areas. Then we have (I will change the colours again to make it visually clearer):
temp.ex <- pol_df[pol_df$exceed.prob >=0.95,]
temp.ex <- st_union(temp.ex)
ggplot() + geom_sf(data = pol_df, aes(fill = exceed.prob), col = NA) + theme_bw() +
scale_fill_viridis_c() + geom_sf(data = sl, fill = NA, col = "red", size = 1) +
geom_sf(data = temp.ex, fill = NA, col = "blue", size = 1)
where the red areas is the true areas of higher risk and the blue ones are the ones that the model suggested as high-risk.
As the last part of this tutorial we will assess the hyperparameters:
par(mfrow = c(1,2), mai = c(.8,.8,.2,.1))
prior.median.range <- 60000
tmp = inla.tmarginal(function(x) 1/x, lgcp_mod$marginals.hyperpar$`Range for i`)
lambda = -log(.5)/(1/prior.median.range)
plot(tmp, type = "l", xlab = "inverse range nominal", ylab = "Density", xlim = c(0,0.0002),
ylim = c(0, lambda))
xvals = seq(0, 0.0004, length.out=1000)
lines(xvals, lambda*exp(-lambda*xvals), col = "blue")
grid()
prior.median.sd <- 1
tmp = inla.tmarginal(function(x) x, lgcp_mod$marginals.hyperpar$`Stdev for i`)
plot(tmp, type = "l", xlab = expression(sigma[u]), ylab = "Density", xlim = c(0, 1))
xvals = seq(0, 1, length.out=1000)
lambda = -log(.01)/prior.median.sd; lines(xvals, lambda*exp(-lambda*xvals), col = "blue")
grid()
Lindgren, Finn, Håvard Rue, and Johan Lindström. 2011. “An Explicit Link Between Gaussian Fields and Gaussian Markov Random Fields: The Stochastic Partial Differential Equation Approach.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (4): 423–98.
Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G Martins, Sigrunn H Sørbye, and others. 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science 32 (1): 1–28.