In this document we will perform ecological regression using R-INLA
(Rue, Martino, and Chopin 2009). We will BYM2 (Riebler et al. 2016), a reparametrization of (Besag, York, and Mollié 1991) to stroke mortality in Sheffield examining the effect of NO\(_x\) after adjusting for deprivation. The dataset includes information about stroke mortality in Sheffield (counts of deaths), population at risk and relevant covariates (NOx concentration, and Townsend index, which is a deprivation index). In the folder there is also a shapefile with the enumeration districts for the city of Sheffield. You can download the data from github.
This practical requires the following packages to be installed and attached: sf
, spdep
, dplyr
,ggplot2
, patchwork
, kableExtra
and INLA
.
install.packages(c("sf", "spdep", "ggplot2", "patchwork"), dependencies = TRUE,
repos = "http://cran.r-project.org")
INLA
, you can useinstall.packages("INLA",repos=c(getOption("repos"),
INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
library(sf)
library(spdep)
library(dplyr)
library(ggplot2)
library(patchwork)
library(kableExtra)
library(INLA)
stroke <- read.csv("Stroke.csv")
stroke$index <- 1:nrow(stroke)
head(stroke)
## SP_ID stroke_exp pop y Townsend.class NOx.class Townsend NOx
## 1 05CGFA01 2.620953 931.65 3 [-6.54,-2.07] (41.4,53.6] 1 2
## 2 05CGFA02 1.989744 710.36 1 (-2.07,0.086] (82.1,243] 2 5
## 3 05CGFA03 2.286056 654.12 3 (0.086,3.11] (82.1,243] 3 5
## 4 05CGFA04 2.151746 1109.37 0 (-2.07,0.086] (82.1,243] 2 5
## 5 05CGFA05 4.380834 1159.31 1 (-2.07,0.086] (53.6,64.8] 2 3
## 6 05CGFA06 3.094954 1020.21 5 [-6.54,-2.07] (53.6,64.8] 1 3
## Offset index
## 1 -2.549570 1
## 2 -2.551463 2
## 3 -2.455050 3
## 4 -2.711442 4
## 5 -2.420999 5
## 6 -2.516716 6
ggplot() + geom_point(data = stroke, aes(x = index, y = stroke_exp), size = 0.5) +
theme_bw() + ylab("expected")
table(stroke$Townsend.class)
##
## (-2.07,0.086] (0.086,3.11] (3.11,5.44] (5.44,9.2] [-6.54,-2.07]
## 206 207 204 207 206
table(stroke$NOx.class)
##
## (41.4,53.6] (53.6,64.8] (64.8,82.1] (82.1,243] [7.8,41.4]
## 207 205 206 206 206
The dataset has 9 columns. The SP_ID gives us information about the ID of the enumeration districts, the columns stroke_exp and pop give infomation about the expected number of deaths and the population at risk at the i-th spatial unit. Information on counts of stroke are given by the variable y. Covariate information (in quintiles) is given in the columns Townsend.class (deprivation) and NOx.class (air pollutant used to capture mainly traffic related air pollution). Townsend and NOx give the categories of the class of this covariates and the offset is defined as \(\text{offset}_i = \text{logit}_i(E_i/\text{Pop}_i)\).
sheffield.gen <- read_sf("Sheffield.shp")
*and plot it
ggplot() + geom_sf(data = sheffield.gen, fill = NA) + theme_bw()
The research question of this particular problem is to examine the spatial distribution of deaths by stroke in Sheffield and examine to which extent this distribution can be explained by variables as traffic-related air pollution and deprivation. We are also interested in examining the ecological association between these covariates and stroke mortality. To answer these questions we will perform disease mapping and ecological regression with INLA. We will use the binomial distribution rather the Poisson one for this example, since stroke is not a rare outcome:
\[y_i \sim \text{Binomial}(p_i, n_i)\\ \text{logit}(p_i) = \beta_0 + u_i + v_i\\ u_{i}|\mathbf{u}_{-i} \sim \mathcal{N}\Big(\frac{\sum_{j=1}^Nw_{ij}u_j}{\sum_{j = 1}^{N}w_{ij}}, \frac{1}{\tau_1\sum_{j=1}^{N}w_{ij}}\Big) \\[5pt] v_i \sim \mathcal{N}(0, \tau_2^{-1}) \\\]
as before we considered the reparametrization of Simpson (Riebler et al. 2016), however by typing model = “bym” you fit exactly the above model.
# define the neighbour structure
W.nb <- poly2nb(sheffield.gen)
nb2INLA("W.adj", W.nb)
# match
stroke <- stroke[match(sheffield.gen$SP_ID,stroke$SP_ID),]
#Check for the first 10 areas
stroke$SP_ID[1:10]
## [1] "05CGGD02" "05CGFT44" "05CGGF35" "05CGGD29" "05CGFH02" "05CGGD31"
## [7] "05CGFM08" "05CGGD22" "05CGGD17" "05CGGD16"
sheffield.gen$SP_ID[1:10]
## [1] "05CGGD02" "05CGFT44" "05CGGF35" "05CGGD29" "05CGFH02" "05CGGD31"
## [7] "05CGFM08" "05CGGD22" "05CGGD17" "05CGGD16"
# create an ID per spatial unit
stroke$ID<- seq(1,1030)
formula <- y ~ 1 + f(ID, model="bym2", graph="W.adj", scale.model = TRUE, constr = TRUE,
# priors
hyper = list(theta1 = list("PCprior", c(1, 0.01)),
# Pr(sd<1) = 0.01, unlikely to have rr>3just based on the spatial confounding
theta2 = list("PCprior", c(0.5, 0.5)))
# Pr(phi<0.5)=0.5, we believe that the unmeasured spatial confounding
# is driven 50% from the strucutred and 50% from the unstructured random effect
)
strokes_DM <- inla(formula, data = stroke, family="binomial", offset=Offset, Ntrials=pop,
control.compute = list(dic = TRUE, waic = TRUE), verbose = F)
strokes_DM$summary.hyperpar
## mean sd 0.025quant 0.5quant 0.975quant
## Precision for ID 2.43832343 0.19061808 2.087736949 2.43015297 2.8358812
## Phi for ID 0.04684131 0.04132873 0.003290217 0.03516711 0.1559025
## mode
## Precision for ID 2.413137841
## Phi for ID 0.009054035
sd_mar <- as.data.frame(inla.tmarginal(function(x) exp(-1/2*x),
strokes_DM$internal.marginals.hyperpar$`Log precision for ID`))
head(sd_mar)
## x y
## 1 0.5635654 0.0792093
## 2 0.5680135 0.1486057
## 3 0.5707516 0.2140805
## 4 0.5727700 0.2758374
## 5 0.5743826 0.3361945
## 6 0.5757298 0.3956633
ggplot() + geom_line(data = sd_mar, aes(x = x, y = y)) + theme_bw() +
ggtitle("Posterior of sd of the spatial field") -> p1
ggplot() + geom_line(data = as.data.frame(strokes_DM$marginals.hyperpar$`Phi for ID`),
aes(x = x, y = y)) + theme_bw() +
ggtitle("Posterior of sd of the mixing parameter") -> p2
p1|p2
There seems to be variation in the field captured by the BYM2 prior, nevertheless the mixing parameter is very small, implying that most variation is an attribute to the unstructured spatial components. In other words, the field is dominated by overdispersion rather than spatial autocorelation.
sheffield.gen$sp <- strokes_DM$summary.random$ID$`0.5quant`[1:1030]
ggplot() + geom_sf(data = sheffield.gen, aes(fill = sp), col = NA) + theme_bw() +
scale_fill_viridis_c()
threshold <- log(1)
exceed.prob <- lapply(X= strokes_DM$marginals.random$ID[1:1030], FUN = function(x) inla.pmarginal(marginal = x, threshold))
exceed.prob <- 1 - unlist(exceed.prob)
ggplot() + geom_boxplot(data=as.data.frame(exceed.prob), aes(y = exceed.prob)) +
geom_hline(yintercept = 0.95, col = "red") + theme_bw() + xlim(c(-0.6,0.6)) +
ylab("") +
theme(axis.text.x = element_blank()) -> p1
sheffield.gen$ex <- exceed.prob
temp.ex <- sheffield.gen[sheffield.gen$ex >=0.95,]
ggplot() + geom_sf(data = sheffield.gen, aes(fill = ex), col = NA) + scale_fill_viridis_c() +
geom_sf(data = temp.ex, col = "red", fill = NA) + theme_bw() -> p2
(p1|p2) + plot_annotation(title = "Posterior probability")
formula_2 <- y ~ 1 + f(ID, model="iid", constr = TRUE,
hyper = list(theta = list("PCprior", c(1, 0.01))))
strokes_DM_unstr <- inla(formula_2, data = stroke,
family="binomial", offset=Offset, Ntrials=pop,
control.compute = list(dic = TRUE, waic = TRUE), verbose = F)
strokes_DM$dic$dic; strokes_DM_unstr$dic$dic
## [1] 3947.695
## [1] 3947.14
It seems that the models without a spatial component performs better.
Now we will add the covariates of interest:
\[y_i \sim \text{Binomial}(p_i, n_i)\\ \text{logit}(p_i) = \beta_0 + \beta_1\text{Townsend} + \beta_2\text{NOx} + u_i + v_i\\ u_{i}|\mathbf{u}_{-i} \sim \mathcal{N}\Big(\frac{\sum_{j=1}^Nw_{ij}u_j}{\sum_{j = 1}^{N}w_{ij}}, \frac{1}{\tau_1\sum_{j=1}^{N}w_{ij}}\Big) \\[5pt] v_i \sim \mathcal{N}(0, \tau_2^{-1}) \\ \beta_1, \beta_2 \sim \mathcal{N}(0, 0.5)\]
stroke <- left_join(sheffield.gen, stroke, by = c("SP_ID" = "SP_ID"))
ggplot() + geom_sf(data = stroke, aes(fill = NOx), col = NA) + scale_fill_viridis_c() +
theme_bw() -> p1
ggplot() + geom_sf(data = stroke, aes(fill = Townsend), col = NA) + scale_fill_viridis_c() +
theme_bw() -> p2
p1|p2
What do you observe?
formula_eco <- y ~ 1 + Townsend + NOx +
f(ID, model="bym2", graph="W.adj", scale.model = TRUE, constr = TRUE,
# hyper priors
hyper = list(theta1 = list("PCprior", c(1, 0.01)),
theta2 = list("PCprior", c(0.5, 0.5)))
)
model_eco <- inla(formula_eco, family="binomial", data=stroke, offset=Offset, Ntrials=pop,
control.compute = list(dic = TRUE, waic = TRUE),
control.fixed=list(prec=list(mean=0, prec=0.5)))
tab <- exp(model_eco$summary.fixed[c("Townsend", "NOx"),c("0.5quant", "0.025quant", "0.975quant")])
tab <- round(tab, digits = 2)
kableExtra::kable(tab, caption = "Median and 95%CrI for the odd ratio for NOx and deprivation") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
0.5quant | 0.025quant | 0.975quant | |
---|---|---|---|
Townsend | 1.01 | 0.97 | 1.06 |
NOx | 1.08 | 1.03 | 1.13 |
ggplot() + geom_line(data = as.data.frame(model_eco$marginals.hyperpar$`Phi for ID`),
aes(x = x, y = y)) + theme_bw() +
ggtitle("Posterior of sd of the mixing parameter")
now almost the entire amount of variation is due to the ustructured random effect.
sheffield.gen$sp_eco <- model_eco$summary.random$ID$`0.5quant`[1:1030]
ggplot() + geom_sf(data = sheffield.gen, aes(fill = sp_eco), col = NA) + theme_bw() +
scale_fill_viridis_c()
not much of variation was explained by the covariates.
Besag, Julian, Jeremy York, and Annie Mollié. 1991. “Bayesian Image Restoration, with Two Applications in Spatial Statistics.” Annals of the Institute of Statistical Mathematics 43 (1): 1–20.
Riebler, Andrea, Sigrunn H Sørbye, Daniel Simpson, and Håvard Rue. 2016. “An Intuitive Bayesian Spatial Model for Disease Mapping That Accounts for Scaling.” Statistical Methods in Medical Research 25 (4): 1145–65.
Rue, Håvard, Sara Martino, and Nicolas Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2): 319–92.