In this document we will perform ecological regression using R-INLA
(Rue, Martino, and Chopin 2009). We will BYM2 (Riebler et al. 2016), a reparametrization of (Besag, York, and Mollié 1991) to lung cancer mortality in England examining the effect of deprivation. The dataset includes information about the geographical units (or intermediate geographies or IG) in England as for example the name of the region, the code etc. It also includes the number of cases diagnosed with lung cancer per intermediate geographies (IG) in England in 2007, the type of cancer, a deprivation index, the expected number of cases in the corresponding IG and the logarithm of this expected number of cases. You can download the data from github.
This practical requires the following packages to be installed and attached: sf
, spdep
, tmap
,ggplot2
, patchwork
, kableExtra
and INLA
.
install.packages(c("sf", "spdep", "ggplot2", "patchwork"), dependencies = TRUE,
repos = "http://cran.r-project.org")
INLA
, you can use:install.packages("INLA",repos=c(getOption("repos"),
INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
library(sf)
library(spdep)
library(tmap)
library(ggplot2)
library(patchwork)
library(kableExtra)
library(INLA)
load("lung.RData")
class(lung)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
sf
object.lung <- st_as_sf(lung)
ggplot() + geom_sf(data = lung, fill = NA) + theme_bw()
ggplot() + geom_sf(data = lung, aes(fill = cases), col = NA) + scale_fill_viridis_c() + theme_bw()
This map does not tell us a lot about the spatial distribution of lung cancer incidence. It mainly reflects the population per IG. We can calculate the standardized incidence ratio (SIR) instead.
lung$SIR <- lung$cases/lung$expected
ggplot() + geom_sf(data = lung, aes(fill = SIR), col = NA) + scale_fill_viridis_c() + theme_bw()
Often the SIR or SMR is not the best estimator you can get because of its large variance. That means that we can often see extremes when the denominator is small. To mitigate this, in the Bayesian setting we smooth the estimators with a combination of local and global smoothing (through conditional autoregressive priors or CAR). The most popular one is the Besag-York-Molli{'e} prior (Besag, York, and Mollié 1991). In the next junk we will run 2 different models, one with an overdispersion parameter and one with a BYM prior. Here we considered a nice reparametrisation given by (Simpson et al. 2017). This reparametrisation overcomes issues as scaling and identifiability issues the common BYM confronts (Riebler et al. 2016, @freni2018). The mathematical formula is given bellow. Supposed \(Y_i\sim\text{Poisson}(\lambda_iE_i)\), where \(Y_i\) is the number of cases, \(E_i\) the expected number of cases and \(\lambda_i\) the SIR of the \(i\)-th IG. Then:
\[\text{log}Y_i = \beta_0 + \pmb{\beta}\mathbf{X} + \frac{1}{\sqrt{\tau}}(\sqrt{1-\phi}v_i + \sqrt{\phi}u_i^*)\] where \(v_i \sim \mathcal{N}(0, 1)\) and \(u_i^*\) is a standardized spatially structured component with characteristic marginal variance equal to 1. The hyperparameter \(\phi\) stands for the mixing parameter and it shows the percentage of variability explained by each random effect, with values close to 0 implying that the majority of the observed variation comes from the unstructured component, whereas values close to 1 the opposite. The hyperparameter \(\tau\)^is the random precision (1/variance). To complete the Bayesian representation we assign penalized complexity priors to the hyperparameters (Simpson et al. 2017).
First we will fit models without covariates, applying some level of smoothing on the SIRs. Lets begin by including just an overdispersion parameter, ie:
\[\text{log}Y_i = \beta_0 + v_i\] The steps to do so are:
lung$ID <- 1:nrow(lung)
formula_iid <- cases ~ 1 + offset(logExpected) +
f(ID, model="iid", constr = TRUE,
hyper = list(theta = list("PCprior", c(1, 0.01))))
lung_mod_ov <- inla(formula_iid, data = lung, family = "poisson",
control.compute = list(dic=T, waic=T), verbose = F)
summary(lung_mod_ov)
##
## Call:
## c("inla(formula = formula_iid, family = \"poisson\", data = lung, ", "
## verbose = F, control.compute = list(dic = T, waic = T))" )
## Time used:
## Pre = 0.558, Running = 0.81, Post = 1.92, Total = 3.29
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) -0.002 0.004 -0.009 -0.002 0.005 -0.002 0
##
## Random effects:
## Name Model
## ID IID model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Precision for ID 16.08 1.91 12.57 16.00 20.04 15.84
##
## Expected number of effective parameters(stdev): 145.46(0.536)
## Number of equivalent replicates : 1.03
##
## Deviance Information Criterion (DIC) ...............: 1526.14
## Deviance Information Criterion (DIC, saturated) ....: 295.66
## Effective number of parameters .....................: 145.56
##
## Watanabe-Akaike information criterion (WAIC) ...: 1484.70
## Effective number of parameters .................: 75.10
##
## Marginal log-Likelihood: -973.52
## Posterior marginals for the linear predictor and
## the fitted values are computed
lung$ov <- lung_mod_ov$summary.random$ID$`0.5quant`
ggplot() + geom_sf(data = lung, aes(fill = ov), col = NA) + scale_fill_viridis_c() + theme_bw()
# posterior
marg_post <- inla.tmarginal(function(x) exp(-1/2*x), marginal =
lung_mod_ov$internal.marginals.hyperpar$`Log precision for ID`)
# prior
marg_prior <- inla.tmarginal(function(x) exp(-1/2*x),
data.frame(x= seq(from = 1, to = 1000, length.out = 1000),
y = inla.pc.dprec(seq(from = 1, to = 1000, length.out = 1000), 1, 0.01))
)
# prepare to plot
marg_post <- as.data.frame(marg_post)
marg_post$type = "posterior"
marg_prior <- as.data.frame(marg_prior)
marg_prior$type = "prior"
dat <- rbind(marg_post, marg_prior)
ggplot() + geom_line(data = dat, aes(x = x, y = y, col = type), size = .8) + theme_bw() +
scale_y_continuous(limits = c(0, 30)) + scale_color_viridis_d(end = .7) +
ggtitle("Prior-posterior plot")
This plot is quite nice, telling us that there is some overdispersion in the data that we need to adjust for. Now we will add the spatially structured random effect using the Simpson parameterisation explained above. To fit this model in INLA
you need to specify model = "bym2"
, however is still possible to fit the original BYM by specifying model = "bym"
instead.
W.nb <- poly2nb(lung)
nb2INLA("W.adj", W.nb)
formula_bym <- cases ~ 1 + offset(logExpected) +
f(ID, model="bym2", graph="W.adj", scale.model = TRUE, constr = TRUE,
# priors
hyper = list(theta1 = list("PCprior", c(1, 0.01)),
# Pr(sd<1) = 0.01, unlikely to have rr>3just based on the spatial confounding
theta2 = list("PCprior", c(0.5, 0.5)))
# Pr(phi<0.5)=0.5, we state that we believe that the unmeasured spatial confounding
# is driven 50% from the structured and 50% from the unstructured random effect
)
lung_mod_bym <- inla(formula_bym, data = lung, family="poisson",
control.compute = list(dic = TRUE, waic = TRUE), verbose = F)
*We can check the DIC (Deviance information criterion) to examine which model performs better. It seems that the model with the bym prior performs slightly better compared to the model with just an overdispersion parameter:
lung_mod_ov$dic$dic; lung_mod_bym$dic$dic
## [1] 1526.145
## [1] 1521.987
# The precision
# sd
marg_post_prec <- inla.tmarginal(function(x) exp(-1/2*x), marginal =
lung_mod_bym$internal.marginals.hyperpar$`Log precision for ID`)
# prior
marg_prior_prec <- inla.tmarginal(function(x) exp(-1/2*x),
data.frame(x= seq(from = 1, to = 1000, length.out = 1000),
y = inla.pc.dprec(seq(from = 1, to = 1000, length.out = 1000),
1, 0.01))
)
# prepare to plot
marg_post_prec <- as.data.frame(marg_post_prec)
marg_post_prec$type = "posterior"
marg_prior_prec <- as.data.frame(marg_prior_prec)
marg_prior_prec$type = "prior"
dat_sd <- rbind(marg_post_prec, marg_prior_prec)
ggplot() + geom_line(data = dat_sd, aes(x = x, y = y, col = type), size = .8) + theme_bw() +
scale_y_continuous(limits = c(0, 30)) + scale_color_viridis_d(end = .7) +
ggtitle("Prior-posterior plot for sd") -> p1
# The mixing parameter
Q = INLA:::inla.pc.bym.Q("W.adj")
n <- dim(Q)[1]
Q = INLA:::inla.scale.model(Q, constr=list(A=matrix(1, 1, n), e=0))
phi.u = 0.5
phi.alpha = 0.5 ## prob(phi < phi.u) = phi.alpha
phis = 1/(1+exp(-seq(-8, 8, len = 10000)))
phi.prior = INLA:::inla.pc.bym.phi(Q=Q, u= phi.u, alpha = phi.alpha)
m.r = inla.smarginal(lung_mod_bym$internal.marginals.hyperpar$`Logit phi for ID`,
factor = 100, extrapolate = 0.5)
mm.r = inla.tmarginal(function(x)1/(1+exp(-x)), m.r)
# prepare to plot
marg_post_mixing <- data.frame(x = mm.r$x, y = mm.r$y)
marg_post_mixing$type = "posterior"
marg_prior_mixing <- data.frame(x = phis, y = exp(phi.prior(phis)))
marg_prior_mixing$type = "prior"
dat_mix <-rbind(marg_post_mixing, marg_prior_mixing)
ggplot() + geom_line(data = dat_mix, aes(x = x, y = y, col = type), size = .8) + theme_bw() +
scale_color_viridis_d(end = .7) +
ggtitle("Prior-posterior plot for mixing parameter") -> p2
p1|p2
lung$spatialRR <- exp(lung_mod_bym$summary.random$ID$`0.5quant`[1:150])
tmap_mode("view")
tm_shape(shp = lung) +
tm_fill(col = "spatialRR", alpha = .6, palette="viridis", breaks = c(0,0.5,1,1.5,2)) +
tm_borders(lty="solid", col="black") +
tm_style("natural") +
tm_layout(title="Spatial relative risk of lung cancer in England")
However these maps show only point estimates, without any variation component. To get an idea about the variation on the map we can use the notion of exceedance probability. Exceedance probability is defined as \(\Pr(\text{RR}_i>\alpha)\) and this \(\alpha\) can be set to anything make sense for each specific application. In our case we consider \(\alpha = 1.5\) meaning we are looking for the probability that the RR of the \(i\)-th region is larger than 1.5 (compared to the overall risk over the domain).
threshold <- log(1.5)
exceed.prob <- lapply(X= lung_mod_bym$marginals.random$ID[1:150],
FUN = function(x) inla.pmarginal(marginal = x, threshold))
exceed.prob <- 1 - unlist(exceed.prob)
lung$exprob <- exceed.prob
tmap_mode("view")
tm_shape(shp = lung) +
tm_fill(col = "exprob", alpha = .6, palette="viridis", breaks = c(0, 0.2, 0.8, 1)) +
tm_borders(lty="solid", col="black") +
tm_style("natural") +
tm_layout(title="Exceedance probability of lung cancer in England")
We will perform ecological regression to examine the association between lung cancer incidence and deprivation. First we will have a look at the deprivation map in the UK.
ggplot() + geom_sf(data = lung, aes(fill = pctDeprived), col = NA) +
scale_fill_viridis_c() + theme_bw()
To perform ecological regression, we just need to add the variable of interest, which in our case is the deprivation, in the formula and re-run the inla()
command. The mathematical formula writes:
\[\text{log}Y_i = \beta_0 + \beta_1\text{deprivation} + \frac{1}{\sqrt{\tau}}(\sqrt{1-\phi}v_i + \sqrt{\phi}u_i^*)\]
formula_bym_cov <- cases ~ 1 + offset(logExpected) + pctDeprived +
f(ID, model="bym2", graph="W.adj", scale.model = TRUE, constr = TRUE,
# priors
hyper = list(theta1 = list("PCprior", c(1, 0.01)),
theta2 = list("PCprior", c(0.5, 0.5)))
)
lung_mod_bym_cov <- inla(formula_bym_cov, data = lung, family="poisson",
# priors for the fixed effects
control.fixed = list(mean=list(pctDeprived = 0),
prec = list(pctDeprived =.0001)),
control.compute = list(dic = TRUE, waic = TRUE), verbose = F)
kableExtra::kable(
round(exp(lung_mod_bym_cov$summary.fixed["pctDeprived",
c("0.5quant", "0.025quant", "0.975quant")]),
digits = 3)) %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
0.5quant | 0.025quant | 0.975quant | |
---|---|---|---|
pctDeprived | 1.009 | 1.008 | 1.011 |
Thus for every unit increase in the deprivation index, the lung cancer mortality increases by 0.9% (95%CI: 0.8-1.1%).
lung$ranef_unadj <- lung_mod_bym$summary.random$ID$`0.5quant`[1:150]
lung$ranef_adj <- lung_mod_bym_cov$summary.random$ID$`0.5quant`[1:150]
ggplot() + geom_sf(data = lung, aes(fill = ranef_unadj), col = NA) +
scale_fill_viridis_c() + theme_bw() -> p1
ggplot() + geom_sf(data = lung, aes(fill = ranef_adj), col = NA) +
scale_fill_viridis_c() + theme_bw() -> p2
p1|p2
we observe a shrinkage to 0, since some observed spatial variation is now explained by deprivation, however there seem to be still spatial confounding, mainly driven by the spatially structured component (possibly because we miss two major lung cancer risk factors, ie smoking and radon). We can also quantify the percentage of the observed spatial variation explained by deprivation.
sd_before <- 1/lung_mod_bym$summary.hyperpar$`0.5quant`[1]
# sd after adjusting for deprivation
sd_after <- 1/lung_mod_bym_cov$summary.hyperpar$`0.5quant`[1]
(sd_before - sd_after)/sd_before
## [1] 0.6233753
We found that approximately 62% of the observed spatial variation can be explained by deprivation.
Besag, Julian, Jeremy York, and Annie Mollié. 1991. “Bayesian Image Restoration, with Two Applications in Spatial Statistics.” Annals of the Institute of Statistical Mathematics 43 (1): 1–20.
Freni-Sterrantino, Anna, Massimo Ventrucci, and Håvard Rue. 2018. “A Note on Intrinsic Conditional Autoregressive Models for Disconnected Graphs.” Spatial and Spatio-Temporal Epidemiology 26: 25–34.
Riebler, Andrea, Sigrunn H Sørbye, Daniel Simpson, and Håvard Rue. 2016. “An Intuitive Bayesian Spatial Model for Disease Mapping That Accounts for Scaling.” Statistical Methods in Medical Research 25 (4): 1145–65.
Rue, Håvard, Sara Martino, and Nicolas Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2): 319–92.
Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G Martins, Sigrunn H Sørbye, and others. 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science 32 (1): 1–28.