Lung Cancer in England

In this document we will perform ecological regression using R-INLA (Rue, Martino, and Chopin 2009). We will BYM2 (Riebler et al. 2016), a reparametrization of (Besag, York, and Mollié 1991) to lung cancer mortality in England examining the effect of deprivation. The dataset includes information about the geographical units (or intermediate geographies or IG) in England as for example the name of the region, the code etc. It also includes the number of cases diagnosed with lung cancer per intermediate geographies (IG) in England in 2007, the type of cancer, a deprivation index, the expected number of cases in the corresponding IG and the logarithm of this expected number of cases. You can download the data from github.

Install and load packages

This practical requires the following packages to be installed and attached: sf, spdep, tmap,ggplot2, patchwork, kableExtra and INLA.

  • To install the entire suite of packages, we can use:
install.packages(c("sf", "spdep", "ggplot2", "patchwork"), dependencies = TRUE, 
                 repos = "http://cran.r-project.org")
  • For INLA, you can use:
install.packages("INLA",repos=c(getOption("repos"),
                                INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
  • Then, load the needed packages:
library(sf)
library(spdep)
library(tmap)
library(ggplot2)
library(patchwork)
library(kableExtra) 
library(INLA)
  • First we will load the object.
load("lung.RData")
class(lung)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
  • Transform it to an sf object.
lung <- st_as_sf(lung)
  • Plot the shp.
ggplot() + geom_sf(data = lung, fill = NA) + theme_bw()

  • Plot the maps containing the information of the shp.
ggplot() + geom_sf(data = lung, aes(fill = cases), col = NA) + scale_fill_viridis_c() + theme_bw()

This map does not tell us a lot about the spatial distribution of lung cancer incidence. It mainly reflects the population per IG. We can calculate the standardized incidence ratio (SIR) instead.

  • Calculate the SIR per IG.
lung$SIR <- lung$cases/lung$expected
ggplot() + geom_sf(data = lung, aes(fill = SIR), col = NA) + scale_fill_viridis_c() + theme_bw()

Often the SIR or SMR is not the best estimator you can get because of its large variance. That means that we can often see extremes when the denominator is small. To mitigate this, in the Bayesian setting we smooth the estimators with a combination of local and global smoothing (through conditional autoregressive priors or CAR). The most popular one is the Besag-York-Molli{'e} prior (Besag, York, and Mollié 1991). In the next junk we will run 2 different models, one with an overdispersion parameter and one with a BYM prior. Here we considered a nice reparametrisation given by (Simpson et al. 2017). This reparametrisation overcomes issues as scaling and identifiability issues the common BYM confronts (Riebler et al. 2016, @freni2018). The mathematical formula is given bellow. Supposed \(Y_i\sim\text{Poisson}(\lambda_iE_i)\), where \(Y_i\) is the number of cases, \(E_i\) the expected number of cases and \(\lambda_i\) the SIR of the \(i\)-th IG. Then:

\[\text{log}Y_i = \beta_0 + \pmb{\beta}\mathbf{X} + \frac{1}{\sqrt{\tau}}(\sqrt{1-\phi}v_i + \sqrt{\phi}u_i^*)\] where \(v_i \sim \mathcal{N}(0, 1)\) and \(u_i^*\) is a standardized spatially structured component with characteristic marginal variance equal to 1. The hyperparameter \(\phi\) stands for the mixing parameter and it shows the percentage of variability explained by each random effect, with values close to 0 implying that the majority of the observed variation comes from the unstructured component, whereas values close to 1 the opposite. The hyperparameter \(\tau\)^is the random precision (1/variance). To complete the Bayesian representation we assign penalized complexity priors to the hyperparameters (Simpson et al. 2017).

Disease Mapping of lung cancer

First we will fit models without covariates, applying some level of smoothing on the SIRs. Lets begin by including just an overdispersion parameter, ie:

\[\text{log}Y_i = \beta_0 + v_i\] The steps to do so are:

  1. Create an ID per IG.
lung$ID <- 1:nrow(lung)
  1. Define the formula. Here we will also set the hyperprior of the overdispersion parameter.
formula_iid <- cases ~ 1 + offset(logExpected) + 
               f(ID, model="iid", constr = TRUE, 
                 hyper = list(theta = list("PCprior", c(1, 0.01))))
  1. Run INLA.
lung_mod_ov <- inla(formula_iid, data = lung, family = "poisson", 
                    control.compute = list(dic=T, waic=T), verbose = F)
  • Get the first summary of the results.
summary(lung_mod_ov)
## 
## Call:
##    c("inla(formula = formula_iid, family = \"poisson\", data = lung, ", " 
##    verbose = F, control.compute = list(dic = T, waic = T))" ) 
## Time used:
##     Pre = 0.558, Running = 0.81, Post = 1.92, Total = 3.29 
## Fixed effects:
##               mean    sd 0.025quant 0.5quant 0.975quant   mode kld
## (Intercept) -0.002 0.004     -0.009   -0.002      0.005 -0.002   0
## 
## Random effects:
##   Name     Model
##     ID IID model
## 
## Model hyperparameters:
##                   mean   sd 0.025quant 0.5quant 0.975quant  mode
## Precision for ID 16.08 1.91      12.57    16.00      20.04 15.84
## 
## Expected number of effective parameters(stdev): 145.46(0.536)
## Number of equivalent replicates : 1.03 
## 
## Deviance Information Criterion (DIC) ...............: 1526.14
## Deviance Information Criterion (DIC, saturated) ....: 295.66
## Effective number of parameters .....................: 145.56
## 
## Watanabe-Akaike information criterion (WAIC) ...: 1484.70
## Effective number of parameters .................: 75.10
## 
## Marginal log-Likelihood:  -973.52 
## Posterior marginals for the linear predictor and
##  the fitted values are computed
  • Plot the overdispersion parameter.
lung$ov <- lung_mod_ov$summary.random$ID$`0.5quant`
ggplot() + geom_sf(data = lung, aes(fill = ov), col = NA) + scale_fill_viridis_c() + theme_bw()

  • Plot the prior together with the posterior of the sd hyperparameter.
# posterior
marg_post <- inla.tmarginal(function(x) exp(-1/2*x), marginal = 
                          lung_mod_ov$internal.marginals.hyperpar$`Log precision for ID`) 

# prior
marg_prior <- inla.tmarginal(function(x) exp(-1/2*x),
                        data.frame(x= seq(from = 1, to = 1000, length.out = 1000), 
                                   y = inla.pc.dprec(seq(from = 1, to = 1000, length.out = 1000), 1, 0.01))
)

# prepare to plot

marg_post <- as.data.frame(marg_post)
marg_post$type = "posterior"

marg_prior <- as.data.frame(marg_prior)
marg_prior$type = "prior"

dat <- rbind(marg_post, marg_prior)



ggplot() + geom_line(data = dat, aes(x = x, y = y, col = type), size = .8) + theme_bw() + 
  scale_y_continuous(limits = c(0, 30)) + scale_color_viridis_d(end = .7) + 
  ggtitle("Prior-posterior plot")

This plot is quite nice, telling us that there is some overdispersion in the data that we need to adjust for. Now we will add the spatially structured random effect using the Simpson parameterisation explained above. To fit this model in INLA you need to specify model = "bym2", however is still possible to fit the original BYM by specifying model = "bym" instead.

  • Specify a neighboring matrix.
W.nb <- poly2nb(lung)
nb2INLA("W.adj", W.nb) 
  • Specify the formula.
formula_bym <- cases ~ 1 + offset(logExpected) + 
               f(ID, model="bym2", graph="W.adj", scale.model = TRUE, constr = TRUE, 
               # priors
               hyper = list(theta1 = list("PCprior", c(1, 0.01)),  
               # Pr(sd<1) = 0.01, unlikely to have rr>3just based on the spatial confounding
                            theta2 = list("PCprior", c(0.5, 0.5))) 
               # Pr(phi<0.5)=0.5, we state that we believe that the unmeasured spatial confounding
              # is driven 50% from the structured and 50% from the unstructured random effect
               )
  • Run INLA.
lung_mod_bym <- inla(formula_bym, data = lung, family="poisson",  
                     control.compute = list(dic = TRUE, waic = TRUE), verbose = F)

*We can check the DIC (Deviance information criterion) to examine which model performs better. It seems that the model with the bym prior performs slightly better compared to the model with just an overdispersion parameter:

lung_mod_ov$dic$dic; lung_mod_bym$dic$dic 
## [1] 1526.145
## [1] 1521.987
  • We can replot the prior posterior plots and examine the hyperparameters.
# The precision

# sd
marg_post_prec <- inla.tmarginal(function(x) exp(-1/2*x), marginal = 
                          lung_mod_bym$internal.marginals.hyperpar$`Log precision for ID`) 

# prior
marg_prior_prec <- inla.tmarginal(function(x) exp(-1/2*x),
                        data.frame(x= seq(from = 1, to = 1000, length.out = 1000), 
                                   y = inla.pc.dprec(seq(from = 1, to = 1000, length.out = 1000), 
                                                     1, 0.01))
)



# prepare to plot

marg_post_prec <- as.data.frame(marg_post_prec)
marg_post_prec$type = "posterior"

marg_prior_prec <- as.data.frame(marg_prior_prec)
marg_prior_prec$type = "prior"

dat_sd <- rbind(marg_post_prec, marg_prior_prec)

ggplot() + geom_line(data = dat_sd, aes(x = x, y = y, col = type), size = .8) + theme_bw() + 
  scale_y_continuous(limits = c(0, 30)) + scale_color_viridis_d(end = .7) + 
  ggtitle("Prior-posterior plot for sd") -> p1




# The mixing parameter

Q = INLA:::inla.pc.bym.Q("W.adj")
n <- dim(Q)[1]
Q = INLA:::inla.scale.model(Q,  constr=list(A=matrix(1, 1, n), e=0))

phi.u = 0.5
phi.alpha = 0.5 ## prob(phi < phi.u) = phi.alpha

phis = 1/(1+exp(-seq(-8, 8,  len = 10000)))
phi.prior = INLA:::inla.pc.bym.phi(Q=Q, u= phi.u, alpha = phi.alpha)

m.r = inla.smarginal(lung_mod_bym$internal.marginals.hyperpar$`Logit phi for ID`, 
                     factor = 100, extrapolate = 0.5)
mm.r = inla.tmarginal(function(x)1/(1+exp(-x)), m.r)

# prepare to plot

marg_post_mixing <- data.frame(x = mm.r$x, y = mm.r$y)
marg_post_mixing$type = "posterior"

marg_prior_mixing <- data.frame(x = phis, y = exp(phi.prior(phis)))
marg_prior_mixing$type = "prior"

dat_mix <-rbind(marg_post_mixing, marg_prior_mixing)


ggplot() + geom_line(data = dat_mix, aes(x = x, y = y, col = type), size = .8) + theme_bw() + 
  scale_color_viridis_d(end = .7) + 
  ggtitle("Prior-posterior plot for mixing parameter") -> p2

p1|p2

  • We can also plot the spatial relative risk on the map:
lung$spatialRR <- exp(lung_mod_bym$summary.random$ID$`0.5quant`[1:150])

tmap_mode("view")
tm_shape(shp = lung) +
  tm_fill(col = "spatialRR", alpha = .6, palette="viridis", breaks = c(0,0.5,1,1.5,2)) +
          tm_borders(lty="solid", col="black") +
            tm_style("natural") +
            tm_layout(title="Spatial relative risk of lung cancer in England")

However these maps show only point estimates, without any variation component. To get an idea about the variation on the map we can use the notion of exceedance probability. Exceedance probability is defined as \(\Pr(\text{RR}_i>\alpha)\) and this \(\alpha\) can be set to anything make sense for each specific application. In our case we consider \(\alpha = 1.5\) meaning we are looking for the probability that the RR of the \(i\)-th region is larger than 1.5 (compared to the overall risk over the domain).

  • Calculate exceedance probability.
threshold <- log(1.5)
exceed.prob <- lapply(X= lung_mod_bym$marginals.random$ID[1:150], 
                      FUN = function(x) inla.pmarginal(marginal = x, threshold))
exceed.prob <- 1 - unlist(exceed.prob)
  • Plot it.
lung$exprob <- exceed.prob

tmap_mode("view")
tm_shape(shp = lung) +
  tm_fill(col = "exprob", alpha = .6, palette="viridis", breaks = c(0, 0.2, 0.8, 1)) +
          tm_borders(lty="solid", col="black") +
            tm_style("natural") +
            tm_layout(title="Exceedance probability of lung cancer in England")

Ecological regression of lung cancer

We will perform ecological regression to examine the association between lung cancer incidence and deprivation. First we will have a look at the deprivation map in the UK.

ggplot() + geom_sf(data = lung, aes(fill = pctDeprived), col = NA) + 
  scale_fill_viridis_c() + theme_bw()

To perform ecological regression, we just need to add the variable of interest, which in our case is the deprivation, in the formula and re-run the inla() command. The mathematical formula writes:

\[\text{log}Y_i = \beta_0 + \beta_1\text{deprivation} + \frac{1}{\sqrt{\tau}}(\sqrt{1-\phi}v_i + \sqrt{\phi}u_i^*)\]

  • Define the formula.
formula_bym_cov <- cases ~ 1 + offset(logExpected) + pctDeprived +
               f(ID, model="bym2", graph="W.adj", scale.model = TRUE, constr = TRUE, 
               # priors
               hyper = list(theta1 = list("PCprior", c(1, 0.01)), 
                            theta2 = list("PCprior", c(0.5, 0.5)))
               )
  • Fit INLA.
lung_mod_bym_cov <- inla(formula_bym_cov, data = lung, family="poisson",  
                     # priors for the fixed effects 
                     control.fixed = list(mean=list(pctDeprived = 0), 
                                              prec = list(pctDeprived =.0001)),
                     control.compute = list(dic = TRUE, waic = TRUE), verbose = F)
  • Extract the results for deprivation.
kableExtra::kable(
round(exp(lung_mod_bym_cov$summary.fixed["pctDeprived",
                                         c("0.5quant", "0.025quant", "0.975quant")]),
      digits = 3))  %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
0.5quant 0.025quant 0.975quant
pctDeprived 1.009 1.008 1.011

Thus for every unit increase in the deprivation index, the lung cancer mortality increases by 0.9% (95%CI: 0.8-1.1%).

  • Plot the random effect before and after including deprivation.
lung$ranef_unadj <- lung_mod_bym$summary.random$ID$`0.5quant`[1:150]
lung$ranef_adj <- lung_mod_bym_cov$summary.random$ID$`0.5quant`[1:150]

ggplot() + geom_sf(data = lung, aes(fill = ranef_unadj), col = NA) + 
  scale_fill_viridis_c() + theme_bw() -> p1

ggplot() + geom_sf(data = lung, aes(fill = ranef_adj), col = NA) + 
  scale_fill_viridis_c() + theme_bw() -> p2

p1|p2

we observe a shrinkage to 0, since some observed spatial variation is now explained by deprivation, however there seem to be still spatial confounding, mainly driven by the spatially structured component (possibly because we miss two major lung cancer risk factors, ie smoking and radon). We can also quantify the percentage of the observed spatial variation explained by deprivation.

sd_before <- 1/lung_mod_bym$summary.hyperpar$`0.5quant`[1]

# sd after adjusting for deprivation
sd_after <- 1/lung_mod_bym_cov$summary.hyperpar$`0.5quant`[1]
(sd_before -  sd_after)/sd_before
## [1] 0.6233753

We found that approximately 62% of the observed spatial variation can be explained by deprivation.

References

Besag, Julian, Jeremy York, and Annie Mollié. 1991. “Bayesian Image Restoration, with Two Applications in Spatial Statistics.” Annals of the Institute of Statistical Mathematics 43 (1): 1–20.

Freni-Sterrantino, Anna, Massimo Ventrucci, and Håvard Rue. 2018. “A Note on Intrinsic Conditional Autoregressive Models for Disconnected Graphs.” Spatial and Spatio-Temporal Epidemiology 26: 25–34.

Riebler, Andrea, Sigrunn H Sørbye, Daniel Simpson, and Håvard Rue. 2016. “An Intuitive Bayesian Spatial Model for Disease Mapping That Accounts for Scaling.” Statistical Methods in Medical Research 25 (4): 1145–65.

Rue, Håvard, Sara Martino, and Nicolas Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2): 319–92.

Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G Martins, Sigrunn H Sørbye, and others. 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science 32 (1): 1–28.