Ecological regression with BYM2 in R-INLA: A case study with stroke mortality in Sheffield

Stroke mortality in Sheffield

In this document we will perform ecological regression using R-INLA (Rue, Martino, and Chopin 2009). We will BYM2 (Riebler et al. 2016), a reparametrization of (Besag, York, and Mollié 1991) to stroke mortality in Sheffield examining the effect of NO\(_x\) after adjusting for deprivation. The dataset includes information about stroke mortality in Sheffield (counts of deaths), population at risk and relevant covariates (NOx concentration, and Townsend index, which is a deprivation index). In the folder there is also a shapefile with the enumeration districts for the city of Sheffield. You can download the data from github.

Install and load packages

This practical requires the following packages to be installed and attached: sf, spdep, dplyr,ggplot2, patchwork, kableExtra and INLA.

To install the entire suite of packages, we can use:

install.packages(c("sf", "spdep", "ggplot2", "patchwork"), dependencies = TRUE, 
                 repos = "http://cran.r-project.org")

For INLA, you can use

install.packages("INLA",repos=c(getOption("repos"),
                                INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)

Then, load the needed packages:

library(sf)
library(spdep)
library(dplyr)
library(ggplot2)
library(patchwork)
library(kableExtra) 
library(INLA)

Importing/cleaning data

Import data

stroke <- read.csv("Stroke.csv")
stroke$index <- 1:nrow(stroke)
head(stroke)

##      SP_ID stroke_exp     pop y Townsend.class   NOx.class Townsend NOx
## 1 05CGFA01   2.620953  931.65 3  [-6.54,-2.07] (41.4,53.6]        1   2
## 2 05CGFA02   1.989744  710.36 1  (-2.07,0.086]  (82.1,243]        2   5
## 3 05CGFA03   2.286056  654.12 3   (0.086,3.11]  (82.1,243]        3   5
## 4 05CGFA04   2.151746 1109.37 0  (-2.07,0.086]  (82.1,243]        2   5
## 5 05CGFA05   4.380834 1159.31 1  (-2.07,0.086] (53.6,64.8]        2   3
## 6 05CGFA06   3.094954 1020.21 5  [-6.54,-2.07] (53.6,64.8]        1   3
##      Offset index
## 1 -2.549570     1
## 2 -2.551463     2
## 3 -2.455050     3
## 4 -2.711442     4
## 5 -2.420999     5
## 6 -2.516716     6

Check data

ggplot() + geom_point(data = stroke, aes(x = index, y = stroke_exp), size = 0.5) + 
  theme_bw() + ylab("expected")

table(stroke$Townsend.class)

## 
## (-2.07,0.086]  (0.086,3.11]   (3.11,5.44]    (5.44,9.2] [-6.54,-2.07] 
##           206           207           204           207           206

table(stroke$NOx.class)

## 
## (41.4,53.6] (53.6,64.8] (64.8,82.1]  (82.1,243]  [7.8,41.4] 
##         207         205         206         206         206

The dataset has 9 columns. The SP_ID gives us information about the ID of the enumeration districts, the columns stroke_exp and pop give infomation about the expected number of deaths and the population at risk at the i-th spatial unit. Information on counts of stroke are given by the variable y. Covariate information (in quintiles) is given in the columns Townsend.class (deprivation) and NOx.class (air pollutant used to capture mainly traffic related air pollution). Townsend and NOx give the categories of the class of this covariates and the offset is defined as \(\text{offset}_i = \text{logit}_i(E_i/\text{Pop}_i)\).

Now lets have a look at the enumeration districts of Sheffield. Please notice that the number of rows in the shapefile is the same as the number of rows in our dataset.

sheffield.gen <- read_sf("Sheffield.shp")

*and plot it

ggplot() + geom_sf(data = sheffield.gen, fill = NA) + theme_bw()

The research question of this particular problem is to examine the spatial distribution of deaths by stroke in Sheffield and examine to which extent this distribution can be explained by variables as traffic-related air pollution and deprivation. We are also interested in examining the ecological association between these covariates and stroke mortality. To answer these questions we will perform disease mapping and ecological regression with INLA. We will use the binomial distribution rather the Poisson one for this example, since stroke is not a rare outcome:

\[y_i \sim \text{Binomial}(p_i, n_i)\\ \text{logit}(p_i) = \beta_0 + u_i + v_i\\ u_{i}|\mathbf{u}_{-i} \sim \mathcal{N}\Big(\frac{\sum_{j=1}^Nw_{ij}u_j}{\sum_{j = 1}^{N}w_{ij}}, \frac{1}{\tau_1\sum_{j=1}^{N}w_{ij}}\Big) \\[5pt] v_i \sim \mathcal{N}(0, \tau_2^{-1}) \\\]

as before we considered the reparametrization of Simpson (Riebler et al. 2016), however by typing model = “bym” you fit exactly the above model.

# define the neighbour structure
W.nb <- poly2nb(sheffield.gen)
nb2INLA("W.adj", W.nb)

# match
stroke <- stroke[match(sheffield.gen$SP_ID,stroke$SP_ID),]
#Check for the first 10 areas
stroke$SP_ID[1:10]

##  [1] "05CGGD02" "05CGFT44" "05CGGF35" "05CGGD29" "05CGFH02" "05CGGD31"
##  [7] "05CGFM08" "05CGGD22" "05CGGD17" "05CGGD16"

sheffield.gen$SP_ID[1:10]

##  [1] "05CGGD02" "05CGFT44" "05CGGF35" "05CGGD29" "05CGFH02" "05CGGD31"
##  [7] "05CGFM08" "05CGGD22" "05CGGD17" "05CGGD16"

# create an ID per spatial unit
stroke$ID<- seq(1,1030)

Disease mapping stroke mortality

Fit a BYM2 model without covariates

formula <- y ~ 1 + f(ID, model="bym2", graph="W.adj", scale.model = TRUE, constr = TRUE, 
               # priors
               hyper = list(theta1 = list("PCprior", c(1, 0.01)),  
               # Pr(sd<1) = 0.01, unlikely to have rr>3just based on the spatial confounding
                            theta2 = list("PCprior", c(0.5, 0.5))) 
                # Pr(phi<0.5)=0.5, we believe that the unmeasured spatial confounding
                # is driven 50% from the strucutred and 50% from the unstructured random effect
               )


strokes_DM <- inla(formula, data = stroke, family="binomial",  offset=Offset, Ntrials=pop,
                   control.compute = list(dic = TRUE, waic = TRUE), verbose = F)

Extract results of hyperparameters

strokes_DM$summary.hyperpar

##                        mean         sd  0.025quant   0.5quant 0.975quant
## Precision for ID 2.43832343 0.19061808 2.087736949 2.43015297  2.8358812
## Phi for ID       0.04684131 0.04132873 0.003290217 0.03516711  0.1559025
##                         mode
## Precision for ID 2.413137841
## Phi for ID       0.009054035

Tranform to get the standard deviation

sd_mar <- as.data.frame(inla.tmarginal(function(x) exp(-1/2*x), 
                        strokes_DM$internal.marginals.hyperpar$`Log precision for ID`))
head(sd_mar)

##           x         y
## 1 0.5635654 0.0792093
## 2 0.5680135 0.1486057
## 3 0.5707516 0.2140805
## 4 0.5727700 0.2758374
## 5 0.5743826 0.3361945
## 6 0.5757298 0.3956633

and plot the posteriors of the hyperparameters of the spatial field

ggplot() + geom_line(data = sd_mar, aes(x = x, y = y)) + theme_bw() + 
  ggtitle("Posterior of sd of the spatial field") -> p1

ggplot() + geom_line(data = as.data.frame(strokes_DM$marginals.hyperpar$`Phi for ID`), 
                     aes(x = x, y = y)) + theme_bw() + 
  ggtitle("Posterior of sd of the mixing parameter") -> p2

p1|p2

There seems to be variation in the field captured by the BYM2 prior, nevertheless the mixing parameter is very small, implying that most variation is an attribute to the unstructured spatial components. In other words, the field is dominated by overdispersion rather than spatial autocorelation.

Plot the spatial random effect

sheffield.gen$sp <- strokes_DM$summary.random$ID$`0.5quant`[1:1030]
ggplot() + geom_sf(data = sheffield.gen, aes(fill = sp), col = NA) + theme_bw() +
  scale_fill_viridis_c()

Calculate posterior probability that the field is larger than 0

threshold <- log(1)
exceed.prob <- lapply(X= strokes_DM$marginals.random$ID[1:1030], FUN = function(x) inla.pmarginal(marginal = x, threshold))
exceed.prob <- 1 - unlist(exceed.prob)

Plots of the posterior probability

ggplot() + geom_boxplot(data=as.data.frame(exceed.prob), aes(y = exceed.prob)) + 
  geom_hline(yintercept = 0.95, col = "red") + theme_bw() + xlim(c(-0.6,0.6)) + 
  ylab("") + 
  theme(axis.text.x = element_blank()) -> p1

  
sheffield.gen$ex <- exceed.prob
temp.ex <-  sheffield.gen[sheffield.gen$ex >=0.95,]

ggplot() + geom_sf(data = sheffield.gen, aes(fill = ex), col = NA) + scale_fill_viridis_c() + 
  geom_sf(data = temp.ex, col = "red", fill = NA) + theme_bw() -> p2

(p1|p2) + plot_annotation(title = "Posterior probability")

At this point and based on Occam’s razor grounds it would make sense to fit a model with just an ustructured random effect and compare the models.

formula_2 <- y ~ 1 +  f(ID, model="iid", constr = TRUE, 
                        hyper = list(theta = list("PCprior", c(1, 0.01))))
                        
strokes_DM_unstr <- inla(formula_2, data = stroke, 
                         family="binomial",  offset=Offset, Ntrials=pop,
                   control.compute = list(dic = TRUE, waic = TRUE), verbose = F)

strokes_DM$dic$dic; strokes_DM_unstr$dic$dic

## [1] 3947.695

## [1] 3947.14

It seems that the models without a spatial component performs better.

Ecological regression stroke mortality

Now we will add the covariates of interest:

\[y_i \sim \text{Binomial}(p_i, n_i)\\ \text{logit}(p_i) = \beta_0 + \beta_1\text{Townsend} + \beta_2\text{NOx} + u_i + v_i\\ u_{i}|\mathbf{u}_{-i} \sim \mathcal{N}\Big(\frac{\sum_{j=1}^Nw_{ij}u_j}{\sum_{j = 1}^{N}w_{ij}}, \frac{1}{\tau_1\sum_{j=1}^{N}w_{ij}}\Big) \\[5pt] v_i \sim \mathcal{N}(0, \tau_2^{-1}) \\ \beta_1, \beta_2 \sim \mathcal{N}(0, 0.5)\]

First lets plot them:

stroke <- left_join(sheffield.gen, stroke, by = c("SP_ID" = "SP_ID"))

ggplot() + geom_sf(data = stroke, aes(fill = NOx), col = NA) + scale_fill_viridis_c() + 
  theme_bw() -> p1

ggplot() + geom_sf(data = stroke, aes(fill = Townsend), col = NA) + scale_fill_viridis_c() + 
  theme_bw() -> p2

p1|p2

What do you observe?

Perform ecological regression.

formula_eco <- y ~  1 + Townsend + NOx + 
               f(ID, model="bym2", graph="W.adj", scale.model = TRUE, constr = TRUE, 
               # hyper priors
               hyper = list(theta1 = list("PCprior", c(1, 0.01)),  
                            theta2 = list("PCprior", c(0.5, 0.5))) 
               )

model_eco <- inla(formula_eco, family="binomial", data=stroke, offset=Offset, Ntrials=pop, 
                  control.compute = list(dic = TRUE, waic = TRUE), 
                  control.fixed=list(prec=list(mean=0, prec=0.5)))

Extract the odds ratios.

tab <- exp(model_eco$summary.fixed[c("Townsend", "NOx"),c("0.5quant", "0.025quant", "0.975quant")])
tab <- round(tab, digits = 2)

kableExtra::kable(tab, caption = "Median and 95%CrI for the odd ratio for NOx and deprivation") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "center")

Median and 95%CrI for the odd ratio for NOx and deprivation
	0.5quant	0.025quant	0.975quant
Townsend	1.01	0.97	1.06
NOx	1.08	1.03	1.13

Check the mixing parameter.

ggplot() + geom_line(data = as.data.frame(model_eco$marginals.hyperpar$`Phi for ID`), 
                     aes(x = x, y = y)) + theme_bw() + 
  ggtitle("Posterior of sd of the mixing parameter")

now almost the entire amount of variation is due to the ustructured random effect.

Check the spatial field.

sheffield.gen$sp_eco <- model_eco$summary.random$ID$`0.5quant`[1:1030]
ggplot() + geom_sf(data = sheffield.gen, aes(fill = sp_eco), col = NA) + theme_bw() + 
  scale_fill_viridis_c()

not much of variation was explained by the covariates.

References

Besag, Julian, Jeremy York, and Annie Mollié. 1991. “Bayesian Image Restoration, with Two Applications in Spatial Statistics.” Annals of the Institute of Statistical Mathematics 43 (1): 1–20.

Riebler, Andrea, Sigrunn H Sørbye, Daniel Simpson, and Håvard Rue. 2016. “An Intuitive Bayesian Spatial Model for Disease Mapping That Accounts for Scaling.” Statistical Methods in Medical Research 25 (4): 1145–65.

Rue, Håvard, Sara Martino, and Nicolas Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2): 319–92.