Install and load packages

This practical requires the following packages to be installed and attached: sf, dplyr, tidyverse, nimble, coda, spdep, patchwork, fastDummier, ggmcmc, gplots and INLA.

To check where the packages are saved type .libPaths()
To check if already installed type .packages(all.available = TRUE). As seen on the previous practical, you can type is.element(“PACKAGE_NAME”, installed.packages()) to check if the individual package “PACKAGE_NAME” is installed.

If the packages are not installed you need to do it through:

install.packages(c("sf","dplyr", "tidyverse", "nimble",  "coda", 
                   "spdep", "patchwork", "fastDummies", "ggmcmc", "gplot", "kableExtra"), 
                 type = "both",dependencies = TRUE, 
                 repos = c(CRAN = "http://cran.r-project.org"))

For INLA, you can use

install.packages("INLA",repos=c(getOption("repos"),
                                INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)

Then you need to load these through:

library(sf)           # Simple feature for R
library(dplyr)        # A package for data manipulation
library(tidyverse)    # A package for data manipulation and plotting


library(nimble)       # A package for performing MCMC in R
library(coda)         # A package for summarizing and plotting of MCMC output 
                      # and diagnostic tests
library(spdep)        # A package to calculate neighbors

library(patchwork)    # A package to combine plots
library(ggmcmc)       # MCMC diagnostics with ggplot
library(INLA)         # A package for quick Bayesian inference (here used to calculate
                      # the scale for BYM2)
library(fastDummies)  # A package to create dummies from a factor
library(gplots)       # A package to create CrI graphs

library(kableExtra)   # package for formatting tables in rmarkdown

Introduction to the problem

In this document we will perform ecological regression using NIMBLE (de Valpine et al. 2017). We will fit the BYM2 model (Riebler et al. 2016) and examine the effect of long term exposure to air-pollution on COVID-19 mortality in England. Long-term exposure to air-pollution has been hypothesised to worsen COVID-19 prognosis: either directly, as it can suppress early immune responses to the infection, or indirectly, as it can increase the risk of stroke, hypertension and other pre-existing conditions (Konstantinoudis et al. 2020). We will use data openly available on COVID-19 mortality that can be downloaded from ONS. This data covers England during March and July 2020. Long term exposure to air-pollution was determined as the averaged NO\(_2\) concentration during 2014-2018 as retrieved from PCM model. We will consider total number of intensive care unit (ICU) beds as of Ferbuary 2020 NHS and socio-economic deprivation in 2019 (IMD) as potential confounders. The data has been cleaned and stored in a .shp format.

Import and prepare the data

In the git folder you will find the following files:

COVIDecoregression.prj, COVIDecoregression.dbf, COVIDecoregression.shp and COVIDecoregression.shx: This is a spatial polygon data frame containing information about: the lower tier local authorities (LTLA) in 2018 in England, the number of COVID-19 deaths (deaths), the expected number of deaths (expectd), the total ICU beds per population (TtlICU), the average concentration of NO2 during 2014-2018 in \(\mu g/m^3\) (NO2) and the deprivation index (IMD).

Read data

Import COVID-19 deaths. Note that the file is a shp, so we will use the read_sf() function!

COVID19eco <- read_sf("COVIDecoregression.shp")

Print the first few rows.

head(COVID19eco)

## Simple feature collection with 6 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 344666.1 ymin: 378867 xmax: 478448.5 ymax: 537152
## Projected CRS: OSGB36 / British National Grid
## # A tibble: 6 × 7
##   LTLA      deaths expectd TtlICUB   NO2   IMD                          geometry
##   <chr>      <int>   <dbl>   <dbl> <dbl> <dbl>                <MULTIPOLYGON [m]>
## 1 E06000001    110    78.8 0.00995  12.9     1 (((447213.9 537036.1, 447228.8 5…
## 2 E06000002    206   103.  0.0392   16.4     1 (((448609.9 521982.6, 448616.8 5…
## 3 E06000003    134   131.  0.0412   11.8     1 (((455932.3 527880.7, 455919.4 5…
## 4 E06000004    155   157.  0.00974  13.7     2 (((444157 527956.3, 444165.9 527…
## 5 E06000005     90    97.7 0.0128   11.6     2 (((423496.6 524724.3, 423497.2 5…
## 6 E06000006    128    93.0 0        17.1     1 (((358374.7 384376.8, 358367.7 3…

Creating the adjacency matrix

Plot the shapefile.

par(mar = c(0,0,0,0))
plot(COVID19eco$geometry)

Adjacency matrix in R

LTLA_nb <- poly2nb(COVID19eco)
summary(LTLA_nb)

## Neighbour list object:
## Number of regions: 316 
## Number of nonzero links: 1610 
## Percentage nonzero weights: 1.612322 
## Average number of links: 5.094937 
## 1 region with no links:
## 44
## 2 disjoint connected subgraphs
## Link number distribution:
## 
##  0  1  2  3  4  5  6  7  8  9 10 11 
##  1  8 25 39 54 58 46 49 21 10  3  2 
## 8 least connected regions:
## 10 26 60 66 87 88 114 262 with 1 link
## 2 most connected regions:
## 50 226 with 11 links

# convert in the format NIMBLE wants
nbWB_B <- nb2WB(LTLA_nb)
names(nbWB_B)

## [1] "adj"     "weights" "num"

summary(nbWB_B)

##         Length Class  Mode   
## adj     1610   -none- numeric
## weights 1610   -none- numeric
## num      316   -none- numeric

Creating the dummy variables

The variable IMD is a categorical one. To make it compatible with the NIBLE format required for categorical variables we will use the function dummy_cols in the fastDummies package.

fastDummies::dummy_cols(COVID19eco$IMD) %>% as_tibble() %>% 
  select(.data_1:.data_5) %>% 
  rename(IMD_1 = .data_1, IMD_2 = .data_2, IMD_3 = .data_3, 
         IMD_4 = .data_4, IMD_5 = .data_5) -> dummies
                                            
COVID19eco <- cbind(COVID19eco, dummies)

Exploratory analysis

Summary of the shapefile

summary(COVID19eco)

##      LTLA               deaths          expectd           TtlICUB         
##  Length:316         Min.   :   4.0   Min.   :  8.287   Min.   :0.000e+00  
##  Class :character   1st Qu.:  77.0   1st Qu.: 99.339   1st Qu.:8.548e-05  
##  Mode  :character   Median : 117.5   Median :129.067   Median :8.092e-03  
##                     Mean   : 155.6   Mean   :155.630   Mean   :1.399e-02  
##                     3rd Qu.: 191.2   3rd Qu.:177.896   3rd Qu.:1.460e-02  
##                     Max.   :1226.0   Max.   :903.219   Max.   :1.911e-01  
##       NO2             IMD            IMD_1            IMD_2       
##  Min.   : 4.48   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:11.09   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :13.96   Median :3.000   Median :0.0000   Median :0.0000  
##  Mean   :15.05   Mean   :2.997   Mean   :0.2025   Mean   :0.1994  
##  3rd Qu.:17.26   3rd Qu.:4.000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :46.45   Max.   :5.000   Max.   :1.0000   Max.   :1.0000  
##      IMD_3            IMD_4            IMD_5                 geometry  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   MULTIPOLYGON :316  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   epsg:27700   :  0  
##  Median :0.0000   Median :0.0000   Median :0.0000   +proj=tmer...:  0  
##  Mean   :0.1994   Mean   :0.1962   Mean   :0.2025                      
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000                      
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

Boxplots of the variables

ggplot() + geom_boxplot(data = COVID19eco, aes(y = deaths)) + 
  theme_bw() + ylab("") + ggtitle("Deaths") -> p1

ggplot() + geom_boxplot(data = COVID19eco, aes(y = expectd)) + 
  theme_bw() + ylab("") + ggtitle("Expected") -> p2

ggplot() + geom_boxplot(data = COVID19eco, aes(y = TtlICUB)) + 
  theme_bw() + ylab("") + ggtitle("ICU beds") -> p3

ggplot() + geom_boxplot(data = COVID19eco, aes(y = NO2)) + 
  theme_bw() + ylab("") + ggtitle(expression(NO[2])) -> p4

ggplot() + geom_boxplot(data = COVID19eco, aes(y = IMD)) + 
  theme_bw() + ylab("") + ggtitle("IMD") -> p5

(p1|p2)/(p3|p4|p5)

Maps of the variables

ggplot() + geom_sf(data = COVID19eco, aes(fill = deaths), col = NA) + 
  theme_bw() + ggtitle("Deaths") + scale_fill_viridis_c() -> p1

ggplot() + geom_sf(data = COVID19eco, aes(fill = expectd), col = NA) + 
  theme_bw() + ggtitle("Expected") + scale_fill_viridis_c() -> p2

ggplot() + geom_sf(data = COVID19eco, aes(fill = TtlICUB), col = NA) + 
  theme_bw() + ggtitle("ICU beds") + scale_fill_viridis_c() -> p3

ggplot() + geom_sf(data = COVID19eco, aes(fill = NO2), col = NA) + 
  theme_bw() + ggtitle(expression(NO[2])) + scale_fill_viridis_c() -> p4

ggplot() + geom_sf(data = COVID19eco, aes(fill = IMD), col = NA) + 
  theme_bw() + ggtitle("IMD") + scale_fill_viridis_c() -> p5

(p1|p2)/(p3|p4|p5)

Correlation between ICU beds and NO2

ggplot() + geom_point(data = COVID19eco, aes(x = NO2, y = TtlICUB)) + 
  theme_bw() + annotate("text", x = 10, y = 0.17, size = 6,
                        label = paste0("cor = ", round(cor(COVID19eco$NO2,
                                                           COVID19eco$TtlICUB), 
                                                       digits = 2))) -> p1

p1

Boxplot and map of the crude SMR (standardized mortality ratio)

COVID19eco$crudeSMR <- COVID19eco$deaths/COVID19eco$expectd

ggplot() + geom_boxplot(data = COVID19eco, aes(y = crudeSMR)) + 
  theme_bw() + ggtitle("") + ylab("") -> p1

ggplot() + geom_sf(data = COVID19eco, aes(fill = crudeSMR), col = NA) + 
  theme_bw() + ggtitle("") + scale_fill_viridis_c() -> p2

(p1|p2) + plot_annotation(title = "Crude SMR")

Ecological regression analysis

Model specification

We will fit a Poisson log-linear model in the Bayesian framework to quantify the effect of NO\(_2\), while adjusting for ICU beds and deprivation. To model the spatial autocorrelation we will use the BYM2 prior (Riebler et al. 2016), which is an attractive extension of the BYM model (Besag, York, and Mollié 1991). Let \(i=1,...N\) be the index for the LTLAs in England, \(Y_i\) be the number of COVID-19 deaths, \(E_i\) the expected number of COVID-19 death, \(X_1\) the NO\(_2\) concentration, \(X_2\) the number of ICU beds per population, and \(X_{3ji}\) the dummy variables for deprivation with \(j=2,...,5\):

\[\begin{equation} \begin{aligned} \hbox{O}_i & \sim \hbox{Poisson}(E_i \lambda_i); \;\;\; i=1,...,N\\ \log \lambda_i & = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \sum_{j=2}^5\beta_{3j} X_{3ji} + b_{i} \\ b_{i} & = \frac{1}{\sqrt{\tau_b}}\big(\theta\sqrt{1-\rho} + \phi\sqrt{\rho}\big)\\ \theta_{i} & \sim N(0,1) \\ \phi_{i} & \sim \text{ICAR}({\bf W^*}, 1)\\ \end{aligned} \end{equation}\]

where \(\alpha\) is an intercept term, \(\mathbf{\beta}\) a vector of coefficients, \(\theta\) the spatially unstructured random effect, \(\phi\) the scaled spatially structured random effect, \(\rho\) a mixing parameter, \(tau_b\) the precision of the field.

Write the model

BYMecoCode <- nimbleCode(
{
  for (i in 1:N){
    
    O[i] ~ dpois(mu[i])                                       # Poisson likelihood 
    log(mu[i]) <- log(E[i]) + inprod(beta[], X[i,]) + b[i]
    
    # of course you can write this explicitly too:
    # log(mu[i]) <- log(E[i]) + alpha + b[i] +
    #               beta1*X1[i] + beta2*X2[i] + 
    #               beta32*X32[i] + beta33*X33[i] + 
    # beta34*X34[i] + beta35*X35[i]
                             
    b[i] <- (1/sqrt(tau.b))*(sqrt((1-rho))*theta[i] + 
                               sqrt(rho/scale)*phi[i])
    
    theta[i] ~ dnorm(0, tau = 1)                              # area-specific RE
    SMR[i] <- exp(b[1] + b[i])                                # area-specific SIR
    resSMR[i] <- exp(b[i])                                    # area-specific residual SIR
    e[i] <- (O[i]-mu[i])/sqrt(mu[i])                          # residuals  
    proba.resSMR[i]<-step(resSMR[i]-1)                        # Posterior probability
  } 
  
  # ICAR prior
  phi[1:N] ~ dcar_normal(adj[1:L], weights[1:L], 
                         num[1:N], tau = 1, zero_mean = 1)    # its scaled so tau = 1
  
  # Priors
  beta[1] ~ dflat()                                           # vague prior (Unif(-inf, +inf)) 
  overallRR <- exp(beta[1])

  # precision parameter of the reparametrisation
  tau.b ~ dgamma(1, 0.01)                                     # prior for the precision of b
  sigma2.b <- 1/tau.b                                         # the variance of b
                           
  # mixing parameter                         
  rho ~ dbeta(1, 1)                                           # prior for the mixing parameter
  
  # priors for the fixed effects
  for(j in 1:(K-1)){
        beta[j+1] ~ dnorm(0, tau = 1)
      RR.beta[j] <- exp(beta[j+1])
  }
  
  if(quin){
  }else{
   beta.no2.unscaled <- beta[2]/sd.no2
   rr.no2.unscaled <- exp(beta.no2.unscaled)
   RR.beta1_5NO2 <- exp(beta.no2.unscaled * 5)                 # this will be explained later
  }

}

)

Calculate the scaling factor

W.scale <- nb2mat(LTLA_nb, zero.policy = TRUE, style = "B")           
W.scale <- -W.scale
diag(W.scale) <- abs(apply(W.scale, 1, sum))
# solve(W.scale) # this should not work since by definition the matrix is singular

Q = inla.scale.model(W.scale, constr=list(A=matrix(1, nrow=1, ncol=nrow(W.scale)), e=0))
scale = exp((1/nrow(W.scale))*sum(log(1/diag(Q))))

Format the data. Don’t forget to include the covariates and the adjacency matrix:

n.LTLA <- dim(COVID19eco)[1]

# Format the data for NIMBLE in a list
COVIDdata = list(
                 O = COVID19eco$deaths,                        # observed nb of deaths
                 
                 # covariates
                 X = cbind(1,                                  # for the intercept
                           scale(COVID19eco$NO2),              # NO2
                           scale(COVID19eco$TtlICUB)[,1],      # ICU beds
                           COVID19eco$IMD_2,                   # IMD 2
                           COVID19eco$IMD_3,                   # IMD 3
                           COVID19eco$IMD_4,                   # IMD 4
                           COVID19eco$IMD_5)                   # IMD 5                        
                                  

)      
      
COVIDConsts <-list(      
                 N = n.LTLA,                                   # nb of LTLAs
                 
                 # adjacency matrix
                 L = length(nbWB_B$weights),                   # the number of neighboring areas
                 E = COVID19eco$expectd,                       # expected number of deaths
                 adj = nbWB_B$adj,                             # the elements of the neigh. matrix
                 num = nbWB_B$num,
                 weights = nbWB_B$weights, 
                 
                 scale = scale,                                # the scale for the covariance
                 K = 7,                                        # the total number of covariates 
                 sd.no2 = sd(COVID19eco$NO2), 
                 quin = FALSE
)

Create the initial values for ALL the unknown parameters. As usual, create two different chains.

inits <- list(
  list(
  # first chain 
  tau.b = 1,
  theta = rep(0.01, times = n.LTLA),
  phi = rep(0.01, times = n.LTLA), 
  rho = 0.6,
  beta = rep(0, 7)
),
#chain 2 
  list(
  tau.b = 3,      
  theta = rep(0.1, times = n.LTLA),
  phi = rep(0.1, times = n.LTLA), 
  rho = 0.9,
  beta = c(-2,-2, rep(1, 5))
  )
)

Specify the parameters to monitor

parameters = c("resSMR", "proba.resSMR", "SMR", "RR.beta[1]",
                 "RR.beta[2]", "RR.beta[3]", "RR.beta[4]", "RR.beta[5]", "RR.beta[6]",
                  "overallRR", "sigma2.b", "mu", "rho", "beta.no2.unscaled", "rr.no2.unscaled", "RR.beta1_5NO2")

Specify the MCMC setting.

ni <- 500000   # nb iterations 
nt <- 250       # thinning interval
nb <- 250000    # nb iterations as burn-in 
nc <- 2         # nb chains

Run the MCMC simulations calling nimbleMCMC:

t0<- Sys.time()
modelBYM.eco <- nimbleMCMC(code = BYMecoCode,
                      data = COVIDdata,
                      constants = COVIDConsts, 
                      inits = inits,
                      monitors = parameters,
                      niter = ni,
                      nburnin = nb,
                      thin = nt, 
                      nchains = nc, 
                      setSeed = 9, 
                      progressBar = TRUE, 
                      samplesAsCodaMCMC = TRUE, 
                      summary = TRUE, 
                      WAIC = TRUE
                      )
t1<- Sys.time()
t1 - t0

MCMC diagnostics

Gelman-Rubin

GR.diag <- gelman.diag(modelBYM.eco$samples, multivariate = FALSE)
all(GR.diag$psrf[,"Point est."] < 1.1)

## [1] FALSE

which(GR.diag$psrf[,"Point est."] > 1.1)

##   proba.resSMR[1]   proba.resSMR[6]  proba.resSMR[13]  proba.resSMR[31] 
##               642               647               654               672 
##  proba.resSMR[74] proba.resSMR[104] proba.resSMR[118] proba.resSMR[134] 
##               715               745               759               775 
## proba.resSMR[173] proba.resSMR[174] proba.resSMR[179] proba.resSMR[182] 
##               814               815               820               823 
## proba.resSMR[187] proba.resSMR[196] proba.resSMR[221] proba.resSMR[256] 
##               828               837               862               897 
## proba.resSMR[284] proba.resSMR[289] proba.resSMR[290] 
##               925               930               931

Traceplots for the regression coefficients

ggs_BYMeco <- ggs(modelBYM.eco$samples) 

ggs_BYMeco %>% filter(Parameter %in% c("RR.beta[1]",
                 "RR.beta[2]", "RR.beta[3]", "RR.beta[4]", 
                 "RR.beta[5]", "RR.beta[6]")) %>% 
  ggs_traceplot() + theme_bw()

ggs_BYMeco %>% filter(Parameter %in% c("beta.no2.unscaled",
                 "rr.no2.unscaled", "RR.beta1_5NO2")) %>% 
  ggs_traceplot() + theme_bw()

Running means for the regression coefficients:

ggs_BYMeco %>% filter(Parameter %in% c("RR.beta[1]",
                 "RR.beta[2]", "RR.beta[3]", "RR.beta[4]", 
                 "RR.beta[5]", "RR.beta[6]")) %>% 
  ggs_running() + theme_bw()

Results

Print summary statistics for the regression coefficients on the exp scale:

modelBYM.eco$summary$all.chains[c("rr.no2.unscaled",
                 "RR.beta[2]", "RR.beta[3]", "RR.beta[4]", 
                 "RR.beta[5]", "RR.beta[6]"),]
modelBYM.eco$summary$all.chains["rr.no2.unscaled",]

Check also the density of the posteriors of the regression coefficients on the exp scale:

ggs_BYMeco %>% filter(Parameter %in% c("RR.beta[1]",
                 "RR.beta[2]", "RR.beta[3]", "RR.beta[4]", 
                 "RR.beta[5]", "RR.beta[6]")) %>% 
  ggs_compare_partial() + theme_bw()

Plot the residual SMR together with exceedance

COVID19eco$resRR <- 
  modelBYM.eco$summary$all.chains[paste0("resSMR[", 1:n.LTLA, "]"), "Mean"]

COVID19eco$PostProb <- 
  modelBYM.eco$summary$all.chains[paste0("proba.resSMR[", 1:n.LTLA, "]"), "Mean"]

ggplot() + geom_sf(data = COVID19eco, aes(fill=resRR), col = NA) + ggtitle("Residual SMR") + 
           scale_fill_viridis_c() + theme_bw() -> p1

ggplot() + geom_sf(data = COVID19eco, aes(fill=PostProb), col = NA) + ggtitle("Pr(resSMR>1)") + 
           scale_fill_viridis_c() + theme_bw() -> p2
p1|p2

Interpretation

Extract the results in a nice table:

tab <- 
round(
modelBYM.eco$summary$all.chains[c("rr.no2.unscaled",
                 "RR.beta[2]", "RR.beta[3]", "RR.beta[4]", 
                 "RR.beta[5]", "RR.beta[6]",  
                 "sigma2.b", "rho"), c("Median", "95%CI_low", "95%CI_upp")], 
digits = 2
)

rownames(tab) <- c("NO$_2$", "ICUBeds", "IMD2", "IMD3", "IMD4", "IMD5", 
                   "$\\sigma^2_{b}$", "$\\rho$")

knitr::kable(tab, caption = "Median and 95%CrI for the covariate coefficients and hyperparameters of the models.") %>%  kable_styling(bootstrap_options = "striped", full_width = F, position = "center")

Median and 95%CrI for the covariate coefficients and hyperparameters of the models.
	Median	95%CI_low	95%CI_upp
NO\(_2\)	1.03	1.02	1.04
ICUBeds	0.96	0.93	0.99
IMD2	0.92	0.83	1.02
IMD3	0.85	0.77	0.94
IMD4	0.90	0.81	1.00
IMD5	0.85	0.76	0.95
\(\sigma^2_{b}\)	0.08	0.06	0.09
\(\rho\)	0.93	0.79	1.00

Based on the table above, we see that for every unit increase in the long term exposure to NO\(_2\), the COVID-19 mortality increases by 0.03% (95% CrI: 0.02% to 0.04%). Number of ICU beds per population seems to be protective, ie the more the beds in one particular LTLA the less likely is someone living in this LTLA to die from COVID-19: for every 1sd increase in the number of beds per population (it is sd because we scale the ICU beds before we put them in the model, in addition 1sd = 2 beds per 100 people). To quantify this, for every 2 beds per 100 people more, there is a -0.04% (95% CrI: -0.07% to -0.01%) decrease in the COVID-19 mortality. The only level of IMD that seems to be associated with COVID-19 mortality is the 5-th: the COVID19 mortality rate for people living in the least deprived areas is 0.85 (95%CrI: 0.76 to 0.95) times the one for people living the the most deprived areas.

With respect to the hyperparameters, we see that the unstructured spatial term has very low variance, thus we could consider exclude it from the model. The mixing parameter is 0.93, and although the uncertainty is high, we see that 93% of the total variation captured is attributed to the spatially structured term.

Alternative interpretation:

It would be of interest for some, to get the effect of air pollution to COVID-19 mortality for every \(k\) units of increase in NO\(_2\). That could be the case when we want results directly comparable with a specific analysis in the literature. To do so we add this line in the model: RR.beta2_kNO2 <- exp(beta[3] * k) and we remember to add it in the parameters to be monitored. To show this explicitly, say we want to find the effect for an increase of 5\(\mu g/m^3\) in the long term exposure to NO\(_2\). We then add RR.beta1_5NO2 <- exp(beta[3] * 5) this in the model and monitor it:

round(
modelBYM.eco$summary$all.chains[c("RR.beta1_5NO2"), c("Median", "95%CI_low", "95%CI_upp")], 
digits = 2
)

##    Median 95%CI_low 95%CI_upp 
##      1.15      1.09      1.22

More flexible NO2 fits

We will finish this tutorial by saying that the modeling approach we selected for far to model the effect of NO\(_2\) on COVID-19 related mortality is a bit restrictive, since we assumed that such an effect is linear. To relax this assumption we could consider several different things. In this tutorial, we will make the continuous variable NO\(_2\) categorical and we will see how valid it the assumption of linearity. Although this approach have some advantages, ie it straight-forward to do and the effect of some categories is easier to communicate plus is might be of interest in a particular application, there are other ways of allowing flexible fits (without categorizing), such as Gaussian processes and/or splines. This is the focus of another tutorial.

Select the categories for NO\(_2\). Lets use quintiles!

COVID19eco$NO2_quintiles <- cut(COVID19eco$NO2, 
                                breaks = quantile(COVID19eco$NO2, 
                                                  probs = seq(from = 0, to = 1, length.out = 6)), 
                                labels = 1:5, include.lowest = T)

Create the dummy variable as we did for the IMD

fastDummies::dummy_cols(COVID19eco$NO2_quintiles)  %>% 
  select(.data_1:.data_5) %>% 
  rename(NO2_1 = .data_1, NO2_2 = .data_2, NO2_3 = .data_3, NO2_4 = .data_4, NO2_5 = .data_5) -> 
  dummies
                                            
COVID19eco <- cbind(COVID19eco, dummies)

Modify the data

# Format the data for NIMBLE in a list
COVIDdata = list(
                 O = COVID19eco$deaths,                     # observed nb of deaths
                 
                 # covariates
                 X = cbind(1,                               # for the intercept
                           COVID19eco$NO2_2,                # NO2 2
                           COVID19eco$NO2_3,                # NO2 3
                           COVID19eco$NO2_4,                # NO2 4
                           COVID19eco$NO2_5,                # NO2 5
                           scale(COVID19eco$TtlICUB)[,1],   # ICU beds
                           COVID19eco$IMD_2,                # IMD 2
                           COVID19eco$IMD_3,                # IMD 3
                           COVID19eco$IMD_4,                # IMD 4
                           COVID19eco$IMD_5)                # IMD 5                        
                                  

)      
      
COVIDConsts <-list(      
                 N = n.LTLA,                                # nb of LTLAs
                 
                 # adjacency matrix
                 L = length(nbWB_B$weights),                # the number of neighboring areas
                 E = COVID19eco$expectd,                    # expected number of deaths
                 adj = nbWB_B$adj,                          # the elements of the neigh. matrix
                 num = nbWB_B$num,
                 weights = nbWB_B$weights, 
                 
                 scale = scale,                             # the scale for the covariance matrix
                 K = 10,                                    # the total number of covariates
                 quin = TRUE
)

Create the initial values for ALL the unknown parameters. As usual, create two different chains.

inits <- list(
  list(
  # first chain 
  tau.b = 1,
  theta = rep(0.01, times = n.LTLA),
  phi = rep(0.01, times = n.LTLA), 
  rho = 0.5,
  beta = rep(0, 10)
),
#chain 2 (different values from chain 1)
  list(    
  tau.b = 5,      
  theta = rep(0.1, times = n.LTLA),
  phi = rep(0.1, times = n.LTLA), 
  rho = 0.8,
  beta = c(rep(-2, 5), rep(1, 5))
  )
)

Specify the parameters to monitor (lets monitor only the covariates now)

parameters = paste0("RR.beta[", 1:9, "]")

Run the model

t0<- Sys.time()
modelBYM.eco_q <- nimbleMCMC(code = BYMecoCode,
                      data = COVIDdata,
                      constants = COVIDConsts, 
                      inits = inits,
                      monitors = parameters,
                      niter = ni,
                      nburnin = nb,
                      thin = nt, 
                      nchains = nc, 
                      setSeed = 9, 
                      progressBar = TRUE, 
                      samplesAsCodaMCMC = TRUE, 
                      summary = TRUE
                      )
t1<- Sys.time()
t1 - t0

# saveRDS(modelBYM.eco_q, file = "modelBYM.eco_q")
modelBYM.eco_q <- readRDS("modelBYM.eco_q")

Check convergence of NO2

ggs(modelBYM.eco_q$samples) %>% filter(Parameter %in% paste0("RR.beta[", 1:4, "]")) %>% 
                       ggs_traceplot() + theme_bw()

Extract results

rbind(
  c(1, NA, NA), 
  round(
modelBYM.eco_q$summary$all.chains[paste0("RR.beta[", 1:4, "]"), 
                                  c("Median", "95%CI_low", "95%CI_upp")], 
digits = 2
 )
) -> tab

rownames(tab) <- levels(cut(COVID19eco$NO2, 
                            breaks = quantile(COVID19eco$NO2, 
                                              probs = seq(from = 0, to = 1, length.out = 6)), 
                            include.lowest = T))

options(knitr.kable.NA = "-")
knitr::kable(tab, 
             caption = "Median and 95%CrI for the NO$_2$ coefficients based on quintiles.") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "center")

Median and 95%CrI for the NO\(_2\) coefficients based on quintiles.
	Median	95%CI_low	95%CI_upp
[4.48,10.7]	1.00
(10.7,13.1]	1.19	1.09	1.31
(13.1,15.4]	1.20	1.08	1.33
(15.4,18.4]	1.37	1.22	1.53
(18.4,46.4]	1.53	1.32	1.79

plotCI(y = tab[,1], 
       x = seq(from = 1, to = 3, length.out = 5), 
       ui = tab[,3], 
       li = tab[,2], 
       ylim = c(.5, 2), xlim = c(0.4, 3.5), 
       lwd = 2, type = "p", pch = 15, cex = 1,  las = 1, 
       ylab = "", main = "", xaxt = "n", xlab = "", cex.axis = 1, bty = "n"
)

axis(side = 1, at = seq(from = 1, to = 3, length.out = 5), 
     labels = levels(cut(COVID19eco$NO2, 
                         breaks = quantile(COVID19eco$NO2, 
                                           probs = seq(from = 0, to = 1, length.out = 6)), 
                         include.lowest = T)), 
     cex.axis = .8, las = 2)
grid()
abline(h = 1, col = "red", lty = "dashed", lwd = 1.5)
mtext(expression(bold("NO"[2])), side = 3, cex = 1.1, line = 0, font = 2)
mtext("Posterior Relative Risk", side = 2, cex = 1.1, line = 3,font = 1)

The linearity assumption looks reasonable.

Conclusion

In this practical we saw an example of ecological regression with NIMBLE. We found a positive association between long term averaged NO\(_2\) exposure and COVID-19 mortality after adjusting for total number of ICU beds per population and deprivation. The interpretation of the results are conditional on the convergence of the coefficients, and for some of them it seems that we should have put more effort and tune the MCMC setting in a better way to achieve convergence (for instance there seems to be insufficient mixing for the linear NO\(_2\) coefficient).

References

Besag, Julian, Jeremy York, and Annie Mollié. 1991. “Bayesian Image Restoration, with Two Applications in Spatial Statistics.” Annals of the Institute of Statistical Mathematics 43 (1): 1–20.

de Valpine, P., D. Turek, C. J. Paciorek, C. Anderson-Bergman, D. Temple Lang, and R. Bodik. 2017. “Programming with Models: Writing Statistical Algorithms for General Model Structures with NIMBLE.” Journal of Computational and Graphical Statistics 26: 403–17. https://doi.org/10.1080/10618600.2016.1172487.

Konstantinoudis, Garyfallos, Tullia Padellini, James E Bennett, Bethan Davies, Majid Ezzati, and Marta Blangiardo. 2020. “Long-Term Exposure to Air-Pollution and COVID-19 Mortality in England: A Hierarchical Spatial Analysis.” medRxiv.

Riebler, Andrea, Sigrunn H Sørbye, Daniel Simpson, and Håvard Rue. 2016. “An Intuitive Bayesian Spatial Model for Disease Mapping That Accounts for Scaling.” Statistical Methods in Medical Research 25 (4): 1145–65.

Ecological regression with NIMBLE in R: A case study examining the long term effect of NO\(_2\) exposure on COVID-19 mortality

Garyfallos Konstantinoudis