Intervals

Inference

Confidence, credible and prediction intervals

Author

Chi Zhang

Published

February 24, 2024

Key difference: confidence and credible intervals: about the (unknown) parameter; prediction interval: about individual (unseen) observations.

Confidence vs credible interval

Confidence interval (frequentist), about the unknown but fixed parameter. That means, the parameter is not treated as a random variable so does not have a probability distribution. CI is random because it is based on your sample.

Credible interval (Bayesian), associated with posterior distribution of the parameter. The parameter is treated as a random variable hence has a probability distribution.

Prediction intereval

In a regression model, you might want to know both confidence and prediction intervals.

CI for mean value of \(y\) when \(x =0\), the mean response (e.g. growth of GDP), this is a parameter, an average
PI for \(y\) when \(x=0\), this is an individual observation.

In simple linear regression

Standard deviation for linear predictor \(\alpha + \beta x\) is

\(\hat{\sigma}_{linpred} = \hat{\sigma} \sqrt{\frac{1}{n} + \frac{(x_{new} - \bar{x})^2}{\sum(x_i - \bar{x})}}\)

Confidence interval

\(\hat{y_{new}} \pm t_{1-\frac{\alpha}{2}, n-2} \times \sqrt{\hat{\sigma}^2 (\frac{1}{n} + \frac{(x_{new} - \bar{x})^2}{\sum(x_i - \bar{x})^2})}\)

Standard deviation for the predicted value \(\alpha + \beta x + \epsilon\) is

\(\hat{\sigma}_{prediction} = \hat{\sigma} \sqrt{1 + \frac{1}{n} + \frac{(x_{new} - \bar{x})^2}{\sum(x_i - \bar{x})}}\)

Prediction interval (frequentist)

\(\hat{y_{new}} \pm t_{1-\frac{\alpha}{2}, n-2} \times \sqrt{\hat{\sigma}^2 (1 + \frac{1}{n} + \frac{(x_{new} - \bar{x})^2}{\sum(x_i - \bar{x})^2})}\)

Posterior predictive distribution

Difference between posterior distribution and posterior predictive distribution PPD

posterior dist \(p(\theta|x) = c \times p(x|\theta)p(\theta)\), depends on the parameter \(\theta\)
PPD does not depend on \(\theta\) as it is integrated out, for unobserved \(x^*\),

\(p(x^*|x) = \int_{\Theta} c \times p(x^*, \theta|x) d\theta = \int_{\Theta} c \times p(x^*|\theta)p(\theta|x) d\theta\)

PD is part of PPD formulation.

PD explains the unknown parameter (treated as a random variable), conditional on the evidence observed (data).
PPD is the distribution for the future predicted data based on the data you have already seen.

Example: Mortality

We explore the all-cause mortality counts in Norway between 2000 to 2023. More specifically, we use year 2000-2019 (pre-pandemic years) to fit a simple linear regression model, and predict the expected number of deaths post pandemic (2020-2023). I include both frequentist approach with lm, and Bayesian approach with rstanarm::stan_glm.

The data of mortality example comes from Statistics Norway and is publicly available.

library(ggplot2)
library(patchwork)
library(data.table)
# load data 
mortality <- readRDS("data/mortality_2000_2023.rds")

# select only total age group
d <- mortality[age == '000_059']
d

     year deaths_n     age pop_jan1_n deaths_vs_pop_per_100k
    <num>    <num>  <char>      <num>                  <num>
 1:  2000     5427 000_059    3613275              150.19615
 2:  2001     5289 000_059    3636329              145.44889
 3:  2002     5303 000_059    3656613              145.02492
 4:  2003     5144 000_059    3679479              139.80240
 5:  2004     5069 000_059    3694134              137.21755
 6:  2005     4957 000_059    3705027              133.79120
 7:  2006     4773 000_059    3718828              128.34689
 8:  2007     4605 000_059    3733477              123.34347
 9:  2008     4582 000_059    3766509              121.65111
10:  2009     4666 000_059    3806779              122.57081
11:  2010     4557 000_059    3844344              118.53778
12:  2011     4588 000_059    3885654              118.07536
13:  2012     4278 000_059    3931709              108.80765
14:  2013     4380 000_059    3976015              110.16055
15:  2014     4156 000_059    4011213              103.60956
16:  2015     3988 000_059    4044006               98.61509
17:  2016     4004 000_059    4067801               98.43156
18:  2017     3812 000_059    4086278               93.28783
19:  2018     3826 000_059    4099015               93.33950
20:  2019     3749 000_059    4105628               91.31368
21:  2020     3711 000_059    4118541               90.10472
22:  2021     3645 000_059    4115975               88.55739
23:  2022     3910 000_059    4124531               94.79866
24:  2023     3896 000_059    4161429               93.62169
     year deaths_n     age pop_jan1_n deaths_vs_pop_per_100k

Explore the data

Show the code

q1 <- ggplot(d, aes(x = year, y = deaths_n, group))
q1 <- q1 + geom_line()
q1 <- q1 + geom_point(size = 2)
q1 <- q1 + geom_vline(xintercept = 2019.5, color = "red", lty = 2)
q1 <- q1 + theme_bw()
q1 <- q1 + scale_x_continuous(breaks = seq(2000, 2023, 2))
q1 <- q1 + labs(
  x = 'Year', 
  y = 'Number of deaths', 
  title = 'Number of death in Norway \n2000 - 2023'
)
q1 <- q1 + theme(
  axis.text = element_text(size = 11),
  axis.title = element_text(size = 12), 
  plot.title = element_text(size = 12), 
  axis.text.x = element_text(angle = 45, vjust = 0.5) 
)
# q1

# per 100k
q2 <- ggplot(d, aes(x = year, y = deaths_vs_pop_per_100k))
q2 <- q2 + geom_line()
q2 <- q2 + geom_point(size = 2)
q2 <- q2 + geom_vline(xintercept = 2019.5, color = "red", lty = 2)
q2 <- q2 + theme_bw()
q2 <- q2 + scale_x_continuous(breaks = seq(2000, 2023, 2))
q2 <- q2 + labs(
  x = 'Year', 
  y = 'Number of deaths', 
  title = 'Number of death in Norway \n(per 100k population) 2000 - 2023'
)
q2 <- q2 + theme(
  axis.text = element_text(size = 11),
  axis.title = element_text(size = 12), 
  plot.title = element_text(size = 12), 
  axis.text.x = element_text(angle = 45, vjust = 0.5) 
)
# plot side by side
q1 + q2

Model mortality using 2000-2019 data

# take pre 2019 data
dt <- d[year <= 2019, .(year, deaths_vs_pop_per_100k)]

# prediction 
dnew <- data.frame(year = c(2020, 2021, 2022, 2023))

Linear regression with `lm`

m_linear <- lm(deaths_vs_pop_per_100k ~ year, 
               data = dt)

# summary(m_linear)

# produce two intervals
pred_freq_pi <- predict(m_linear, newdata = dnew, interval = 'prediction')
pred_freq_ci <- predict(m_linear, newdata = dnew, interval = 'confidence')

Verify from formula

# verify with formula
n <- nrow(dt)

# option 1
fitted_val <- m_linear$fitted.values
mse <- sum((dt$deaths_vs_pop_per_100k - fitted_val)^2)/(n-2)
mse

[1] 4.383279

# option 2
sum((m_linear$residuals)^2)/(n-2)

[1] 4.383279

# option 3
summary(m_linear)$sigma^2

[1] 4.383279

# option 4
dvmisc::get_mse(m_linear)

[1] 4.383279

# t-val
tval <- qt(p=0.975, df=n-2)
mean_x <- mean(dt$year)

# sum(xi - xbar)^2
ssx <- sum((dt$year - mean_x)^2)

sd_confint <- sqrt(mse * (1/20+ ((dnew$year - mean_x)^2)/ssx))
sd_predint <- sqrt(mse * (1 + 1/20+ ((dnew$year - mean_x)^2)/ssx))


# point prediction
b0 <- coef(m_linear)[1]
b <- coef(m_linear)[2]
prednew <- b0 + b*dnew$year
prednew

[1] 85.68773 82.50765 79.32757 76.14749

# prediction interval
predint_linear <- data.frame(fit = prednew, 
                             lwr = prednew - tval*sd_predint, 
                             upr = prednew + tval*sd_predint)

predint_linear

       fit      lwr      upr
1 85.68773 80.83777 90.53770
2 82.50765 77.59214 87.42316
3 79.32757 74.34154 84.31360
4 76.14749 71.08617 81.20880

pred_freq_pi # compare with result from lm

       fit      lwr      upr
1 85.68773 80.83777 90.53770
2 82.50765 77.59214 87.42316
3 79.32757 74.34154 84.31360
4 76.14749 71.08617 81.20880

# confidence interval
confint_linear <- data.frame(fit = prednew, 
                             lwr = prednew - tval*sd_confint, 
                             upr = prednew + tval*sd_confint)

confint_linear

       fit      lwr      upr
1 85.68773 83.64447 87.73100
2 82.50765 80.31334 84.70196
3 79.32757 76.97954 81.67560
4 76.14749 73.64355 78.65142

pred_freq_ci # compare with result from lm

       fit      lwr      upr
1 85.68773 83.64447 87.73100
2 82.50765 80.31334 84.70196
3 79.32757 76.97954 81.67560
4 76.14749 73.64355 78.65142

Linear regression with `rstanarm`

m_bayes <- rstanarm::stan_glm(
  deaths_vs_pop_per_100k ~ year, 
  data = dt, 
  family = gaussian,
  iter = 2000,
  chains = 8,
  refresh = 0
)

m_bayes

stan_glm
 family:       gaussian [identity]
 formula:      deaths_vs_pop_per_100k ~ year
 observations: 20
 predictors:   2
------
            Median MAD_SD
(Intercept) 6510.0  169.5
year          -3.2    0.1

Auxiliary parameter(s):
      Median MAD_SD
sigma 2.2    0.4   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

sims <- as.matrix(m_bayes)
median <- apply(sims, 2, median)
median
mad_sd <- apply(sims, 2, mad) 
# median absolute deviation (similar to sd)
mad_sd

Credible interval

# credible interval about the fit
cred <- rstanarm::posterior_interval(m_bayes, prob = 0.95)
cred

                   2.5%       97.5%
(Intercept) 6147.110403 6852.327354
year          -3.350749   -2.999700
sigma          1.614450    3.220209

# equivalent to
sims <- as.matrix(m_bayes)
apply(sims, 2, quantile, probs = c(0.025, 0.5, 0.975))

       parameters
        (Intercept)      year    sigma
  2.5%     6147.110 -3.350749 1.614450
  50%      6510.045 -3.180305 2.194318
  97.5%    6852.327 -2.999700 3.220209

Point prediction

# point predict
# uses median from the posterior sim
y_point_pred <- predict(m_bayes, newdata = dnew)
y_point_pred

       1        2        3        4 
85.69091 82.51177 79.33263 76.15349

a_hat <- coef(m_bayes)[1] # median
b_hat <- coef(m_bayes)[2] # median
a_hat + b_hat*dnew$year

[1] 85.82895 82.64865 79.46834 76.28804

Uncertainty of linear predictor

The uncertainty of linear predictor, \(a + bx\) is propagated through the uncertainty in \(a\) and \(b\), respectively. For now the error term is not included.

rstanarm::posterior_linpred is equivalent to using each pairs of \(a, b\) from the posterior distribution to compute the point predictions.

# linear predictor with uncertainty via a's and b's
y_linpred <- rstanarm::posterior_linpred(m_bayes, newdata = dnew)
head(y_linpred)

          
iterations        1        2        3        4
      [1,] 86.76040 83.68034 80.60028 77.52022
      [2,] 86.33213 83.24274 80.15334 77.06394
      [3,] 86.24299 83.12216 80.00133 76.88050
      [4,] 85.26344 82.07579 78.88813 75.70048
      [5,] 85.49074 82.23835 78.98596 75.73357
      [6,] 85.28163 82.01827 78.75492 75.49156

# focus on one year: 2023
hist(y_linpred[, 4], main = 'Predictions for 2023')

head(y_linpred[, 4])

[1] 77.52022 77.06394 76.88050 75.70048 75.73357 75.49156

y_linpred_byhand <- sims[,1] + dnew$year[4] * sims[,2]
y_linpred_byhand[1:6]

[1] 77.52022 77.06394 76.88050 75.70048 75.73357 75.49156

Posterior predictive distribution

With PPD, include the additional uncertainty in \(\sigma\). This is equivalent to using the linear predictor from above, plus a random draw from rnorm(1, 0, sigma).

Due to randomness in the error, we can not get the exact same results. But they should be close enough.

# posterior predictive dist (with sigma)
y_postpred <- rstanarm::posterior_predict(m_bayes, newdata = dnew)
head(y_postpred)

            1        2        3        4
[1,] 85.38304 80.76660 80.77002 78.29299
[2,] 83.58124 83.70593 83.34616 76.07834
[3,] 86.75044 85.02622 78.82639 76.81196
[4,] 86.80538 78.61904 81.45173 78.71479
[5,] 83.38782 82.76790 79.21554 75.24868
[6,] 90.54799 80.21181 83.45230 75.14096

# by hand
# focus on one year: 2023
n_sim <- nrow(sims)
y_postpred_byhand <- y_linpred_byhand + rnorm(n_sim, 0, sims[,3])
par(mfrow = c(1,2))
hist(y_postpred_byhand, main = 'PPD for 2023 (linpred + error)')
hist(y_postpred[,4], main = 'PPD for 2023 (post_predict)')

y_postpred[,4] |> summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  66.36   74.44   76.12   76.15   77.85   87.16

y_postpred_byhand |> summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  65.10   74.51   76.23   76.21   77.89   87.72

Prediction for 2020-2023 mortality

Here we choose to use rstanarm::posterior_predict().

pred_all <- rstanarm::posterior_predict(m_bayes, d)

pred_all_quantile <- apply(pred_all, 2, quantile, 
                    probs = c(0.025, 0.5, 0.975)) |> 
  t()  |> 
  as.data.frame()
pred_all_quantile[21:24,]

       2.5%      50%    97.5%
21 80.76780 85.74303 90.58276
22 77.46361 82.48300 87.47500
23 74.24409 79.29768 84.40856
24 70.99240 76.20945 81.30611

We can compare with the frequentist prediction interval, and we see that they are very close.

predint_linear

       fit      lwr      upr
1 85.68773 80.83777 90.53770
2 82.50765 77.59214 87.42316
3 79.32757 74.34154 84.31360
4 76.14749 71.08617 81.20880

Visualize mortality prediction

Show the code

pd <- copy(d)
# head(pd)
pd <- cbind(pd, pred_all_quantile)

setnames(pd, old = '2.5%', new = 'exp_death_per100k_025')
setnames(pd, old = '50%', new = 'exp_death_per100k_50')
setnames(pd, old = '97.5%', new = 'exp_death_per100k_975')

# also compute the expected death overall

pd[, exp_death_025 := exp_death_per100k_025 * pop_jan1_n/100000]
pd[, exp_death_50 := exp_death_per100k_50 * pop_jan1_n/100000]
pd[, exp_death_975 := exp_death_per100k_975 * pop_jan1_n/100000]

pd[, alert := fcase(
  deaths_vs_pop_per_100k > exp_death_per100k_975, "Higher than expected",
  default = "Expected"
)]

pd[, type := fcase(
  year <= 2019, paste0("Baseline (2000-2019)"),
  default = "Pandemic years (2020-2023)"
)]


# make plot
q <- ggplot(pd, aes(x = year))
q <- q + geom_ribbon(mapping = aes(ymin = exp_death_per100k_025, 
                                   ymax = exp_death_per100k_975), 
                     alpha = 0.3)
q <- q + geom_line(mapping = aes(y = exp_death_per100k_50, lty = type), linewidth = 1)
q <- q + geom_point(mapping = aes(y = deaths_vs_pop_per_100k, color = alert), size = 3)
q <- q + geom_vline(xintercept = 2019.5, color = "red", lty = 2)
q <- q + expand_limits(y=0)
q <- q + scale_y_continuous("Number of death per 100k", expand = expansion(mult = c(0, 0.1)))
q <- q + scale_x_continuous("Year", breaks = seq(2000, 2023, 2))
q <- q + scale_linetype_discrete(NULL)
q <- q + scale_color_brewer(NULL, palette = "Set1", direction = -1)
q <- q + theme_bw()
q <- q + theme(
  axis.text = element_text(size = 11),
  axis.title = element_text(size = 12), 
  plot.title = element_text(size = 15), 
  axis.text.x = element_text(angle = 45, vjust = 0.5) 
)
q <- q + theme(legend.position = "bottom", legend.direction = "horizontal")
q <- q + theme(legend.box = "horizontal", legend.margin = margin(2, 2, 2, 2))
q

Reference

A very good decomposition of prediction interval in linear regression: Post by Bryan Shalloway
Blog by Professor Rob Hyndman on difference between prediction intervals and confidence intervals
The data of mortality example comes from Statistics Norway and is publicly available.

Confidence vs credible interval

Prediction intereval

In simple linear regression

Posterior predictive distribution

Example: Mortality

Explore the data

Model mortality using 2000-2019 data

Linear regression with lm

Verify from formula

Linear regression with rstanarm

Credible interval

Point prediction

Uncertainty of linear predictor

Posterior predictive distribution

Prediction for 2020-2023 mortality

Visualize mortality prediction

Reference

Linear regression with `lm`

Linear regression with `rstanarm`