Length of hospital stay: Part II

Survival

Part 2: time-to-event analysis

Author

Chi Zhang

Published

October 2, 2024

This analysis is in preparation for interviews related to time-to-event analysis. Focus will be put on the procedure (and how to do it in R), as well as interpretation of the results.

The data liggetid was collected at the Geriatric Department at Ullevål Sykehus. Detailed description can be found in part 1 of the case study.

We will focus on the following variables:

los <- readRDS('data/los.rds')
head(los, 3)
  admission_year age    sex admission_from stroke los
1           1987  81 female           home      0  13
2           1987  96 female           home      0  17
3           1987  79 female           home      0   6

Simulate times and status

In this analysis we will do the following steps for processing. First we limit the analysis to data from 1985 onwards (not including).

Simulate censor time. We know the time for discharge is given as los - this is the true event time. In this example for hospital LOS, all patients are discharged (with final status 1). However for the purpose of learning, we randomly select 20% of the subjects and assume they are censored. The time for censored is fixed to 0.9 times their final LOS.

Simulate stroke time. stroke is a variable that is typically not measured at baseline, hence it should be treated as a time dependent variable. The time for stroke is fixed at 0.7 of their time (after censoring).

Show the code
suppressMessages(library(dplyr))
library(ggplot2)
suppressMessages(library(data.table))


los <- readRDS('data/los.rds')
# table(dlos$admission_from)

# to increase readability, process
data.table::setDT(los)

# code a status: 1 indicates released. all of them
los$discharge <- 1

# take those from 1985 onwards
los <- los[admission_year > 1985]

# we know the true time (los)
# now randomly select subjects (20%), make a new time, then mark censor

set.seed(1)
n <- nrow(los)
id_time <- sample(1:n, size = 0.2*n, replace = F)

los[, time := los]
los[id_time, time := round(0.9*los, digits = 0)]

# make new status
los[, status := 1]
los[id_time, status := 0] # censored

# for those with stroke, simulate a time
# if no stroke, the time is same as overall time
los[stroke == 1, time_stroke := round(0.7*time, digits = 0)]
los[stroke == 0, time_stroke := time]

head(los, 3)
   admission_year   age    sex admission_from stroke   los discharge  time
            <num> <num> <char>         <char> <char> <num>     <num> <num>
1:           1987    81 female           home      0    13         1    13
2:           1987    96 female           home      0    17         1    17
3:           1987    79 female           home      0     6         1     6
   status time_stroke
    <num>       <num>
1:      1          13
2:      1          17
3:      1           6

Kaplan-Meier analysis

Create KM curve

library(survival)
library(ggsurvfit)

# from ggsurvfit pkg. using surfit also works
ss <- survfit2(Surv(time, status) ~1, data = los)

ggsurvfit(ss) + 
  labs(x = 'Days', 
       y = 'Overall survival prob') + 
  add_confidence_interval() + 
  add_censor_mark() + 
  add_risktable()

x-day survival probability

# beyond 100 days
summary(survfit(Surv(time, status)~1, data = los), times = 100)
Call: survfit(formula = Surv(time, status) ~ 1, data = los)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  100     38     290    0.152  0.0212        0.116          0.2
nrow(los[time>=100]) # n at risk (still alive)
[1] 38
# nrow(los[time<=100])  # this includes censored, 363+1
nrow(los[status == 1 & time <= 100]) # n event (already happened)
[1] 290
# n at risk: 38; n event 290, p = 0.15

401 total, 38 at risk, within the 364 (-1), 290 died

Naive but incorrect way of calculating: 1-290/401 = 27.7%. This over-estimates how many are alive, as (364-290) = 74 are censored and we do not know the outcome.

Median survival time

Median survival time means half of the subjects have events happen before, and half after. It corresponds to the time where survival probability is 0.5.

The median survival time is higher than using median(time_to_event) on those with outcome. It is intuitive because with censored data there are time periods that are not observed, which are ignored by median().

survfit(Surv(time, status) ~ 1, data = los)
Call: survfit(formula = Surv(time, status) ~ 1, data = los)

       n events median 0.95LCL 0.95UCL
[1,] 401    321     35      28      41
median(los[status == 1, time]) # only compute time for uncensored
[1] 28

Compare with the complete case when there’s no censoring. We use discharge as the outcome status, all subjects have 1. The time variable is los which is the true time. Now these two should be consistent.

survfit(Surv(los, discharge) ~ 1, data = los)
Call: survfit(formula = Surv(los, discharge) ~ 1, data = los)

       n events median 0.95LCL 0.95UCL
[1,] 401    401     28      26      33
median(los$los)
[1] 28
ggsurvfit(ss) + geom_vline(xintercept = c(100, 35), col = 'red')

  • over-estimated (higher) survival probability calculated as (1 - dead/n), since the numerator does not include censored subjects whose outcome is unknown.
  • under-estimated (lower) median survival time, since their time before event will prolong the survival time.

Cox proportional hazards model

Time varying covariate