Survey, stratification

RWD

Survey sampling

Author

Chi Zhang

Published

April 23, 2024

Survey sampling is a bit different from RCT

Terms

Weights

Probability weight: inverse of probability of being included.

  • N/n
  • N: number of elements in the population; n: in the sample
  • if population has 10 and 3 sampled at random, probability weight is 10/3 = 3.33

Sampling weight: a probability weight that could have other corrections (e.g. unit non-response, calibration, trimming etc)

Strata

Stratification breaks up the population into groups. Each element in the population must belong to only one strata.

Typically you need two or more PSU in each stratum.

Purpose of stratification: reduce standard error of the estimates

PSU: Primary sampling unit

Post-stratification

Stratify data AFTER data is collected, to ensure data is representative of the target population.

Examples

  • male n=20, y1 = 180
  • female n=80, y2 = 120
  • overall mean: 132, from \((20*180 + 80*120)/100\)

This would be underestimating due to over-representation of female.

Adjustment:

  • in the population, proportion is 0.5 and 0.5
  • mean would be \(\bar{y_{st}} = 0.5*180 + 0.5*120 = 150\)

This is the post stratification estimator.

Calibration

Use inverse probability weights to adjust sample.

Horvitz-Thompson estimator

A method to estimate mean of population in a stratified sample, by applying IPW to account for the difference in sampling distribution between the collected data and target population.

Inverse probability weighting

Crude example of how it would affect the mean estimate

df <- data.frame(
  v = c(100, 100, 200), # values
  w = c(1, 1, 1), 
  popw = c(0.35, 0.5, 0.15), # assume population proportion
  invw = c(1/0.35, 1/0.5, 1/0.15), 
  w2 = c(0.3, 0.2, 0.5), # observed prop 
  invw2 = c(0.3/0.35, 0.2/0.5, 0.5/0.15) # use obs/pop
)

df
    v w popw     invw  w2     invw2
1 100 1 0.35 2.857143 0.3 0.8571429
2 100 1 0.50 2.000000 0.2 0.4000000
3 200 1 0.15 6.666667 0.5 3.3333333
crude_mean <- mean(df$v)
crude_mean
[1] 133.3333
weighted.mean(df$v, w = df$w)
[1] 133.3333
# use inverse probability
weighted.mean(df$v, w = df$invw)
[1] 157.8512
(sum(df$v * 1/df$popw))/(sum(1/df$popw))
[1] 157.8512
# mean 2, need to use weight
weighted.mean(df$v, w = df$w2)
[1] 150
# different obs prop
weighted.mean(df$v, w = df$invw2)
[1] 172.6141