Missing data and imputation

Missing data

Overview of multiple imputation

Author

Chi Zhang

Published

February 14, 2024

Overview

Missing data can occur in various situations:

  • unit non-response: individuals decline to participate
  • item non-response: some questions or measurements left out by participating individuals
  • loss of follow-up: drop outs

For unit non-response and loss of follow-up, weighting the data could help. Need to know the response rate, characteristics of the non-responders and how the respondent differ.

Solutions (to item non-response)

  • complete case analysis (listwise deletion)
  • single imputation
    • mean imputation
    • conditional imputation: make a prediction, plus minus some noise
  • multiple imputation: repeat single imputation M times, produce M complete datasets, use Rubin’s rules to pool the estimates.

MCAR, MAR, MNAR

Three categories of missing mechanism. Example: disease status, level of exposure, age. Exposure is missing for some subjects.

  • MCAR: any two individuals have the same probability of having missing value for exposure. This is unlikely.
  • MAR: any two individuaals with the same disease status and age have the same chance of having the exposure missing
  • MNAR: chance of having missing exposure depends on the value of exposure

Data could also have special missing structures: multi-level, longitudinal and repeated measurements. In RCT a specific technique reference based imputation can be applied.

Consideration

Need to account for the missinng data process, preserve the relations in the data and uncertainty in the relations.

  • MAR assumption whether plausible. (FCS can handle both MAR anad MNAR)
  • form of imputation model: structure and error distribution
  • predictors, as many relevant as possible, including interactions
  • the order in which to impute
  • set up starting imputation and number of iteration
  • decide number of imputed datasets

Univariate imputation

Assume only missing in one (continuous) variable.

  • predict from (linear) regression. No uncertainty
  • predict + noise. Assume normality, draw a random value from N(0, sd)
  • predict + noise + parameter uncertainty. Bayesian method, bootstrap (resamples observed data, re-estimate parameters)
  • a second predictor (or more)
  • draw from observed data - predictive mean matching

Regression

VIM::regressionImp

Predictive mean matching

Assumption: distribution of missing is the same aas obsereved data of the candidates that produce the closest values to the predicted value by the missing entry.

Robust to transformation, less vulnerable to model misspecification.

Implementation in MICE:

Multivariate imputation

Keywords: Sequential imputation (MICE), joint modeling, multi-level imputation

Joint modeling

More information see book by van Buuren

Data can be described by a multivariate distribution, imputations are created by draws from the fitted distribution.

Continuous data: imputation under multivariate normal assumption is robusst to non-normal data, see simulation by Demirtas et al 2008.

Fully conditional specification, MICE algorithm

More information see book by van Buuren

FCS imputes multivariate data variable-by-variable, specify imputation model for each variable, then impute iteratively.

One iteration goes through all \(Y_j, j = 1, ..., p\), number of iteration can be low as 5-10.

Multivariate distribution \(P(Y, X, R|\theta)\) is specified through a set of conditional densities \(P(Y_j | X, Y_{-j}, R, \phi_j)\)

Other names of this category of approach:

  • variable by variable imputation (Brand 1999)
  • switching regressions (van Buuren, Boshuizen aand Knook 1999)
  • sequential regressions (Raaghunathan 2001)
  • partially incompatible MCMC (Rubin 2003)
  • iterated univariate imputation (Gelman 2004)
  • chained equations (van Buuren, Groothuis-Oudshoorn 2000)
  • fully conditional specification FCS (van Buuren 2006)

MICE algorithm

Multiple imputation via chained equations, it is a MCMC method. If conditionals are compatible, MICEE is a Gibbs sampler.

Gibbs sampling (idea): knowledge on the conditional distribution is sufficient to determine a joint distribution.

Reference based methods (RCT)

(This type of imputation is specific to RCT, might deserve a separate post)

Package: rbmi

Keywords: delta adjustment, tipping point analysis, sensitivity analysis, jump to reference,

https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-021-01261-6