Missing data and imputation

Missing data

Overview of multiple imputation

Author

Chi Zhang

Published

February 14, 2024

Overview

Missing data can occur in various situations:

unit non-response: individuals decline to participate
item non-response: some questions or measurements left out by participating individuals
loss of follow-up: drop outs

For unit non-response and loss of follow-up, weighting the data could help. Need to know the response rate, characteristics of the non-responders and how the respondent differ.

Solutions (to item non-response)

complete case analysis (listwise deletion)
single imputation
- mean imputation
- conditional imputation: make a prediction, plus minus some noise
multiple imputation: repeat single imputation M times, produce M complete datasets, use Rubin’s rules to pool the estimates.

MCAR, MAR, MNAR

Three categories of missing mechanism. Example: disease status, level of exposure, age. Exposure is missing for some subjects.

MCAR: any two individuals have the same probability of having missing value for exposure. This is unlikely.
MAR: any two individuaals with the same disease status and age have the same chance of having the exposure missing
MNAR: chance of having missing exposure depends on the value of exposure

Data could also have special missing structures: multi-level, longitudinal and repeated measurements. In RCT a specific technique reference based imputation can be applied.

Consideration

Need to account for the missinng data process, preserve the relations in the data and uncertainty in the relations.

MAR assumption whether plausible. (FCS can handle both MAR anad MNAR)
form of imputation model: structure and error distribution
predictors, as many relevant as possible, including interactions
the order in which to impute
set up starting imputation and number of iteration
decide number of imputed datasets

Univariate imputation

Assume only missing in one (continuous) variable.

predict from (linear) regression. No uncertainty
predict + noise. Assume normality, draw a random value from N(0, sd)
predict + noise + parameter uncertainty. Bayesian method, bootstrap (resamples observed data, re-estimate parameters)
a second predictor (or more)
draw from observed data - predictive mean matching

Regression

VIM::regressionImp

Linear regression without parameter uncertainty, mice.impute.norm.nob
Linear regression through prediction, mice.impute.norm.predict
Bayesian linear regression, mice.impute.norm
Linear regression bootstrap, mice.impute.norm.boot

Predictive mean matching

Assumption: distribution of missing is the same aas obsereved data of the candidates that produce the closest values to the predicted value by the missing entry.

Robust to transformation, less vulnerable to model misspecification.

Implementation in MICE:

Predictive mean matching, mice.impute.pmm
Weighted predictive mean matching, mice.impute.midastouch
Multivariate predictive mean matching, mice.impute.mpmm

Multivariate imputation

Keywords: Sequential imputation (MICE), joint modeling, multi-level imputation

Joint modeling

More information see book by van Buuren

Data can be described by a multivariate distribution, imputations are created by draws from the fitted distribution.

Continuous data: imputation under multivariate normal assumption is robusst to non-normal data, see simulation by Demirtas et al 2008.

Fully conditional specification, MICE algorithm

More information see book by van Buuren

FCS imputes multivariate data variable-by-variable, specify imputation model for each variable, then impute iteratively.

One iteration goes through all \(Y_j, j = 1, ..., p\), number of iteration can be low as 5-10.

Multivariate distribution \(P(Y, X, R|\theta)\) is specified through a set of conditional densities \(P(Y_j | X, Y_{-j}, R, \phi_j)\)

Other names of this category of approach:

variable by variable imputation (Brand 1999)
switching regressions (van Buuren, Boshuizen aand Knook 1999)
sequential regressions (Raaghunathan 2001)
partially incompatible MCMC (Rubin 2003)
iterated univariate imputation (Gelman 2004)
chained equations (van Buuren, Groothuis-Oudshoorn 2000)
fully conditional specification FCS (van Buuren 2006)

MICE algorithm

Multiple imputation via chained equations, it is a MCMC method. If conditionals are compatible, MICEE is a Gibbs sampler.

Gibbs sampling (idea): knowledge on the conditional distribution is sufficient to determine a joint distribution.

Reference based methods (RCT)

(This type of imputation is specific to RCT, might deserve a separate post)

Package: rbmi

Keywords: delta adjustment, tipping point analysis, sensitivity analysis, jump to reference,

https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-021-01261-6

Interview questions

How do you handle missing data in clinical trials? * Can you explain the concepts of missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR)?
Can you explain how multiple imputation works, and in what situations would you use it?
What is a sensitivity analysis, and why is it important when dealing with missing data?