Missing data and imputation
Overview of multiple imputation
Overview
Missing data can occur in various situations:
- unit non-response: individuals decline to participate
- item non-response: some questions or measurements left out by participating individuals
- loss of follow-up: drop outs
For unit non-response and loss of follow-up, weighting the data could help. Need to know the response rate, characteristics of the non-responders and how the respondent differ.
Solutions (to item non-response)
- complete case analysis (listwise deletion)
- single imputation
- mean imputation
- conditional imputation: make a prediction, plus minus some noise
- multiple imputation: repeat single imputation M times, produce M complete datasets, use Rubin’s rules to pool the estimates.
MCAR, MAR, MNAR
Three categories of missing mechanism. Example: disease status, level of exposure, age. Exposure is missing for some subjects.
- MCAR: any two individuals have the same probability of having missing value for exposure. This is unlikely.
- MAR: any two individuaals with the same disease status and age have the same chance of having the exposure missing
- MNAR: chance of having missing exposure depends on the value of exposure
Data could also have special missing structures: multi-level, longitudinal and repeated measurements. In RCT a specific technique reference based imputation can be applied.
Consideration
Need to account for the missinng data process, preserve the relations in the data and uncertainty in the relations.
- MAR assumption whether plausible. (FCS can handle both MAR anad MNAR)
- form of imputation model: structure and error distribution
- predictors, as many relevant as possible, including interactions
- the order in which to impute
- set up starting imputation and number of iteration
- decide number of imputed datasets
Univariate imputation
Assume only missing in one (continuous) variable.
- predict from (linear) regression. No uncertainty
- predict + noise. Assume normality, draw a random value from N(0, sd)
- predict + noise + parameter uncertainty. Bayesian method, bootstrap (resamples observed data, re-estimate parameters)
- a second predictor (or more)
- draw from observed data - predictive mean matching
Regression
VIM::regressionImp
- Linear regression without parameter uncertainty, mice.impute.norm.nob
- Linear regression through prediction, mice.impute.norm.predict
- Bayesian linear regression, mice.impute.norm
- Linear regression bootstrap, mice.impute.norm.boot
Predictive mean matching
Assumption: distribution of missing is the same aas obsereved data of the candidates that produce the closest values to the predicted value by the missing entry.
Robust to transformation, less vulnerable to model misspecification.
Implementation in MICE
:
- Predictive mean matching, mice.impute.pmm
- Weighted predictive mean matching, mice.impute.midastouch
- Multivariate predictive mean matching, mice.impute.mpmm
Multivariate imputation
Keywords: Sequential imputation (MICE), joint modeling, multi-level imputation
Joint modeling
More information see book by van Buuren
Data can be described by a multivariate distribution, imputations are created by draws from the fitted distribution.
Continuous data: imputation under multivariate normal assumption is robusst to non-normal data, see simulation by Demirtas et al 2008.
Fully conditional specification, MICE algorithm
More information see book by van Buuren
FCS imputes multivariate data variable-by-variable, specify imputation model for each variable, then impute iteratively.
One iteration goes through all \(Y_j, j = 1, ..., p\), number of iteration can be low as 5-10.
Multivariate distribution \(P(Y, X, R|\theta)\) is specified through a set of conditional densities \(P(Y_j | X, Y_{-j}, R, \phi_j)\)
Other names of this category of approach:
- variable by variable imputation (Brand 1999)
- switching regressions (van Buuren, Boshuizen aand Knook 1999)
- sequential regressions (Raaghunathan 2001)
- partially incompatible MCMC (Rubin 2003)
- iterated univariate imputation (Gelman 2004)
- chained equations (van Buuren, Groothuis-Oudshoorn 2000)
- fully conditional specification FCS (van Buuren 2006)
MICE algorithm
Multiple imputation via chained equations, it is a MCMC method. If conditionals are compatible, MICEE is a Gibbs sampler.
Gibbs sampling (idea): knowledge on the conditional distribution is sufficient to determine a joint distribution.
Reference based methods (RCT)
(This type of imputation is specific to RCT, might deserve a separate post)
Package: rbmi
Keywords: delta adjustment, tipping point analysis, sensitivity analysis, jump to reference,
https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-021-01261-6
Interview questions
- How do you handle missing data in clinical trials? * Can you explain the concepts of missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR)?
- Can you explain how multiple imputation works, and in what situations would you use it?
- What is a sensitivity analysis, and why is it important when dealing with missing data?