Regression
Linear, logistic, Cox proportional hazard
Summary
Aspect | Linear | Logistic | Cox |
---|---|---|---|
Variable selection | stepwise regularisation |
||
Model selection | AIC BIC R2 |
AIC BIC adjusted R2 ROC/AUC |
AIC BIC Concordance index |
Hypothesis test for one or more coefficients | t-test / Wald test F-test (overall model) LRT (less common) |
Wald test LRT |
Wald test Score (log rank) test LRT |
Diagnostics | Residual plot QQ plot Influence (Cook’s distance) |
Hosmer-Lemeshow test Calibration plots influence |
Proportional hazard assumption (PH) residuals (shoenfeld, martingale) influence measures |
When assumption does not hold | PH: stratified cox model time varying covariats parametric models competing risk |
Model building workflow
- state hypothesis
- data exploration
- fit a regression model
- diagnostics
- (linear reg:) residual vs fitted
- normal QQ plot of residuals
- added variable plots
- influence plot (residual vs leverage)
- fix biggest problem, go back to 3
- compare alternative models with nested model tests
- interpret the coefficients
Hypothesis tests for regression
T-test in linear regression for individual coefficient
Hypothesis: \(H_0: \beta_1 = 0\)
Test statistic: \(T_0 = \frac{\hat{\beta_1} - 0}{se{\hat{\beta_1}}}\)
Reject H0 if \(t_0 > t_{\alpha, n-p-1}\). If one covariate, \(n-2\)
F-test in linear regression for joint significance
Evaluate the overall significance of model, and compare nested models.
Hypothesis: \(H_0: \beta_1 = \beta_2 = \beta_3 = ... = 0\). Joint non-significance of the model
Alternative hypothesis: \(H_1: \beta_i \neq = 0\), one of them is significant
Can be used for two nested linear regression models. Based on variance decomposition, not likelihood.
Compare unrestricted sum of square of residuals (SSR) with restricted SSR.
- restricted: coefficients are restricted to be 0 (i.e intercept only, H0). Would have higher SSR as no variance is explained by covariates
- unrestricted: not restricted to be 0, H1
If restricted is much larger than unrestricted, then reject the null.
\[F = \frac{(SSR_r - SSR_u)/p}{SSR_u/n-p-1} \sim f_{p, n-p-1}\]
Likelihood ratio test for nested models
Typically used for GLM and Cox regression. LRT can also be used for linear regression, but F-test is more common.
LRT tests goodness-of-fit between two nested models, tests whether removing one predictor improves the model fit. It is based on likelihood function.
For GLM we do not have SSR, so comparing nested models requires likelihood from restricted and unrestricted models.
Null hypothesis: reduced (fewer predictors) model is sufficient
Alternative: full (more predictors) model is better
Deviance: measures the difference in log-likelihood between fitted model and saturated model, which means it measures how far the current model is from the ideal model that fits the data perfectly.
- null deviance: intercept only
- residual deviance: deviance of the model with predictors included. Lower residual deviance, better fit
\[D = -2 \times (\frac{lik_{\text{fitted model}}}{lik_{\text{saturated}}}) \sim \chi^2_{p1 - p2}\]
Model 1 deviance - model 2 deviance
Wald test
Can be used to test about individual coefficients or set of coefficients in regression
\[W = \frac{(\hat{\theta} - \theta_0)^2}{I(\theta)^{-1}} \sim \chi^2_1\] For individual \(\beta\), it is equivalent to t-test (to the power of 2) under normality assumptions
\[W = \frac{(\hat{\beta} - \beta_0)^2}{var(\hat{\beta})}\] For multiple coefficients, testing whether several coefficients are simultaneously zero, the tests uses vectorized theta and covariance matrix.
Wald test is commonly seen in logistic and cox models to test the significance of covariates.
See an example here.
Log rank test for survival curves (semi-parametric)
Score test (log rank) for overall fit
Score test is used to test overall fit of the model without fitting the
Model selection and comparison
Forward selection: start with null, add one-at-a-time, refit
Backward selection: start with full, eliminate one-at-a-time (from the one with largest p-value), refit, repeat
No guarantee that they arrive at the same final model. Generally choose the one with large adjusted R2.
F-test and likelihood ratio tests for nested model
F-test (linear) and LRT can be used for comparing nested models. The commands are similar, anova(lm1, lm2)
or anova(lr1, lr2, test = 'Chisq')
.
AIC, BIC
Trade-offs between goodness-of-fit and number of parameters. Lower AIC, BIC is better
Can be used for non-nested models.
BIC is stricter for model complexity, so favors simpler models in small datasets.
Assumptions and diagnostics
- distribution (QQ)
- residuals (pattern, outliers; deviance; Schoenfled)
- fit (overall fit, goodness of fit, comparing nested models)
- collinearity (variance inflation factor)
When assumptions do not hold
- model mis-specification: add more or remove variables; interaction terms; transformation of variables
- collinearity: remove after checking VIF
- other models: GAM, mixed effects, parametric models
Related to my own work (TBF)
Linear regression
Assumptions:
- all relationships are linear
- independent observation
- no perfect collinearity, no zero variance of independent variance (e.g. only female gender in the data, no male)
- error term is normally distributed
- homoscedasticity: error term has expected value of zero, uncorrelated with independent var
- error term has equal variance
Use residual as estimate for error terms.
- Residual vs fitted: should show no pattern. If it shows patterns (clusters, butterfly, U shape …) indicate either non-linearity or heteroscedasticity
- heteroscedasticity: try robust standard errors
- non-linearity: consider transformation
- Q-Q plot
- Residual vs leverage: identify outliers (influential observations)
- leverage: distance from the mass center of the data
- Cook’s distance: overall measure of influence of an observation
Less important: scale vs location
Other plots: car::avPlots
See case study: prestige for more examples.
When assumption does not hold
Logistic regression
Logit(p) = log(p/(1-p)) = b0 + bpxp
Assumptions:
- binary out ordinal outcome
- large samples
- independence
- linearity of indep variables and log odds (so that it’s linear addition)
- none or little multicollinearity between independent variables
When assumption does not hold
Poisson regression
And negative binomial
Key assumption for poisson regression: variance approximately equal to mean. If over dispersed (variance greater than mean), use NB.
How to choose
- fit a poisson, compute dispersion parameter (SSR/df), where df is n-p. If much greater than 1, consider nb
- fit a nb
- compare goodness of fit measures (AIC, BIC, log-likelihood)
- use likelihood ratio tests
Assumptions:
- count data
- response follows poisson or nb distribution.
- independent observations
- linearity of indep variables and the log link
- no excessive zeros
When assumption does not hold
- robust standard errors if over dispersion is mild
- excessive zero: zero-inflated poisson or nb; hurdle models that splits the zeros and non-zeros
- independence: use mixed model
- linearity: polynomial term, transformation
Cox regression
Proportional hazards assumption, tested with cox.zph()
.
- p-value for each covariate
- significant suggests proportional hazards is violated for this covariate
Concordance: model’s ability to predict the ordering of survival times, i.e. how well the model can rank individual subjects by risk. It ranges from 0.5 to 1, the higher the better.
Residual diagnostics, shouldn’t display patterns
- martingale residual
- deviance residual
When assumption does not hold
- Modify the model within Cox: stratified cox, time varying covariate (e.g.landmark analysis)
- Parametric model: accelerated failure time AFT model, cure model, competing risk
General: VIF, Cook’s distance, GoF
Subgroup analysis, Sensitivity, interaction
Related to my own work (TBF)
- requests for post-hoc power analysis
- separate analysis stratefied by sex and smoking status (COVITA)
- interaction: alcohol study
Subgroup analysis
E.g. analyse effect of new treatment on patient under 50 vs above 50 to see if treatment works differently in two age subgroups.
- pre-specified (a priori): planned in SAP
- post-hoc: conducted afterwards. useful for generating hypothesis, but high risk of false positives (type I) due to multiple comparisons
Steps: define subgroup -> conduct analysis and estimate treatment effect -> check for interaction -> interpretation
Risks of subgroup analysis, how to mitigate
Risk | Solution | Comment | |
---|---|---|---|
Multiple comparison | Increase the risk of FP (type I error), statistically significant differences occur by chance | Bonferroni correction, FDR adjustments | |
Reduced power | (even) Smaller sample | ||
Over-interpretation | Post-hoc analysis are not confirming the hypothesis made in SAP, interpretation need to be cautious | Relate to other studies | |
p-hacking | Search for significant subgroup without clear hypothesis | ||
Loss of generalizability | Obscure the overall treatment effect |
ANOVA et al
- Anova compares the mean difference between groups
- Ancova adds one additional continuous covariates, e.g. adjusting for age, tumor size (not gender as it’s not continuous)
- Mancova allows for more than one continuous covariate, on more than one dependent variable.
Interview questions
Concept explanation
Procedure
Model selection
Diagnostics