Regression

Model

Linear, logistic, Cox proportional hazard

Author

Chi Zhang

Published

October 2, 2024

Summary

Aspect	Linear	Logistic	Cox
Variable selection	stepwise regularisation
Model selection	AIC BIC R2	AIC BIC adjusted R2 ROC/AUC	AIC BIC Concordance index
Hypothesis test for one or more coefficients	t-test / Wald test F-test (overall model) LRT (less common)	Wald test LRT	Wald test Score (log rank) test LRT
Diagnostics	Residual plot QQ plot Influence (Cook’s distance)	Hosmer-Lemeshow test Calibration plots influence	Proportional hazard assumption (PH) residuals (shoenfeld, martingale) influence measures
When assumption does not hold			PH: stratified cox model time varying covariats parametric models competing risk

Model building workflow

state hypothesis
data exploration
fit a regression model
diagnostics

(linear reg:) residual vs fitted
normal QQ plot of residuals
added variable plots
influence plot (residual vs leverage)

fix biggest problem, go back to 3
compare alternative models with nested model tests
interpret the coefficients

Hypothesis tests for regression

T-test in linear regression for individual coefficient

Hypothesis: \(H_0: \beta_1 = 0\)

Test statistic: \(T_0 = \frac{\hat{\beta_1} - 0}{se{\hat{\beta_1}}}\)

Reject H0 if \(t_0 > t_{\alpha, n-p-1}\). If one covariate, \(n-2\)

F-test in linear regression for joint significance

Evaluate the overall significance of model, and compare nested models.

Hypothesis: \(H_0: \beta_1 = \beta_2 = \beta_3 = ... = 0\). Joint non-significance of the model

Alternative hypothesis: \(H_1: \beta_i \neq = 0\), one of them is significant

Can be used for two nested linear regression models. Based on variance decomposition, not likelihood.

Compare unrestricted sum of square of residuals (SSR) with restricted SSR.

restricted: coefficients are restricted to be 0 (i.e intercept only, H0). Would have higher SSR as no variance is explained by covariates
unrestricted: not restricted to be 0, H1

If restricted is much larger than unrestricted, then reject the null.

\[F = \frac{(SSR_r - SSR_u)/p}{SSR_u/n-p-1} \sim f_{p, n-p-1}\]

Likelihood ratio test for nested models

Typically used for GLM and Cox regression. LRT can also be used for linear regression, but F-test is more common.

LRT tests goodness-of-fit between two nested models, tests whether removing one predictor improves the model fit. It is based on likelihood function.

For GLM we do not have SSR, so comparing nested models requires likelihood from restricted and unrestricted models.

Null hypothesis: reduced (fewer predictors) model is sufficient

Alternative: full (more predictors) model is better

Deviance: measures the difference in log-likelihood between fitted model and saturated model, which means it measures how far the current model is from the ideal model that fits the data perfectly.

null deviance: intercept only
residual deviance: deviance of the model with predictors included. Lower residual deviance, better fit

\[D = -2 \times (\frac{lik_{\text{fitted model}}}{lik_{\text{saturated}}}) \sim \chi^2_{p1 - p2}\]

Model 1 deviance - model 2 deviance

Wald test

Can be used to test about individual coefficients or set of coefficients in regression

\[W = \frac{(\hat{\theta} - \theta_0)^2}{I(\theta)^{-1}} \sim \chi^2_1\] For individual \(\beta\), it is equivalent to t-test (to the power of 2) under normality assumptions

\[W = \frac{(\hat{\beta} - \beta_0)^2}{var(\hat{\beta})}\] For multiple coefficients, testing whether several coefficients are simultaneously zero, the tests uses vectorized theta and covariance matrix.

Wald test is commonly seen in logistic and cox models to test the significance of covariates.

See an example here.

Log rank test for survival curves (semi-parametric)

Score test (log rank) for overall fit

Score test is used to test overall fit of the model without fitting the

Model selection and comparison

Forward selection: start with null, add one-at-a-time, refit

Backward selection: start with full, eliminate one-at-a-time (from the one with largest p-value), refit, repeat

No guarantee that they arrive at the same final model. Generally choose the one with large adjusted R2.

F-test and likelihood ratio tests for nested model

F-test (linear) and LRT can be used for comparing nested models. The commands are similar, anova(lm1, lm2) or anova(lr1, lr2, test = 'Chisq').

AIC, BIC

Trade-offs between goodness-of-fit and number of parameters. Lower AIC, BIC is better

Can be used for non-nested models.

BIC is stricter for model complexity, so favors simpler models in small datasets.

Assumptions and diagnostics

Aspects to consider

distribution (QQ)
residuals (pattern, outliers; deviance; Schoenfled)
fit (overall fit, goodness of fit, comparing nested models)
collinearity (variance inflation factor)

When assumptions do not hold

model mis-specification: add more or remove variables; interaction terms; transformation of variables
collinearity: remove after checking VIF
other models: GAM, mixed effects, parametric models

Related to my own work (TBF)

Linear regression

Assumptions:

all relationships are linear
independent observation
no perfect collinearity, no zero variance of independent variance (e.g. only female gender in the data, no male)
error term is normally distributed
homoscedasticity: error term has expected value of zero, uncorrelated with independent var
error term has equal variance

Use residual as estimate for error terms.

Residual vs fitted: should show no pattern. If it shows patterns (clusters, butterfly, U shape …) indicate either non-linearity or heteroscedasticity
- heteroscedasticity: try robust standard errors
- non-linearity: consider transformation
Q-Q plot
Residual vs leverage: identify outliers (influential observations)
- leverage: distance from the mass center of the data
- Cook’s distance: overall measure of influence of an observation

Less important: scale vs location

Other plots: car::avPlots

See case study: prestige for more examples.

When assumption does not hold

Logistic regression

Logit(p) = log(p/(1-p)) = b0 + bpxp

Assumptions:

binary out ordinal outcome
large samples
independence
linearity of indep variables and log odds (so that it’s linear addition)
none or little multicollinearity between independent variables

When assumption does not hold

Poisson regression

And negative binomial

Key assumption for poisson regression: variance approximately equal to mean. If over dispersed (variance greater than mean), use NB.

How to choose

fit a poisson, compute dispersion parameter (SSR/df), where df is n-p. If much greater than 1, consider nb
fit a nb
compare goodness of fit measures (AIC, BIC, log-likelihood)
use likelihood ratio tests

Assumptions:

count data
response follows poisson or nb distribution.
independent observations
linearity of indep variables and the log link
no excessive zeros

When assumption does not hold

robust standard errors if over dispersion is mild
excessive zero: zero-inflated poisson or nb; hurdle models that splits the zeros and non-zeros
independence: use mixed model
linearity: polynomial term, transformation

Cox regression

Proportional hazards assumption, tested with cox.zph().

p-value for each covariate
significant suggests proportional hazards is violated for this covariate

Concordance: model’s ability to predict the ordering of survival times, i.e. how well the model can rank individual subjects by risk. It ranges from 0.5 to 1, the higher the better.

Residual diagnostics, shouldn’t display patterns

martingale residual
deviance residual

Note

When assumption does not hold

Modify the model within Cox: stratified cox, time varying covariate (e.g.landmark analysis)
Parametric model: accelerated failure time AFT model, cure model, competing risk

General: VIF, Cook’s distance, GoF

Subgroup analysis, Sensitivity, interaction

Related to my own work (TBF)

requests for post-hoc power analysis
- separate analysis stratefied by sex and smoking status (COVITA)
- interaction: alcohol study

Subgroup analysis

E.g. analyse effect of new treatment on patient under 50 vs above 50 to see if treatment works differently in two age subgroups.

pre-specified (a priori): planned in SAP
post-hoc: conducted afterwards. useful for generating hypothesis, but high risk of false positives (type I) due to multiple comparisons

Steps: define subgroup -> conduct analysis and estimate treatment effect -> check for interaction -> interpretation

Risks of subgroup analysis, how to mitigate

Risk		Solution
Multiple comparison	Increase the risk of FP (type I error), statistically significant differences occur by chance	Bonferroni correction, FDR adjustments
Reduced power	(even) Smaller sample
Over-interpretation	Post-hoc analysis are not confirming the hypothesis made in SAP, interpretation need to be cautious	Relate to other studies
p-hacking	Search for significant subgroup without clear hypothesis
Loss of generalizability	Obscure the overall treatment effect

ANOVA et al

Anova compares the mean difference between groups
Ancova adds one additional continuous covariates, e.g. adjusting for age, tumor size (not gender as it’s not continuous)
Mancova allows for more than one continuous covariate, on more than one dependent variable.

Interview questions

Concept explanation

Procedure

Model selection

Diagnostics