Linear regression example: prestige

Regression

Linear regression

Author

Chi Zhang

Published

October 5, 2024

This analysis is in preparation for interviews related to linear regression. Focus will be put on the procedure (and how to do it in R), as well as interpretation of the results.

suppressMessages(library(car))
prestige <- carData::Prestige

m1 <- lm(prestige ~ education + income + women, 
         data = prestige)

summary(m1)

Call:
lm(formula = prestige ~ education + income + women, data = prestige)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.8246  -5.3332  -0.1364   5.1587  17.5045 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.7943342  3.2390886  -2.098   0.0385 *  
education    4.1866373  0.3887013  10.771  < 2e-16 ***
income       0.0013136  0.0002778   4.729 7.58e-06 ***
women       -0.0089052  0.0304071  -0.293   0.7702    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.846 on 98 degrees of freedom
Multiple R-squared:  0.7982,    Adjusted R-squared:  0.792 
F-statistic: 129.2 on 3 and 98 DF,  p-value: < 2.2e-16

T-test

F-test

m11 <- lm(prestige ~ education + income, data = prestige)
summary(m11)

Call:
lm(formula = prestige ~ education + income, data = prestige)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.4040  -5.3308   0.0154   4.9803  17.6889 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.8477787  3.2189771  -2.127   0.0359 *  
education    4.1374444  0.3489120  11.858  < 2e-16 ***
income       0.0013612  0.0002242   6.071 2.36e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.81 on 99 degrees of freedom
Multiple R-squared:  0.798, Adjusted R-squared:  0.7939 
F-statistic: 195.6 on 2 and 99 DF,  p-value: < 2.2e-16
m10 <- lm(prestige ~ education, data = prestige)
summary(m10)

Call:
lm(formula = prestige ~ education, data = prestige)

Residuals:
     Min       1Q   Median       3Q      Max 
-26.0397  -6.5228   0.6611   6.7430  18.1636 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -10.732      3.677  -2.919  0.00434 ** 
education      5.361      0.332  16.148  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.103 on 100 degrees of freedom
Multiple R-squared:  0.7228,    Adjusted R-squared:   0.72 
F-statistic: 260.8 on 1 and 100 DF,  p-value: < 2.2e-16
anova(m1, m11)
Analysis of Variance Table

Model 1: prestige ~ education + income + women
Model 2: prestige ~ education + income
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     98 6033.6                           
2     99 6038.9 -1   -5.2806 0.0858 0.7702
anova(m1, m10)
Analysis of Variance Table

Model 1: prestige ~ education + income + women
Model 2: prestige ~ education
  Res.Df    RSS Df Sum of Sq    F    Pr(>F)    
1     98 6033.6                                
2    100 8287.0 -2   -2253.4 18.3 1.765e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Likelihood ratio test (less used)

lmtest::lrtest(m1, m11) # not sig
Likelihood ratio test

Model 1: prestige ~ education + income + women
Model 2: prestige ~ education + income
  #Df  LogLik Df  Chisq Pr(>Chisq)
1   5 -352.82                     
2   4 -352.86 -1 0.0892     0.7652
lmtest::lrtest(m1, m10) # sig
Likelihood ratio test

Model 1: prestige ~ education + income + women
Model 2: prestige ~ education
  #Df  LogLik Df Chisq Pr(>Chisq)    
1   5 -352.82                        
2   3 -369.00 -2 32.37  9.355e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC, BIC

AIC(m1, m11, m10) # m11 is the best
    df      AIC
m1   5 715.6358
m11  4 713.7251
m10  3 744.0053
BIC(m1, m11, m10)
    df      BIC
m1   5 728.7607
m11  4 724.2250
m10  3 751.8802

Diagnostics

par(mfrow = c(2, 2))
plot(m1)