Eva's Photo Album

The Data
Multiple linear regression model
Statistical Inference for the model
- Hypothesis testing for the slope
- Confidence intervals for slopes
Checking Assumptions for Multiple Linear Regression

The Data

Cognitive test scores of three- and four-year old children and characteristics of their mothers (from a subsample from the National Longitudinal Survey of Youth).

In R

cognitive <- read.csv("http://bit.ly/dasi_cognitive")
head(cognitive, 5)

##   kid_score mom_hs    mom_iq mom_work mom_age
## 1        65    yes 121.11753      yes      27
## 2        98    yes  89.36188      yes      25
## 3        85    yes 115.44316      yes      27
## 4        83    yes  99.44964      yes      25
## 5       115    yes  92.74571      yes      27

Multiple linear regression model

In R

cog_final <- lm(kid_score ~ mom_hs + mom_iq + mom_work + mom_age, data=cognitive)
m_summary <- summary(cog_final)
print(m_summary)

## 
## Call:
## lm(formula = kid_score ~ mom_hs + mom_iq + mom_work + mom_age, 
##     data = cognitive)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.045 -12.918   1.992  11.563  49.267 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.59241    9.21906   2.125   0.0341 *  
## mom_hsyes    5.09482    2.31450   2.201   0.0282 *  
## mom_iq       0.56147    0.06064   9.259   <2e-16 ***
## mom_workyes  2.53718    2.35067   1.079   0.2810    
## mom_age      0.21802    0.33074   0.659   0.5101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.14 on 429 degrees of freedom
## Multiple R-squared:  0.2171, Adjusted R-squared:  0.2098 
## F-statistic: 29.74 on 4 and 429 DF,  p-value: < 2.2e-16

anova(cog_final)

## Analysis of Variance Table
## 
## Response: kid_score
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## mom_hs      1  10125 10125.0 30.7574 5.121e-08 ***
## mom_iq      1  28504 28504.1 86.5892 < 2.2e-16 ***
## mom_work    1    393   392.5  1.1924    0.2755    
## mom_age     1    143   143.0  0.4345    0.5101    
## Residuals 429 141222   329.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Statistical Inference for the model

H0: β1 = β2 = … = βk = 0
Ha: At least one βi is different than 0

The p-value for the model is < 2.2e-16 hence the model as a whole is significant. However, it doesn’t mean the model fits the data well, it just meant the alternate hypothesis is true that at least one of the βs is non-zero. If the p-value is large and the model as a whole is insignificant, it means the combination of these variables doesn’t yield a good model.

Hypothesis testing for the slope

Is whether or not the mother works a signficant predictor of the cognitive test scores of children, given all other variables in the model? Find the T score and p-value for the slope of mom_work.

H0: β1 = 0 while all other variables are included in the model
Ha: β1 != 0 while all other variables are included in the model

T = (point estimate - null value) / SE with df = n - k - 1 where k is the number of variables

In R

T_score <- (m_summary$coefficients['mom_workyes', 'Estimate'] - 0) / m_summary$coefficients['mom_workyes', 'Std. Error']
p_value <- pt(T_score, df=(dim(cognitive)[1]-4-1), lower.tail=FALSE) * 2
print(paste('T score = ', T_score, '; p-value = ', p_value, sep=""))

## [1] "T score = 1.07934202975356; p-value = 0.281041738350315"

Confidence intervals for slopes

Calculate the 95% confidence interval for the slope of mom_hs.

point estimate +/ margin of error

In R

t <- qt((1-0.95)/2, df=dim(cognitive)[1]-4-1)
ci <- c(m_summary$coefficients['mom_hsyes', 'Estimate'] + t * 2.3145, m_summary$coefficients['mom_hsyes', 'Estimate'] - t * 2.3145)
print(ci)

## [1] 0.545654 9.643996

We are 95% confident that, all else held equal, the model predicts that children whose moms went to high school score 0.546 to 9.644 points higher than those whose moms did not went to high school.

Checking Assumptions for Multiple Linear Regression

Linear relationships between numerical variables x and y

Using residuals plots check each numerical explanatory variable linearly related to the response variable, look for random scatter around 0.

In R

plot(cog_final$residuals ~ cognitive$mom_iq)

Using ggplot2

cognitive$residuals <- cog_final$residuals

library(ggplot2)
library(gridExtra)
ggplot(cognitive, aes(x=mom_iq, y=residuals)) + geom_point()

Nearly normal residuals with mean 0

Using histogram or normal probability plot, check residuals are nearly normal distributed centered at 0.

In R

par(mfrow=c(1, 2))
hist(cog_final$residuals)
qqnorm(cog_final$residuals)
qqline(cog_final$residuals)

Using ggplot2

p1 <- ggplot(cognitive, aes(x=residuals)) + geom_histogram(stat="bin", bins=10, color="red", fill="grey60") + ggtitle("Histogram of residuals")
p2 <- ggplot(cognitive, aes(sample=residuals)) + geom_qq() + ggtitle("Normal Q-Q Plot")

grid.arrange(p1, p2, ncol=2)

Constant variability of residuals

Using residuals plots of residuals vs. predicted to check residuals randomly scattered in a band with a constant width around 0 (no fan shape).

In R

par(mfrow=c(1, 2))
plot(cog_final$residuals, cog_final$fitted, main='Residuals vs. fitted')
plot(abs(cog_final$residuals), cog_final$fitted, main='Absolute value of residuals vs. fitted')

Using ggplot2

cognitive$fitted <- cog_final$fitted

p1 <- ggplot(cognitive, aes(x=residuals, y=fitted)) + geom_point() + ggtitle("Residuals vs. fitted")
p2 <- ggplot(cognitive, aes(x=abs(residuals), y=fitted)) + geom_point() + ggtitle("Absolute value of residuals vs. fitted")

grid.arrange(p1, p2, ncol=2)

Independence of residuals

Using a scatter plot of residuals to check independency of observations.

In R

plot(cog_final$residuals, main='Residuals vs. mom_iq')

Using ggplot2

library(scales)
ggplot(cognitive, aes(x=row.names(cognitive), y=residuals)) + geom_point() + xlab("Index") + scale_x_discrete(breaks=seq(0, 450, 50)) + ggtitle('Residuals vs. mom_iq')

Multiple Linear Regression Statistical Inference and Assumptions in R and Python Pandas

April 4, 2016

The Data

Multiple linear regression model

Statistical Inference for the model

Hypothesis testing for the slope

Confidence intervals for slopes

Checking Assumptions for Multiple Linear Regression

Linear relationships between numerical variables x and y

Nearly normal residuals with mean 0

Constant variability of residuals

Independence of residuals