Cognitive test scores of three- and four-year old children and characteristics of their mothers (from a subsample from the National Longitudinal Survey of Youth).
In R
cognitive <- read.csv("http://bit.ly/dasi_cognitive")
head(cognitive, 5)
## kid_score mom_hs mom_iq mom_work mom_age
## 1 65 yes 121.11753 yes 27
## 2 98 yes 89.36188 yes 25
## 3 85 yes 115.44316 yes 27
## 4 83 yes 99.44964 yes 25
## 5 115 yes 92.74571 yes 27
In R
cog_final <- lm(kid_score ~ mom_hs + mom_iq + mom_work + mom_age, data=cognitive)
m_summary <- summary(cog_final)
print(m_summary)
##
## Call:
## lm(formula = kid_score ~ mom_hs + mom_iq + mom_work + mom_age,
## data = cognitive)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.045 -12.918 1.992 11.563 49.267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.59241 9.21906 2.125 0.0341 *
## mom_hsyes 5.09482 2.31450 2.201 0.0282 *
## mom_iq 0.56147 0.06064 9.259 <2e-16 ***
## mom_workyes 2.53718 2.35067 1.079 0.2810
## mom_age 0.21802 0.33074 0.659 0.5101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.14 on 429 degrees of freedom
## Multiple R-squared: 0.2171, Adjusted R-squared: 0.2098
## F-statistic: 29.74 on 4 and 429 DF, p-value: < 2.2e-16
anova(cog_final)
## Analysis of Variance Table
##
## Response: kid_score
## Df Sum Sq Mean Sq F value Pr(>F)
## mom_hs 1 10125 10125.0 30.7574 5.121e-08 ***
## mom_iq 1 28504 28504.1 86.5892 < 2.2e-16 ***
## mom_work 1 393 392.5 1.1924 0.2755
## mom_age 1 143 143.0 0.4345 0.5101
## Residuals 429 141222 329.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: β1 = β2 = … = βk = 0
Ha: At least one βi is different than 0
The p-value for the model is < 2.2e-16 hence the model as a whole is significant. However, it doesn’t mean the model fits the data well, it just meant the alternate hypothesis is true that at least one of the βs is non-zero. If the p-value is large and the model as a whole is insignificant, it means the combination of these variables doesn’t yield a good model.
Is whether or not the mother works a signficant predictor of the cognitive test scores of children, given all other variables in the model? Find the T score and p-value for the slope of mom_work.
H0: β1 = 0 while all other variables are included in the model
Ha: β1 != 0 while all other variables are included in the model
T = (point estimate - null value) / SE with df = n - k - 1 where k is the number of variables
In R
T_score <- (m_summary$coefficients['mom_workyes', 'Estimate'] - 0) / m_summary$coefficients['mom_workyes', 'Std. Error']
p_value <- pt(T_score, df=(dim(cognitive)[1]-4-1), lower.tail=FALSE) * 2
print(paste('T score = ', T_score, '; p-value = ', p_value, sep=""))
## [1] "T score = 1.07934202975356; p-value = 0.281041738350315"
Calculate the 95% confidence interval for the slope of mom_hs.
point estimate +/ margin of error
In R
t <- qt((1-0.95)/2, df=dim(cognitive)[1]-4-1)
ci <- c(m_summary$coefficients['mom_hsyes', 'Estimate'] + t * 2.3145, m_summary$coefficients['mom_hsyes', 'Estimate'] - t * 2.3145)
print(ci)
## [1] 0.545654 9.643996
We are 95% confident that, all else held equal, the model predicts that children whose moms went to high school score 0.546 to 9.644 points higher than those whose moms did not went to high school.
Using residuals plots check each numerical explanatory variable linearly related to the response variable, look for random scatter around 0.
In R
plot(cog_final$residuals ~ cognitive$mom_iq)
Using ggplot2
cognitive$residuals <- cog_final$residuals
library(ggplot2)
library(gridExtra)
ggplot(cognitive, aes(x=mom_iq, y=residuals)) + geom_point()
Using histogram or normal probability plot, check residuals are nearly normal distributed centered at 0.
In R
par(mfrow=c(1, 2))
hist(cog_final$residuals)
qqnorm(cog_final$residuals)
qqline(cog_final$residuals)
Using ggplot2
p1 <- ggplot(cognitive, aes(x=residuals)) + geom_histogram(stat="bin", bins=10, color="red", fill="grey60") + ggtitle("Histogram of residuals")
p2 <- ggplot(cognitive, aes(sample=residuals)) + geom_qq() + ggtitle("Normal Q-Q Plot")
grid.arrange(p1, p2, ncol=2)
Using residuals plots of residuals vs. predicted to check residuals randomly scattered in a band with a constant width around 0 (no fan shape).
In R
par(mfrow=c(1, 2))
plot(cog_final$residuals, cog_final$fitted, main='Residuals vs. fitted')
plot(abs(cog_final$residuals), cog_final$fitted, main='Absolute value of residuals vs. fitted')
Using ggplot2
cognitive$fitted <- cog_final$fitted
p1 <- ggplot(cognitive, aes(x=residuals, y=fitted)) + geom_point() + ggtitle("Residuals vs. fitted")
p2 <- ggplot(cognitive, aes(x=abs(residuals), y=fitted)) + geom_point() + ggtitle("Absolute value of residuals vs. fitted")
grid.arrange(p1, p2, ncol=2)
Using a scatter plot of residuals to check independency of observations.
In R
plot(cog_final$residuals, main='Residuals vs. mom_iq')
Using ggplot2
library(scales)
ggplot(cognitive, aes(x=row.names(cognitive), y=residuals)) + geom_point() + xlab("Index") + scale_x_discrete(breaks=seq(0, 450, 50)) + ggtitle('Residuals vs. mom_iq')