In 1966 Cyril Burt published a paper called ‘The genetic determination of differences in intelligence: A study of monozygotic twins reared apart?’ The data consist of IQ scores for 27 identical twins, one raised by foster parents, the other by the biological parents.
In R
fosIQ <- c(82, 80, 88, 108, 116, 117, 132, 71, 75, 93, 95, 88, 111, 63, 77, 86, 83, 93, 97, 87, 94, 96, 112, 113, 106, 107, 98)
bioIQ <- c(82, 90, 91, 115, 115, 129, 131, 78, 79, 82, 97, 100, 107, 68, 73, 81, 85, 87, 87, 93, 94, 95, 97, 97, 103, 106, 111)
Class <- c('A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C')
Burt <- data.frame(fosIQ, bioIQ, Class)
head(Burt)
## fosIQ bioIQ Class
## 1 82 82 A
## 2 80 90 A
## 3 88 91 A
## 4 108 115 A
## 5 116 115 A
## 6 117 129 A
Map categorical variable into binary
In R
burt_cat <- lm(fosIQ ~ Class, data=Burt)
summary(burt_cat)
##
## Call:
## lm(formula = fosIQ ~ Class, data = Burt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.714 -12.274 2.286 12.500 28.714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.286 5.973 17.293 4.72e-15 ***
## ClassB -14.452 8.792 -1.644 0.113
## ClassC -9.571 7.315 -1.308 0.203
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.8 on 24 degrees of freedom
## Multiple R-squared: 0.1088, Adjusted R-squared: 0.03452
## F-statistic: 1.465 on 2 and 24 DF, p-value: 0.2511
anova(burt_cat)
## Analysis of Variance Table
##
## Response: fosIQ
## Df Sum Sq Mean Sq F value Pr(>F)
## Class 2 731.5 365.77 1.4648 0.2511
## Residuals 24 5993.1 249.71
In R
burt_num <- lm(fosIQ ~ bioIQ, data=Burt)
m_sum <- summary(burt_num)
print(m_sum)
##
## Call:
## lm(formula = fosIQ ~ bioIQ, data = Burt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3512 -5.7311 0.0574 4.3244 16.3531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.20760 9.29990 0.990 0.332
## bioIQ 0.90144 0.09633 9.358 1.2e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.729 on 25 degrees of freedom
## Multiple R-squared: 0.7779, Adjusted R-squared: 0.769
## F-statistic: 87.56 on 1 and 25 DF, p-value: 1.204e-09
anova(burt_num)
## Analysis of Variance Table
##
## Response: fosIQ
## Df Sum Sq Mean Sq F value Pr(>F)
## bioIQ 1 5231.1 5231.1 87.563 1.204e-09 ***
## Residuals 25 1493.5 59.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Is the explanatory variable a significant predictor of the response variable? i.e. is the slope 0?
H0: β1 = 0; not a significant predictor, i.e. no relationship => slope is 0
Ha: βi ≠ 0; a significant predictor, i.e. relationship => slope is different than 0
T = (point estimate - null value) / SE with df = n - 2
In R
T_score <- (m_sum$coefficients[2, 1] - 0) / m_sum$coefficients[2, 2]
p_value <- pt(T_score, df=(dim(Burt)[1]-2), lower.tail=FALSE) * 2
print(paste('T score = ', T_score, ' p-value = ', p_value))
## [1] "T score = 9.35751276770383 p-value = 1.20360006499638e-09"
Calculate the 95% confidence interval for the slope of the relationship between biological and foster twin’s IQs?
point estimate +/ margin of error
In R
t <- qt((1-0.95)/2, df=dim(Burt)[1]-2)
ci <- c(m_sum$coefficients[2, 1] + t * m_sum$coefficients[2, 2], m_sum$coefficients[2, 1] - t * m_sum$coefficients[2, 2])
print(ci)
## [1] 0.7030348 1.0998372
We are 95% confident that for each additional point on the biological twins’ IQs, the foster twins’ IQs are expected on average to be higher by 0.7 to 1.1 points.
Using a scatterplot of the data or a residuals plots, check the relationship between the numerical explanatory variable linearly related to the response variable. Look for random scatter around 0.
In R
par(mfrow=c(1, 2))
plot(fosIQ ~ bioIQ, data=Burt, type='p', main='Scatterplot of bioIQ & fosIQ')
abline(burt_num, col="red")
plot(burt_num$residuals ~ bioIQ, main='Residuals plot of bioIQ')
Using ggplot2
library(ggplot2)
library(gridExtra)
Burt$res <- burt_num$residuals
p1 <- ggplot(Burt, aes(x=bioIQ, y=fosIQ)) + geom_point() + geom_abline() + ggtitle("Scatterplot of bioIQ & fosIQ")
p2 <- ggplot(Burt, aes(x=bioIQ, res)) + geom_point() + ggtitle("Residuals plot of bioIQ")
grid.arrange(p1, p2, ncol=2)
Using histogram or normal quantile-quantile plot, check residuals are nearly normal distributed centered at 0.
In R
par(mfrow=c(1, 2))
hist(burt_num$residuals)
qqnorm(burt_num$residuals)
qqline(burt_num$residuals)
Using ggplot2
p1 <- ggplot(Burt, aes(x=res)) + geom_histogram(stat="bin", bins=10, color="red", fill="grey60") + ggtitle("Historgram of residuals")
p2 <- ggplot(Burt, aes(sample=res)) + geom_qq() + ggtitle("Normal Q-Q Plot")
grid.arrange(p1, p2, ncol=2)
Variability of points around the least squares line should be roughly constant, this is also called homoscedasticity. Using residuals plots of residuals vs. predicted to check residuals randomly scattered in a band with a constant width around 0 (no fan shape).
In R
par(mfrow=c(1, 2))
plot(fosIQ~bioIQ, data=Burt, main='Scatterplot of bioIQ & fosIQ')
abline(burt_num, col="red")
plot(burt_num$residuals, burt_num$fitted, main='Residuals vs. fitted')
Using ggplot2
Burt$fitted <- burt_num$fitted.values
p1 <- ggplot(Burt, aes(x=bioIQ, y=fosIQ)) + geom_point() + geom_abline() + ggtitle("Scatterplot of bioIQ & fosIQ")
p2 <- ggplot(Burt, aes(x=res, fitted)) + geom_point() + ggtitle("Residuals vs fitted")
grid.arrange(p1, p2, ncol=2)