Eva's Photo Album

The Data
Linear regression model
- For categorical explanatory variable
- For numerical explanatory variable
Statistical Inference for the model
Confidence intervals for slopes
Checking Assumptions for Linear Regression

The Data

In 1966 Cyril Burt published a paper called ‘The genetic determination of differences in intelligence: A study of monozygotic twins reared apart?’ The data consist of IQ scores for 27 identical twins, one raised by foster parents, the other by the biological parents.

In R

fosIQ <- c(82, 80, 88, 108, 116, 117, 132, 71, 75, 93, 95, 88, 111, 63, 77, 86, 83, 93, 97, 87, 94, 96, 112, 113, 106, 107, 98)
bioIQ <- c(82, 90, 91, 115, 115, 129, 131, 78, 79, 82, 97, 100, 107, 68, 73, 81, 85, 87, 87, 93, 94, 95, 97, 97, 103, 106, 111)
Class <- c('A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C')
Burt <- data.frame(fosIQ, bioIQ, Class)
head(Burt)

##   fosIQ bioIQ Class
## 1    82    82     A
## 2    80    90     A
## 3    88    91     A
## 4   108   115     A
## 5   116   115     A
## 6   117   129     A

Linear regression model

For categorical explanatory variable

Map categorical variable into binary

In R

burt_cat <- lm(fosIQ ~ Class, data=Burt)
summary(burt_cat)

## 
## Call:
## lm(formula = fosIQ ~ Class, data = Burt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.714 -12.274   2.286  12.500  28.714 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  103.286      5.973  17.293 4.72e-15 ***
## ClassB       -14.452      8.792  -1.644    0.113    
## ClassC        -9.571      7.315  -1.308    0.203    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.8 on 24 degrees of freedom
## Multiple R-squared:  0.1088, Adjusted R-squared:  0.03452 
## F-statistic: 1.465 on 2 and 24 DF,  p-value: 0.2511

anova(burt_cat)

## Analysis of Variance Table
## 
## Response: fosIQ
##           Df Sum Sq Mean Sq F value Pr(>F)
## Class      2  731.5  365.77  1.4648 0.2511
## Residuals 24 5993.1  249.71

For numerical explanatory variable

In R

burt_num <- lm(fosIQ ~ bioIQ, data=Burt)
m_sum <- summary(burt_num)
print(m_sum)

## 
## Call:
## lm(formula = fosIQ ~ bioIQ, data = Burt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3512  -5.7311   0.0574   4.3244  16.3531 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.20760    9.29990   0.990    0.332    
## bioIQ        0.90144    0.09633   9.358  1.2e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.729 on 25 degrees of freedom
## Multiple R-squared:  0.7779, Adjusted R-squared:  0.769 
## F-statistic: 87.56 on 1 and 25 DF,  p-value: 1.204e-09

anova(burt_num)

## Analysis of Variance Table
## 
## Response: fosIQ
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## bioIQ      1 5231.1  5231.1  87.563 1.204e-09 ***
## Residuals 25 1493.5    59.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Statistical Inference for the model

Is the explanatory variable a significant predictor of the response variable? i.e. is the slope 0?

H0: β1 = 0; not a significant predictor, i.e. no relationship => slope is 0
Ha: βi ≠ 0; a significant predictor, i.e. relationship => slope is different than 0

T = (point estimate - null value) / SE with df = n - 2

In R

T_score <- (m_sum$coefficients[2, 1] - 0) / m_sum$coefficients[2, 2]
p_value <- pt(T_score, df=(dim(Burt)[1]-2), lower.tail=FALSE) * 2
print(paste('T score = ', T_score, ' p-value = ', p_value))

## [1] "T score =  9.35751276770383  p-value =  1.20360006499638e-09"

Confidence intervals for slopes

Calculate the 95% confidence interval for the slope of the relationship between biological and foster twin’s IQs?

point estimate +/ margin of error

In R

t <- qt((1-0.95)/2, df=dim(Burt)[1]-2)
ci <- c(m_sum$coefficients[2, 1] + t * m_sum$coefficients[2, 2], m_sum$coefficients[2, 1] - t * m_sum$coefficients[2, 2])
print(ci)

## [1] 0.7030348 1.0998372

We are 95% confident that for each additional point on the biological twins’ IQs, the foster twins’ IQs are expected on average to be higher by 0.7 to 1.1 points.

Checking Assumptions for Linear Regression

Linearity

Using a scatterplot of the data or a residuals plots, check the relationship between the numerical explanatory variable linearly related to the response variable. Look for random scatter around 0.

In R

par(mfrow=c(1, 2))
plot(fosIQ ~ bioIQ, data=Burt, type='p', main='Scatterplot of bioIQ & fosIQ')
abline(burt_num, col="red")
plot(burt_num$residuals ~ bioIQ, main='Residuals plot of bioIQ')

Using ggplot2

library(ggplot2)
library(gridExtra)

Burt$res <- burt_num$residuals

p1 <- ggplot(Burt, aes(x=bioIQ, y=fosIQ)) + geom_point() + geom_abline() + ggtitle("Scatterplot of bioIQ & fosIQ")
p2 <- ggplot(Burt, aes(x=bioIQ, res)) + geom_point() + ggtitle("Residuals plot of bioIQ")

grid.arrange(p1, p2, ncol=2)

Nearly normal residuals with mean 0

Using histogram or normal quantile-quantile plot, check residuals are nearly normal distributed centered at 0.

In R

par(mfrow=c(1, 2))
hist(burt_num$residuals)
qqnorm(burt_num$residuals)
qqline(burt_num$residuals)

Using ggplot2

p1 <- ggplot(Burt, aes(x=res)) + geom_histogram(stat="bin", bins=10, color="red", fill="grey60") + ggtitle("Historgram of residuals")
p2 <- ggplot(Burt, aes(sample=res)) + geom_qq() + ggtitle("Normal Q-Q Plot")

grid.arrange(p1, p2, ncol=2)

Constant variability of residuals

Variability of points around the least squares line should be roughly constant, this is also called homoscedasticity. Using residuals plots of residuals vs. predicted to check residuals randomly scattered in a band with a constant width around 0 (no fan shape).

In R

par(mfrow=c(1, 2))
plot(fosIQ~bioIQ, data=Burt, main='Scatterplot of bioIQ & fosIQ')
abline(burt_num, col="red")
plot(burt_num$residuals, burt_num$fitted, main='Residuals vs. fitted')

Using ggplot2

Burt$fitted <- burt_num$fitted.values

p1 <- ggplot(Burt, aes(x=bioIQ, y=fosIQ)) + geom_point() + geom_abline() + ggtitle("Scatterplot of bioIQ & fosIQ")
p2 <- ggplot(Burt, aes(x=res, fitted)) + geom_point() + ggtitle("Residuals vs fitted")

grid.arrange(p1, p2, ncol=2)

Linear Regression Statistical Inference and Assumptions in R and Python Pandas

March 30, 2016

The Data

Linear regression model

For categorical explanatory variable

For numerical explanatory variable

Statistical Inference for the model

Confidence intervals for slopes

Checking Assumptions for Linear Regression

Linearity

Nearly normal residuals with mean 0

Constant variability of residuals