Example of Simpson’s Paradox using synthetic data.
Consider book price (y) by number of pages (x)
library(tidyverse)
library(ggiraphExtra)
Create the example data in a tibble.
# hard cover books
x1 <- c( 150, 225, 342, 185)
y1 <- c( 27.43, 48.76, 50.25, 32.01 )
# paperback books
x2 <- c( 475, 834, 1020, 790)
y2 <- c( 10.00, 15.73, 20.00, 17.89 )
x <- c(x1, x2)
y <- c(y1, y2)
prices <- tibble(
z = c("hard cover","hard cover","hard cover","hard cover",
"paperback", "paperback","paperback", "paperback"),
x = x,
y = y
)
prices
The best model with fitt the data like this.
prices %>% ggplot(aes(x = x, y = y, color = z)) +
geom_point() +
geom_smooth(method = lm)
How are these models fitted to the data? Answer: Using linear regression with a dummy variable.
Covert the type of book variable z to a factor. In R using factors is turns a variable into a dummy variable.
prices <- prices %>% mutate(z = as_factor(z))
prices
Fit a linear regression ignoring the type of book vaiable z.
model1 <- lm( y ~ x, data = prices)
summary(model1)
Call:
lm(formula = y ~ x, data = prices)
Residuals:
Min 1Q Median 3Q Max
-18.495 -5.591 -2.705 7.922 18.211
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.15238 8.71061 4.724 0.00324 **
x -0.02665 0.01470 -1.813 0.11977
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.05 on 6 degrees of freedom
Multiple R-squared: 0.3539, Adjusted R-squared: 0.2463
F-statistic: 3.287 on 1 and 6 DF, p-value: 0.1198
model.matrix(~ x, data = prices)
(Intercept) x
1 1 150
2 1 225
3 1 342
4 1 185
5 1 475
6 1 834
7 1 1020
8 1 790
attr(,"assign")
[1] 0 1
Plot the fitted model.
ggPredict(model1, se=TRUE, interactive=TRUE)
Using ggplot directly.
prices %>% ggplot(aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = lm)
Fit a linear regression model including the type of book variable z.
model2 <- lm( y ~ x + z, data = prices)
summary(model2)
Call:
lm(formula = y ~ x + z, data = prices)
Residuals:
1 2 3 4 5 6 7 8
-9.909 9.163 7.129 -6.383 3.272 -1.809 -3.140 1.676
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.82168 5.70239 5.756 0.00222 **
x 0.03011 0.01855 1.623 0.16544
zpaperback -40.39847 11.65128 -3.467 0.01790 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.751 on 5 degrees of freedom
Multiple R-squared: 0.8102, Adjusted R-squared: 0.7343
F-statistic: 10.67 on 2 and 5 DF, p-value: 0.01569
model.matrix.lm(~ x + z, data = prices)
(Intercept) x zpaperback
1 1 150 0
2 1 225 0
3 1 342 0
4 1 185 0
5 1 475 1
6 1 834 1
7 1 1020 1
8 1 790 1
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$z
[1] "contr.treatment"
Plot the fitted model.
ggPredict(model2, se=TRUE, interactive=TRUE)
Fit a linear regression model including the type of book variable z and the interaction between the variables x and z.
model3 <- lm( y ~ x + z + x*z, data = prices)
summary(model3)
Call:
lm(formula = y ~ x + z + x * z, data = prices)
Residuals:
1 2 3 4 5 6 7 8
-3.2928 9.2064 -3.0796 -2.8339 -0.3626 -1.1616 -0.2744 1.7986
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.06128 8.89055 1.469 0.2157
x 0.11774 0.03754 3.136 0.0350 *
zpaperback -11.33740 14.24991 -0.796 0.4708
x:zpaperback -0.09956 0.04002 -2.488 0.0676 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.429 on 4 degrees of freedom
Multiple R-squared: 0.9255, Adjusted R-squared: 0.8696
F-statistic: 16.57 on 3 and 4 DF, p-value: 0.01014
model.matrix(~ x + z + x*z, data = prices)
(Intercept) x zpaperback x:zpaperback
1 1 150 0 0
2 1 225 0 0
3 1 342 0 0
4 1 185 0 0
5 1 475 1 475
6 1 834 1 834
7 1 1020 1 1020
8 1 790 1 790
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$z
[1] "contr.treatment"
Plot the fitted model.
Using ggplot directly.
prices %>% ggplot(aes(x = x, y = y, color = z)) +
geom_point() +
geom_smooth(method = lm)