Example of Simpson’s Paradox using synthetic data.

Consider book price (y) by number of pages (x)

library(tidyverse)
library(ggiraphExtra)

Create the example data in a tibble.

# hard cover books
x1 <- c( 150, 225, 342, 185)
y1 <- c( 27.43, 48.76, 50.25, 32.01 )
# paperback books
x2 <- c( 475, 834, 1020, 790)
y2 <- c( 10.00, 15.73, 20.00, 17.89 )
x <- c(x1, x2)
y <- c(y1, y2)
prices <- tibble( 
  z = c("hard cover","hard cover","hard cover","hard cover",
      "paperback", "paperback","paperback", "paperback"),
  x = x,
  y = y
)
prices 

The best model with fitt the data like this.

prices %>% ggplot(aes(x = x, y = y, color = z)) +
  geom_point() +
  geom_smooth(method = lm)

How are these models fitted to the data? Answer: Using linear regression with a dummy variable.

Covert the type of book variable z to a factor. In R using factors is turns a variable into a dummy variable.

prices <- prices %>% mutate(z = as_factor(z))
prices

Fit a linear regression ignoring the type of book vaiable z.

model1 <- lm( y ~ x, data = prices)
summary(model1)

Call:
lm(formula = y ~ x, data = prices)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.495  -5.591  -2.705   7.922  18.211 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 41.15238    8.71061   4.724  0.00324 **
x           -0.02665    0.01470  -1.813  0.11977   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.05 on 6 degrees of freedom
Multiple R-squared:  0.3539,    Adjusted R-squared:  0.2463 
F-statistic: 3.287 on 1 and 6 DF,  p-value: 0.1198
model.matrix(~ x, data = prices)
  (Intercept)    x
1           1  150
2           1  225
3           1  342
4           1  185
5           1  475
6           1  834
7           1 1020
8           1  790
attr(,"assign")
[1] 0 1

Plot the fitted model.

ggPredict(model1, se=TRUE, interactive=TRUE)

Using ggplot directly.

prices %>% ggplot(aes(x = x, y = y)) +
  geom_point() + 
  geom_smooth(method = lm)

Fit a linear regression model including the type of book variable z.

model2 <- lm( y ~ x + z, data = prices)
summary(model2)

Call:
lm(formula = y ~ x + z, data = prices)

Residuals:
     1      2      3      4      5      6      7      8 
-9.909  9.163  7.129 -6.383  3.272 -1.809 -3.140  1.676 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  32.82168    5.70239   5.756  0.00222 **
x             0.03011    0.01855   1.623  0.16544   
zpaperback  -40.39847   11.65128  -3.467  0.01790 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.751 on 5 degrees of freedom
Multiple R-squared:  0.8102,    Adjusted R-squared:  0.7343 
F-statistic: 10.67 on 2 and 5 DF,  p-value: 0.01569
model.matrix.lm(~ x + z, data = prices)
  (Intercept)    x zpaperback
1           1  150          0
2           1  225          0
3           1  342          0
4           1  185          0
5           1  475          1
6           1  834          1
7           1 1020          1
8           1  790          1
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$z
[1] "contr.treatment"

Plot the fitted model.

ggPredict(model2, se=TRUE, interactive=TRUE)

Fit a linear regression model including the type of book variable z and the interaction between the variables x and z.

model3 <- lm( y ~ x + z + x*z, data = prices)
summary(model3)

Call:
lm(formula = y ~ x + z + x * z, data = prices)

Residuals:
      1       2       3       4       5       6       7       8 
-3.2928  9.2064 -3.0796 -2.8339 -0.3626 -1.1616 -0.2744  1.7986 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)   13.06128    8.89055   1.469   0.2157  
x              0.11774    0.03754   3.136   0.0350 *
zpaperback   -11.33740   14.24991  -0.796   0.4708  
x:zpaperback  -0.09956    0.04002  -2.488   0.0676 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.429 on 4 degrees of freedom
Multiple R-squared:  0.9255,    Adjusted R-squared:  0.8696 
F-statistic: 16.57 on 3 and 4 DF,  p-value: 0.01014
model.matrix(~ x + z + x*z, data = prices)
  (Intercept)    x zpaperback x:zpaperback
1           1  150          0            0
2           1  225          0            0
3           1  342          0            0
4           1  185          0            0
5           1  475          1          475
6           1  834          1          834
7           1 1020          1         1020
8           1  790          1          790
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$z
[1] "contr.treatment"

Plot the fitted model.

Using ggplot directly.

prices %>% ggplot(aes(x = x, y = y, color = z)) +
  geom_point() + 
  geom_smooth(method = lm)

LS0tCnRpdGxlOiAiU3RhdC4gNjk0IC0gRXhhbXBsZSBvZiBTaW1wc29uJ3MgUGFyYWRveCIKYXV0aG9yOiAiUHJvZi4gRXJpYyBBLiBTdWVzcyIKZGF0ZTogJzIwMTktMDktMjcnCm91dHB1dDoKICBodG1sX25vdGVib29rOiBkZWZhdWx0CiAgcGRmX2RvY3VtZW50OiBkZWZhdWx0CiAgd29yZF9kb2N1bWVudDogZGVmYXVsdAotLS0KCkV4YW1wbGUgb2YgKipTaW1wc29uJ3MgUGFyYWRveCoqIHVzaW5nIHN5bnRoZXRpYyBkYXRhLgoKQ29uc2lkZXIgYm9vayBwcmljZSAoeSkgYnkgbnVtYmVyIG9mIHBhZ2VzICh4KQoKYGBge3J9CmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KGdnaXJhcGhFeHRyYSkKYGBgCgpDcmVhdGUgdGhlIGV4YW1wbGUgZGF0YSBpbiBhIHRpYmJsZS4KCmBgYHtyfQojIGhhcmQgY292ZXIgYm9va3MKCngxIDwtIGMoIDE1MCwgMjI1LCAzNDIsIDE4NSkKeTEgPC0gYyggMjcuNDMsIDQ4Ljc2LCA1MC4yNSwgMzIuMDEgKQoKIyBwYXBlcmJhY2sgYm9va3MKCngyIDwtIGMoIDQ3NSwgODM0LCAxMDIwLCA3OTApCnkyIDwtIGMoIDEwLjAwLCAxNS43MywgMjAuMDAsIDE3Ljg5ICkKCnggPC0gYyh4MSwgeDIpCnkgPC0gYyh5MSwgeTIpCgpwcmljZXMgPC0gdGliYmxlKCAKICB6ID0gYygiaGFyZCBjb3ZlciIsImhhcmQgY292ZXIiLCJoYXJkIGNvdmVyIiwiaGFyZCBjb3ZlciIsCiAgICAgICJwYXBlcmJhY2siLCAicGFwZXJiYWNrIiwicGFwZXJiYWNrIiwgInBhcGVyYmFjayIpLAoKICB4ID0geCwKICB5ID0geQopCgpwcmljZXMgCmBgYAoKVGhlIGJlc3QgbW9kZWwgd2l0aCBmaXR0IHRoZSBkYXRhIGxpa2UgdGhpcy4KCmBgYHtyfQpwcmljZXMgJT4lIGdncGxvdChhZXMoeCA9IHgsIHkgPSB5LCBjb2xvciA9IHopKSArCiAgZ2VvbV9wb2ludCgpICsKICBnZW9tX3Ntb290aChtZXRob2QgPSBsbSkKYGBgCgpIb3cgYXJlIHRoZXNlIG1vZGVscyBmaXR0ZWQgdG8gdGhlIGRhdGE/ICAqKkFuc3dlcjoqKiBVc2luZyBsaW5lYXIgcmVncmVzc2lvbiB3aXRoIGEgZHVtbXkgdmFyaWFibGUuCgpDb3ZlcnQgdGhlICp0eXBlIG9mIGJvb2sqIHZhcmlhYmxlICp6KiB0byBhIGZhY3Rvci4gIEluIFIgdXNpbmcgZmFjdG9ycyBpcyB0dXJucyBhIHZhcmlhYmxlIGludG8gYSBkdW1teSB2YXJpYWJsZS4KCmBgYHtyfQpwcmljZXMgPC0gcHJpY2VzICU+JSBtdXRhdGUoeiA9IGFzX2ZhY3Rvcih6KSkKcHJpY2VzCmBgYAoKRml0IGEgbGluZWFyIHJlZ3Jlc3Npb24gaWdub3JpbmcgdGhlIHR5cGUgb2YgYm9vayB2YWlhYmxlIHouCgpgYGB7cn0KbW9kZWwxIDwtIGxtKCB5IH4geCwgZGF0YSA9IHByaWNlcykKc3VtbWFyeShtb2RlbDEpCgptb2RlbC5tYXRyaXgofiB4LCBkYXRhID0gcHJpY2VzKQpgYGAKClBsb3QgdGhlIGZpdHRlZCBtb2RlbC4KCmBgYHtyfQpnZ1ByZWRpY3QobW9kZWwxLCBzZT1UUlVFLCBpbnRlcmFjdGl2ZT1UUlVFKQpgYGAKClVzaW5nIGdncGxvdCBkaXJlY3RseS4KCmBgYHtyfQpwcmljZXMgJT4lIGdncGxvdChhZXMoeCA9IHgsIHkgPSB5KSkgKwogIGdlb21fcG9pbnQoKSArIAogIGdlb21fc21vb3RoKG1ldGhvZCA9IGxtKQpgYGAKCkZpdCBhIGxpbmVhciByZWdyZXNzaW9uIG1vZGVsIGluY2x1ZGluZyB0aGUgdHlwZSBvZiBib29rIHZhcmlhYmxlICp6Ki4KCmBgYHtyfQptb2RlbDIgPC0gbG0oIHkgfiB4ICsgeiwgZGF0YSA9IHByaWNlcykKc3VtbWFyeShtb2RlbDIpCgptb2RlbC5tYXRyaXgubG0ofiB4ICsgeiwgZGF0YSA9IHByaWNlcykKYGBgCgpQbG90IHRoZSBmaXR0ZWQgbW9kZWwuCgpgYGB7cn0KZ2dQcmVkaWN0KG1vZGVsMiwgc2U9VFJVRSwgaW50ZXJhY3RpdmU9VFJVRSkKYGBgCgpGaXQgYSBsaW5lYXIgcmVncmVzc2lvbiBtb2RlbCBpbmNsdWRpbmcgdGhlIHR5cGUgb2YgYm9vayB2YXJpYWJsZSAqeiogYW5kIHRoZSBpbnRlcmFjdGlvbiBiZXR3ZWVuIHRoZSB2YXJpYWJsZXMgKngqIGFuZCAqeiouCgpgYGB7cn0KbW9kZWwzIDwtIGxtKCB5IH4geCArIHogKyB4KnosIGRhdGEgPSBwcmljZXMpCnN1bW1hcnkobW9kZWwzKQoKbW9kZWwubWF0cml4KH4geCArIHogKyB4KnosIGRhdGEgPSBwcmljZXMpCmBgYAoKUGxvdCB0aGUgZml0dGVkIG1vZGVsLgoKYGBge3J9CmdnUHJlZGljdChtb2RlbDMsIHNlPVRSVUUsIGludGVyYWN0aXZlPVRSVUUpCmBgYAoKVXNpbmcgZ2dwbG90IGRpcmVjdGx5LgoKYGBge3J9CnByaWNlcyAlPiUgZ2dwbG90KGFlcyh4ID0geCwgeSA9IHksIGNvbG9yID0geikpICsKICBnZW9tX3BvaW50KCkgKyAKICBnZW9tX3Ntb290aChtZXRob2QgPSBsbSkKYGBgCg==