library(pacman)
p_load(mdsr, nycflights13, tidyverse, skimr, moderndive, infer)
Chapter 9 Statistical Foundations
Samples and Populations
The population considered in the book is contained in the nycflights13 dataset. In the nycflights13 dataset contains all of the departing flights from airports in the NYC Area.
Recall the flights dataframe.
flights
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Take a sample of flights from NYC to SF. Get all of the flights to SF and take a sample of 25 of them.
Note that I did not set a seed for the sample, so your answers will differ from what is in the book.
<- flights %>%
SF filter(dest == "SFO", !is.na(arr_delay))
SF
# A tibble: 13,173 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 558 600 -2 923 937
2 2013 1 1 611 600 11 945 931
3 2013 1 1 655 700 -5 1037 1045
4 2013 1 1 729 730 -1 1049 1115
5 2013 1 1 734 737 -3 1047 1113
6 2013 1 1 745 745 0 1135 1125
7 2013 1 1 746 746 0 1119 1129
8 2013 1 1 803 800 3 1132 1144
9 2013 1 1 826 817 9 1145 1158
10 2013 1 1 1029 1030 -1 1427 1355
# ℹ 13,163 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
<- SF %>%
sf_25 ::slice_sample(n = 25)
dplyr sf_25
# A tibble: 25 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 9 11 1451 1453 -2 1753 1811
2 2013 12 13 1542 1530 12 1915 1856
3 2013 6 7 1001 1000 1 1257 1314
4 2013 10 4 1533 1529 4 1842 1847
5 2013 5 23 801 800 1 1120 1135
6 2013 12 15 2258 2021 157 228 2357
7 2013 5 20 1926 1929 -3 2335 2310
8 2013 9 26 848 854 -6 1229 1213
9 2013 1 12 554 600 -6 915 927
10 2013 2 16 1930 1935 -5 2234 2316
# ℹ 15 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Get the summary statistics for the sample taken.
%>%
sf_25 skim(arr_delay)
Name | Piped data |
Number of rows | 25 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
arr_delay | 0 | 1 | 2 | 39.63 | -42 | -17 | -10 | 12 | 151 | ▇▃▁▁▁ |
Since the SF dataset contains all flights from NYC to SF in 2013, the statistics computed from the SF dataset are the calculated population paramaters.
%>%
SF skim(arr_delay)
Name | Piped data |
Number of rows | 13173 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
arr_delay | 0 | 1 | 2.67 | 47.67 | -86 | -23 | -8 | 12 | 1007 | ▇▁▁▁▁ |
Get the 98th percentile of the sample.
%>%
sf_25 summarize(q98 = quantile(arr_delay, p = 0.98))
# A tibble: 1 × 1
q98
<dbl>
1 109.
Get the estimated proportion of flights to SF with a delay less than 90 minutes.
%>%
SF group_by(arr_delay < 90) %>%
count() %>%
mutate(pct = n / nrow(SF))
# A tibble: 2 × 3
# Groups: arr_delay < 90 [2]
`arr_delay < 90` n pct
<lgl> <int> <dbl>
1 FALSE 640 0.0486
2 TRUE 12533 0.951
Compare with the 98th percentile. 90 minutes is the 95th percentile.
%>%
SF summarize(q98 = quantile(arr_delay, p = 0.98))
# A tibble: 1 × 1
q98
<dbl>
1 153
Sample Statistics
The sampling distribution.
Usually sampling is done without replacement.
<- 25
n %>%
SF ::slice_sample(n = n) %>%
dplyrsummarize(mean_arr_delay = mean(arr_delay))
# A tibble: 1 × 1
mean_arr_delay
<dbl>
1 1.36
<- 25
n %>%
SF ::rep_slice_sample(n = n, reps = 1) %>%
infersummarize(mean_arr_delay = mean(arr_delay))
# A tibble: 1 × 2
replicate mean_arr_delay
<int> <dbl>
1 1 5.04
Note that a different random sample produces a different value for the sample statistic.
%>%
SF ::rep_slice_sample(n = n, reps = 1) %>%
infersummarize(mean_arr_delay = mean(arr_delay))
# A tibble: 1 × 2
replicate mean_arr_delay
<int> <dbl>
1 1 8.72
Using simulation we can approximate the sampling distribution of the sample statistics. Here we are computing the mean, but we could do this for any statistic we are interested in, the median, sample variance, sample standard deviation, etc.
<- 500
num_trials
<- 25
n
<- SF %>%
sf_25_means ::rep_slice_sample(n = n, reps = num_trials) %>%
infersummarize(mean_arr_delay = mean(arr_delay)) %>%
mutate(n = n)
head(sf_25_means)
# A tibble: 6 × 3
replicate mean_arr_delay n
<int> <dbl> <dbl>
1 1 0.44 25
2 2 -4.96 25
3 3 4.28 25
4 4 -4.12 25
5 5 9.72 25
6 6 6.96 25
Summarize the sample means.
%>%
sf_25_means skim(mean_arr_delay)
Name | Piped data |
Number of rows | 500 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mean_arr_delay | 0 | 1 | 2.9 | 9.75 | -18.76 | -3.88 | 1.82 | 8.58 | 41.04 | ▂▇▅▁▁ |
Confidence intervals
%>%
sf_25_means summarize(
x_bar = mean(mean_arr_delay),
se = sd(mean_arr_delay)
%>%
) mutate(
ci_lower = x_bar - 2 * se, # approximately 95% of observations
ci_upper = x_bar + 2 * se # are within two standard errors
)
# A tibble: 1 × 4
x_bar se ci_lower ci_upper
<dbl> <dbl> <dbl> <dbl>
1 2.90 9.75 -16.6 22.4
Alternatively, it can be calculated directly using a t-test.
%>%
sf_25_means ::t_test(response = mean_arr_delay) infer
# A tibble: 1 × 7
statistic t_df p_value alternative estimate lower_ci upper_ci
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 6.66 499 7.10e-11 two.sided 2.90 2.05 3.76
Using a larger sample size give a smaller Standard Error.
<- 100
n
<- SF %>% infer::rep_slice_sample(n = n, reps = num_trials) %>%
sf_100_means group_by(replicate) %>%
summarize(mean_arr_delay = mean(arr_delay)) %>%
mutate(n = n)
head(sf_100_means)
# A tibble: 6 × 3
replicate mean_arr_delay n
<int> <dbl> <dbl>
1 1 6.09 100
2 2 -1 100
3 3 4.47 100
4 4 7.56 100
5 5 8.04 100
6 6 -2.21 100
Plots to compare the sampling distributions with different sample sizes.
%>%
sf_25_means bind_rows(sf_100_means) %>%
ggplot(aes(x = mean_arr_delay)) +
geom_histogram(bins = 30) +
facet_grid( ~ n) +
xlab("Sample mean")
Boostrap
Usually for random sampling we sample without replacement.
<- SF %>%
three_flights slice_sample(n = 3, replace = FALSE) %>%
select(year, month, day, dep_time)
three_flights
# A tibble: 3 × 4
year month day dep_time
<int> <int> <int> <int>
1 2013 10 27 1931
2 2013 9 11 1709
3 2013 6 8 726
With the Bootstrap we use sampling with replacement
%>% slice_sample(n = 3, replace = TRUE) three_flights
# A tibble: 3 × 4
year month day dep_time
<int> <int> <int> <int>
1 2013 9 11 1709
2 2013 6 8 726
3 2013 9 11 1709
<- 200
n
<- SF %>%
orig_sample slice_sample(n = n, replace = FALSE)
orig_sample
# A tibble: 200 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 4 2 1530 1530 0 1905 1922
2 2013 10 24 1914 1630 164 2216 2010
3 2013 8 27 656 700 -4 947 1023
4 2013 6 2 2029 1830 119 2329 2148
5 2013 9 13 1457 1453 4 1758 1811
6 2013 11 10 1721 1721 0 2049 2055
7 2013 6 9 1102 1030 32 1406 1350
8 2013 11 16 1601 1600 1 1926 1942
9 2013 9 22 1034 1030 4 1337 1350
10 2013 10 19 1733 1735 -2 2031 2107
# ℹ 190 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
The reason for sampling with replacement is that we only have the Original Sample of size \(n\). We do not have the population. So if we sampled with out replacement we would only have one sample to look at. Sampling the Original Sample with replace allow us to general lots of samples and to investigate the variability of these samples.
%>%
orig_sample slice_sample(n = n, replace = TRUE) %>%
summarize(mean_arr_delay = mean(arr_delay))
# A tibble: 1 × 1
mean_arr_delay
<dbl>
1 2.65
<- orig_sample %>% infer::rep_slice_sample(n = n, replace = TRUE, reps = num_trials) %>%
sf_200_bs summarize(mean_arr_delay = mean(arr_delay)) %>%
mutate(n = n)
%>%
sf_200_bs skim(mean_arr_delay)
Name | Piped data |
Number of rows | 500 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mean_arr_delay | 0 | 1 | 5.53 | 3.27 | -3.2 | 3.2 | 5.42 | 7.8 | 16.75 | ▁▇▇▃▁ |
%>% ggplot(aes(x = mean_arr_delay)) +
sf_200_bs geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Compare with the entire population.
<- SF %>% infer::rep_slice_sample(n = n, replace = TRUE, reps = num_trials) %>%
sf_200_pop summarize(mean_arr_delay = mean(arr_delay)) %>%
mutate(n = n)
%>%
sf_200_pop skim(mean_arr_delay)
Name | Piped data |
Number of rows | 500 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mean_arr_delay | 0 | 1 | 2.66 | 3.41 | -5.72 | 0.46 | 2.3 | 4.73 | 13.35 | ▂▇▇▃▁ |
%>% ggplot(aes(x = mean_arr_delay)) +
sf_200_pop geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Question: Do the histograms look somewhat similar?
Using the code from the Modern Dive book.
To learn about the infer package this is a good link to start with.
%>%
orig_sample specify(response = arr_delay) %>%
calculate(stat = "mean")
Response: arr_delay (numeric)
# A tibble: 1 × 1
stat
<dbl>
1 5.32
<- orig_sample %>%
bootstrap_distribution specify(formula = arr_delay ~ NULL) %>%
generate(reps = num_trials, type = "bootstrap") %>%
calculate(stat = "mean")
bootstrap_distribution
Response: arr_delay (numeric)
# A tibble: 500 × 2
replicate stat
<int> <dbl>
1 1 4.7
2 2 6.01
3 3 9.48
4 4 5.36
5 5 0.52
6 6 3.14
7 7 -0.945
8 8 5.95
9 9 -1.74
10 10 7.94
# ℹ 490 more rows
%>% visualize() bootstrap_distribution
See the Modern Dive book for the infer process.
<- bootstrap_distribution %>%
percentile_ci get_confidence_interval(level = 0.95, type = "percentile")
percentile_ci
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 -0.691 12.1
%>% visualize() +
bootstrap_distribution shade_confidence_interval(endpoints = percentile_ci)
Example: Setting travel policy
%>%
orig_sample summarize(q98 = quantile(arr_delay, p = 0.98))
# A tibble: 1 × 1
q98
<dbl>
1 133.
<- nrow(orig_sample)
n <- orig_sample %>% infer::rep_slice_sample(n = n, replace = TRUE, reps = num_trials) %>%
sf_200_bs summarize(q98 = quantile(arr_delay, p = 0.98))
%>%
sf_200_bs skim(q98)
Name | Piped data |
Number of rows | 500 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
q98 | 0 | 1 | 138.87 | 34.32 | 73.02 | 124.04 | 133 | 156.06 | 285 | ▂▇▂▁▁ |
<- 10000
n_large <- SF %>% slice_sample(n = n_large, replace = FALSE)
sf_10000_bs
<- sf_10000_bs %>% infer::rep_slice_sample(n = n, replace = TRUE, reps = num_trials) %>%
sf_200_bs summarize(q98 = quantile(arr_delay, p = 0.98))
%>%
sf_200_bs skim(q98)
Name | Piped data |
Number of rows | 500 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
q98 | 0 | 1 | 139.02 | 31.86 | 62 | 114.27 | 136.05 | 164.07 | 253.62 | ▂▇▆▂▁ |
outliers
%>%
SF filter(arr_delay >= 420) %>%
select(month, day, dep_delay, arr_delay, carrier)
# A tibble: 7 × 5
month day dep_delay arr_delay carrier
<int> <int> <dbl> <dbl> <chr>
1 12 7 374 422 UA
2 7 6 589 561 DL
3 7 7 629 676 VX
4 7 7 653 632 VX
5 7 10 453 445 B6
6 7 10 432 433 VX
7 9 20 1014 1007 AA
%>%
SF filter(arr_delay < 420) %>%
ggplot(aes(arr_delay)) +
geom_histogram(binwidth = 15) +
labs(x = "Arrival delay (in minutes)")
%>%
SF group_by(hour) %>%
count() %>%
pivot_wider(names_from = hour, values_from = n) %>%
data.frame()
X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21
1 55 663 1696 987 429 1744 413 504 476 528 946 897 1491 1091 731 465 57
%>%
SF ggplot(aes(x = hour, y = arr_delay)) +
geom_boxplot(alpha = 0.1, aes(group = hour)) +
geom_smooth(method = "lm") +
xlab("Scheduled hour of departure") +
ylab("Arrival delay (minutes)") +
coord_cartesian(ylim = c(-30, 120))
`geom_smooth()` using formula = 'y ~ x'
<- lm(arr_delay ~ hour, data = SF)
mod1 ::tidy(mod1) broom
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -22.9 1.23 -18.6 2.88e- 76
2 hour 2.01 0.0915 22.0 1.78e-105