library(pacman)
p_load(mdsr, nycflights13, tidyverse, skimr)
Chapter 9 Statistical Foundations
Samples and Populations
The population considered in the book is contained in the nycflights13 dataset. In the nycflights13 dataset contains all of the departing flights from airports in the NYC Area.
Recall the flights dataframe.
flights
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Take a sample of flights from NYC to SF. Get all of the flights to SF and take a sample of 25 of them.
Note that I did not set a seed for the sample, so your answers will differ from what is in the book.
<- flights %>%
SF filter(dest == "SFO", !is.na(arr_delay))
SF
# A tibble: 13,173 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 558 600 -2 923 937
2 2013 1 1 611 600 11 945 931
3 2013 1 1 655 700 -5 1037 1045
4 2013 1 1 729 730 -1 1049 1115
5 2013 1 1 734 737 -3 1047 1113
6 2013 1 1 745 745 0 1135 1125
7 2013 1 1 746 746 0 1119 1129
8 2013 1 1 803 800 3 1132 1144
9 2013 1 1 826 817 9 1145 1158
10 2013 1 1 1029 1030 -1 1427 1355
# ℹ 13,163 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
<- SF %>%
sf_25 slice_sample(n = 25)
sf_25
# A tibble: 25 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 7 8 555 559 -4 845 902
2 2013 7 30 1208 1030 98 1534 1350
3 2013 11 24 2034 2019 15 1 2355
4 2013 7 15 821 829 -8 1103 1143
5 2013 9 15 828 829 -1 1127 1151
6 2013 6 3 1023 1030 -7 1319 1345
7 2013 5 27 1403 1400 3 1730 1730
8 2013 7 8 1710 1659 11 2139 2012
9 2013 2 13 1632 1630 2 1953 2015
10 2013 10 1 2024 2025 -1 2319 2350
# ℹ 15 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Get the summary statistics for the sample taken.
%>%
sf_25 skim(arr_delay)
Name | Piped data |
Number of rows | 25 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
arr_delay | 0 | 1 | 4.4 | 37.12 | -40 | -22 | -3 | 21 | 104 | ▇▆▂▂▂ |
Since the SF dataset contains all flights from NYC to SF in 2013, the statistics computed from the SF dataset are the calculated population paramaters.
%>%
SF skim(arr_delay)
Name | Piped data |
Number of rows | 13173 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
arr_delay | 0 | 1 | 2.67 | 47.67 | -86 | -23 | -8 | 12 | 1007 | ▇▁▁▁▁ |
Get the 98th percentile of the sample.
%>%
sf_25 summarize(q98 = quantile(arr_delay, p = 0.98))
# A tibble: 1 × 1
q98
<dbl>
1 95.8
Get the estimated proportion of flights to SF with a delay less than 90 minutes.
%>%
SF group_by(arr_delay < 90) %>%
count() %>%
mutate(pct = n / nrow(SF))
# A tibble: 2 × 3
# Groups: arr_delay < 90 [2]
`arr_delay < 90` n pct
<lgl> <int> <dbl>
1 FALSE 640 0.0486
2 TRUE 12533 0.951
Compare with the 98th percentile. 90 minutes is the 95th percentile.
%>%
SF summarize(q98 = quantile(arr_delay, p = 0.98))
# A tibble: 1 × 1
q98
<dbl>
1 153
Sample Statistics
The sampling distribution.
Usually sampling is done without replacement.
<- 25
n %>%
SF slice_sample(n = n) %>%
summarize(mean_arr_delay = mean(arr_delay))
# A tibble: 1 × 1
mean_arr_delay
<dbl>
1 44.3
Note that different random sample produce different values for the sample statistic.
%>%
SF slice_sample(n = n) %>%
summarize(mean_arr_delay = mean(arr_delay))
# A tibble: 1 × 1
mean_arr_delay
<dbl>
1 2
Using simulation we can approximate the sampling distribution of the sample statistics. Here we are computing the mean, but we could do this for any statistic we are interested in, the median, sample variance, sample standard deviation, etc.
<- 500
num_trials <- 1:num_trials %>%
sf_25_means map_dfr(
~ SF %>%
slice_sample(n = n) %>%
summarize(mean_arr_delay = mean(arr_delay))
%>%
) mutate(n = n)
head(sf_25_means)
# A tibble: 6 × 2
mean_arr_delay n
<dbl> <dbl>
1 11 25
2 -2 25
3 13.2 25
4 -8.44 25
5 -12.6 25
6 -5.24 25
Summarize the sample means.
%>%
sf_25_means skim(mean_arr_delay)
Name | Piped data |
Number of rows | 500 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mean_arr_delay | 0 | 1 | 3.37 | 9.53 | -20.6 | -3.77 | 2.44 | 9.25 | 52.2 | ▂▇▃▁▁ |
Confidence intervals
%>%
sf_25_means summarize(
x_bar = mean(mean_arr_delay),
se = sd(mean_arr_delay)
%>%
) mutate(
ci_lower = x_bar - 2 * se, # approximately 95% of observations
ci_upper = x_bar + 2 * se # are within two standard errors
)
# A tibble: 1 × 4
x_bar se ci_lower ci_upper
<dbl> <dbl> <dbl> <dbl>
1 3.37 9.53 -15.7 22.4
Using a larger sample size give a smaller Standard Error.
<- 100
n <- 1:500 %>%
sf_100_means map_dfr(
~ SF %>%
slice_sample(n = n) %>%
summarize(mean_arr_delay = mean(arr_delay))
%>%
) mutate(n = n)
Plots to compare the sampling distributions with different sample sizes.
%>%
sf_25_means bind_rows(sf_100_means) %>%
ggplot(aes(x = mean_arr_delay)) +
geom_histogram(bins = 30) +
facet_grid( ~ n) +
xlab("Sample mean")
Boostrap
Usually for random sampling we sample without replacement.
<- SF %>%
three_flights slice_sample(n = 3, replace = FALSE) %>%
select(year, month, day, dep_time)
three_flights
# A tibble: 3 × 4
year month day dep_time
<int> <int> <int> <int>
1 2013 2 22 952
2 2013 4 23 748
3 2013 4 8 1102
With the Bootstrap we use sampling with replacement
%>% slice_sample(n = 3, replace = TRUE) three_flights
# A tibble: 3 × 4
year month day dep_time
<int> <int> <int> <int>
1 2013 4 23 748
2 2013 4 23 748
3 2013 4 23 748
<- 200
n <- SF %>%
orig_sample slice_sample(n = n, replace = FALSE)
orig_sample
# A tibble: 200 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 5 31 2100 2059 1 2359 34
2 2013 1 27 1101 1100 1 1504 1439
3 2013 7 23 1702 1659 3 2009 2012
4 2013 5 20 719 735 -16 951 1110
5 2013 8 15 1153 1029 84 1457 1336
6 2013 7 9 2250 1855 235 147 2215
7 2013 5 23 1631 1600 31 2020 1951
8 2013 4 20 612 612 0 936 935
9 2013 4 13 2057 1950 67 17 2340
10 2013 3 13 1659 1530 89 2026 1910
# ℹ 190 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
The reason for sampling with replacement is that we only have the Original Sample of size \(n\). We do not have the population. So if we sampled with out replacement we would only have one sample to look at. Sampling the Original Sample with replace allow us to general lots of samples and to investiage the variability of these samples.
%>%
orig_sample slice_sample(n = n, replace = TRUE) %>%
summarize(mean_arr_delay = mean(arr_delay))
# A tibble: 1 × 1
mean_arr_delay
<dbl>
1 -3.54
<- 1:num_trials %>%
sf_200_bs map_dfr(
~orig_sample %>%
slice_sample(n = n, replace = TRUE) %>%
summarize(mean_arr_delay = mean(arr_delay))
%>%
) mutate(n = n)
%>%
sf_200_bs skim(mean_arr_delay)
Name | Piped data |
Number of rows | 500 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mean_arr_delay | 0 | 1 | -2.21 | 3.42 | -11.76 | -4.5 | -2.17 | 0.2 | 8.04 | ▁▅▇▅▁ |
%>% ggplot(aes(x = mean_arr_delay)) +
sf_200_bs geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
<- 1:num_trials %>%
sf_200_pop map_dfr(
~SF %>%
slice_sample(n = n, replace = TRUE) %>%
summarize(mean_arr_delay = mean(arr_delay))
%>%
) mutate(n = n)
%>%
sf_200_pop skim(mean_arr_delay)
Name | Piped data |
Number of rows | 500 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mean_arr_delay | 0 | 1 | 2.44 | 2.98 | -5.18 | 0.52 | 2.33 | 4.19 | 12.57 | ▂▇▇▃▁ |
%>% ggplot(aes(x = mean_arr_delay)) +
sf_200_pop geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Question: Do the histograms look somewhat similar?
%>%
orig_sample summarize(q98 = quantile(arr_delay, p = 0.98))
# A tibble: 1 × 1
q98
<dbl>
1 135.
<- nrow(orig_sample)
n <- 1:num_trials %>%
sf_200_bs map_dfr(
~orig_sample %>%
slice_sample(n = n, replace = TRUE) %>%
summarize(q98 = quantile(arr_delay, p = 0.98))
)
%>%
sf_200_bs skim(q98)
Name | Piped data |
Number of rows | 500 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
q98 | 0 | 1 | 159.85 | 75.62 | 41.04 | 82.04 | 134.58 | 212.78 | 347 | ▇▇▅▃▂ |
<- 10000
n_large <- SF %>%
sf_10000_bs slice_sample(n = n_large, replace = FALSE)
<- 1:num_trials %>%
sf_200_bs map_dfr(~sf_10000_bs %>%
slice_sample(n = n_large, replace = TRUE) %>%
summarize(q98 = quantile(arr_delay, p = 0.98))
)
%>%
sf_200_bs skim(q98)
Name | Piped data |
Number of rows | 500 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
q98 | 0 | 1 | 151.77 | 6.07 | 133 | 148.04 | 152 | 155.01 | 167 | ▁▂▇▅▂ |
outliers
%>%
SF filter(arr_delay >= 420) %>%
select(month, day, dep_delay, arr_delay, carrier)
# A tibble: 7 × 5
month day dep_delay arr_delay carrier
<int> <int> <dbl> <dbl> <chr>
1 12 7 374 422 UA
2 7 6 589 561 DL
3 7 7 629 676 VX
4 7 7 653 632 VX
5 7 10 453 445 B6
6 7 10 432 433 VX
7 9 20 1014 1007 AA
%>%
SF filter(arr_delay < 420) %>%
ggplot(aes(arr_delay)) +
geom_histogram(binwidth = 15) +
labs(x = "Arrival delay (in minutes)")
%>%
SF group_by(hour) %>%
count() %>%
pivot_wider(names_from = hour, values_from = n) %>%
data.frame()
X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21
1 55 663 1696 987 429 1744 413 504 476 528 946 897 1491 1091 731 465 57
%>%
SF ggplot(aes(x = hour, y = arr_delay)) +
geom_boxplot(alpha = 0.1, aes(group = hour)) +
geom_smooth(method = "lm") +
xlab("Scheduled hour of departure") +
ylab("Arrival delay (minutes)") +
coord_cartesian(ylim = c(-30, 120))
`geom_smooth()` using formula = 'y ~ x'
<- lm(arr_delay ~ hour, data = SF)
mod1 ::tidy(mod1) broom
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -22.9 1.23 -18.6 2.88e- 76
2 hour 2.01 0.0915 22.0 1.78e-105