library(pacman)
p_load(mdsr, nycflights13, tidyverse, skimr)
Chapter 9 Statistical Foundations
Samples and Populations
The population considered in the book is contained in the nycflights13 dataset. In the nycflights13 dataset contains all of the departing flights from airports in the NYC Area.
Recall the flights dataframe.
flights
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Take a sample of flights from NYC to SF. Get all of the flights to SF and take a sample of 25 of them.
Note that I did not set a seed for the sample, so your answers will differ from what is in the book.
<- flights |>
SF filter(dest == "SFO", !is.na(arr_delay))
SF
# A tibble: 13,173 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 558 600 -2 923 937
2 2013 1 1 611 600 11 945 931
3 2013 1 1 655 700 -5 1037 1045
4 2013 1 1 729 730 -1 1049 1115
5 2013 1 1 734 737 -3 1047 1113
6 2013 1 1 745 745 0 1135 1125
7 2013 1 1 746 746 0 1119 1129
8 2013 1 1 803 800 3 1132 1144
9 2013 1 1 826 817 9 1145 1158
10 2013 1 1 1029 1030 -1 1427 1355
# ℹ 13,163 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
<- SF |>
sf_25 slice_sample(n = 25)
sf_25
# A tibble: 25 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 3 24 1855 1855 0 2207 2240
2 2013 5 29 1851 1855 -4 2307 2215
3 2013 4 8 1029 1027 2 1356 1359
4 2013 9 12 1904 1730 94 2357 2100
5 2013 7 30 1606 1609 -3 1948 1928
6 2013 12 9 1435 1430 5 1757 1757
7 2013 10 2 1528 1530 -2 1851 1845
8 2013 7 22 1854 1900 -6 18 2240
9 2013 10 1 1026 1025 1 1317 1340
10 2013 2 18 1134 1049 45 1448 1431
# ℹ 15 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Get the summary statistics for the sample taken.
|>
sf_25 skim(arr_delay)
Name | sf_25 |
Number of rows | 25 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
arr_delay | 0 | 1 | 17.8 | 56.07 | -69 | -6 | 5 | 17 | 177 | ▃▇▁▁▁ |
Since the SF dataset contains all flights from NYC to SF in 2013, the statistics computed from the SF dataset are the calculated population paramaters.
|>
SF skim(arr_delay)
Name | SF |
Number of rows | 13173 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
arr_delay | 0 | 1 | 2.67 | 47.67 | -86 | -23 | -8 | 12 | 1007 | ▇▁▁▁▁ |
Get the 98th percentile of the sample.
|>
sf_25 summarize(q98 = quantile(arr_delay, p = 0.98))
# A tibble: 1 × 1
q98
<dbl>
1 157.
Get the estimated proportion of flights to SF with a delay less than 90 minutes.
|>
SF group_by(arr_delay < 90) |>
count() |>
mutate(pct = n / nrow(SF))
# A tibble: 2 × 3
# Groups: arr_delay < 90 [2]
`arr_delay < 90` n pct
<lgl> <int> <dbl>
1 FALSE 640 0.0486
2 TRUE 12533 0.951
Compare with the 98th percentile. 90 minutes is the 95th percentile.
|>
SF summarize(q98 = quantile(arr_delay, p = 0.98))
# A tibble: 1 × 1
q98
<dbl>
1 153
Sample Statistics
The sampling distribution.
Usually sampling is done without replacement.
<- 25
n |>
SF slice_sample(n = n) |>
summarize(mean_arr_delay = mean(arr_delay))
# A tibble: 1 × 1
mean_arr_delay
<dbl>
1 -0.48
Note that different random sample produce different values for the sample statistic.
|>
SF slice_sample(n = n) |>
summarize(mean_arr_delay = mean(arr_delay))
# A tibble: 1 × 1
mean_arr_delay
<dbl>
1 2.24
Using simulation we can approximate the sampling distribution of the sample statistics. Here we are computing the mean, but we could do this for any statistic we are interested in, the median, sample variance, sample standard deviation, etc.
<- 500
num_trials <- 1:num_trials |>
sf_25_means map_dfr(
~ SF |>
slice_sample(n = n) |>
summarize(mean_arr_delay = mean(arr_delay))
|>
) mutate(n = n)
head(sf_25_means)
# A tibble: 6 × 2
mean_arr_delay n
<dbl> <dbl>
1 16.3 25
2 6.36 25
3 8.72 25
4 2.56 25
5 -1.56 25
6 -5.4 25
Summarize the sample means.
|>
sf_25_means skim(mean_arr_delay)
Name | sf_25_means |
Number of rows | 500 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mean_arr_delay | 0 | 1 | 2.39 | 9.09 | -22.76 | -4.02 | 1.3 | 7.88 | 34 | ▁▇▇▃▁ |
Confidence intervals
|>
sf_25_means summarize(
x_bar = mean(mean_arr_delay),
se = sd(mean_arr_delay)
|>
) mutate(
ci_lower = x_bar - 2 * se, # approximately 95% of observations
ci_upper = x_bar + 2 * se # are within two standard errors
)
# A tibble: 1 × 4
x_bar se ci_lower ci_upper
<dbl> <dbl> <dbl> <dbl>
1 2.39 9.09 -15.8 20.6
Using a larger sample size give a smaller Standard Error.
<- 100
n <- 1:500 |>
sf_100_means map_dfr(
~ SF |>
slice_sample(n = n) |>
summarize(mean_arr_delay = mean(arr_delay))
|>
) mutate(n = n)
Plots to compare the sampling distributions with different sample sizes.
|>
sf_25_means bind_rows(sf_100_means) |>
ggplot(aes(x = mean_arr_delay)) +
geom_histogram(bins = 30) +
facet_grid( ~ n) +
xlab("Sample mean")
Boostrap
Usually for random sampling we sample without replacement.
<- SF |>
three_flights slice_sample(n = 3, replace = FALSE) |>
select(year, month, day, dep_time)
three_flights
# A tibble: 3 × 4
year month day dep_time
<int> <int> <int> <int>
1 2013 8 10 1026
2 2013 3 27 1651
3 2013 9 23 2020
With the Bootstrap we use sampling with replacement
|> slice_sample(n = 3, replace = TRUE) three_flights
# A tibble: 3 × 4
year month day dep_time
<int> <int> <int> <int>
1 2013 9 23 2020
2 2013 3 27 1651
3 2013 8 10 1026
<- 200
n <- SF |>
orig_sample slice_sample(n = n, replace = FALSE)
orig_sample
# A tibble: 200 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 3 20 607 600 7 941 925
2 2013 10 29 1555 1555 0 1944 1939
3 2013 7 8 1802 1710 52 2144 2035
4 2013 6 12 1154 1147 7 1506 1454
5 2013 3 3 1723 1725 -2 2047 2050
6 2013 5 26 858 800 58 1210 1135
7 2013 7 13 1155 1130 25 1442 1446
8 2013 11 7 846 850 -4 1209 1228
9 2013 12 12 1709 1710 -1 2021 2045
10 2013 11 24 1129 1030 59 1424 1355
# ℹ 190 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
The reason for sampling with replacement is that we only have the Original Sample of size \(n\). We do not have the population. So if we sampled with out replacement we would only have one sample to look at. Sampling the Original Sample with replace allow us to general lots of samples and to investiage the variability of these samples.
|>
orig_sample slice_sample(n = n, replace = TRUE) |>
summarize(mean_arr_delay = mean(arr_delay))
# A tibble: 1 × 1
mean_arr_delay
<dbl>
1 7.10
<- 1:num_trials |>
sf_200_bs map_dfr(
~orig_sample |>
slice_sample(n = n, replace = TRUE) |>
summarize(mean_arr_delay = mean(arr_delay))
|>
) mutate(n = n)
|>
sf_200_bs skim(mean_arr_delay)
Name | sf_200_bs |
Number of rows | 500 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mean_arr_delay | 0 | 1 | 4.7 | 3.56 | -4.66 | 2.42 | 4.79 | 6.97 | 16.73 | ▂▆▇▂▁ |
|> ggplot(aes(x = mean_arr_delay)) +
sf_200_bs geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
<- 1:num_trials |>
sf_200_pop map_dfr(
~SF |>
slice_sample(n = n, replace = TRUE) |>
summarize(mean_arr_delay = mean(arr_delay))
|>
) mutate(n = n)
|>
sf_200_pop skim(mean_arr_delay)
Name | sf_200_pop |
Number of rows | 500 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mean_arr_delay | 0 | 1 | 2.36 | 3.28 | -6.82 | 0.07 | 2.2 | 4.62 | 11.26 | ▁▅▇▅▁ |
|> ggplot(aes(x = mean_arr_delay)) +
sf_200_pop geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Question: Do the histograms look somewhat similar?
|>
orig_sample summarize(q98 = quantile(arr_delay, p = 0.98))
# A tibble: 1 × 1
q98
<dbl>
1 151.
<- nrow(orig_sample)
n <- 1:num_trials |>
sf_200_bs map_dfr(
~orig_sample |>
slice_sample(n = n, replace = TRUE) |>
summarize(q98 = quantile(arr_delay, p = 0.98))
)
|>
sf_200_bs skim(q98)
Name | sf_200_bs |
Number of rows | 500 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
q98 | 0 | 1 | 147.46 | 46.93 | 78.06 | 112.78 | 151 | 166 | 327 | ▇▇▁▁▁ |
<- 10000
n_large <- SF |>
sf_10000_bs slice_sample(n = n_large, replace = FALSE)
<- 1:num_trials |>
sf_200_bs map_dfr(~sf_10000_bs |>
slice_sample(n = n_large, replace = TRUE) |>
summarize(q98 = quantile(arr_delay, p = 0.98))
)
|>
sf_200_bs skim(q98)
Name | sf_200_bs |
Number of rows | 500 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
q98 | 0 | 1 | 151.07 | 4.73 | 136 | 149 | 152 | 154 | 166 | ▁▂▇▃▁ |
outliers
|>
SF filter(arr_delay >= 420) |>
select(month, day, dep_delay, arr_delay, carrier)
# A tibble: 7 × 5
month day dep_delay arr_delay carrier
<int> <int> <dbl> <dbl> <chr>
1 12 7 374 422 UA
2 7 6 589 561 DL
3 7 7 629 676 VX
4 7 7 653 632 VX
5 7 10 453 445 B6
6 7 10 432 433 VX
7 9 20 1014 1007 AA
|>
SF filter(arr_delay < 420) |>
ggplot(aes(arr_delay)) +
geom_histogram(binwidth = 15) +
labs(x = "Arrival delay (in minutes)")
|>
SF group_by(hour) |>
count() |>
pivot_wider(names_from = hour, values_from = n) |>
data.frame()
X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21
1 55 663 1696 987 429 1744 413 504 476 528 946 897 1491 1091 731 465 57
|>
SF ggplot(aes(x = hour, y = arr_delay)) +
geom_boxplot(alpha = 0.1, aes(group = hour)) +
geom_smooth(method = "lm") +
xlab("Scheduled hour of departure") +
ylab("Arrival delay (minutes)") +
coord_cartesian(ylim = c(-30, 120))
`geom_smooth()` using formula = 'y ~ x'
<- lm(arr_delay ~ hour, data = SF)
mod1 ::tidy(mod1) broom
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -22.9 1.23 -18.6 2.88e- 76
2 hour 2.01 0.0915 22.0 1.78e-105