help(cars)
head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
Suppose we are interested in estimating the stopping distance of cars.
We can investigate the data contained in the cars dataset in R.
help(cars)
head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
We will focus our investigation on the second variable dist, which measure the stopping distance in feet (ft). In R the $ can be used to specify the column of a dataframe we want to work with.
<- cars$dist
y y
[1] 2 10 4 22 16 10 18 26 34 17 28 14 20 24 28 26 34 34 46
[20] 26 36 60 80 20 26 54 32 40 32 40 50 42 56 76 84 36 46 68
[39] 32 48 52 56 64 66 54 70 92 93 120 85
First estimate the population mean \(\mu\) using this one sample of stopping distances.
<- mean(y)
y_bar y_bar
[1] 42.98
So our estimate is 43 feet is the mean stopping distance.
Now lets compute a 95% confidence interval for \(\mu\) to see the range of likely values of \(\mu\).
We need the sample mean, sample standard deviation, and the sample size. Since \(n\) is greater that 30 we will use the z critical value and we will substitute the sample standard deviation for the population standard deviation \(\sigma\).
<- sd(y)
y_sd <- length(y) y_n
Plot the data to see if there are any outliers.
hist(y)
plot(density(y))
y_bar
[1] 42.98
y_sd
[1] 25.76938
y_n
[1] 50
# z critical value
<- .95
C
<- (1-C)/2 # This is alpha/2
alpha_2
<- qnorm(alpha_2, lower.tail = FALSE)
z_cv z_cv
[1] 1.959964
# confidence interval
<- c( y_bar - z_cv * y_sd/sqrt(y_n),
y_ci + z_cv * y_sd/sqrt(y_n) )
y_bar y_ci
[1] 35.83722 50.12278
I can say ….
“The estimated stopping distance is 43 feet. We are 95% confident that the (unknown) true population mean is between 36 and 50.”
The t-distribution is more appropriate to use when working on a computer.
<- qt(alpha_2, df = y_n - 1, lower.tail = FALSE)
t_cv
<- c( y_bar - t_cv * y_sd/sqrt(y_n), y_bar + t_cv * y_sd/sqrt(y_n) )
y_ci y_ci
[1] 35.65642 50.30358
In R there is the t.test function that produces the t confidence interval when we have data.
library(ISwR)
t.test(y)
One Sample t-test
data: y
t = 11.794, df = 49, p-value = 6.384e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
35.65642 50.30358
sample estimates:
mean of x
42.98
We now have the 95% confidence interval for the unknown population mean \(\mu\), the mean stopping distances for cars. We can say the following:
Question: Why can’t we say that there is a 95% probability that the true population mean stopping distance is between 36 and 50 feet?
Recall all of our probability simulations, to estimate a probability we need to know what we are counting toward the success of our experiment (the numerator of our calculation). Here success is that the computed 95% confidence interval is included the unknown \(\mu\) . Since it is unknown we do not know if it is included or not. So either it is or it is not. Therefore the probability of including \(\mu\) is either 0 or 1 and we do not know what it is.
To understand 95% confidence we need to understand the idea of repeated sampling and what the CLT says about the distribution of the sample mean \(\bar{x}\).
When we say we are 95% confident that the true mean is contained in the computed confidence interval, we are saying we believe that the “method” used to compute the interval is good.
We can say, “If repeated sampling is used to repeated the experiment and the 95% confidence interval is repeatedly computed, it should contain the true (unknown) mean 95% of the time.”
The problem here is that this simulation is not performed. We have only taken one sample and we do no know the value of \(\mu\).
To simulate the performance of the 95% confidence interval we need to assume we know \(\mu\).
If we Google average stopping distance of a car there are many websites that contain information about stopping distance. You will note that there is a difference between stopping distance and braking distance. I am going to assume the cars dataset contains braking distance.
So lets assume the population stopping distance for cars traveling less that 25 mph is 55 feet with a standard deviation of 25 feet.
<- 55
mu <- 20
sigma
<- rnorm(y_n, mu, sigma)
y_sample y_sample
[1] 35.23952 51.63520 77.97296 59.79019 68.92897 37.83072 91.09382
[8] 61.72958 43.77240 57.88949 58.27106 58.38760 36.47747 47.00615
[15] 55.80350 73.75179 77.25147 77.82498 29.78769 54.39745 39.49608
[22] 14.21419 47.17743 50.91242 49.25746 60.01363 58.86216 79.86691
[29] 73.11861 40.82893 31.36866 106.63731 44.77388 34.91503 94.30470
[36] 18.99751 64.85707 47.27459 56.16636 61.85862 28.74674 37.69893
[43] 40.80150 76.32590 51.04598 40.08785 68.62321 31.60441 46.93854
[50] 36.57960
<- trunc(y_sample)
y_sample y_sample
[1] 35 51 77 59 68 37 91 61 43 57 58 58 36 47 55 73 77 77 29
[20] 54 39 14 47 50 49 60 58 79 73 40 31 106 44 34 94 18 64 47
[39] 56 61 28 37 40 76 51 40 68 31 46 36
<- t.test(y_sample)
y_ci y_ci
One Sample t-test
data: y_sample
t = 19.456, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
47.70509 58.69491
sample estimates:
mean of x
53.2
$conf.int y_ci
[1] 47.70509 58.69491
attr(,"conf.level")
[1] 0.95
$conf.int[1] y_ci
[1] 47.70509
$conf.int[2] y_ci
[1] 58.69491
Run one iteration of the repeated sampling procedure and see the probability is 0 or 1.
<- (mu > y_ci$conf.int[1] & mu < y_ci$conf.int[2])
y_01 y_01
[1] TRUE
<- mean(mu > y_ci$conf.int[1] & mu < y_ci$conf.int[2])
y_conf y_conf
[1] 1
All code from above that we need to replicate over and over again.
<- trunc(rnorm(y_n, mu, sigma))
y_sample <- t.test(y_sample)
y_ci <- mean(mu > y_ci$conf.int[1] & mu < y_ci$conf.int[2])
y_conf
y_conf
[1] 1
Do the repeated sampling B times to determine see that the “method” works 95% of the time.
<- 100000
B
<- mean(replicate(B, {
y_conf <- trunc(rnorm(y_n, mu, sigma))
y_sample <- t.test(y_sample)
y_ci mean(mu > y_ci$conf.int[1] & mu < y_ci$conf.int[2])
}))
y_conf
[1] 0.94619
This probability simulation is never done in practice and requires the knowledge of \(\mu\). So we do not discuss probability. We trust this process or method to produce a confidence interval is a good way to produce the margin-or-error.