Stat. 316: Confidence Intervals

Author

Prof. Eric A. Suess

Published

April 15, 2024

95% Confidence Intervals

Suppose we are interested in estimating the stopping distance of cars.

We can investigate the data contained in the cars dataset in R.

help(cars)

head(cars)

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

We will focus our investigation on the second variable dist, which measure the stopping distance in feet (ft). In R the $ can be used to specify the column of a dataframe we want to work with.

y <- cars$dist
y

 [1]   2  10   4  22  16  10  18  26  34  17  28  14  20  24  28  26  34  34  46
[20]  26  36  60  80  20  26  54  32  40  32  40  50  42  56  76  84  36  46  68
[39]  32  48  52  56  64  66  54  70  92  93 120  85

First estimate the population mean $\mu$ using this one sample of stopping distances.

y_bar <- mean(y)
y_bar

[1] 42.98

So our estimate is 43 feet is the mean stopping distance.

Now lets compute a 95% confidence interval for $\mu$ to see the range of likely values of $\mu$.

Calculated a 95% confidence interval for a data set.

We need the sample mean, sample standard deviation, and the sample size. Since $n$ is greater that 30 we will use the z critical value and we will substitute the sample standard deviation for the population standard deviation $\sigma$.

y_sd  <- sd(y)
y_n   <- length(y)

Plot the data to see if there are any outliers.

hist(y)

plot(density(y))

y_bar

[1] 42.98

y_sd

[1] 25.76938

y_n

[1] 50

# z critical value

C <- .95

alpha_2 <- (1-C)/2  # This is alpha/2

z_cv <- qnorm(alpha_2, lower.tail = FALSE)
z_cv

[1] 1.959964

# confidence interval

y_ci <- c( y_bar - z_cv * y_sd/sqrt(y_n), 
           y_bar + z_cv * y_sd/sqrt(y_n) )
y_ci

[1] 35.83722 50.12278

I can say ….

“The estimated stopping distance is 43 feet. We are 95% confident that the (unknown) true population mean is between 36 and 50.”

The t-distribution is more appropriate to use when working on a computer.

t_cv <- qt(alpha_2, df = y_n - 1, lower.tail = FALSE)

y_ci <- c( y_bar - t_cv * y_sd/sqrt(y_n), y_bar + t_cv * y_sd/sqrt(y_n) )
y_ci

[1] 35.65642 50.30358

In R there is the t.test function that produces the t confidence interval when we have data.

library(ISwR)

t.test(y)


    One Sample t-test

data:  y
t = 11.794, df = 49, p-value = 6.384e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 35.65642 50.30358
sample estimates:
mean of x 
    42.98

We now have the 95% confidence interval for the unknown population mean $\mu$, the mean stopping distances for cars. We can say the following:

Our best estimate of the population mean stopping distance $\mu$ is 43 feet.
We are 95% confident that the true population mean $\mu$ is between 36 and 50 feet.

Question: Why can’t we say that there is a 95% probability that the true population mean stopping distance is between 36 and 50 feet?

We do not know what the value of $\mu$ is. So we cannot check to see if it is included in the one interval we have calculated.
We have only performed the “experiment” of sampling from the population one time.

Recall all of our probability simulations, to estimate a probability we need to know what we are counting toward the success of our experiment (the numerator of our calculation). Here success is that the computed 95% confidence interval is included the unknown $\mu$ . Since it is unknown we do not know if it is included or not. So either it is or it is not. Therefore the probability of including $\mu$ is either 0 or 1 and we do not know what it is.

The idea of the CLT and repeated sampling.

To understand 95% confidence we need to understand the idea of repeated sampling and what the CLT says about the distribution of the sample mean $\bar{x}$.

When we say we are 95% confident that the true mean is contained in the computed confidence interval, we are saying we believe that the “method” used to compute the interval is good.

We can say, “If repeated sampling is used to repeated the experiment and the 95% confidence interval is repeatedly computed, it should contain the true (unknown) mean 95% of the time.”

The problem here is that this simulation is not performed. We have only taken one sample and we do no know the value of $\mu$.

To simulate the performance of the 95% confidence interval we need to assume we know $\mu$.

If we Google average stopping distance of a car there are many websites that contain information about stopping distance. You will note that there is a difference between stopping distance and braking distance. I am going to assume the cars dataset contains braking distance.

So lets assume the population stopping distance for cars traveling less that 25 mph is 55 feet with a standard deviation of 25 feet.

mu <- 55
sigma <- 20

y_sample <- rnorm(y_n, mu, sigma)
y_sample

 [1]  35.23952  51.63520  77.97296  59.79019  68.92897  37.83072  91.09382
 [8]  61.72958  43.77240  57.88949  58.27106  58.38760  36.47747  47.00615
[15]  55.80350  73.75179  77.25147  77.82498  29.78769  54.39745  39.49608
[22]  14.21419  47.17743  50.91242  49.25746  60.01363  58.86216  79.86691
[29]  73.11861  40.82893  31.36866 106.63731  44.77388  34.91503  94.30470
[36]  18.99751  64.85707  47.27459  56.16636  61.85862  28.74674  37.69893
[43]  40.80150  76.32590  51.04598  40.08785  68.62321  31.60441  46.93854
[50]  36.57960

y_sample <- trunc(y_sample)
y_sample

 [1]  35  51  77  59  68  37  91  61  43  57  58  58  36  47  55  73  77  77  29
[20]  54  39  14  47  50  49  60  58  79  73  40  31 106  44  34  94  18  64  47
[39]  56  61  28  37  40  76  51  40  68  31  46  36

y_ci <- t.test(y_sample)
y_ci


    One Sample t-test

data:  y_sample
t = 19.456, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 47.70509 58.69491
sample estimates:
mean of x 
     53.2

y_ci$conf.int

[1] 47.70509 58.69491
attr(,"conf.level")
[1] 0.95

y_ci$conf.int[1]

[1] 47.70509

y_ci$conf.int[2]

[1] 58.69491

Run one iteration of the repeated sampling procedure and see the probability is 0 or 1.

y_01 <- (mu > y_ci$conf.int[1] & mu < y_ci$conf.int[2])
y_01

[1] TRUE

y_conf <- mean(mu > y_ci$conf.int[1] & mu < y_ci$conf.int[2])
y_conf

[1] 1

All code from above that we need to replicate over and over again.

y_sample <- trunc(rnorm(y_n, mu, sigma))
y_ci <- t.test(y_sample)
y_conf <- mean(mu > y_ci$conf.int[1] & mu < y_ci$conf.int[2])

y_conf

[1] 1

Do the repeated sampling B times to determine see that the “method” works 95% of the time.

B <- 100000

y_conf <- mean(replicate(B, {
  y_sample <- trunc(rnorm(y_n, mu, sigma))
  y_ci <- t.test(y_sample)
  mean(mu > y_ci$conf.int[1] & mu < y_ci$conf.int[2])
}))

y_conf

[1] 0.94619

This probability simulation is never done in practice and requires the knowledge of $\mu$. So we do not discuss probability. We trust this process or method to produce a confidence interval is a good way to produce the margin-or-error.