--- title: "Stat. 316: Confidence Intervals" author: "Prof. Eric A. Suess" date: "4/15/2024" format: html: self-contained: true --- # 95% Confidence Intervals Suppose we are interested in estimating the stopping distance of cars. We can investigate the data contained in the **cars** dataset in R. ```{r} help(cars) head(cars) ``` We will focus our investigation on the second variable **dist**, which measure the stopping distance in feet (ft). In R the **$** can be used to specify the column of a dataframe we want to work with. ```{r} y <- cars$dist y ``` First estimate the population mean $\mu$ using this one sample of stopping distances. ```{r} y_bar <- mean(y) y_bar ``` So our estimate is 43 feet is the mean stopping distance. Now lets compute a 95% confidence interval for $\mu$ to see the range of likely values of $\mu$. ## Calculated a 95% confidence interval for a data set. We need the sample mean, sample standard deviation, and the sample size. Since $n$ is greater that 30 we will use the *z* critical value and we will substitute the sample standard deviation for the population standard deviation $\sigma$. ```{r} y_sd <- sd(y) y_n <- length(y) ``` Plot the data to see if there are any outliers. ```{r} hist(y) plot(density(y)) ``` ```{r} y_bar y_sd y_n # z critical value C <- .95 alpha_2 <- (1-C)/2 # This is alpha/2 z_cv <- qnorm(alpha_2, lower.tail = FALSE) z_cv # confidence interval y_ci <- c( y_bar - z_cv * y_sd/sqrt(y_n), y_bar + z_cv * y_sd/sqrt(y_n) ) y_ci ``` I can say .... "The estimated stopping distance is 43 feet. We are 95% confident that the (unknown) true population mean is between 36 and 50." The t-distribution is more appropriate to use when working on a computer. ```{r} t_cv <- qt(alpha_2, df = y_n - 1, lower.tail = FALSE) y_ci <- c( y_bar - t_cv * y_sd/sqrt(y_n), y_bar + t_cv * y_sd/sqrt(y_n) ) y_ci ``` In R there is the *t.test* function that produces the t confidence interval when we have data. ```{r} library(ISwR) t.test(y) ``` We now have the 95% confidence interval for the unknown population mean $\mu$, the mean stopping distances for cars. We can say the following: 1. Our best estimate of the population mean stopping distance $\mu$ is 43 feet. 2. We are 95% confident that the true population mean $\mu$ is between 36 and 50 feet. **Question:** Why can't we say that there is a 95% probability that the true population mean stopping distance is between 36 and 50 feet? 1. We do not know what the value of $\mu$ is. So we cannot check to see if it is included in the one interval we have calculated. 2. We have only performed the "experiment" of sampling from the population one time. Recall all of our probability simulations, to estimate a probability we need to know what we are counting toward the success of our experiment (the numerator of our calculation). Here success is that the computed 95% confidence interval is included the unknown $\mu$ . Since it is unknown we do not know if it is included or not. So either it is or it is not. Therefore the probability of including $\mu$ is either 0 or 1 and we do not know what it is. ## The idea of the CLT and repeated sampling. To understand 95% confidence we need to understand the idea of repeated sampling and what the CLT says about the distribution of the sample mean $\bar{x}$. When we say we are 95% confident that the true mean is contained in the computed confidence interval, we are saying we believe that the "method" used to compute the interval is good. We can say, "If repeated sampling is used to repeated the experiment and the 95% confidence interval is repeatedly computed, it should contain the true (unknown) mean 95% of the time." The problem here is that this simulation is not performed. We have only taken one sample and we do no know the value of $\mu$. To simulate the performance of the 95% confidence interval we need to assume we know $\mu$. If we Google *average stopping distance of a car* there are many websites that contain information about [stopping distance](https://desimonelawoffice.com/blog/how-long-does-it-take-to-stop-a-moving-vehicle/). You will note that there is a difference between stopping distance and braking distance. I am going to assume the cars dataset contains braking distance. So lets assume the population stopping distance for cars traveling less that 25 mph is 55 feet with a standard deviation of 25 feet. ```{r} mu <- 55 sigma <- 20 y_sample <- rnorm(y_n, mu, sigma) y_sample y_sample <- trunc(y_sample) y_sample ``` ```{r} y_ci <- t.test(y_sample) y_ci ``` ```{r} y_ci$conf.int y_ci$conf.int[1] y_ci$conf.int[2] ``` Run one iteration of the repeated sampling procedure and see the probability is 0 or 1. ```{r} y_01 <- (mu > y_ci$conf.int[1] & mu < y_ci$conf.int[2]) y_01 y_conf <- mean(mu > y_ci$conf.int[1] & mu < y_ci$conf.int[2]) y_conf ``` All code from above that we need to replicate over and over again. ```{r} y_sample <- trunc(rnorm(y_n, mu, sigma)) y_ci <- t.test(y_sample) y_conf <- mean(mu > y_ci$conf.int[1] & mu < y_ci$conf.int[2]) y_conf ``` Do the repeated sampling B times to determine see that the "method" works 95% of the time. ```{r} B <- 100000 y_conf <- mean(replicate(B, { y_sample <- trunc(rnorm(y_n, mu, sigma)) y_ci <- t.test(y_sample) mean(mu > y_ci$conf.int[1] & mu < y_ci$conf.int[2]) })) y_conf ``` This probability simulation is **never done** in practice and requires the knowledge of $\mu$. So we do not discuss probability. We trust this process or method to produce a confidence interval is a good way to produce the margin-or-error.