--- title: "Statistical Foundations" author: "Prof. Eric A. Suess" date: "January 25, 2021" output: beamer_presentation: default ioslides_presentation: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Statistical Foundations The authors of our book make many important observations about Data Science at the beginning of Chapter 7 (2e Chapter 9). - "The objective of Data Science is to extract meaning from data." - "Visualization is good for seeing patterns in noisy data." - "It is important to be able to see when the patterns we see are strong enough that they are not mere accidents." - **"Statistical methods quantify patterns and their strengths."** - "Some people think that "big data" has made statistics obsolete. The argument is that with lots of data, the data can speak clearly for themselves. This is wrong, as we shall see." - "This chapter will illustrate a Data Science *workflow*." ## Samples and Populations - Recall the Law of Large Numbers - Recall the Central Limit Theorem ## LLN Sample mean $\bar{x}$ converges to the population mean $\mu$. Simulation assuming the populations parameters are known. ```{r} mu <- 65; sigma <- 3; # population parameters assumed B <- 1000 x.sample <- rnorm(B, mu, sigma) x.running.mean <- numeric(B) for(i in 1: B){ x.running.mean[i] <- sum(x.sample[1:i])/i } ``` ## LLN ```{r} plot(x.running.mean, type="l") abline(h=mu, col="red") ``` ## CLT The *sampling distribution* of $\bar{x}$ is approximately $N(\mu, \frac{\sigma^2}{n})$. Simulation assuming the populations parameters are known. ```{r} mu <- 65; sigma <- 3; n <- 25 B <- 1000 x.mean <- numeric(B) for(i in 1: B){ x.mean[i] <- mean(rnorm(n, mu, sigma)) } ``` ## CLT ```{r} hist(x.mean) ``` ## CLT in the Tidyverse Simulation assuming the populations parameters are known. ```{r} library(pacman) p_load(mosaic, tidyverse) Trials_n <- do(1000) * mean( rnorm(n, mu, sigma) ) ``` ## CLT in the Tidyverse ```{r} Trials_n %>% ggplot(aes(mean)) + geom_histogram(bins=10) ``` ## ModernDive Chapter 7 Sampling For a [ModernDive](https://moderndive.com/) into sampling and the Central Limit Theorem try the code in [Chapter 7 Sampling](https://moderndive.com/7-sampling.html). In this chapter the [infer](http://infer.tidymodels.org/) R package is ued. This package has a number of modern functions to easily simulate resampling from a datafram. For further modern code, check out the [tidymodels](https://www.tidymodels.org/) R package [rsample](https://rsample.tidymodels.org/). ## Bootstrap The *bootstrap* is a statistical method that allows us to approximate the *sampling distribution* even without access to the population. Bootstrapping is a *resampling* method. Bootstrapping uses *sampling with replacement.* ## Sketch the difference Note that the main difference between the CLT and the Bootstrap is that for the CLT the sample size $n$ goes to infinity and with the Bootstrap the sample size remains fixed and the number of samples $B$ goes to infinity. ## ModernDive Chapter 8 Bootstrapping and Confidence Intervals. For a [ModernDive](https://moderndive.com/) into Bootstrapping and Confidence Intervals try the code in [Chapter 8 Bootstrapping and Confidence Intervals](https://moderndive.com/8-confidence-intervals.html). In this chapter the [infer](http://infer.tidymodels.org/) R package is ued. This package has a number of modern functions to easily simulate resampling from a dataframe. For further modern code, check out the [tidymodels](https://www.tidymodels.org/) R package [rsample](https://rsample.tidymodels.org/). ## Outliers Outliers should never be dropped unless there is a clear rationale. If outliers are dropped this should be clearly reported. ```{r, eval=FALSE} p_load(anomalize, tibbletime) tidyverse_cran_downloads %>% time_decompose(count, merge = TRUE) %>% anomalize(remainder) %>% time_recompose() %>% plot_anomalies(ncol = 3, alpha_dots = 0.25) ``` ## Outliers ```{r, echo=FALSE} p_load(anomalize, tibbletime) tidyverse_cran_downloads %>% time_decompose(count, merge = TRUE) %>% anomalize(remainder) %>% time_recompose() %>% plot_anomalies(ncol = 3, alpha_dots = 0.25) ``` ## Statistical Models Statistical models are used to explain variation between *response variables* and *explanatory variables*. Linear Regression models are commonly used to build models. They are fit using the least squares algorithm. This algorithm leads to unbiased estimators that have minimum variance. - We know that the estimators of the parameters in the model are computed used **optimization**. - We know that the estimators are **unbiased**. ## Confounding Variables What does the correlation coefficient measure? **Answer:** ??? Recall "Correlation does not imply causation." The gold standard is a controlled experiment. The authors describe the idea of *A/B testing*. Most data collected today is observational. So no designed experiment has been used. Recall **Simpson's Paradox**. ## Problems with **p-values** Everyone is using *many many many* p-values all assuming $\alpha = 0.05$. This causes much higher overall error rates. When using *multiple comparisons* and overall error rate should be addressed. ## Appendix E. Be sure to read Appendix E at the end of the book. It includes a very nice summary of fitting Multiple Linear Regression. ## Confounding variables A [confounding variable](https://en.wikipedia.org/wiki/Confounding) is another variable that influences the other variables. [Simpson's Paradox](http://en.wikipedia.org/wiki/Simpson%27s_paradox) [Edward Simpson: Bayes at Bletchley Park](http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2010.00424.x/abstract) ## Example of Simpson's Paradox ```{r} ### synthetic data # Consider book price (y) by number of pages (x) z = c("hardcover","hardcover", "hardcover","hardcover", "paperback", "paperback","paperback", "paperback") x1 = c( 150, 225, 342, 185) y1 = c( 27.43, 48.76, 50.25, 32.01 ) x2 = c( 475, 834, 1020, 790) y2 = c( 10.00, 15.73, 20.00, 17.89 ) x = c(x1, x2) y = c(y1, y2) ``` ## Example of Simpson's Paradox ```{r} plot(x,y) ``` ## Example of Simpson's Paradox ```{r} # correlation cor(y, x) cor(y1, x1) cor(y2, x2) ``` ## Example of Simpson's Paradox ```{r} # linear regression lm(y ~ x) ``` ## Example of Simpson's Paradox ```{r} # linear regression lm(y1 ~ x1) ``` ## Example of Simpson's Paradox ```{r} # linear regression lm(y2 ~ x2) ``` ## Example of Simpson's Paradox **Summary:** Simpson's Paradox is the changing of the direction of a relationship with the introduction of another variable. The relationship between Price and Number of pages in a book changes with the introduction of the variable Type of Book (Hardcover, Paperback). See the R Markdown document [SimpsonsParadox](http://rpubs.com/esuess/SimpsonsParadox) available on [RPubs.com/esuess](http://rpubs.com/esuess).