2024-01-24
The authors of our book make many important observations about Data Science at the beginning of Chapter 9.
Sample mean \(\bar{x}\) converges to the population mean \(\mu\).
Simulation assuming the populations parameters are known.
The sampling distribution of \(\bar{x}\) is approximately \(N(\mu, \frac{\sigma^2}{n})\).
Simulation assuming the populations parameters are known.
Simulation assuming the populations parameters are known. See Chapter 7.
For a ModernDive into sampling and the Central Limit Theorem try the code in Chapter 7 Sampling.
In this chapter the infer R package is ued. This package has a number of modern functions to easily simulate resampling from a dataframe.
For further modern code, check out the tidymodels R package rsample.
The bootstrap is a statistical method that allows us to approximate the sampling distribution even without access to the population.
Bootstrapping is a resampling method.
Bootstrapping uses sampling with replacement.
Note that the main difference between the CLT and the Bootstrap is that for the CLT the sample size \(n\) goes to infinity and with the Bootstrap the sample size remains fixed and the number of samples \(B\) goes to infinity.
For a ModernDive into Bootstrapping and Confidence Intervals try the code in Chapter 8 Bootstrapping and Confidence Intervals.
In this chapter the infer R package is ued. This package has a number of modern functions to easily simulate resampling from a dataframe.
For further modern code, check out the tidymodels R package rsample.
Outliers should never be dropped unless there is a clear rationale. If outliers are dropped this should be clearly reported.
Statistical models are used to explain variation between response variables and explanatory variables.
Linear Regression models are commonly used to build models. They are fit using the least squares algorithm. This algorithm leads to unbiased estimators that have minimum variance.
What does the correlation coefficient measure? Answer: ???
Recall “Correlation does not imply causation.”
The gold standard is a controlled experiment. The authors describe the idea of A/B testing.
Most data collected today is observational. So no designed experiment has been used.
Recall Simpson’s Paradox.
Everyone is using many many many p-values all assuming \(\alpha = 0.05\).
This causes much higher overall error rates.
When using multiple comparisons and overall error rate should be addressed.
Be sure to read Appendix E at the end of the book.
It includes a very nice summary of fitting Multiple Linear Regression.
A confounding variable is another variable that influences the other variables.
### synthetic data
# Consider book price (y) by number of pages (x)
z = c("hardcover","hardcover",
"hardcover","hardcover",
"paperback", "paperback","paperback",
"paperback")
x1 = c( 150, 225, 342, 185)
y1 = c( 27.43, 48.76, 50.25, 32.01 )
x2 = c( 475, 834, 1020, 790)
y2 = c( 10.00, 15.73, 20.00, 17.89 )
x = c(x1, x2)
y = c(y1, y2)
Summary: Simpson’s Paradox is the changing of the direction of a relationship with the introduction of another variable.
The relationship between Price and Number of pages in a book changes with the introduction of the variable Type of Book (Hardcover, Paperback).
See the R Markdown document SimpsonsParadox available on RPubs.com/esuess.