--- title: "Stat. 316: Linear Regression" author: "Prof. Eric A. Suess" format: html: self-contained: true --- ## Relationships between two quantitative variables Here we have measurements on two quantitative variables for the same group of individuals. **Response variable:** A variable that measure the outcome of a study. **Explanatory variable:** A variable that explains changes in the response variable. **Scatterplot:** A plot that shows the relationship between two quantitative variables measured on the same individuals. ## Scatterplot For 6 students we have time spent studying and test score. Is there a relationship? | time studying (x) | test score (y) | |-------------------|----------------| | 1 | 3 | | 4 | 6 | | 6 | 7 | | 10 | 9 | | 8 | 6 | | 5 | 5 | Make a scatterplot of the data. Do the data look linear? ```{r} time <- c(1, 4, 6, 10, 8, 5) score <- c(3, 6, 7, 9, 6, 5) plot(time, score, xlab = "Time Studying", ylab = "Test Score", main = "Scatterplot of Test Scores vs. Time Studying") ``` ## Correlation The correlation coefficient, $r$, measures the strength and direction of a linear relationship between two quantitative variables. *Remark:** It does not distinguish between explanatory and response variables. It is not affected by changes in the unit of measurement of either or both variables. $$ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} $$ where $\bar{x}$ and $\bar{y}$ are the sample means of $x$ and $y$. ## Correlation For the test score and time studying data, the correlation is ```{r} cor(time, score) ``` ## Least Squares Regression The least squares regression line is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. In other words, it is the "best fitting line." The equation of the least squares regression line is $$ \hat{y} = b_0 + b_1 x $$ where $b_0$ is the y-intercept and $b_1$ is the slope. ## Least Squares Regression The slope of the least squares regression line is $$ b_1 = r \frac{s_y}{s_x} $$ where $s_x$ and $s_y$ are the sample standard deviations of $x$ and $y$. The y-intercept of the least squares regression line is $$ b_0 = \bar{y} - b_1 \bar{x} $$ ## Least Squares Regression For the test score and time studying data, the least squares regression line is ```{r} lm(score ~ time) ``` ## Least Squares Regression The least squares regression line is $$ \hat{y} = 2.5 + 0.6 x $$ ## Least Squares Regression Plot a line on the scatterplot. ```{r} plot(time, score, xlab = "Time Studying", ylab = "Test Score", main = "Scatterplot of Test Scores vs. Time Studying") abline(lm(score ~ time), col = "red") ```