Stat. 316: Linear Regression

Author

Prof. Eric A. Suess

Relationships between two quantitative variables

Here we have measurements on two quantitative variables for the same group of individuals.

Response variable: A variable that measure the outcome of a study.

Explanatory variable: A variable that explains changes in the response variable.

Scatterplot: A plot that shows the relationship between two quantitative variables measured on the same individuals.

Scatterplot

For 6 students we have time spent studying and test score. Is there a relationship?

time studying (x) test score (y)
1 3
4 6
6 7
10 9
8 6
5 5

Make a scatterplot of the data. Do the data look linear?

time <- c(1, 4, 6, 10, 8, 5)
score <- c(3, 6, 7, 9, 6, 5)
plot(time, score, xlab = "Time Studying", ylab = "Test Score", main = "Scatterplot of Test Scores vs. Time Studying")

Correlation

The correlation coefficient, \(r\), measures the strength and direction of a linear relationship between two quantitative variables.

*Remark:** It does not distinguish between explanatory and response variables. It is not affected by changes in the unit of measurement of either or both variables.

\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} \]

where \(\bar{x}\) and \(\bar{y}\) are the sample means of \(x\) and \(y\).

Correlation

For the test score and time studying data, the correlation is

cor(time, score)
[1] 0.8914004

Least Squares Regression

The least squares regression line is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. In other words, it is the “best fitting line.”

The equation of the least squares regression line is

\[ \hat{y} = b_0 + b_1 x \]

where \(b_0\) is the y-intercept and \(b_1\) is the slope.

Least Squares Regression

The slope of the least squares regression line is

\[ b_1 = r \frac{s_y}{s_x} \]

where \(s_x\) and \(s_y\) are the sample standard deviations of \(x\) and \(y\).

The y-intercept of the least squares regression line is

\[ b_0 = \bar{y} - b_1 \bar{x} \]

Least Squares Regression

For the test score and time studying data, the least squares regression line is

lm(score ~ time)

Call:
lm(formula = score ~ time)

Coefficients:
(Intercept)         time  
     2.7838       0.5676  

Least Squares Regression

The least squares regression line is

\[ \hat{y} = 2.5 + 0.6 x \]

Least Squares Regression

Plot a line on the scatterplot.

plot(time, score, xlab = "Time Studying", ylab = "Test Score", main = "Scatterplot of Test Scores vs. Time Studying")
abline(lm(score ~ time), col = "red")