--- title: "CART" author: "Prof. Eric A. Suess" date: "Feburary 8, 2021" output: beamer_presentation: default ioslides_presentation: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ## Introduction Today we will discuss more details about **Regression Trees**. This is the **RT** part of CART. Before discussing Regression Trees, we will discuss a little more about Linear Regression. ## Prediction or Forecasting The author uses the word forecasting and prediction interchangeably. As a Statistician, I have always tried to be specific about which word to use. Prediction is what is done when using Regression, or other similar methods, to predict a mean value or future observation within the range of the data. Forecasting is what is done with time series data when future observations are forecasted beyond the range of the data. ## Prediction or Forecasting See wikipedia for further discussion. - [Prediction](https://en.wikipedia.org/wiki/Prediction#Statistics) - [Forecasting](https://en.wikipedia.org/wiki/Forecasting) I would have titled the chapter Predicting Numeric Values - Regression Methods. This is not that important in the big picture. ## Linear Regression The **simple linear regession model**: $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$ where $\epsilon_i \sim N(0,\sigma_{\epsilon}^2)$ Because of the distributional assumption on the **error terms** we can do Statistics. So we can **test the statistical significance** of the slope $\beta_1$ and **compute a confidence interval** for the slope $\beta_1$. ## Linear Regression **Parameter estimates** are usually represented by the parameter with a **hat**. So the estimate of the slope in the simple linear regression mode would be $\hat{\beta}_1$ The author introduces the **a** and **b** as the estimates. ## Linear Regression The estimates are produced by minimizing the **Sum of Squares Error** for $\beta_1$ and $\beta_0$. $SSE = \sum (y_i - \hat{y}_i)^2 = \sum \hat{\epsilon}_i^2$ where $\hat{y}_i = \hat{\beta_0} + \hat{\beta}_1 x_i$ ## Linear Regression The estimates are $\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = r \cdot \frac{s_y}{s_x}$ where $s_y^2 = \frac{\sum(y_i - \bar{y})^2}{n-1}$ and $s_y = \sqrt{s_y^2}$ ## Linear Regression and $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\cdot \bar{x}$ So the fitted model would be $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$ We could use this equation to draw a line on a scatterplot. ## Final Questions **Question**: What is the relationship between the correlation coefficient and the estimated slope coefficient? **Question**: Can you define clearly what the correlation coefficient $r$ measures? ## Multiple Linear Regression Multiple Linear Regression includes more predictor variables. Matrix notation is usually used to write down the model. $Y = X \cdot \beta + \epsilon$ where - $Y$ is an $n \times 1$ vector of the $y_i$ values - $X$ is the $n \times (p+1)$ design matrix - $\beta$ is an $(p+1) \times 1$ vector of the $\beta_i$ values - $\epsilon$ is an $n \times 1$ vector of the $\epsilon_i$ values ## Multiple Linear Regression The estimates are $\hat{\beta} = (X^T \cdot X)^{-1} X^T Y$ - $^{-1}$ is the inverse - $^T$ is the transpose ## Regression Trees using CART Trees for Numeric Prediction. **Strengths**: - trees for numeric data - automatic feature selection - no model in advance - may work better than traditional regression - does not require knowledge of statistics to interpret the results ## Regression Trees using CART **Weaknesses**: - not so commonly used - requires large amount of training data - difficult to interpret effect of the predictors/features - may not be as easy to interpret as a traditional regression model ## Regression Trees using CART Partitioning is done using a **divide-and-conquer** strategy according to the feature that will result in the greatest increase in homogeneity in the outcome after a split is preformed. So the measurement is on the response variable/target variable $Y$. ## Common Splitting Criteria **Standard deviation reduction (SDR)** $SDR = sd(T) -\sum\frac{|T_i|}{|T|} \times sd(T_i)$ where $sd(T)$ is the standard deviation of the $Y$ values that are in the set $T$. $|T|$ stands for the number of observations in the set $T$. ## Today Today we will look at the wine data example that tries to create a system to mimic expert ratings of wine. [Wine Spectator's 100-point Scale](http://www.winespectator.com/display/show/id/scoring-scale) White Wine Red Wine ## Measuring performance with MAE Since we are not performing a Classification with Regression Trees, in this example, we cannot use a Confusion Matrix. We will look at the **correlation** between the *test values* and the *predicted values*. Another way to measure the error is to use the **Mean Absolute Error** (MAE) $MAE = \frac{1}{n} \sum |\epsilon_i|$ or **Mean Squared Error** (MSE) $MSE = \frac{1}{n} \sum \epsilon^2_i$