---
title: "CART"
author: "Prof. Eric A. Suess"
date: "Feburary 8, 2021"
output:
  beamer_presentation: default
  ioslides_presentation: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

## Introduction

Today we will discuss more details about **Regression Trees**.
This is the **RT** part of CART.

Before discussing Regression Trees, we will discuss a little 
more about Linear Regression.

## Prediction or Forecasting

The author uses the word forecasting and prediction interchangeably.

As a Statistician, I have always tried to be specific about which 
word to use.  Prediction is what is done when using Regression, or
other similar methods, to predict a mean value or future observation 
within the range of the data.  Forecasting is what is done with time
series data when future observations are forecasted beyond the range 
of the data.

## Prediction or Forecasting

See wikipedia for further discussion.

- [Prediction](https://en.wikipedia.org/wiki/Prediction#Statistics)
- [Forecasting](https://en.wikipedia.org/wiki/Forecasting)

I would have titled the chapter Predicting Numeric Values - Regression Methods.

This is not that important in the big picture.

## Linear Regression

The **simple linear regession model**:

$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$

where 

$\epsilon_i \sim N(0,\sigma_{\epsilon}^2)$

Because of the distributional assumption on the **error terms**
we can do Statistics.  So we can **test the statistical significance** 
of the slope $\beta_1$ and **compute a confidence interval** 
for the slope $\beta_1$.

## Linear Regression

**Parameter estimates** are usually represented by the parameter
with a **hat**.  So the estimate of the slope in the simple 
linear regression mode would be

$\hat{\beta}_1$

The author introduces the **a** and **b** as the estimates.

## Linear Regression

The estimates are produced by minimizing the **Sum of Squares Error**
for $\beta_1$ and $\beta_0$.

$SSE = \sum (y_i - \hat{y}_i)^2 = \sum \hat{\epsilon}_i^2$

where

$\hat{y}_i = \hat{\beta_0} + \hat{\beta}_1 x_i$

## Linear Regression

The estimates are

$\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = r \cdot \frac{s_y}{s_x}$

where

$s_y^2 = \frac{\sum(y_i - \bar{y})^2}{n-1}$

and 

$s_y = \sqrt{s_y^2}$


## Linear Regression

and

$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\cdot \bar{x}$

So the fitted model would be

$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$

We could use this equation to draw a line on a scatterplot.

## Final Questions

**Question**:  What is the relationship between the correlation coefficient 
and the estimated slope coefficient?

**Question**:  Can you define clearly what the correlation coefficient $r$ 
measures?

## Multiple Linear Regression

Multiple Linear Regression includes more predictor variables.

Matrix notation is usually used to write down the model.

$Y = X \cdot \beta + \epsilon$

where 

- $Y$ is an $n \times 1$ vector of the $y_i$ values
- $X$ is the $n \times (p+1)$ design matrix
- $\beta$ is an $(p+1) \times 1$ vector of the $\beta_i$ values
- $\epsilon$ is an $n \times 1$ vector of the $\epsilon_i$ values

## Multiple Linear Regression

The estimates are

$\hat{\beta} = (X^T \cdot X)^{-1} X^T Y$

- $^{-1}$ is the inverse
- $^T$ is the transpose

## Regression Trees using CART

Trees for Numeric Prediction.

**Strengths**:

- trees for numeric data
- automatic feature selection
- no model in advance
- may work better than traditional regression
- does not require knowledge of statistics to interpret the results

## Regression Trees using CART

**Weaknesses**:

- not so commonly used
- requires large amount of training data
- difficult to interpret effect of the predictors/features
- may not be as easy to interpret as a traditional regression model

## Regression Trees using CART

Partitioning is done using a **divide-and-conquer** strategy
according to the feature that will result in the greatest increase
in homogeneity in the outcome after a split is preformed.

So the measurement is on the response variable/target variable $Y$.

## Common Splitting Criteria

**Standard deviation reduction (SDR)**

$SDR = sd(T) -\sum\frac{|T_i|}{|T|} \times sd(T_i)$

where 

$sd(T)$ is the standard deviation of the $Y$ values that are in the 
set $T$.

$|T|$ stands for the number of observations in the set $T$.

## Today

Today we will look at the wine data example that tries to create 
a system to mimic expert ratings of wine.

[Wine Spectator's 100-point Scale](http://www.winespectator.com/display/show/id/scoring-scale)

White Wine

Red Wine

## Measuring performance with MAE

Since we are not performing a Classification with 
Regression Trees, in this example, we cannot use a 
Confusion Matrix.  

We will look at the **correlation** between the *test values*
and the *predicted values*.

Another way to measure the error is to use the **Mean Absolute Error** (MAE)

$MAE = \frac{1}{n} \sum |\epsilon_i|$

or **Mean Squared Error** (MSE)

$MSE = \frac{1}{n} \sum \epsilon^2_i$