2024-01-29
Today we will briefly discuss Regression methods and the use of Regression for Classification.
Having taken a Regression class you know about
What about???
The main idea with Regression is to model the relationship between a dependent variable and an independent variable(s).
To make numeric predictions.
The main idea with Logistic Regression is to model the relationship between a 0-1 dependent variable and an independent variable(s).
To make classifications.
Read over the first half of Chapter 6, this is review.
We will try the predicting medical expenses example.
We will try the predicting insurance policy churn example.
In R the lm() function is used to fit linear regression models it knows about dummy variables. There is no extra work that is need to include categorical variables into a regression model. This is because when a categorical variable is a factor in R, the lm() function knows the dummy variables to use.
In R the glm() function is used to fit logistic regression models.
Later we will discuss the glmnet() function for fitting lasso regression models.
The preceding Chapter, Trees were used for Classification.
Later in this Chapter, Trees are used for Numeric Prediction.
One type of tree for prediction is CART, Classification and Regression Trees.
This is a bit of a misnomer, Linear Regression methods are not used. Predictions are made based on the average value of examples that reach a leaf.
A second type of tree for prediction is known as Model Trees.
These were developed later, are less widely used but may be more powerful.
A multiple linear regression model is built from the exmples reaching that node.
Trees can make predictions and can be considered as an alternative to regression modeling.
The data are partitioned using a divide-and-conquer strategy according to the feature that will result in the greatest increase in homogeneity in the outcome after a split is performed.
For Classification Trees entropy is used.
For Numeric Decision Trees statistics such as standard deviation are used.
Today we will fit a multiple linear regression model for the medical expenses data.
We will look at the application of Regression Trees to the wine rating data.
The rpart package will be used.