--- title: "Stat652: Quiz01" author: "Prof. Eric A. Suess" format: html: self-contained: true --- **Instruction:** For problem 1 you can complete the questions in an Excel Spreadsheet or in the R Quarto Notebook, submit either a .xlsx file or both the .qmd and .html file. For problem 2 run the provided R Quarto Notebook answering the questions asked, submit both a .qmd and .html files. Also, submit a report.html file from the DataExplorer package for the Auto data. ## Problem 1: kNN Complete 2.4 Exercises Problem 7 a, b, c from the ISL. Do parts a, b, and c without normalization or scaling. Re-do parts a, b, and c using either normalization or scaling. Do the results differ? ## ISLR Problem 7 # Compute the distances **without** normalization or scaling. #### a) #### b) #### c) # Compute the distances **with** scaling. #### a) #### b) #### c) # Answer Table: | k | raw | scale | min_max | |-----|-------|-------|---------| | 1 | | | | | 3 | | | | ## Problem 2: subset regression Use the Auto dataset from the ISLR package. The goal is to predict the miles per gallon (mpg) of a car based on the other variables in the dataset. Use all possible subsets regression from the *olsrr* package and the *leaps* package to find the best subset of predictors for the mpg variable. Use the adjusted R-squared and AIC to determine the best model(s). Use the best model to predict the mpg of a car with the following characteristics: cylinders = 6, displacement = 200, horsepower = 100, weight = 3100, year = 1999, origin = 1. No Training and Test sets are used in this example. This is just to illustrate the use of all possible subsets and best subset regression. Note the use of the *pacman* package to load the necessary libraries. The function p_load() checks if each package is installed, if not it installs the package, then loads the packages. ```{r} library(pacman) p_load(tidyverse, ISLR2, skimr, DataExplorer, olsrr, leaps) ``` ## AutoEDA: Automatically explore the dataset. Remove the name column from the Auto dataset because it is a unique identifier and not a predictor. ```{r} Auto <- Auto |> select(-name) ``` Automatically generate a report of the dataset ```{r} skimr::skim(Auto) ``` ```{r} Auto |> DataExplorer::create_report() ``` ## All possible subsets regression To use the *olsrr* function *ols_step_all_possible()* the model must be created using the *lm()* function. The *ols_step_all_possible()* function will return a list of models with the adjusted R-squared and AIC for each model. The *plot()* function can be used to visualize the results. ```{r} model <- lm(mpg ~ ., data = Auto) summary(model) ``` ### [olsrr](https://olsrr.rsquaredacademy.com/) package ```{r} k <- ols_step_all_possible(model) k ``` ```{r} # plot plot(k) ``` ```{r} which.max(k$adjr) which.min(k$aic) ``` Find the model with the highest adjusted R-squared and the lowest AIC. ```{r} x <- which.max(k$adjr) x k |> filter(mindex == 120) ``` ### Question 1: **Question:** What is the best model for the Auto data based on the adjusted R-squared? **Answer:** Type your answer here. ```{r} k |> group_by(n) |> reframe('index' = mindex, max_adjr = max(adjr), min_aic = min(aic)) |> arrange(desc(max_adjr), min_aic) |> head(10) ``` Instead of running all regression models, the *ols_step_best_subset()* function can be used to find the best subset of predictors for the mpg variable. The *plot()* function can be used to visualize the results. ```{r} model <- lm(mpg ~ ., data = Auto) k <- ols_step_best_subset(model) k plot(k) ``` ### leaps package The leaps package uses a different approach to find the best subset of predictors for the mpg variable. The *regsubsets()* function is used to find the best subset of predictors for the mpg variable using the leaps algorithm. ```{r} model2 <- lm(mpg ~ ., data = Auto) summary(model2) ``` ```{r} Best_Subset <- regsubsets(mpg ~ ., data = Auto, nbest = 1, # 1 best model for each number of predictors nvmax = NULL, # NULL for no limit on number of variables force.in = NULL, force.out = NULL, method = "exhaustive") summary_best_subset <- summary(Best_Subset) as.data.frame(summary_best_subset$outmat) ``` ```{r} which.max(summary_best_subset$adjr2) ``` ```{r} summary_best_subset$which[6,] ``` Run the regression model with the best predictors ```{r} best.model <- lm(mpg ~ cylinders + displacement + horsepower + weight + year + origin, data = Auto) summary(best.model) ``` ### Question 2: **Question:** What variable(s) are not included in the best model? Are there any variables in the best model that you would drop from the model and why? **Answer:** Type your answer here. ```{r} **Note:** No Training and Test sets are used in this example. This is just to illustrate the use of all possible subsets and best subset regression