---
title: "Stat652: Quiz01"
author: "Prof. Eric A. Suess"
format: 
  html:
    self-contained: true
---

**Instruction:** For problem 1 you can complete the questions in an Excel Spreadsheet or in the R Quarto Notebook, submit either a .xlsx file or both the .qmd and .html file. For problem 2 run the provided R Quarto Notebook answering the questions asked, submit both a .qmd and .html files.  Also, submit a report.html file from the DataExplorer package for the Auto data.

## Problem 1: kNN

Complete 2.4 Exercises Problem 7 a, b, c from the ISL.
Do parts a, b, and c without normalization or scaling. Re-do parts a, b, and c using either normalization or scaling. Do the results differ?

## ISLR Problem 7

# Compute the distances **without** normalization or scaling.

#### a)

#### b)

#### c)

# Compute the distances **with** scaling.

#### a)

#### b)

#### c)

# Answer Table:

| k   | raw   | scale | min_max |
|-----|-------|-------|---------|
| 1   |       |       |         |
| 3   |       |       |         |

## Problem 2: subset regression

Use the Auto dataset from the ISLR package.  The goal is to predict the miles per gallon (mpg) of a car based on the other variables in the dataset.  Use all possible subsets regression from the *olsrr* package and the *leaps* package to find the best subset of predictors for the mpg variable.  Use the adjusted R-squared and AIC to determine the best model(s).  Use the best model to predict the mpg of a car with the following characteristics: cylinders = 6, displacement = 200, horsepower = 100, weight = 3100, year = 1999, origin = 1.  No Training and Test sets are used in this example.  This is just to illustrate the use of all possible subsets and best subset regression.

Note the use of the *pacman* package to load the necessary libraries.  The function p_load() checks if each package is installed, if not it installs the package, then loads the packages.

```{r}
library(pacman)
p_load(tidyverse, ISLR2, skimr, DataExplorer, olsrr, leaps)
```


## AutoEDA: Automatically explore the dataset.

Remove the name column from the Auto dataset because it is a unique identifier and not a predictor.

```{r}
Auto <- Auto |> select(-name)
```

Automatically generate a report of the dataset

```{r}
skimr::skim(Auto)
```

```{r}
Auto |> DataExplorer::create_report()
```

## All possible subsets regression

To use the *olsrr* function *ols_step_all_possible()* the model must be created using the *lm()* function.  The *ols_step_all_possible()* function will return a list of models with the adjusted R-squared and AIC for each model.  The *plot()* function can be used to visualize the results.

```{r}
model <- lm(mpg ~ ., data = Auto)
summary(model)
```

### [olsrr](https://olsrr.rsquaredacademy.com/) package

```{r}
k <- ols_step_all_possible(model)
k
```

```{r}
# plot
plot(k)
```

```{r}
which.max(k$adjr)

which.min(k$aic)
```

Find the model with the highest adjusted R-squared and the lowest AIC.

```{r}
x <- which.max(k$adjr)
x
k |> filter(mindex == 120)
```

### Question 1:

**Question:** What is the best model for the Auto data based on the adjusted R-squared?  

**Answer:** Type your answer here.

```{r}
k |> group_by(n) |> 
  reframe('index' = mindex, max_adjr = max(adjr), min_aic = min(aic)) |> 
  arrange(desc(max_adjr), min_aic) |> 
  head(10)
```

Instead of running all regression models, the *ols_step_best_subset()* function can be used to find the best subset of predictors for the mpg variable.  The *plot()* function can be used to visualize the results.

```{r}
model <- lm(mpg ~ ., data = Auto)

k <- ols_step_best_subset(model)
k

plot(k)
```

### leaps package

The leaps package uses a different approach to find the best subset of predictors for the mpg variable.  The *regsubsets()* function is used to find the best subset of predictors for the mpg variable using the leaps algorithm.

```{r}
model2 <- lm(mpg ~ ., data = Auto)
summary(model2)
```


```{r}
Best_Subset <- regsubsets(mpg ~ .,
               data = Auto,
               nbest = 1,      # 1 best model for each number of predictors
               nvmax = NULL,    # NULL for no limit on number of variables
               force.in = NULL, force.out = NULL,
               method = "exhaustive")

summary_best_subset <- summary(Best_Subset)

as.data.frame(summary_best_subset$outmat)
```

```{r}
which.max(summary_best_subset$adjr2)
```

```{r}
summary_best_subset$which[6,]
```

Run the regression model with the best predictors

```{r}
best.model <- lm(mpg ~ cylinders + displacement + horsepower + weight + year + origin, data = Auto)

summary(best.model)
```

### Question 2:

**Question:** What variable(s) are not included in the best model?  Are there any variables in the best model that you would drop from the model and why?

**Answer:** Type your answer here.

```{r}

**Note:** No Training and Test sets are used in this example.  This is just to illustrate the use of all possible subsets and best subset regression