---
title: "Palmer Penguins, visualization, ML Classification, kNN, Confussion Matrix, Accuracy"
subtitle: "Welcome prospective Data Science Students!"
author: 
- name: "Prof. Eric A. Suess"
  affiliation: "Department of Statistics and Biostatistics, CSU East Bay"
date: "`r format(Sys.time(), '%d %B %Y')`"
format: 
  html:
    self-contained: true
---

## Welcome students!

This is a quick tutorial on how to visualize the Palmer Penguins dataset, and then use kNN to classify the species of penguins.  We will then use a confusion matrix to evaluate the accuracy of our model.

## QR Code for this page

```{r}
#| label: qrcode
library(qrcode)
plot(qr_code("https://rpubs.com/esuess/kNN2"))
```

## Load the R libraries we will be using.

```{r}
#| message = FALSE

library(palmerpenguins)
library(DT)
library(gt)
library(naniar)

# library(devtools)
# devtools::install_github("cmartin/ggConvexHull")

library(ggConvexHull)

library(tidyverse)
library(plotly)
library(tidymodels)
library(dials)
library(tune)
library(yardstick)
```

## Load the data

We drop the two *categorical* variables, `island` and `sex`.  We will use the `species` variable as our response variable.

```{r}
data(penguins)
datatable(penguins)

penguins <- penguins |> select(-c("island","sex"))
datatable(penguins)
```

## How many penguins are there?

```{r}
penguins |> select(species) |> 
            group_by(species) |>
            count() |> 
            pivot_wider(names_from = species, values_from = n) |> 
            gt()
```

## How many missing values are there?

```{r}
vis_miss(penguins)
n_var_miss(penguins)
gg_miss_var(penguins)
```

```{r}
#| message = FALSE
library(skimr)
skim(penguins)
```

## Visualize the data

```{r}
penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() 
```

Which species is this penguin?  

This is a? ![Gentoo](Brown_Bluff-2016-Tabarin_Peninsula%E2%80%93Gentoo_penguin_(Pygoscelis_papua)_03.jpg) [Wikipedia](https://en.wikipedia.org/wiki/Gentoo_penguin)


```{r}
penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point()
```

```{r}
penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point() + facet_wrap(~species)
```

```{r}
peng_convex <- penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + 
                geom_point() +
                geom_convexhull(alpha = 0.3, aes(fill = species))
peng_convex

peng_convex |> ggplotly()
```

## Split the data into training and testing sets

When applying [Machine Learning](https://en.wikipedia.org/wiki/Machine_learning) we start by randomly splitting the data into a training set and a testing set.  We will use the training set to build our model, and then use the testing set to evaluate the accuracy of our model. 

```{r}
set.seed(123)
penguin_split <- initial_split(penguins, prop = 0.8, strata = species)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

datatable(penguin_train)
datatable(penguin_test)
```

## Define a recipe for data preprocessing

This will normalize the numeric predictors and remove missing values

```{r}
penguin_recipe <- recipe(species ~ ., data = penguin_train)  |> 
  step_normalize(all_numeric())  |> 
  step_naomit(all_predictors())
```

## Define a 5-fold cross-validation procedure

```{r}
# Define a 5-fold cross-validation procedure
set.seed(234)
penguin_folds <- vfold_cv(penguin_train, v = 5, strata = species)
```


## Build a kNN model for Classification

The $k$ nearest neighbor model [kNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) is a simple model that classifies a new observation by finding the $k$ closest observations in the training set, and then classifying the new observation by the majority vote of the k closest observations.  The kNN model is a non-parametric model, meaning that it does not assume a particular distribution for the data.  The kNN model is a **lazy learner**, meaning that it *does not build a model*, but rather stores the training data, and then uses the training data to classify new observations.  The kNN model is a simple model, and is often used as a baseline model to compare to more complex models.

The kNN model measure distances.  How do we measure distance in one dimension?  We use absolute value.

$$d(x,y) = |x-y|$$
How do we measure distance in two dimensions?  We use the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance).

$$d(x,y) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2}$$

How do we measure distance in $p$ dimensions?  

$$d(x,y) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2 + \cdots + (x_p-y_p)^2}$$

## The kNN model

We use the kNN model for classification with the training data.

We will tune the model to find the best value of $k$.


```{r}
# Using the "rectangular" weight function is the same as unweighted kNN
penguin_knn <- nearest_neighbor(neighbors = tune()) |> 
  set_mode("classification") |> 
  set_engine("kknn") 
```


```{r}
# Create a parameter object for the number of neighbors
# Specify that we want odd values only
neighbor_param <- neighbors(range = c(4, 15))
# Generate a grid of 8 values from the parameter object
# You can use other grid_* functions to generate different types of grids
neighbor_grid <- grid_regular(neighbor_param, levels = 12)

# Print the grid
neighbor_grid
```


```{r}
#| message = FALSE
# Tune the number of neighbors using accuracy as the metric
set.seed(345)
penguin_tune <- tune_grid(
  penguin_knn,
  penguin_recipe,
  resamples = penguin_folds,
  grid = neighbor_grid,
  metrics = metric_set(accuracy)
)
```

```{r}
# Find the best number of neighbors
best_k <- penguin_tune  |> 
  select_best("accuracy")
best_k
```

```{r}
final_knn <- finalize_model(penguin_knn, best_k)
final_knn 
```


```{r}
# Fit the final model on the training data
final_fit <- workflow()  |> 
  add_model(final_knn)  |> 
  add_recipe(penguin_recipe)  |> 
  fit(data = penguin_train)
```

```{r}
# Predict on the testing data
final_pred <- final_fit  |> 
  predict(new_data = penguin_test)  |> 
  bind_cols(penguin_test)

datatable(final_pred)
```

```{r}
# Evaluate the prediction accuracy
final_acc <- final_pred  |> 
  metrics(truth = species, estimate = .pred_class)  |> 
  filter(.metric == "accuracy")

final_acc 
```

```{r}
# Compute and print the confusion matrix
final_cm <- final_pred  |> 
  conf_mat(truth = species, estimate = .pred_class)

# Visualize the confusion matrix as a heatmap
autoplot(final_cm, type = "heatmap")
```

## Summary, updated

1. We have loaded some data on penguins into R.
2. We have visualized the data.
3. We have cleaned the data.
4. We have split the data into training and testing sets.
5. We have  normalize the data.
6. We have used cross-validation to tune the model.
7. We have built a kNN models for classification and picked the best one.
8. We have evaluated the accuracy of the model on the testing set.