Palmer Penguins, visualization, ML Classification, kNN, Confussion Matrix, Accuracy

Welcome prospective Data Science Students!

Author
Affiliation

Prof. Eric A. Suess

Department of Statistics and Biostatistics, CSU East Bay

Published

November 5, 2023

Welcome students!

This is a quick tutorial on how to visualize the Palmer Penguins dataset, and then use kNN to classify the species of penguins. We will then use a confusion matrix to evaluate the accuracy of our model.

QR Code for this page

library(qrcode)
plot(qr_code("https://rpubs.com/esuess/kNN2"))

Load the R libraries we will be using.

library(palmerpenguins)
library(DT)
library(gt)
library(naniar)

# library(devtools)
# devtools::install_github("cmartin/ggConvexHull")

library(ggConvexHull)

library(tidyverse)
library(plotly)
library(tidymodels)
library(dials)
library(tune)
library(yardstick)

Load the data

We drop the two categorical variables, island and sex. We will use the species variable as our response variable.

data(penguins)
datatable(penguins)
penguins <- penguins |> select(-c("island","sex"))
datatable(penguins)

How many penguins are there?

penguins |> select(species) |> 
            group_by(species) |>
            count() |> 
            pivot_wider(names_from = species, values_from = n) |> 
            gt()
Adelie Chinstrap Gentoo
152 68 124

How many missing values are there?

vis_miss(penguins)

n_var_miss(penguins)
[1] 4
gg_miss_var(penguins)

library(skimr)
skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1 FALSE 3 Ade: 152, Gen: 124, Chi: 68

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂

Visualize the data

penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() 
Warning: Removed 2 rows containing missing values (`geom_point()`).

Which species is this penguin?

This is a? Gentoo Wikipedia

penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point() + facet_wrap(~species)
Warning: Removed 2 rows containing missing values (`geom_point()`).

peng_convex <- penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + 
                geom_point() +
                geom_convexhull(alpha = 0.3, aes(fill = species))
peng_convex
Warning: Removed 2 rows containing non-finite values (`stat_convex_hull()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

peng_convex |> ggplotly()
Warning: Removed 2 rows containing non-finite values (`stat_convex_hull()`).

Split the data into training and testing sets

When applying Machine Learning we start by randomly splitting the data into a training set and a testing set. We will use the training set to build our model, and then use the testing set to evaluate the accuracy of our model.

set.seed(123)
penguin_split <- initial_split(penguins, prop = 0.8, strata = species)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

datatable(penguin_train)
datatable(penguin_test)

Define a recipe for data preprocessing

This will normalize the numeric predictors and remove missing values

penguin_recipe <- recipe(species ~ ., data = penguin_train)  |> 
  step_normalize(all_numeric())  |> 
  step_naomit(all_predictors())

Define a 5-fold cross-validation procedure

# Define a 5-fold cross-validation procedure
set.seed(234)
penguin_folds <- vfold_cv(penguin_train, v = 5, strata = species)

Build a kNN model for Classification

The \(k\) nearest neighbor model kNN is a simple model that classifies a new observation by finding the \(k\) closest observations in the training set, and then classifying the new observation by the majority vote of the k closest observations. The kNN model is a non-parametric model, meaning that it does not assume a particular distribution for the data. The kNN model is a lazy learner, meaning that it does not build a model, but rather stores the training data, and then uses the training data to classify new observations. The kNN model is a simple model, and is often used as a baseline model to compare to more complex models.

The kNN model measure distances. How do we measure distance in one dimension? We use absolute value.

\[d(x,y) = |x-y|\] How do we measure distance in two dimensions? We use the Euclidean distance.

\[d(x,y) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2}\]

How do we measure distance in \(p\) dimensions?

\[d(x,y) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2 + \cdots + (x_p-y_p)^2}\]

The kNN model

We use the kNN model for classification with the training data.

We will tune the model to find the best value of \(k\).

# Using the "rectangular" weight function is the same as unweighted kNN
penguin_knn <- nearest_neighbor(neighbors = tune()) |> 
  set_mode("classification") |> 
  set_engine("kknn") 
# Create a parameter object for the number of neighbors
# Specify that we want odd values only
neighbor_param <- neighbors(range = c(4, 15))
# Generate a grid of 8 values from the parameter object
# You can use other grid_* functions to generate different types of grids
neighbor_grid <- grid_regular(neighbor_param, levels = 12)

# Print the grid
neighbor_grid
# A tibble: 12 × 1
   neighbors
       <int>
 1         4
 2         5
 3         6
 4         7
 5         8
 6         9
 7        10
 8        11
 9        12
10        13
11        14
12        15
# Tune the number of neighbors using accuracy as the metric
set.seed(345)
penguin_tune <- tune_grid(
  penguin_knn,
  penguin_recipe,
  resamples = penguin_folds,
  grid = neighbor_grid,
  metrics = metric_set(accuracy)
)
# Find the best number of neighbors
best_k <- penguin_tune  |> 
  select_best("accuracy")
best_k
# A tibble: 1 × 2
  neighbors .config              
      <int> <chr>                
1         5 Preprocessor1_Model02
final_knn <- finalize_model(penguin_knn, best_k)
final_knn 
K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = 5

Computational engine: kknn 
# Fit the final model on the training data
final_fit <- workflow()  |> 
  add_model(final_knn)  |> 
  add_recipe(penguin_recipe)  |> 
  fit(data = penguin_train)
# Predict on the testing data
final_pred <- final_fit  |> 
  predict(new_data = penguin_test)  |> 
  bind_cols(penguin_test)

datatable(final_pred)
# Evaluate the prediction accuracy
final_acc <- final_pred  |> 
  metrics(truth = species, estimate = .pred_class)  |> 
  filter(.metric == "accuracy")

final_acc 
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.986
# Compute and print the confusion matrix
final_cm <- final_pred  |> 
  conf_mat(truth = species, estimate = .pred_class)

# Visualize the confusion matrix as a heatmap
autoplot(final_cm, type = "heatmap")

Summary, updated

  1. We have loaded some data on penguins into R.
  2. We have visualized the data.
  3. We have cleaned the data.
  4. We have split the data into training and testing sets.
  5. We have normalize the data.
  6. We have used cross-validation to tune the model.
  7. We have built a kNN models for classification and picked the best one.
  8. We have evaluated the accuracy of the model on the testing set.