library(qrcode)
plot(qr_code("https://rpubs.com/esuess/kNN2"))
Palmer Penguins, visualization, ML Classification, kNN, Confussion Matrix, Accuracy
Welcome prospective Data Science Students!
Welcome students!
This is a quick tutorial on how to visualize the Palmer Penguins dataset, and then use kNN to classify the species of penguins. We will then use a confusion matrix to evaluate the accuracy of our model.
QR Code for this page
Load the R libraries we will be using.
library(palmerpenguins)
library(DT)
library(gt)
library(naniar)
# library(devtools)
# devtools::install_github("cmartin/ggConvexHull")
library(ggConvexHull)
library(tidyverse)
library(plotly)
library(tidymodels)
library(dials)
library(tune)
library(yardstick)
Load the data
We drop the two categorical variables, island
and sex
. We will use the species
variable as our response variable.
data(penguins)
datatable(penguins)
<- penguins |> select(-c("island","sex"))
penguins datatable(penguins)
How many penguins are there?
|> select(species) |>
penguins group_by(species) |>
count() |>
pivot_wider(names_from = species, values_from = n) |>
gt()
Adelie | Chinstrap | Gentoo |
---|---|---|
152 | 68 | 124 |
How many missing values are there?
vis_miss(penguins)
n_var_miss(penguins)
[1] 4
gg_miss_var(penguins)
library(skimr)
skim(penguins)
Name | penguins |
Number of rows | 344 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
Visualize the data
|> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() penguins
Warning: Removed 2 rows containing missing values (`geom_point()`).
Which species is this penguin?
This is a? Wikipedia
|> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point() penguins
Warning: Removed 2 rows containing missing values (`geom_point()`).
|> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point() + facet_wrap(~species) penguins
Warning: Removed 2 rows containing missing values (`geom_point()`).
<- penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
peng_convex geom_point() +
geom_convexhull(alpha = 0.3, aes(fill = species))
peng_convex
Warning: Removed 2 rows containing non-finite values (`stat_convex_hull()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
|> ggplotly() peng_convex
Warning: Removed 2 rows containing non-finite values (`stat_convex_hull()`).
Split the data into training and testing sets
When applying Machine Learning we start by randomly splitting the data into a training set and a testing set. We will use the training set to build our model, and then use the testing set to evaluate the accuracy of our model.
set.seed(123)
<- initial_split(penguins, prop = 0.8, strata = species)
penguin_split <- training(penguin_split)
penguin_train <- testing(penguin_split)
penguin_test
datatable(penguin_train)
datatable(penguin_test)
Define a recipe for data preprocessing
This will normalize the numeric predictors and remove missing values
<- recipe(species ~ ., data = penguin_train) |>
penguin_recipe step_normalize(all_numeric()) |>
step_naomit(all_predictors())
Define a 5-fold cross-validation procedure
# Define a 5-fold cross-validation procedure
set.seed(234)
<- vfold_cv(penguin_train, v = 5, strata = species) penguin_folds
Build a kNN model for Classification
The \(k\) nearest neighbor model kNN is a simple model that classifies a new observation by finding the \(k\) closest observations in the training set, and then classifying the new observation by the majority vote of the k closest observations. The kNN model is a non-parametric model, meaning that it does not assume a particular distribution for the data. The kNN model is a lazy learner, meaning that it does not build a model, but rather stores the training data, and then uses the training data to classify new observations. The kNN model is a simple model, and is often used as a baseline model to compare to more complex models.
The kNN model measure distances. How do we measure distance in one dimension? We use absolute value.
\[d(x,y) = |x-y|\] How do we measure distance in two dimensions? We use the Euclidean distance.
\[d(x,y) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2}\]
How do we measure distance in \(p\) dimensions?
\[d(x,y) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2 + \cdots + (x_p-y_p)^2}\]
The kNN model
We use the kNN model for classification with the training data.
We will tune the model to find the best value of \(k\).
# Using the "rectangular" weight function is the same as unweighted kNN
<- nearest_neighbor(neighbors = tune()) |>
penguin_knn set_mode("classification") |>
set_engine("kknn")
# Create a parameter object for the number of neighbors
# Specify that we want odd values only
<- neighbors(range = c(4, 15))
neighbor_param # Generate a grid of 8 values from the parameter object
# You can use other grid_* functions to generate different types of grids
<- grid_regular(neighbor_param, levels = 12)
neighbor_grid
# Print the grid
neighbor_grid
# A tibble: 12 × 1
neighbors
<int>
1 4
2 5
3 6
4 7
5 8
6 9
7 10
8 11
9 12
10 13
11 14
12 15
# Tune the number of neighbors using accuracy as the metric
set.seed(345)
<- tune_grid(
penguin_tune
penguin_knn,
penguin_recipe,resamples = penguin_folds,
grid = neighbor_grid,
metrics = metric_set(accuracy)
)
# Find the best number of neighbors
<- penguin_tune |>
best_k select_best("accuracy")
best_k
# A tibble: 1 × 2
neighbors .config
<int> <chr>
1 5 Preprocessor1_Model02
<- finalize_model(penguin_knn, best_k)
final_knn final_knn
K-Nearest Neighbor Model Specification (classification)
Main Arguments:
neighbors = 5
Computational engine: kknn
# Fit the final model on the training data
<- workflow() |>
final_fit add_model(final_knn) |>
add_recipe(penguin_recipe) |>
fit(data = penguin_train)
# Predict on the testing data
<- final_fit |>
final_pred predict(new_data = penguin_test) |>
bind_cols(penguin_test)
datatable(final_pred)
# Evaluate the prediction accuracy
<- final_pred |>
final_acc metrics(truth = species, estimate = .pred_class) |>
filter(.metric == "accuracy")
final_acc
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy multiclass 0.986
# Compute and print the confusion matrix
<- final_pred |>
final_cm conf_mat(truth = species, estimate = .pred_class)
# Visualize the confusion matrix as a heatmap
autoplot(final_cm, type = "heatmap")
Summary, updated
- We have loaded some data on penguins into R.
- We have visualized the data.
- We have cleaned the data.
- We have split the data into training and testing sets.
- We have normalize the data.
- We have used cross-validation to tune the model.
- We have built a kNN models for classification and picked the best one.
- We have evaluated the accuracy of the model on the testing set.