library(qrcode)
plot(qr_code("https://rpubs.com/esuess/kNN2"))Palmer Penguins, visualization, ML Classification, kNN, Confussion Matrix, Accuracy
Welcome prospective Data Science Students!
Welcome students!
This is a quick tutorial on how to visualize the Palmer Penguins dataset, and then use kNN to classify the species of penguins. We will then use a confusion matrix to evaluate the accuracy of our model.
QR Code for this page
Load the R libraries we will be using.
library(palmerpenguins)
library(DT)
library(gt)
library(naniar)
# library(devtools)
# devtools::install_github("cmartin/ggConvexHull")
library(ggConvexHull)
library(tidyverse)
library(plotly)
library(tidymodels)
library(dials)
library(tune)
library(yardstick)Load the data
We drop the two categorical variables, island and sex. We will use the species variable as our response variable.
data(penguins)
datatable(penguins)penguins <- penguins |> select(-c("island","sex"))
datatable(penguins)How many penguins are there?
penguins |> select(species) |>
group_by(species) |>
count() |>
pivot_wider(names_from = species, values_from = n) |>
gt()| Adelie | Chinstrap | Gentoo |
|---|---|---|
| 152 | 68 | 124 |
How many missing values are there?
vis_miss(penguins)n_var_miss(penguins)[1] 4
gg_miss_var(penguins)library(skimr)
skim(penguins)| Name | penguins |
| Number of rows | 344 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
Visualize the data
penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Which species is this penguin?
This is a? Wikipedia
penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point()Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point() + facet_wrap(~species)Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
peng_convex <- penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point() +
geom_convexhull(alpha = 0.3, aes(fill = species))
peng_convexWarning: Removed 2 rows containing non-finite outside the scale range
(`stat_convex_hull()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
peng_convex |> ggplotly()Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_convex_hull()`).
Split the data into training and testing sets
When applying Machine Learning we start by randomly splitting the data into a training set and a testing set. We will use the training set to build our model, and then use the testing set to evaluate the accuracy of our model.
set.seed(123)
penguin_split <- initial_split(penguins, prop = 0.8, strata = species)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)
datatable(penguin_train)datatable(penguin_test)Define a recipe for data preprocessing
This will normalize the numeric predictors and remove missing values
penguin_recipe <- recipe(species ~ ., data = penguin_train) |>
step_normalize(all_numeric()) |>
step_naomit(all_predictors())Define a 5-fold cross-validation procedure
# Define a 5-fold cross-validation procedure
set.seed(234)
penguin_folds <- vfold_cv(penguin_train, v = 5, strata = species)Build a kNN model for Classification
The \(k\) nearest neighbor model kNN is a simple model that classifies a new observation by finding the \(k\) closest observations in the training set, and then classifying the new observation by the majority vote of the k closest observations. The kNN model is a non-parametric model, meaning that it does not assume a particular distribution for the data. The kNN model is a lazy learner, meaning that it does not build a model, but rather stores the training data, and then uses the training data to classify new observations. The kNN model is a simple model, and is often used as a baseline model to compare to more complex models.
The kNN model measure distances. How do we measure distance in one dimension? We use absolute value.
\[d(x,y) = |x-y|\] How do we measure distance in two dimensions? We use the Euclidean distance.
\[d(x,y) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2}\]
How do we measure distance in \(p\) dimensions?
\[d(x,y) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2 + \cdots + (x_p-y_p)^2}\]
The kNN model
We use the kNN model for classification with the training data.
We will tune the model to find the best value of \(k\).
# Using the "rectangular" weight function is the same as unweighted kNN
penguin_knn <- nearest_neighbor(neighbors = tune()) |>
set_mode("classification") |>
set_engine("kknn") # Create a parameter object for the number of neighbors
# Specify that we want odd values only
neighbor_param <- neighbors(range = c(4, 15))
# Generate a grid of 8 values from the parameter object
# You can use other grid_* functions to generate different types of grids
neighbor_grid <- grid_regular(neighbor_param, levels = 12)
# Print the grid
neighbor_grid# A tibble: 12 × 1
neighbors
<int>
1 4
2 5
3 6
4 7
5 8
6 9
7 10
8 11
9 12
10 13
11 14
12 15
# Tune the number of neighbors using accuracy as the metric
set.seed(345)
penguin_tune <- tune_grid(
penguin_knn,
penguin_recipe,
resamples = penguin_folds,
grid = neighbor_grid,
metrics = metric_set(accuracy)
)# Find the best number of neighbors
best_k <- penguin_tune |>
select_best(metric = "accuracy")
best_k# A tibble: 1 × 2
neighbors .config
<int> <chr>
1 5 Preprocessor1_Model02
final_knn <- finalize_model(penguin_knn, best_k)
final_knn K-Nearest Neighbor Model Specification (classification)
Main Arguments:
neighbors = 5
Computational engine: kknn
# Fit the final model on the training data
final_fit <- workflow() |>
add_model(final_knn) |>
add_recipe(penguin_recipe) |>
fit(data = penguin_train)# Predict on the testing data
final_pred <- final_fit |>
predict(new_data = penguin_test) |>
bind_cols(penguin_test)
datatable(final_pred)# Evaluate the prediction accuracy
final_acc <- final_pred |>
metrics(truth = species, estimate = .pred_class) |>
filter(.metric == "accuracy")
final_acc # A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy multiclass 0.986
# Compute and print the confusion matrix
final_cm <- final_pred |>
conf_mat(truth = species, estimate = .pred_class)
# Visualize the confusion matrix as a heatmap
autoplot(final_cm, type = "heatmap")Summary, updated
- We have loaded some data on penguins into R.
- We have visualized the data.
- We have cleaned the data.
- We have split the data into training and testing sets.
- We have normalize the data.
- We have used cross-validation to tune the model.
- We have built a kNN models for classification and picked the best one.
- We have evaluated the accuracy of the model on the testing set.