--- title: "Palmer Penguins, visualization, ML Classification, kNN, Confussion Matrix, Accuracy" subtitle: "Welcome prospective Data Science Students!" author: - name: "Prof. Eric A. Suess" affiliation: "Department of Statistics and Biostatistics, CSU East Bay" date: "`r format(Sys.time(), '%d %B %Y')`" format: html: self-contained: true --- ## Welcome students! This is a quick tutorial on how to visualize the Palmer Penguins dataset, and then use kNN to classify the species of penguins. We will then use a confusion matrix to evaluate the accuracy of our model. ## QR Code for this page ```{r} #| label: qrcode library(qrcode) plot(qr_code("https://rpubs.com/esuess/kNN2")) ``` ## Load the R libraries we will be using. ```{r} #| message = FALSE library(palmerpenguins) library(DT) library(gt) library(naniar) # library(devtools) # devtools::install_github("cmartin/ggConvexHull") library(ggConvexHull) library(tidyverse) library(plotly) library(tidymodels) library(dials) library(tune) library(yardstick) ``` ## Load the data We drop the two *categorical* variables, `island` and `sex`. We will use the `species` variable as our response variable. ```{r} data(penguins) datatable(penguins) penguins <- penguins |> select(-c("island","sex")) datatable(penguins) ``` ## How many penguins are there? ```{r} penguins |> select(species) |> group_by(species) |> count() |> pivot_wider(names_from = species, values_from = n) |> gt() ``` ## How many missing values are there? ```{r} vis_miss(penguins) n_var_miss(penguins) gg_miss_var(penguins) ``` ```{r} #| message = FALSE library(skimr) skim(penguins) ``` ## Visualize the data ```{r} penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() ``` Which species is this penguin? This is a? ![Gentoo](Brown_Bluff-2016-Tabarin_Peninsula%E2%80%93Gentoo_penguin_(Pygoscelis_papua)_03.jpg) [Wikipedia](https://en.wikipedia.org/wiki/Gentoo_penguin) ```{r} penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point() ``` ```{r} penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point() + facet_wrap(~species) ``` ```{r} peng_convex <- penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() + geom_convexhull(alpha = 0.3, aes(fill = species)) peng_convex peng_convex |> ggplotly() ``` ## Split the data into training and testing sets When applying [Machine Learning](https://en.wikipedia.org/wiki/Machine_learning) we start by randomly splitting the data into a training set and a testing set. We will use the training set to build our model, and then use the testing set to evaluate the accuracy of our model. ```{r} set.seed(123) penguin_split <- initial_split(penguins, prop = 0.8, strata = species) penguin_train <- training(penguin_split) penguin_test <- testing(penguin_split) datatable(penguin_train) datatable(penguin_test) ``` ## Define a recipe for data preprocessing This will normalize the numeric predictors and remove missing values ```{r} penguin_recipe <- recipe(species ~ ., data = penguin_train) |> step_normalize(all_numeric()) |> step_naomit(all_predictors()) ``` ## Define a 5-fold cross-validation procedure ```{r} # Define a 5-fold cross-validation procedure set.seed(234) penguin_folds <- vfold_cv(penguin_train, v = 5, strata = species) ``` ## Build a kNN model for Classification The $k$ nearest neighbor model [kNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) is a simple model that classifies a new observation by finding the $k$ closest observations in the training set, and then classifying the new observation by the majority vote of the k closest observations. The kNN model is a non-parametric model, meaning that it does not assume a particular distribution for the data. The kNN model is a **lazy learner**, meaning that it *does not build a model*, but rather stores the training data, and then uses the training data to classify new observations. The kNN model is a simple model, and is often used as a baseline model to compare to more complex models. The kNN model measure distances. How do we measure distance in one dimension? We use absolute value. $$d(x,y) = |x-y|$$ How do we measure distance in two dimensions? We use the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance). $$d(x,y) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2}$$ How do we measure distance in $p$ dimensions? $$d(x,y) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2 + \cdots + (x_p-y_p)^2}$$ ## The kNN model We use the kNN model for classification with the training data. We will tune the model to find the best value of $k$. ```{r} # Using the "rectangular" weight function is the same as unweighted kNN penguin_knn <- nearest_neighbor(neighbors = tune()) |> set_mode("classification") |> set_engine("kknn") ``` ```{r} # Create a parameter object for the number of neighbors # Specify that we want odd values only neighbor_param <- neighbors(range = c(4, 15)) # Generate a grid of 8 values from the parameter object # You can use other grid_* functions to generate different types of grids neighbor_grid <- grid_regular(neighbor_param, levels = 12) # Print the grid neighbor_grid ``` ```{r} #| message = FALSE # Tune the number of neighbors using accuracy as the metric set.seed(345) penguin_tune <- tune_grid( penguin_knn, penguin_recipe, resamples = penguin_folds, grid = neighbor_grid, metrics = metric_set(accuracy) ) ``` ```{r} # Find the best number of neighbors best_k <- penguin_tune |> select_best("accuracy") best_k ``` ```{r} final_knn <- finalize_model(penguin_knn, best_k) final_knn ``` ```{r} # Fit the final model on the training data final_fit <- workflow() |> add_model(final_knn) |> add_recipe(penguin_recipe) |> fit(data = penguin_train) ``` ```{r} # Predict on the testing data final_pred <- final_fit |> predict(new_data = penguin_test) |> bind_cols(penguin_test) datatable(final_pred) ``` ```{r} # Evaluate the prediction accuracy final_acc <- final_pred |> metrics(truth = species, estimate = .pred_class) |> filter(.metric == "accuracy") final_acc ``` ```{r} # Compute and print the confusion matrix final_cm <- final_pred |> conf_mat(truth = species, estimate = .pred_class) # Visualize the confusion matrix as a heatmap autoplot(final_cm, type = "heatmap") ``` ## Summary, updated 1. We have loaded some data on penguins into R. 2. We have visualized the data. 3. We have cleaned the data. 4. We have split the data into training and testing sets. 5. We have normalize the data. 6. We have used cross-validation to tune the model. 7. We have built a kNN models for classification and picked the best one. 8. We have evaluated the accuracy of the model on the testing set.