--- title: "Palmer Penguins, visualization, ML Classification, kNN, Confussion Matrix, Accuracy" subtitle: "Welcome prospective Data Science Students!" author: - name: "Prof. Eric A. Suess" affiliation: "Department of Statistics and Biostatistics, CSU East Bay" date: "`r format(Sys.time(), '%d %B %Y')`" format: html: embed-resources: true --- ## Welcome students! This is a quick tutorial on how to visualize the Palmer Penguins dataset, and then use kNN to classify the species of penguins. We will then use a confusion matrix to evaluate the accuracy of our model. ## QR Code for this page ```{r} #| label: qrcode library(qrcode) plot(qr_code("https://rpubs.com/esuess/kNN")) ``` ## Load the R libraries we will be using. ```{r} #| message = FALSE library(palmerpenguins) library(DT) library(gt) library(naniar) # library(devtools) # devtools::install_github("cmartin/ggConvexHull") library(ggConvexHull) library(tidyverse) library(plotly) library(tidymodels) library(yardstick) ``` ## Load the data We drop the two *categorical* variables, `island` and `sex`. We will use the `species` variable as our response variable. ```{r} data(penguins) datatable(penguins) penguins <- penguins |> select(-c("island","sex")) datatable(penguins) ``` ## How many penguins are there? ```{r} penguins |> select(species) |> group_by(species) |> count() |> pivot_wider(names_from = species, values_from = n) |> gt() ``` ## How many missing values are there? ```{r} vis_miss(penguins) n_var_miss(penguins) gg_miss_var(penguins) ``` ```{r} #| message = FALSE library(skimr) skim(penguins) ``` ## Drop the missing values We will be using the kNN algorithm, so we need to remove the rows of data with missing values. ```{r} penguins <- penguins |> drop_na() datatable(penguins) ``` ## Visualize the data ```{r} penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() ``` Which species is this penguin? This is a? ![Gentoo](Brown_Bluff-2016-Tabarin_Peninsula%E2%80%93Gentoo_penguin_(Pygoscelis_papua)_03.jpg) [Wikipedia](https://en.wikipedia.org/wiki/Gentoo_penguin) ```{r} penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point() ``` ```{r} penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point() + facet_wrap(~species) ``` ```{r} peng_convex <- penguins |> ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() + geom_convexhull(alpha = 0.3, aes(fill = species)) peng_convex peng_convex |> ggplotly() ``` ## Split the data into training and testing sets When applying [Machine Learning](https://en.wikipedia.org/wiki/Machine_learning) we start by randomly splitting the data into a training set and a testing set. We will use the training set to build our model, and then use the testing set to evaluate the accuracy of our model. ```{r} set.seed(123) penguin_split <- initial_split(penguins, prop = 0.8, strata = species) penguin_train <- training(penguin_split) penguin_test <- testing(penguin_split) datatable(penguin_train) datatable(penguin_test) ``` ## Build a kNN model for Classification The $k$ nearest neighbor model [kNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) is a simple model that classifies a new observation by finding the $k$ closest observations in the training set, and then classifying the new observation by the majority vote of the k closest observations. The kNN model is a non-parametric model, meaning that it does not assume a particular distribution for the data. The kNN model is a **lazy learner**, meaning that it *does not build a model*, but rather stores the training data, and then uses the training data to classify new observations. The kNN model is a simple model, and is often used as a baseline model to compare to more complex models. The kNN model measure distances. How do we measure distance in one dimension? We use absolute value. $$d(x,y) = |x-y|$$ How do we measure distance in two dimensions? We use the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance). $$d(x,y) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2}$$ How do we measure distance in $p$ dimensions? $$d(x,y) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2 + \cdots + (x_p-y_p)^2}$$ ## The kNN model We use the kNN model for classification with the training data. ```{r} # Using the "rectangular" weight function is the same as unweighted kNN knn_model <- nearest_neighbor(weight_func = "rectangular", neighbors = 4) |> set_mode("classification") |> set_engine("kknn") |> fit(species ~ ., data = penguin_train) knn_model ``` ## Make predictions on the training set ```{r} knn_predictions <- predict(knn_model, penguin_train) |> bind_cols(penguin_train) ``` ## Confusion Matrix, count the number of correctly classified penguins on the training set ```{r} conf_m <- conf_mat(knn_predictions, truth = species, estimate = .pred_class) conf_m autoplot(conf_m, type = "heatmap") ``` ## Accuracy of the model on the training data ```{r} accuracy(knn_predictions, truth = species, estimate = .pred_class) ``` ## Make predictions on the testing set ```{r} knn_predictions <- predict(knn_model, penguin_test) |> bind_cols(penguin_test) ``` ## Confusion Matrix, count the number of correctly classified penguins, on the testing set ```{r} conf_m <- conf_mat(knn_predictions, truth = species, estimate = .pred_class) conf_m autoplot(conf_m, type = "heatmap") ``` ## Accuracy of the model on the testing data ```{r} accuracy(knn_predictions, truth = species, estimate = .pred_class) ``` ## Summary 1. We have loaded some data on penguins into R. 2. We have visualized the data. 3. We have cleaned the data. 4. We have split the data into training and testing sets. 5. We have neglected to scale or normalize the data. (See this second analysis that normalizes the data and uses cross-validation to tune the model to pick the best $k$. [kNN2](https://rpubs.com/esuess/kNN2)) 6. We have built a kNN model for classification. 7. We have evaluated the accuracy of the model on the training set. 8. We have evaluated the accuracy of the model on the testing set.