--- title: "R Notebook" output: html_notebook --- ```{r} library(pacman) p_load(NHANES, tidyverse, tidymodels, discrim) ``` ### Step 1: Access the data. ```{r} data(NHANES) NHANES ``` ### Step 2: Clean the data. Prepare the data and make a recipe for applying the steps needed to preprocess the data. First drop all of the rows where the y-variable SleepTrouble is missing. ```{r} NHANES_SleepTrouble <- NHANES %>% select(-ID, -SleepHrsNight) %>% select( SleepTrouble, everything()) %>% drop_na(SleepTrouble) NHANES_SleepTrouble ``` Summarize the y-variable. ```{r} NHANES_SleepTrouble %>% group_by(SleepTrouble) %>% summarize(n = n()) %>% mutate(freq = n / sum(n)) ``` Make the first split with 80% of the data being in the trainning data set. ```{r} NHANES_SleepTrouble_split <- initial_split(NHANES_SleepTrouble, prop = 0.8) NHANES_SleepTrouble_split ``` Trainning data. ```{r} NHANES_SleepTrouble_split %>% training() ``` ```{r} NHANES_SleepTrouble_split %>% training() %>% vis_miss() ``` Create the recipe for applying the preprocessing. Note the use of step_nzv(), which removes any columns that have very low variability, the use of the step_knnimpute() function, which fills in the cells that are missing with the median of the column, and tghe use of the step_corr() function, which removes highly correlated input features. ```{r} NHANES_SleepTrouble_recipe <- training(NHANES_SleepTrouble_split) %>% recipe(SleepTrouble ~ .) %>% step_nzv(all_predictors()) %>% step_knnimpute(all_predictors()) %>% step_corr(all_numeric()) %>% prep() summary(NHANES_SleepTrouble_recipe) tidy(NHANES_SleepTrouble_recipe) ``` ```{r} NHANES_SleepTrouble_testing <- NHANES_SleepTrouble_recipe %>% bake(testing(NHANES_SleepTrouble_split)) NHANES_SleepTrouble_testing ``` ```{r} NHANES_SleepTrouble_training <- juice(NHANES_SleepTrouble_recipe) NHANES_SleepTrouble_training ``` ### Step 3: Training a model on the data Setup the models. ```{r} NHANES_SleepTrouble_glm <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification") %>% fit(SleepTrouble ~ ., data = NHANES_SleepTrouble_training) ``` ```{r} predict(NHANES_SleepTrouble_glm, NHANES_SleepTrouble_testing) ``` ```{r} NHANES_SleepTrouble_glm %>% predict(NHANES_SleepTrouble_testing) %>% bind_cols(NHANES_SleepTrouble_testing) ``` ### Step 4: Evaluate the models. ```{r} NHANES_SleepTrouble_glm %>% predict(NHANES_SleepTrouble_testing) %>% bind_cols(NHANES_SleepTrouble_testing) %>% metrics(truth = SleepTrouble, estimate = .pred_class) ``` ```{r} NHANES_SleepTrouble_glm %>% predict(NHANES_SleepTrouble_testing) %>% bind_cols(NHANES_SleepTrouble_testing) %>% conf_mat(truth = SleepTrouble, estimate = .pred_class) ``` ```{r} NHANES_SleepTrouble_glm %>% predict(NHANES_SleepTrouble_testing, type = "prob") %>% bind_cols(NHANES_SleepTrouble_testing) %>% roc_curve(SleepTrouble, .pred_Yes) %>% ggplot(aes(x = 1 - specificity, y = sensitivity)) + geom_path() + geom_abline(lty = 3) + coord_equal() ``` ```{r} NHANES_SleepTrouble_glm %>% predict(NHANES_SleepTrouble_testing, type = "prob") %>% bind_cols(NHANES_SleepTrouble_testing) %>% roc_auc(SleepTrouble, .pred_Yes) ``` ### Step 5: Improve the model. GLM model using regularization. ```{r} NHANES_SleepTrouble_glmnet <- logistic_reg(penalty = 0.001, mixture = 0.5) %>% set_engine("glmnet") %>% set_mode("classification") %>% fit(SleepTrouble ~ ., data = NHANES_SleepTrouble_training) summary(NHANES_SleepTrouble_glmnet) ``` ```{r} predict(NHANES_SleepTrouble_glmnet, NHANES_SleepTrouble_testing) ``` ```{r} NHANES_SleepTrouble_glmnet %>% predict(NHANES_SleepTrouble_testing) %>% bind_cols(NHANES_SleepTrouble_testing) ``` ```{r} NHANES_SleepTrouble_glmnet %>% predict(NHANES_SleepTrouble_testing) %>% bind_cols(NHANES_SleepTrouble_testing) %>% metrics(truth = SleepTrouble, estimate = .pred_class) ``` ```{r} NHANES_SleepTrouble_glmnet %>% predict(NHANES_SleepTrouble_testing) %>% bind_cols(NHANES_SleepTrouble_testing) %>% conf_mat(truth = SleepTrouble, estimate = .pred_class) ``` ```{r} NHANES_SleepTrouble_glmnet %>% predict(NHANES_SleepTrouble_testing, type = "prob") %>% bind_cols(NHANES_SleepTrouble_testing) %>% roc_curve(SleepTrouble, .pred_Yes) %>% ggplot(aes(x = 1 - specificity, y = sensitivity)) + geom_path() + geom_abline(lty = 3) + coord_equal() ``` ```{r} NHANES_SleepTrouble_glmnet %>% predict(NHANES_SleepTrouble_testing, type = "prob") %>% bind_cols(NHANES_SleepTrouble_testing) %>% roc_auc(SleepTrouble, .pred_Yes) ``` Naive Bayes model using regularization. ```{r} NHANES_SleepTrouble_nb <- naive_Bayes(smoothness = .1) %>% set_engine("klaR") %>% set_mode("classification") %>% fit(SleepTrouble ~ ., data = NHANES_SleepTrouble_training) summary(NHANES_SleepTrouble_nb) ``` ```{r} predict(NHANES_SleepTrouble_nb, NHANES_SleepTrouble_testing) ``` ```{r} NHANES_SleepTrouble_nb %>% predict(NHANES_SleepTrouble_testing) %>% bind_cols(NHANES_SleepTrouble_testing) ``` ```{r} NHANES_SleepTrouble_nb %>% predict(NHANES_SleepTrouble_testing) %>% bind_cols(NHANES_SleepTrouble_testing) %>% metrics(truth = SleepTrouble, estimate = .pred_class) ``` ```{r} NHANES_SleepTrouble_nb %>% predict(NHANES_SleepTrouble_testing) %>% bind_cols(NHANES_SleepTrouble_testing) %>% conf_mat(truth = SleepTrouble, estimate = .pred_class) ``` ```{r} NHANES_SleepTrouble_nb %>% predict(NHANES_SleepTrouble_testing, type = "prob") %>% bind_cols(NHANES_SleepTrouble_testing) %>% roc_curve(SleepTrouble, .pred_Yes) %>% ggplot(aes(x = 1 - specificity, y = sensitivity)) + geom_path() + geom_abline(lty = 3) + coord_equal() ``` ```{r} NHANES_SleepTrouble_nb %>% predict(NHANES_SleepTrouble_testing, type = "prob") %>% bind_cols(NHANES_SleepTrouble_testing) %>% roc_auc(SleepTrouble, .pred_Yes) ```