Stat. 652 Midterm

Author

Prof. Eric A. Suess

Midterm

For the titanic data set try the following machine learning classification algorithms.

Use the training and test datasets from the titanic R package.

You should note that the titanic_train has the Survived variable and the titanic_test does not. So to select your best model you need to use the titanic_train dataset to train and test your models. So that means you will need to select a training dataset from titanic_train and select a testing dataset (this would be a validation dataset) from titanic_train to evaluate the models you try.

I have not demonstrated the use of cross-validation, once you are comfortable running all of the models see if you can figure out how to use cross-validation to pick the best model.

Once you have picked the best model you should do the following:

  1. Re-run your chosen model on the full titanic_train dataset.
  2. Then produce predictions for the titanic_test dataset. This is what you would submit in a .csv to kaggle in a competition.

Build classification models for the Survived variable. Pick a model scoring function and determine which model is the best. I would suggest making a confusion matrix and computing the accuracy or kappa.

  1. Null Model
  2. kNN (the sample code given did not scale or normalize, if you use this model you need to do that.)
  3. Boosted C5.0
  4. Random Forest
  5. Logistic Regression using regularization
  6. Naive Bayes

Extra Credit:

Make one plot containing all of the ROC curves for the algorithms trained.

library(tidyverse)
library(tidymodels)

Data

Loading

library(titanic)
data(titanic_train)
data(titanic_test)

Exploration

library(naniar)
naniar::gg_miss_var(titanic_train)

naniar::gg_miss_var(titanic_test)

Prep

titanic_train <- titanic_train |>
    mutate(
        Survived = as.factor(Survived),
        Sex = as.factor(Sex),
        Embarked = as.factor(Embarked),
        Pclass = as.factor(Pclass),
        SibSp = as.double(SibSp),
        Parch = as.double(Parch)
    )
titanic_test <- titanic_test |>
    mutate(
        Sex = as.factor(Sex),
        Embarked = as.factor(Embarked),
        Pclass = as.factor(Pclass),
        SibSp = as.double(SibSp),
        Parch = as.double(Parch)
    )
# Training/validation split.
set.seed(108)
parts <- titanic_train |> initial_split(prop = 0.8, strata = Survived)
train <- parts |> training()
val <- parts |> testing()
rec <- recipe(Survived ~ ., data = train) |>
    update_role(PassengerId, Name, Ticket, Cabin, new_role = "id") |>
    step_impute_mean(all_numeric_predictors()) |>
    step_nzv(all_predictors()) |>
    step_normalize(all_numeric_predictors())

preprocess <- rec 

Models

Null model

The null model just predicts not survived, and has a validation accuracy of 61.5%.

model_null <- null_model() |>
    set_engine("parsnip") |>
    set_mode("classification")

fit_null <- workflow() |>
    add_formula(Survived ~ .) |>
    add_model(model_null) |>
    last_fit(parts)
→ A | warning: Novel levels found in column "Name": "McCarthy, Mr. Timothy J", "Hewlett, Mrs.
               (Mary D Kingcome) ", "Williams, Mr. Charles Eugene", "Fynney, Mr. Joseph J",
               "Sloper, Mr. William Thompson", "Asplund, Mrs. Carl Oscar (Selma Augusta Emilia
               Johansson)", "Uruchurtu, Don. Manuel E", "Meyer, Mr. Edgar Joseph", "Holverson,
               Mr. Alexander Oskar", "Kraeff, Mr. Theodor", "Arnold-Franchi, Mrs. Josef
               (Josefine Franchi)", "Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson)",
               "Sirayanian, Mr. Orsen", "Harris, Mr. Henry Birkhardt", "Skoog, Master.
               Harald", "Andersson, Miss. Erna Alexandra", "Staneff, Mr. Ivan", "Moutal, Mr.
               Rahamin Haim", …, "Carlsson, Mr. Frans Olof", and "Petroff, Mr. Nedelio".
               ℹ The levels have been removed, and values have been coerced to <NA>., Novel levels found in column "Ticket": "17463", "248706", "244373", "113788",
               "PC 17601", "349253", "349237", "2926", "2669", "3101281", "349208", "374746",
               "113059", "343275", "7540", "349207", "312991", "349249", …, "695", and
               "349212".
               ℹ The levels have been removed, and values have been coerced to <NA>., Novel levels found in column "Cabin": "E46", "A6", "C110", "E33", "C49", "B80",
               "C93", "D7", "C106", "C124", "B35", "C104", "C111", "D21", "A14", "C86", "B41",
               "B50", "C90", and "B101".
               ℹ The levels have been removed, and values have been coerced to <NA>.
There were issues with some computations   A: x1
There were issues with some computations   A: x1
fit_null |> collect_metrics()
# A tibble: 3 × 4
  .metric     .estimator .estimate .config             
  <chr>       <chr>          <dbl> <chr>               
1 accuracy    binary         0.615 Preprocessor1_Model1
2 roc_auc     binary         0.5   Preprocessor1_Model1
3 brier_class binary         0.237 Preprocessor1_Model1
fit_null |> collect_predictions() |> conf_mat(truth = Survived, estimate = .pred_class)
          Truth
Prediction   0   1
         0 110  69
         1   0   0

k-NN

For k nearest neighbors, after doing a grid search of k = 1 to 10 along with 10-fold cross validation, the optimal k value was found to be 7, yielding an accuracy of 79.3%.

The k nearest neighbors model found k = 8 to be the optimal v

model_knn <- nearest_neighbor(neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

workflow_knn <- workflow() |>
    add_recipe(preprocess) |>
    add_model(model_knn)

fit_knn <- workflow_knn |>
    tune_grid(
        resamples = vfold_cv(train, v = 10, strata = Survived),
        grid = expand_grid(neighbors = 1:10),
        control = control_grid(save_pred = T)
    )

best_knn <- fit_knn |> select_best(metric = "accuracy")
best_knn
# A tibble: 1 × 2
  neighbors .config              
      <int> <chr>                
1         7 Preprocessor1_Model07
workflow_knn_final <- workflow_knn |> finalize_workflow(best_knn)

fit_knn_final <- workflow_knn_final |> last_fit(parts)

fit_knn_final |> collect_metrics()
# A tibble: 3 × 4
  .metric     .estimator .estimate .config             
  <chr>       <chr>          <dbl> <chr>               
1 accuracy    binary         0.793 Preprocessor1_Model1
2 roc_auc     binary         0.808 Preprocessor1_Model1
3 brier_class binary         0.167 Preprocessor1_Model1
fit_knn_final |>
    collect_predictions() |>
    conf_mat(truth = Survived, estimate = .pred_class)
          Truth
Prediction  0  1
         0 98 25
         1 12 44

Boosted C5.0

Wasn’t able to get C5.0 to work with tidymodels tuning.

model_c50 <- boost_tree(trees = 10) |>
    set_engine("C5.0") |>
    set_mode("classification")

workflow_c50 <- workflow() |>
    add_recipe(preprocess) |>
    add_model(model_c50)

fit_c50 <- workflow_c50 |> fit(train)
c50 code called exit with value 1
fit_c50
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_impute_mean()
• step_nzv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────

Call:
C5.0.default(x = x, y = y, trials = 10, control = C50::C5.0Control(minCases
 = 2, sample = 0))

Classification Tree
Number of samples: 712 
Number of predictors: 7 

Number of boosting iterations: 10 

Non-standard options: attempt to group attributes

Random forest

The random forest model has an accuracy of 81.6%.

model_rf <- rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |>
    set_engine("randomForest") |>
    set_mode("classification")

workflow_rf <- workflow() |>
    add_recipe(rec) |>
    add_model(model_rf)

fit_rf <- workflow_rf |>
    tune_grid(
        resamples = vfold_cv(train, v = 10, strata = Survived),
        grid = grid_latin_hypercube(
            mtry(range = c(1, 7)),
            trees(), min_n(),
            size = 10
        ),
        control = control_grid(save_pred = T)
    )
Warning: `grid_latin_hypercube()` was deprecated in dials 1.3.0.
ℹ Please use `grid_space_filling()` instead.
best_rf <- fit_rf |> select_best(metric = "accuracy")
best_rf
# A tibble: 1 × 4
   mtry trees min_n .config              
  <int> <int> <int> <chr>                
1     4  1156     5 Preprocessor1_Model06
workflow_rf_final <- workflow_rf |> finalize_workflow(best_rf)

fit_rf_final <- workflow_rf_final |> last_fit(parts)

fit_rf_final |> collect_metrics()
# A tibble: 3 × 4
  .metric     .estimator .estimate .config             
  <chr>       <chr>          <dbl> <chr>               
1 accuracy    binary         0.816 Preprocessor1_Model1
2 roc_auc     binary         0.853 Preprocessor1_Model1
3 brier_class binary         0.143 Preprocessor1_Model1
fit_rf_final |>
    collect_predictions() |>
    conf_mat(truth = Survived, estimate = .pred_class)
          Truth
Prediction   0   1
         0 104  27
         1   6  42

Logistic regression using regularization

The logistic regression model obtained an accuracy of 78.8%.

model_lr <- logistic_reg(penalty = tune(), mixture = tune()) |>
    set_engine("glmnet") |>
    set_mode("classification")

# Extra preprocessing since logistic regression only works with numeric or dummy
# variables.
preprocess_lr <- rec |>
    step_dummy(Pclass, Sex, Embarked) 

workflow_lr <- workflow() |>
    add_recipe(preprocess_lr) |>
    add_model(model_lr)

fit_lr <- workflow_lr |>
    tune_grid(
        resamples = vfold_cv(train, v = 10, strata = Survived),
        grid = grid_latin_hypercube(penalty(), mixture(), size = 10),
        control = control_grid(save_pred = T)
    )

best_lr <- fit_lr |> select_best(metric = "accuracy")
best_lr
# A tibble: 1 × 3
  penalty mixture .config              
    <dbl>   <dbl> <chr>                
1  0.0382  0.0125 Preprocessor1_Model01
workflow_lr_final <- workflow_lr |> finalize_workflow(best_lr)

fit_lr_final <- workflow_lr_final |> last_fit(parts)

fit_lr_final |> collect_metrics()
# A tibble: 3 × 4
  .metric     .estimator .estimate .config             
  <chr>       <chr>          <dbl> <chr>               
1 accuracy    binary         0.788 Preprocessor1_Model1
2 roc_auc     binary         0.856 Preprocessor1_Model1
3 brier_class binary         0.149 Preprocessor1_Model1
fit_lr_final |>
    collect_predictions() |>
    conf_mat(truth = Survived, estimate = .pred_class)
          Truth
Prediction   0   1
         0 101  29
         1   9  40

Naive Bayes

The Naive Bayes model obtained an accuracy of 78.2%.

# Required for Naive Bayes implementation.
library(agua)

Attaching package: 'agua'
The following object is masked from 'package:workflowsets':

    rank_results
library(discrim)

Attaching package: 'discrim'
The following object is masked from 'package:dials':

    smoothness
model_nb <- naive_Bayes(smoothness = tune(), Laplace = tune()) |>
    set_engine("klaR") |>
    set_mode("classification")

workflow_nb <- workflow() |>
    add_recipe(preprocess) |>
    add_model(model_nb)

fit_nb <- workflow_nb |>
    tune_grid(
        resamples = vfold_cv(train, v = 10, strata = Survived),
        grid = grid_latin_hypercube(smoothness(), Laplace(), size = 10),
        control = control_grid(save_pred = T)
    )

best_nb <- fit_nb |> select_best(metric = "accuracy")
best_nb
# A tibble: 1 × 3
  smoothness Laplace .config              
       <dbl>   <dbl> <chr>                
1      0.578    1.51 Preprocessor1_Model07
workflow_nb_final <- workflow_nb |> finalize_workflow(best_nb)

fit_nb_final <- workflow_nb_final |> last_fit(parts)

fit_nb_final |> collect_metrics()
# A tibble: 3 × 4
  .metric     .estimator .estimate .config             
  <chr>       <chr>          <dbl> <chr>               
1 accuracy    binary         0.782 Preprocessor1_Model1
2 roc_auc     binary         0.815 Preprocessor1_Model1
3 brier_class binary         0.174 Preprocessor1_Model1
fit_nb_final |>
    collect_predictions() |>
    conf_mat(truth = Survived, estimate = .pred_class)
          Truth
Prediction  0  1
         0 99 28
         1 11 41

Final

The model with the highest accuracy was the random forest model, so that will be used to fit on the entire training set.

There is an error in the next code chunk. Needs to be updated.

model_final <- model_rf |> finalize_model(best_rf)

workflow_final <- workflow() |>
    add_recipe(preprocess) |>
    add_model(model_final)

fit_final <- workflow_final |> fit(titanic_train)

pred_final <- fit_final |> predict(new_data = preprocess |> bake(titanic_test) )

titanic_test |>
    dplyr::select(PassengerId) |>
    mutate(Survived = pred_final$.pred_class)