library(tidyverse)
library(tidymodels)
Stat. 652 Midterm
Midterm
For the titanic data set try the following machine learning classification algorithms.
Use the training and test datasets from the titanic R package.
You should note that the titanic_train has the Survived variable and the titanic_test does not. So to select your best model you need to use the titanic_train dataset to train and test your models. So that means you will need to select a training dataset from titanic_train and select a testing dataset (this would be a validation dataset) from titanic_train to evaluate the models you try.
I have not demonstrated the use of cross-validation, once you are comfortable running all of the models see if you can figure out how to use cross-validation to pick the best model.
Once you have picked the best model you should do the following:
- Re-run your chosen model on the full titanic_train dataset.
- Then produce predictions for the titanic_test dataset. This is what you would submit in a .csv to kaggle in a competition.
Build classification models for the Survived variable. Pick a model scoring function and determine which model is the best. I would suggest making a confusion matrix and computing the accuracy or kappa.
- Null Model
- kNN (the sample code given did not scale or normalize, if you use this model you need to do that.)
- Boosted C5.0
- Random Forest
- Logistic Regression using regularization
- Naive Bayes
Extra Credit:
Make one plot containing all of the ROC curves for the algorithms trained.
Data
Loading
library(titanic)
data(titanic_train)
data(titanic_test)
Exploration
library(naniar)
::gg_miss_var(titanic_train) naniar
::gg_miss_var(titanic_test) naniar
Prep
<- titanic_train |>
titanic_train mutate(
Survived = as.factor(Survived),
Sex = as.factor(Sex),
Embarked = as.factor(Embarked),
Pclass = as.factor(Pclass),
SibSp = as.double(SibSp),
Parch = as.double(Parch)
)<- titanic_test |>
titanic_test mutate(
Sex = as.factor(Sex),
Embarked = as.factor(Embarked),
Pclass = as.factor(Pclass),
SibSp = as.double(SibSp),
Parch = as.double(Parch)
)
# Training/validation split.
set.seed(108)
<- titanic_train |> initial_split(prop = 0.8, strata = Survived)
parts <- parts |> training()
train <- parts |> testing() val
<- recipe(Survived ~ ., data = train) |>
rec update_role(PassengerId, Name, Ticket, Cabin, new_role = "id") |>
step_impute_mean(all_numeric_predictors()) |>
step_nzv(all_predictors()) |>
step_normalize(all_numeric_predictors())
<- rec preprocess
Models
Null model
The null model just predicts not survived, and has a validation accuracy of 61.5%.
<- null_model() |>
model_null set_engine("parsnip") |>
set_mode("classification")
<- workflow() |>
fit_null add_formula(Survived ~ .) |>
add_model(model_null) |>
last_fit(parts)
→ A | warning: Novel levels found in column "Name": "McCarthy, Mr. Timothy J", "Hewlett, Mrs.
(Mary D Kingcome) ", "Williams, Mr. Charles Eugene", "Fynney, Mr. Joseph J",
"Sloper, Mr. William Thompson", "Asplund, Mrs. Carl Oscar (Selma Augusta Emilia
Johansson)", "Uruchurtu, Don. Manuel E", "Meyer, Mr. Edgar Joseph", "Holverson,
Mr. Alexander Oskar", "Kraeff, Mr. Theodor", "Arnold-Franchi, Mrs. Josef
(Josefine Franchi)", "Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson)",
"Sirayanian, Mr. Orsen", "Harris, Mr. Henry Birkhardt", "Skoog, Master.
Harald", "Andersson, Miss. Erna Alexandra", "Staneff, Mr. Ivan", "Moutal, Mr.
Rahamin Haim", …, "Carlsson, Mr. Frans Olof", and "Petroff, Mr. Nedelio".
ℹ The levels have been removed, and values have been coerced to <NA>., Novel levels found in column "Ticket": "17463", "248706", "244373", "113788",
"PC 17601", "349253", "349237", "2926", "2669", "3101281", "349208", "374746",
"113059", "343275", "7540", "349207", "312991", "349249", …, "695", and
"349212".
ℹ The levels have been removed, and values have been coerced to <NA>., Novel levels found in column "Cabin": "E46", "A6", "C110", "E33", "C49", "B80",
"C93", "D7", "C106", "C124", "B35", "C104", "C111", "D21", "A14", "C86", "B41",
"B50", "C90", and "B101".
ℹ The levels have been removed, and values have been coerced to <NA>.
There were issues with some computations A: x1
There were issues with some computations A: x1
|> collect_metrics() fit_null
# A tibble: 3 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.615 Preprocessor1_Model1
2 roc_auc binary 0.5 Preprocessor1_Model1
3 brier_class binary 0.237 Preprocessor1_Model1
|> collect_predictions() |> conf_mat(truth = Survived, estimate = .pred_class) fit_null
Truth
Prediction 0 1
0 110 69
1 0 0
k-NN
For k nearest neighbors, after doing a grid search of k = 1 to 10 along with 10-fold cross validation, the optimal k value was found to be 7, yielding an accuracy of 79.3%.
The k nearest neighbors model found k = 8 to be the optimal v
<- nearest_neighbor(neighbors = tune()) |>
model_knn set_engine("kknn") |>
set_mode("classification")
<- workflow() |>
workflow_knn add_recipe(preprocess) |>
add_model(model_knn)
<- workflow_knn |>
fit_knn tune_grid(
resamples = vfold_cv(train, v = 10, strata = Survived),
grid = expand_grid(neighbors = 1:10),
control = control_grid(save_pred = T)
)
<- fit_knn |> select_best(metric = "accuracy")
best_knn best_knn
# A tibble: 1 × 2
neighbors .config
<int> <chr>
1 7 Preprocessor1_Model07
<- workflow_knn |> finalize_workflow(best_knn)
workflow_knn_final
<- workflow_knn_final |> last_fit(parts)
fit_knn_final
|> collect_metrics() fit_knn_final
# A tibble: 3 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.793 Preprocessor1_Model1
2 roc_auc binary 0.808 Preprocessor1_Model1
3 brier_class binary 0.167 Preprocessor1_Model1
|>
fit_knn_final collect_predictions() |>
conf_mat(truth = Survived, estimate = .pred_class)
Truth
Prediction 0 1
0 98 25
1 12 44
Boosted C5.0
Wasn’t able to get C5.0 to work with tidymodels tuning.
<- boost_tree(trees = 10) |>
model_c50 set_engine("C5.0") |>
set_mode("classification")
<- workflow() |>
workflow_c50 add_recipe(preprocess) |>
add_model(model_c50)
<- workflow_c50 |> fit(train) fit_c50
c50 code called exit with value 1
fit_c50
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_impute_mean()
• step_nzv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Call:
C5.0.default(x = x, y = y, trials = 10, control = C50::C5.0Control(minCases
= 2, sample = 0))
Classification Tree
Number of samples: 712
Number of predictors: 7
Number of boosting iterations: 10
Non-standard options: attempt to group attributes
Random forest
The random forest model has an accuracy of 81.6%.
<- rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |>
model_rf set_engine("randomForest") |>
set_mode("classification")
<- workflow() |>
workflow_rf add_recipe(rec) |>
add_model(model_rf)
<- workflow_rf |>
fit_rf tune_grid(
resamples = vfold_cv(train, v = 10, strata = Survived),
grid = grid_latin_hypercube(
mtry(range = c(1, 7)),
trees(), min_n(),
size = 10
),control = control_grid(save_pred = T)
)
Warning: `grid_latin_hypercube()` was deprecated in dials 1.3.0.
ℹ Please use `grid_space_filling()` instead.
<- fit_rf |> select_best(metric = "accuracy")
best_rf best_rf
# A tibble: 1 × 4
mtry trees min_n .config
<int> <int> <int> <chr>
1 4 1156 5 Preprocessor1_Model06
<- workflow_rf |> finalize_workflow(best_rf)
workflow_rf_final
<- workflow_rf_final |> last_fit(parts)
fit_rf_final
|> collect_metrics() fit_rf_final
# A tibble: 3 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.816 Preprocessor1_Model1
2 roc_auc binary 0.853 Preprocessor1_Model1
3 brier_class binary 0.143 Preprocessor1_Model1
|>
fit_rf_final collect_predictions() |>
conf_mat(truth = Survived, estimate = .pred_class)
Truth
Prediction 0 1
0 104 27
1 6 42
Logistic regression using regularization
The logistic regression model obtained an accuracy of 78.8%.
<- logistic_reg(penalty = tune(), mixture = tune()) |>
model_lr set_engine("glmnet") |>
set_mode("classification")
# Extra preprocessing since logistic regression only works with numeric or dummy
# variables.
<- rec |>
preprocess_lr step_dummy(Pclass, Sex, Embarked)
<- workflow() |>
workflow_lr add_recipe(preprocess_lr) |>
add_model(model_lr)
<- workflow_lr |>
fit_lr tune_grid(
resamples = vfold_cv(train, v = 10, strata = Survived),
grid = grid_latin_hypercube(penalty(), mixture(), size = 10),
control = control_grid(save_pred = T)
)
<- fit_lr |> select_best(metric = "accuracy")
best_lr best_lr
# A tibble: 1 × 3
penalty mixture .config
<dbl> <dbl> <chr>
1 0.0382 0.0125 Preprocessor1_Model01
<- workflow_lr |> finalize_workflow(best_lr)
workflow_lr_final
<- workflow_lr_final |> last_fit(parts)
fit_lr_final
|> collect_metrics() fit_lr_final
# A tibble: 3 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.788 Preprocessor1_Model1
2 roc_auc binary 0.856 Preprocessor1_Model1
3 brier_class binary 0.149 Preprocessor1_Model1
|>
fit_lr_final collect_predictions() |>
conf_mat(truth = Survived, estimate = .pred_class)
Truth
Prediction 0 1
0 101 29
1 9 40
Naive Bayes
The Naive Bayes model obtained an accuracy of 78.2%.
# Required for Naive Bayes implementation.
library(agua)
Attaching package: 'agua'
The following object is masked from 'package:workflowsets':
rank_results
library(discrim)
Attaching package: 'discrim'
The following object is masked from 'package:dials':
smoothness
<- naive_Bayes(smoothness = tune(), Laplace = tune()) |>
model_nb set_engine("klaR") |>
set_mode("classification")
<- workflow() |>
workflow_nb add_recipe(preprocess) |>
add_model(model_nb)
<- workflow_nb |>
fit_nb tune_grid(
resamples = vfold_cv(train, v = 10, strata = Survived),
grid = grid_latin_hypercube(smoothness(), Laplace(), size = 10),
control = control_grid(save_pred = T)
)
<- fit_nb |> select_best(metric = "accuracy")
best_nb best_nb
# A tibble: 1 × 3
smoothness Laplace .config
<dbl> <dbl> <chr>
1 0.578 1.51 Preprocessor1_Model07
<- workflow_nb |> finalize_workflow(best_nb)
workflow_nb_final
<- workflow_nb_final |> last_fit(parts)
fit_nb_final
|> collect_metrics() fit_nb_final
# A tibble: 3 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.782 Preprocessor1_Model1
2 roc_auc binary 0.815 Preprocessor1_Model1
3 brier_class binary 0.174 Preprocessor1_Model1
|>
fit_nb_final collect_predictions() |>
conf_mat(truth = Survived, estimate = .pred_class)
Truth
Prediction 0 1
0 99 28
1 11 41
Final
The model with the highest accuracy was the random forest model, so that will be used to fit on the entire training set.
There is an error in the next code chunk. Needs to be updated.
<- model_rf |> finalize_model(best_rf)
model_final
<- workflow() |>
workflow_final add_recipe(preprocess) |>
add_model(model_final)
<- workflow_final |> fit(titanic_train)
fit_final
<- fit_final |> predict(new_data = preprocess |> bake(titanic_test) )
pred_final
|>
titanic_test ::select(PassengerId) |>
dplyrmutate(Survived = pred_final$.pred_class)