Statistics 652 - Midterm

Author

Prof. Eric A. Suess

Published

February 10, 2024

Midterm

Question 1

Clearly state the 5 steps for applying Machine Learning to a problem.

Answer:

Summarize your answer to the question here.

Type your answer here.

Question 2

For the titanic data set try the following machine learning classification algorithms.

Use the training and test datasets from the titanic R package.

You should note that the titanic_train has the Survived variable and the titanic_test does not. So to select your best model you need to use the titanic_train dataset to train and test your models. So that means you will need to select a training dataset from titanic_train and select a testing dataset (this would be a validation dataset) from titanic_train to evaluate the models you try.

I have not demonstrated the use of cross-validation, once you are comfortable running all of the models see if you can figure out how to use cross-validation to pick the best model.

Once you have picked the best model you should do the following:

  1. Re-run your chosen model on the full titanic_train dataset.
  2. Then produce predictions for the titanic_test dataset. This is what you would submit in a .csv to Kaggle in a competition.

Build classification models for the Survived variable. Pick an appropriate model scoring function, i.e. metric, and determine which model is the best. I would suggest making a confusion matrix and computing the accuracy or kappa.

  1. Null Model
  2. kNN (the sample code given did not scale or normalize, if you use this model you need to do that.)
  3. Boosted C5.0
  4. Random Forest
  5. Logistic Regression using regularization
  6. Naive Bayes

Extra Credit:

Make one plot containing all of the ROC curves for the algorithms trained.

Data

library(titanic)

data(titanic_train)
data(titanic_test)

head(titanic_train)
head(titanic_test)

Model 0 Null Model

Answer:

Summarize your answer to the question here. All code and comments should be below and your written answer above.

Code and Comments:

Model 1 kNN

Answer:

Summarize your answer to the question here. All code and comments should be below and your written answer above.

Code and Comments:

Model 2 Boosted C5.0

Answer:

Summarize your answer to the question here. All code and comments should be below and your written answer above.

Code and Comments:

Model 3 Random Forest

Answer:

Summarize your answer to the question here. All code and comments should be below and your written answer above.

Code and Comments:

Model 4 Logistic Regression using regularization

Answer:

Summarize your answer to the question here. All code and comments should be below and your written answer above.

Code and Comments:

Model 5 Naive Bayes

Answer:

Summarize your answer to the question here. All code and comments should be below and your written answer above.

Code and Comments: