library(titanic)
data(titanic_train)
data(titanic_test)
head(titanic_train)
head(titanic_test)
Statistics 652 - Midterm
Midterm
Question 1
Clearly state the 5 steps for applying Machine Learning to a problem.
Answer:
Summarize your answer to the question here.
Type your answer here.
Question 2
For the titanic data set try the following machine learning classification algorithms.
Use the training and test datasets from the titanic R package.
You should note that the titanic_train has the Survived variable and the titanic_test does not. So to select your best model you need to use the titanic_train dataset to train and test your models. So that means you will need to select a training dataset from titanic_train and select a testing dataset (this would be a validation dataset) from titanic_train to evaluate the models you try.
I have not demonstrated the use of cross-validation, once you are comfortable running all of the models see if you can figure out how to use cross-validation to pick the best model.
Once you have picked the best model you should do the following:
- Re-run your chosen model on the full titanic_train dataset.
- Then produce predictions for the titanic_test dataset. This is what you would submit in a .csv to Kaggle in a competition.
Build classification models for the Survived variable. Pick an appropriate model scoring function, i.e. metric, and determine which model is the best. I would suggest making a confusion matrix and computing the accuracy or kappa.
- Null Model
- kNN (the sample code given did not scale or normalize, if you use this model you need to do that.)
- Boosted C5.0
- Random Forest
- Logistic Regression using regularization
- Naive Bayes
Extra Credit:
Make one plot containing all of the ROC curves for the algorithms trained.
Data
Model 0 Null Model
Answer:
Summarize your answer to the question here. All code and comments should be below and your written answer above.
Code and Comments:
Model 1 kNN
Answer:
Summarize your answer to the question here. All code and comments should be below and your written answer above.
Code and Comments:
Model 2 Boosted C5.0
Answer:
Summarize your answer to the question here. All code and comments should be below and your written answer above.
Code and Comments:
Model 3 Random Forest
Answer:
Summarize your answer to the question here. All code and comments should be below and your written answer above.
Code and Comments:
Model 4 Logistic Regression using regularization
Answer:
Summarize your answer to the question here. All code and comments should be below and your written answer above.
Code and Comments:
Model 5 Naive Bayes
Answer:
Summarize your answer to the question here. All code and comments should be below and your written answer above.