Assignments
Homework 5: (not collected)
Read the following Chapters in Machine Learning with R, 4ed, Lantz.
Read: Chapter 7 Neural Networks
Read: Chapter 7 SVMs
Read: Chapter 9 Clustering
Read: Chapter 8 Market Basket Analysis Using Association Rules
Problems:
Perform the ANN analysis on the concrete data. Produce a report explaining the data, the analysis, and the findings. Using an R Quarto Notebook.
- Organize you report using the Five Steps.
Perform the SVM analysis on the OCR analysis letter data. Produce a report explaining the data, the analysis, and the findings. Using an R Quarto Notebook.
- Organize you report using the Five Steps.
Perform the Cluster analysis on the sns data. Produce a report explaining the data, the analysis, and the findings. Using an R Quarto Notebook.
- Organize you report using the Five Steps.
Perform the Association analysis on the groceries analysis letter data. Produce a report explaining the data, the analysis, and the findings. Using an R Quarto Notebook.
- Organize you report using the Five Steps.
Final: (due in Canvas by Sunday March 10, 2024 or end of finals week.)
The Final will be about implementing a machine learning feature selection algorithm based on Random Forests called the Boruta Algorithm.
Unzip the R Project. See the Final_part04.html.
Answer two questions in the Final_part04.html file. The questions are:
- What are the important variables identified by the Boruta algorithm from the Ozone data?
- What are the important variables identified by the Boruta algorithm from the titanic training data?
Project: (due in Canvas by Sunday March 10, 2024 or end of finals week.)
Run all of the R code in the A predictive modeling case study from the tidymodels Welcome website in a self-contained R Quarto Notebook.
- Build the PENALIZED LOGISTIC REGRESSION model the hotel data. In this case study, explain how the recipe and workflow functions are used to prepare the data for the model. Also, explain how the tune_grid is used.
- Build the TREE-BASED ENSEMBLE model the hotel data.
- Compare the ROC Curve for the two models and explain which model is better for classifying a hotel booking as with children or no children.
Homework04: (complete by Monday February 26, 2024)
Using the provided Quarto Project, rename the file lastname_firstname_Stat652_Homework04.qmd using your own last name and first name in the filename.
You should plan to come to class on Monday next week to ask questions and you will have until Friday to turn in this homework through Canvas.
Upload two files to Canvas. Your .pdf or .doc and your .qmd files. Do not submit a .zip file
- Read: mdsr2e Chapter 10, Chapter 11
- Machine Learning with R, 4th ed, Chapter 5, first half of Chapter 7. To access the book CSUEB Library Databases A-Z > Safari Books Online, register and access the book
- Problems:
- 11.7 Exercises: Problem 6a, Run Models 5. Neural network using training and test datasets, as described in part c of the problem.
- 11.7 Exercises: Problem 6b, Run Models 4. Random Forest, 6. LASSO using training and test datasets, as described in part c of the problem.
Midterm: (due in Canvas by Friday February 23, 2024)
The Midterm is about determining which classification algorithm is best for classifying passengers on the titanic for survival.
- Midterm-2024.zip Updated: See Examples_Tidymodels for the new tidymodels code.
Unzip the R Project. See the Midterm.html.
For the Midterm the process of developing a model using the training data is described. Final predictions will be made with the testing data that does not include the labels. This is how kaggle submissions are made.
The old tidymodels code is provided. This code should be updated to the new tidymodels workflow.
- See Rebecca Barter’s blog post Tidymodels: tidy machine learning in R.
- See Olivier Gimenez’s blog post Experimenting with machine learning in R with tidymodels and the Kaggle titanic dataset.
- See Jan Kirenz blog post Data Science with Tidymodels, Workflows and Recipes
Homework03: (complete by Monday February 19, 2024)
Using the provided Quarto Project, rename the file lastname_firstname_Stat652_Homework02.qmd using your own last name and first name in the filename.
You should plan to come to class on Monday next week to ask questions and you will have until Friday to turn in this homework through Canvas.
Upload two files to Canvas. Your self-contained: true .html and your .qmd files. DO NOT submit a .zip file
- Read: mdsr2e Chapter 10, Chapter 11
- Machine Learning with R, 4th, Chapter 4, Chapter 5. To access the book CSUEB Library Databases A-Z > Safari Books Online, register and access the book
- Problems:
- 11.7 Exercises: Problem 6a, Run Models 3. Decision Tree, using c5.0, 4. Random Forest, 6. Naive Bayes, using training and test data sets, as described in part c of the problem.
- (Optional: This is a problem in the first homework in Stat. 653. So if you are not taking that class you might considering doing this problem for this class Stat. 652.) Perform the SMS spam filtering analysis from Lantz Machine Learing with R, 4ed. Produce a report explaining the data, the analysis, and the findings. Organize you report using the Five Steps. Be sure to include:
- Show the prediction that the algorithm produced.
- Give the Accuracy of the predictions.
- Include the confusion matrix.
Quiz: (due in Canvas by Friday February 9, 2024)
Instruction: For problem 1 you can complete the questions in an Excel Spreadsheet or in and R Quarto Notebook. Submit either a .xlsx file or both a .qmd and .html file. For problem 2 run the provided R Quarto Notebook answering the questions asked. Submit both a .qmd and .html files.
Use the following Quarto Notebook to answer the questions in the quiz. lastname_firstname_Stat652_Quiz01.qmd
A nice blog post from yuza-Blog to read is glmulti best model and the YouTube video glmulti. The video is a good introduction to the another way to do model selection.
- Complete 2.4 Exercises Problem 7 a, b, c from the ISL.
Do parts a, b, and c without normalization or scaling. Re-do parts a, b, and c using either normalization or scaling. Do the results differ?
- Run the R code using the best subset regression code the olsrr, from the rsquaredacademy, and leaps packages. This question demonstrates the use of automating the model selection process by fitting all possible regressions and picking the best model using a criteria/metric such as Adrjusted R-squared or AIC.
Homework02b: (complete by Monday February 12, 2024)
Using the provided Quarto Project, rename the file lastname_firstname_Stat652_Homework02.qmd using your own last name and first name in the filename.
You should plan to come to class on Monday next week to ask questions and you will have until Friday to turn in this homework through Canvas.
Upload two files to Canvas. Your self-contained: true .html and your .qmd files. DO NOT submit a .zip file
Read: mdsr2e Chapter 10, Chapter 11
Machine Learning with R, 4ed, Chapter 5, first half of Chapter 6. To access the book CSUEB Library Databases A-Z > Safari Books Online, register and access the book
To access the book CSUEB Library Databases A-Z > Safari Books Online, register and access the book
Problems:
11.7 Exercises: Problem 6b, Run Models 1. Null Model, 2. Multiple Linear Regression, 3. Decision Tree, using CART, from the R package rpart, using training and test datasets, as described in part c of the problem.
Hints: For Problems 6b, explore the dataset before attempting to fit the models. You will need to deal with the missing values before applying some or all of the models. Which models do not work with missing data?
Homework02a: (complete by Monday February 5, 2024)
Using the provided Quarto Project, rename the file lastname_firstname_Stat652_Homework02.qmd using your own last name and first name in the filename.
You should plan to come to class on Monday next week to ask questions and you will have until Friday to turn in this homework through Canvas.
Upload two files to Canvas. Your self-contained: true .html and your .qmd files. DO NOT submit a .zip file
Read: mdsr2e Chapter 10, Chapter 11
Machine Learning with R, 4ed, Chapter 3, second half of Chapter 6. To access the book CSUEB Library Databases A-Z > Safari Books Online, register and access the book
To access the book CSUEB Library Databases A-Z > Safari Books Online, register and access the book
Problems:
10.6 Exercises: Problem 3
Hints: The HELPrct data from the mosaicData R package. Note that this problem does not ask you to use a training and testing dataset. It is asking you to proceed without the testing dataset and you should use the full dataset to fit the model.
11.7 Exercises: Problem 4
11.7 Exercises: Problem 6a, Run Models 1. Null Model, 2. Logistic Regression, 7. kNN, using training and test datasets, as described in part c of the problem.
Hints: For Problems 6a, explore the dataset before attempting to fit the models. You will need to deal with the missing values before applying some or all of the models. Which models do not work with missing data?
Homework01: (complete by Monday January 29, 2024)
Using the provided Quarto Project, rename the file lastname_firstname_Stat652_Homework01.qmd using your own last name and first name in the filename.
You should plan to come to class on Monday next week to ask questions and you will have until Friday to turn in this homework through Canvas.
Upload two files to Canvas. Your self-contained: true .html and your .qmd files. DO NOT submit a .zip file
Read:
- mdsr2e Chapter 9
- Machine Learning with R, 4ed, first half of the Chapter 6 on linear regression and logistic regression.
To access the book CSUEB Library Databases A-Z > O’Reilly Online Learning E-books (formerly Safari Books Online), register and access the book.
Problems:
9.9 Exercises: Problem 2, Problem 3
9.10 Supplemental exercises: Problem 2