Assignments


Homework 5: (not collected)

Read the following Chapters in Machine Learning with R, 4ed, Lantz.


Final: (due in Canvas by Sunday March 15, 2026 or end of finals week.)

The Final will be about implementing a machine learning feature selection algorithm based on Random Forests called the Boruta Algorithm. And there are two further questions asking you to use an AI tool you are comfortable with, Google AI Studio, ChatGPT, Mistral, Microsoft Colab, or another to prepare a response to the prompt provided.

Unzip the R Project. See the Final_part04.html.

Answer two questions in the Final_part04.qmd file.


Project: (due in Canvas by Friday March 13, 2026 or before the end of finals week.)

Complete your project in an R Project folder.

Upload two files to Canvas. Your .pdf or .doc and your .qmd files. Do not submit a .zip file. Your .qmd file will have a table-of-contents that clearly shows the project you worked on and each question should be include in the table-of-contents.

Complete one of the following projects:

Project 1 LASSO

Run all of the R code in the 5 A predictive modeling case study from the tidymodels Welcome website in a R Quarto Notebook with embed-resource: true. Organize the analysis and report using the Five Steps. Be sure to include:

  1. Build the PENALIZED LOGISTIC REGRESSION model the hotel data. In this case study, explain how the recipe and workflow functions are used to prepare the data for the model. Also, explain how the tune_grid is used.
  2. Build the TREE-BASED ENSEMBLE model for the hotel data.
  3. Compare the ROC Curve for the two models and explain which model is better for classifying a hotel booking as with children or no children.

Project 2 Use your own dataset for Classification

Pick a dataset of interest to you. Proceed to applying the 5-step process for data science to the dataset. You can use any dataset you like, but it should be a dataset that has a binary response variable and at least 100 observations. You can use a dataset that you can find on your own. The goal is to build a best model to predict the binary response variable using the other variables in the dataset. You should try to fit all of the classification algorithms from the class and compare their performance using appropriate metrics such as accuracy, precision, recall, F1 score, and ROC AUC. Report the name of the best model. Organize the analysis and report using the Five Steps. Provide your prediction for the test data in a .csv file.

Project 3 AutoML, install H2O, requires install of JAVA

Run all of the code on the H2O AutoML: Automatic Machine Learning page from the H2O.ai website in a R Quarto Notebook with embed-resource: true. This code runs all of the h2o ML models automatically. Read the Model Explainability page and run the function. The data is the Higgs dataset from UCI ML Repository. Organize the analysis and report using the Five Steps. Be sure to include:

Project 4 install Google Antigravity or Qoder

Use Agentic Programming to build a Shiny App front end to use the Tidymodels code from the Midterm to build models to predict the survival of passengers on the Titanic, compare them, and pick the best model. Use either Google Antigravity or Alibaba Qoder to write the code for you. You can use the code from the Midterm as a starting point, include the folder of provided R Tidymodels code in your Project Folder. Organize the analysis and report using the Five Steps.

In a report, explain how you used the Agentic Programming tool to write the code for you. Include the prompt you used and the code that was generated. Also, include a discussion of the results of the models that were generated by the Agentic Programming tool. Which model was best? How did you determine which model was best? What are the implications of your findings?

Example output:


Homework04: (complete by Monday March 9, 2026)

Using the provided Quarto Project, rename the file lastname_firstname_Stat652_Homework02b.qmd using your own last name and first name in the filename.

You should plan to come to class on Monday next week to ask questions and you will have until Friday to turn in this homework through Canvas.

Upload two files to Canvas. Your .pdf or .doc and your .qmd files. Do not submit a .zip file

  • Read: mdsr2e Chapter 10, Chapter 11
  • Machine Learning with R, 4th ed, Chapter 5, first half of Chapter 7. To access the book CSUEB Library Databases A-Z > Safari Books Online, register and access the book
  • Problems:
  • 11.7 Exercises: Problem 6a, Run Models 5. Neural network using training and test datasets, as described in part c of the problem.
  • 11.7 Exercises: Problem 6b, Run Models 4. Random Forest, 6. LASSO using training and test datasets, as described in part c of the problem.

Midterm: (due in Canvas by Friday February 27, 2026)

The Midterm is about determining which classification algorithm is best for classifying passengers on the titanic for survival.

Unzip the R Project. See the Midterm.html to read the assigned problems.

For the Midterm the process of developing a model using the training data is described. Final predictions will be made with the testing data that does not include the labels. This is how kaggle submissions are made.

The old tidymodels code is provided. This code should be updated to the new tidymodels workflows.


Homework03: (complete by Monday March 2, 2026)

Using the provided Quarto Project, rename the file lastname_firstname_Stat652_Homework02b.qmd using your own last name and first name in the filename.

You should plan to come to class on Monday next week to ask questions and you will have until Friday to turn in this homework through Canvas.

Upload two files to Canvas. Your self-contained: true .html and your .qmd files. DO NOT submit a .zip file

  • Read: mdsr2e Chapter 10, Chapter 11
  • Machine Learning with R, 4th, Chapter 4, Chapter 5. To access the book CSUEB Library Databases A-Z > Safari Books Online, register and access the book
  • Problems:
  • 11.7 Exercises: Problem 6a, Run Models 3. Decision Tree, using c5.0, 4. Random Forest, 6. Naive Bayes, using training and test data sets, as described in part c of the problem.
  • (Optional: This used to be a problem on the first homework assignment in Stat. 653, when I used to teach this class. So if you are not taking that class from me you might considering doing this problem for this class Stat. 652.) Perform the SMS spam filtering analysis from Lantz Machine Learing with R, 4ed. Produce a report explaining the data, the analysis, and the findings. Organize you report using the Five Steps. Be sure to include:
    1. Show the prediction that the algorithm produced.
    2. Give the Accuracy of the predictions.
    3. Include the confusion matrix.

Homework02b: (complete by Monday February 23, 2026)

Using the provided Quarto Project, rename the file lastname_firstname_Stat652_Homework02b.qmd using your own last name and first name in the filename.

You should plan to come to class on Monday next week to ask questions and you will have until Friday to turn in this homework through Canvas.

Upload two files to Canvas. Your self-contained: true .html and your .qmd files. DO NOT submit a .zip file

  • Read: mdsr2e Chapter 10, Chapter 11

  • Machine Learning with R, 4ed, Chapter 5, first half of Chapter 6. To access the book CSUEB Library Databases A-Z > Safari Books Online, register and access the book

  • To access the book CSUEB Library Databases A-Z > Safari Books Online, register and access the book

  • Problems:

  • 11.7 Exercises: Problem 6b, Run Models 1. Null Model, 2. Multiple Linear Regression, 3. Decision Tree, using CART, from the R package rpart, using training and test datasets, as described in part c of the problem.

Hints: For Problems 6b, explore the dataset before attempting to fit the models. You will need to deal with the missing values before applying some or all of the models. Which models do not work with missing data?


Quiz: (due in Canvas by Friday February 13, 2026)

Instruction: For problem 1 you can complete the questions in an Excel Spreadsheet or in and R Quarto Notebook. For problem 2 run the provided R Quarto Notebook answering the questions asked. Submit either a .xlsx file for problem 1 and both a .qmd and .html file for problem 2. Or submit both a .qmd and .html files containing your code, output, and answers for both problems.

Use the following Quarto Notebook to answer the questions in the quiz. Download this .qmd file into a new R Project directory that you create on your computer. lastname_firstname_Stat652_Quiz01.qmd

Before submitting your files in Canvas evaluate your files using an AI such as ChatGPT or Gemini to evaluate if you have followed the guidelines. Here is a suggested prompt to use after uploading the Homework_Guidelines.qmd file to the AI and your .qmd and .html file for the assignment.

Prompt: “Please review my homework that is in the .html and .qmd files. Please compare these files to the guidelines provide in the guidelines.qmd file. Please score each part of each requirement on a 5 point scale. Please provide summary of the completeness my my homework and anything that needs to be fixed.”

Upload two files to Canvas. Your self-contained: true .html and your .qmd files. DO NOT submit a .zip file

  1. Complete 2.4 Exercises Problem 7 a, b, c from the ISL.

Do parts a, b, and c without normalization or scaling. Re-do parts a, b, and c using either normalization or scaling. Do the results differ?

  1. Run the R code using the best subset regression code the olsrr, from the rsquaredacademy, and leaps packages. This question demonstrates the use of automating the model selection process by fitting all possible regressions and picking the best model using a criteria/metric such as Adjusted R-squared or AIC.

A nice blog post from yuza-Blog to read is glmulti best model and the YouTube video glmulti. The video is a good introduction to the another way to do model selection.


Homework02a: (complete by Monday February 16, 2026)

Using the provided Quarto Project, rename the file lastname_firstname_Stat652_Homework02a.qmd using your own last name and first name in the filename.

You should plan to come to class on Monday next week to ask questions and you will have until Friday to turn in this homework through Canvas.

Before submitting your files in Canvas evaluate your files using an AI such as ChatGPT or Gemini to evaluate if you have followed the guidelines for the homework. Here is a suggested prompt to use after uploading the Homework_Guidelines.qmd file to the AI and your .qmd and .html file for the assignment.

Prompt: “Please review my homework that is in the .html and .qmd files. Please compare these files to the guidelines provide in the guidelines.qmd file. Please score each part of each requirement on a 5 point scale. Please provide summary of the completeness my my homework and anything that needs to be fixed.”

Upload two files to Canvas. Your self-contained: true .html and your .qmd files. DO NOT submit a .zip file

  • Read: mdsr2e Chapter 10, Chapter 11

  • Lantz, Machine Learning with R, 4ed, Chapter 3, second half of Chapter 6. To access the book CSUEB Library Databases A-Z > Safari Books Online, register and access the book

  • To access the book CSUEB Library Databases A-Z > Safari Books Online, register and access the book

  • Problems:

  • 10.6 Exercises: Problem 3

Hints: The HELPrct data from the mosaicData R package. Note that this problem does not ask you to use a training and testing dataset. It is asking you to proceed without the testing dataset and you should use the full dataset to fit the model.

  • 11.7 Exercises: Problem 4

  • 11.7 Exercises: Problem 6a, Run Models 1. Null Model, 2. Logistic Regression, 7. kNN, using training and test datasets, as described in part c of the problem.

Hints: For Problems 6a, explore the dataset before attempting to fit the models. You will need to deal with the missing values before applying some or all of the models. Which models do not work with missing data?


Homework01: (complete by Monday February 2, 2026)

Using the provided Quarto Project, rename the file lastname_firstname_Stat652_Homework01.qmd using your own last name and first name in the filename.

  • Stat652_Homework01.zip Updated: I have updated the .qmd file in the .zip. There is currently a problem installing the R package mosaic. Since the .qmd file only uses the R package mosaicData, there is no reason to load mosaic. I have removed the mosaic package from the p_load() function and changed it to load only the mosaicData package in the .qmd file.

You should plan to come to class on Monday next week to ask questions and you will have until Friday to turn in this homework through Canvas.

Before submitting your files in Canvas evaluate your files using an AI such as ChatGPT or Gemini to evaluate if you have followed the guidelines for the homework. Here is a suggested prompt to use after uploading the Homework_Guidelines.qmd file to the AI and your .qmd and .html file for the assignment.

Prompt: “Please review my homework that is in the .html and .qmd files. Please compare these files to the guidelines provide in the guidelines.qmd file. Please score each part of each requirement on a 5 point scale. Please provide summary of the completeness my my homework and anything that needs to be fixed.”

Upload two files to Canvas. Your embed-resources: true .html and your .qmd files. DO NOT submit a .zip file

  • Read:
    • mdsr3e Chapter 9
    • Lantz, Machine Learning with R, 4ed, first half of the Chapter 6 on linear regression and logistic regression.
    • To access the book CSUEB Library Databases A-Z > O’Reilly Online Learning E-books (formerly Safari Books Online), register and access the book.
  • Problems:
    • 9.9 Exercises: Problem 2, Problem 3
    • 9.10 Supplemental exercises: Problem 2 Hint: This problem does not have a given dataset, so there is no code to run. The question is asking you to answer the questions asked in writing.