# Stat 652: Statistical Learning

Department of Statistics and Biostatistics, CSU East Bay

**Spring 2024:**

**Week 7:**

**Next Week:**On Monday March 4 and Wednesday 6 next week there is no class, next week is Finals week for this course. I will hold my usual office hours on Wednesday and Thursday next week for questions about the Final, the Project or any other assignments you are working on.**Final:**The Project and Final may be turned in no later than the end of the day Sunday March 10.- On Monday this week we will complete the discussion of Rules based Decision Tree and Tuning, from last week. We will discuss Neural Networks and Clustering on Wednesday. If you are interested, in Chapter 7 of Lantz 4ed there is a discussion of SVMs and in Chapter 8 there is a discussion of Association Rules.
**Homework:**The solutions to homework 3 have been posted in Canvas.**Presentation:**images.pdf- ANN.html
- ANN.qmd
- MLwR4e Chapter_07.R
- MLwR_v2_07.r
- MLwR_v2_07_h2o.r
- concrete.csv
**R Project:**Chap07.zip

**Spotlight blog:**Deep Learning in R**YouTube:**Neural Networks Demistified Video**Image, Translation, Speech:****Microsoft:**Cognitive Services Translator Speech API**Google:**Cloud Services CoLab**Amazon:**AWS Deep Learning SageMaker**h2o:**Deep Learning Deep Learning with R Tutorials**Nvidia:**NVidia’s GPUs - The Engine of Deep Learning**Baidu:**Baidu Reseach- Tensorflow
- PyTorch

**Presentation:**- SVM.html
- SVM.qmd
- MLwR4e Chapter_07.R
- MLwR_v2_07.r
- MLwR_v2_07_h2o.r
- letterdata.csv
**R Project:**Chap07.zip

**Website Spotlight:**UCI MLR nminst dataset**kaggle Competition:**Digit Recognizer**Spotlight paper:**EMNIST**Website Spotlight:**EMNIST Dataset**Website Spotlight:**cedar**Website Spotlight:**IMAGENET**Website Spotlight:**Fashion NIST**Presentation:****Presentation:**Transactional Data and Association Rules- Association.html
- Association.qmd
- MLwR4e Chapter_08.R
- MLwR_v2_08.r
- groceries.csv
**R Project:**Chap08.zip

**Spotlight book:**Hands on Machine Learning**Spotlight book:**Dive into Deep Learning

**Week 6:**

**Midterm:**The Midterm has been posted. See the Homework link. The titanic data is used. See Examples_Tidymodels for the new tidymodels code.**Homework:**Homework 4 has been posted.**Presentation:**images.pdf**Further Examples:**This R Project contains new example that use the new Workflow package that is part of the tidymodels collection of packages.**Presentation:**- Tuning.html
- Tuning.qmd
**R Project:**Chap11.zip This code takes a long time to run. Try this after the Chap05 code.

**Presentation:**- Rules.html
- Rules.qmd
- mushrooms.csv
**R Project:**Chap05.zip

**Software Spotlight:**BigML Are You Ready for Big Machine Learning?**Software Spotlight:**DataRobot

**Week 5:**

**Homework:**Homework 3 has been posted.**Midterm:**The Midterm has been posted. See the Homework link. The titanic data is used.- See Rebecca Barter’s blog post Tidymodels: tidy machine learning in R.

- See Olivier Gimenez’s blog post Experimenting with machine learning in R with tidymodels and the Kaggle titanic dataset.
- See Jan Kirenz blog post Data Science with Tidymodels, Workflows and Recipes

- See Rebecca Barter’s blog post Tidymodels: tidy machine learning in R.
**Presentation:****Quarto Project:**Chap06.zip updated Logistic Regression code, compare ROCs**Presentation:****Website Spotlight:**Rseek**Website Spotlight:**METACRAN**Website Spotlight:**RDocumentation**Website Spotlight:**rdrr.io**Presentation:**images.pdf**Presentation:**- Naive Bayes SMS spam filtering.html
- Naive Bayes SMS spam filtering.qmd
**Notes:**BayesNotes.pdf- sms_spam.csv
- MLwR4e Chapter_04.R
- MLwR_v2_04.r
**R Project:**Chap04.zip**R Notebook:**NB.Rmd

- What is a VCorpus? StackExchange
- tm Vignettes
**Hint:**Recall from class that some people running R on Windows had a fonts problem. To solve the problem we added a line to the code giving the third DTM to the first. Since all of the steps used to create the first DTM are also done for the third DTM.

- compare the result
- sms_dtm
- sms_dtm2
- sms_dtm3
- sms_dtm <- sms_dtm3

**Homework Solutions:**TidyModels Examples: All of the code below is work in progress. I need to add my comments still to my notebooks. The code below is made available as an extra set of code using the new TidyModels package.- TidyModels
**Spotlight blog:**A Gentle Introduction to tidymodels**Books Spotlight:**wikibooks

**Week 4:**

**Announcements:**- This week we will start by running the code from last week for Linear Regression and kNN.
**Homework:**Homework 2b has been posted.- Before starting to run any ML algorithms on the NHANES data you should investigate what is in the dataset. In particular which variables are numeric and which ones are categorical. You should also check to see how much data is missing from each variable.
- NHANES_missing-values.Rmd

**Quiz:**Quiz 1 has been posted under Assignments.**Datasets:****Presentation:**images.pdf- CART.html
- CART.qmd
- MLwR4e Chapter_06.R
- MLwR_v2_06.r
- whitewines.csv
- redwines.csv
**Quarto Project:**Chap06.zip

**Spotlight blog:**Beginner’s guide to machine learning in R (with step-by-step tutorial)**Spotlight blog:**yuzaR-blog**YouTube:**R package reviews | glmulti | Find The Best Model !**Spotlight Software:**Tidyverse**Spotlight Software:**Tidymodels rsample**Spotlight Software:**caret**Spotlight Software:**easytats report performance**Website Spotlight:**UC Business Analytics R Programming Guide Very Nice!!!**Website Spotlight:**UC-r Logistic Regression Tutorial**Website Spotlight:**UC-r Resampling Methods Tutorial**Spotlight blog:**How to perform Logistic Regression in R**Website Spotlight:**Generalized Linear Models**YouTube:**The tradeoff between Sensitivity and Specificity**YouTube:**ROC Curves Video**See minute 7.****DataCamp:**Bret Lantz Supervised Learning in R: Classification**DataCamp:**Machine Learning with Tree-Based Models in R**DataCamp:**Nina John Supervised Learning in R: Regression**DataCamp:**Sergey Fogelson Extreme Gradient Boosting with XGBoost**YouTube:**GBM**Spotlight blog:**statistics.com Lift and Persuasion

**Week 3:**

**Reminder:**On Mondays class is online on Zoom. See Canvas > Zoom for the link to the Zoom meeting.**Reminder:**You should finish the assignment by the due date on the class website. You can ask questions that day and then fix anything that needs to be updated and submit by the end of the week. Does this make sense? I expect you to finish the homework by the due date, it would be best to turn it in soon after but no later then the date on Canvas. I want everyone to reflect on their work and finalize it after asking questions and letting me suggest some ideas if needed. And I do not want to be asked about extending the due date, as everyone has gotten 4 extra days to submit.**Homework:**Homework 2a has been posted.- The following presentations are based on the chapters in the Lantz Machine Learning with R, Fourth Edition
**Presentation:**images.pdf- Regression.html
- Regression.qmd
- MLwR4e Chapter_06.R
- MLwR_v2_06.r
- challenger.csv
- insurance.csv
**Quarto Project:**Chap06.zip

**Website Spotlight:**UC-r Linear Regression Tutorial**Website Spotlight:**UC-r Linear Model Selection Tutorial**Spotlight blog:**Decision Trees - An Intuitive Introduction**Presentation:**images.pdf**Presentation:**- kNN_diagnosing_breast_cancer.html
- kNN_diagnosing_breast_cancer.qmd
- wisc_bc_data.csv
- MLwR4e Chapter_03.R
- MLwR_v2_03.r
**Quarto Project:**Chap03.zip**R Notebook:**Chap03-RNotebook

**Spotlight blog:**Tidymodels: tidy machine learning in R**Spotlight Youtube:**Jelena Ilic: Modeling, Tidyverse Way

**Week 2:**

**CSU Faculty Strike:**No in person class Monday. Strike is over. We will meet Wednesday on campus.

- I have reviewed the 4th ed. of the Lantz book and plan to use it as the reference instead of the 2nd ed. The new version of the book has better code available than the previous edition. I will be updating the course materials to reflect this change. Please download the code for the 4th ed. of the book and use it for the course.
**Assignment:**Homework 1 has been posted.**Presentation:****Quarto Notebook:****Quarto Notebook:**- Ch09 Statistical Foundations.html using the
*infer*R package - Ch09 Statistical Foundations.qmd

- Ch09 Statistical Foundations.html using the
**Spotlight Software:****Spotlight Blog post:**Prime Hints For Running A Data Project In R**Spotlight blog:**Data Science Central**Spotlight blog:**The 10 Statistical Techniques Data Scientists Need to Master**Spotlight blog:**MACHINE LEARNING TRENDS IN 2023**Spotlight blog:**10 top AI and machine learning trends for 2024**Spotlight blog:**2023 emerging AI and Machine Learning trends**RStudio::global 2023:**rstudio::conf 2023**RStudio::global 2022:**rstudio::conf 2022**RStudio::global 2021:**rstudio::global 2021**RStudio::conf 2019:**rstudio::conf 2019**Spotlight Books:**

**Week 1:**

**Classroom:**Section 3 Zoom and Art and Education, RM 285, Zoom links in Canvas.**Book:**mdsr2e**Assignment:**Homework 1 has been posted.**Presentation:****Spotlight Software:**

**Week 0:**

**Learning R:**

**Learning Python:**

**Learn SQL:**

**Excellent References:**

**Data Science:**

- Socviz
- r4ds
- ModernDive
- Yarrr!
- R Data Science Essentials
- Python Data Science Essentials
- Deep Learning Made Easy with R
- Doing Data Science
- Data Science from Scratch
- What is Data Science? (fast easy read)
- Ethics and Data Science (fast easy read)
- Data Driven (fast easy read)
- R Markdown: The Definitive Guide

**Reading related to the Digital Economy:**

- The Second Machine Age: Work, Progress and Prosperity in a Time of Brilliant Technologies
- Race Against the Machine
- Wired For Innovation
- Strategies for e-business success
- Understanding the Digital Economy

**More Big Picture:**

- Fourth Paradigm of Science: Data-Intensive Scientific Discovery
- McKinsey Global Institute Big Data: The next frontier for innovation, competition, and productivity