Stat 452 Introduction to Statistical Learning
Department of Statistics and Biostatistics
California State University, East Bay
Spring 2020
Course Description | Homework | Important Dates | Software |
Syllabus | Handouts | Links | |
Blackboard | podcasts | Data | Online Books |
Week 8: Finals week
- I will have my usual office hours MW 2-3pm.
- The scheduled final for the class is on Monday 8 - 10 PM. I will be online starting at 8pm and will be on until all questions have been answered. I will log into Office Hours.
Week 15:
- Quiz: I listed Wednesday for the quiz. So we will have the quiz next time. The quiz will be a short list of questions that will be useful for the final.
- Final: The final for the class will be take-home. You can complete it at home individually. I will be available during the scheduled exam time for questions. Final topics: Ch. 6 Trees, Random Forests, Ch. 7 ANN and SVM, Ch. 8 Market Basket Analysis, Ch. 9 Clustering. You should be able to communicate clearly what the Holdout Rule is and what a prediction is.
- Project: Your project should be completed before the end of the quarter. The last day you can submit it is Friday May 15.
- Presentation: Transactional Data and Association Rules Association Association.pdf
- MLwRv2_08.r
- groceries.csv
- R Project: Chap8.zip
- Presentation: Clustering Clustering.pdf
- Review and other topics
- Presentation: Validation Validation.pdf
- Presentation: Ensembles Ensembles.pdf
- Presentation: Selection Selection.pdf
- Project: Just as a reminder, you should randomize your data using the sample() function. Taking the top part of your data for training and using the bottom for testing is generally a bad idea. The author of our book does this to simplify things, but he claims to have pre-randomized the rows of the datasets. Either pre-randomize and say you did that or use the sample() function to randomly select the training and test data.
- Presentation:
Week 14:
- Quiz: There will be a quiz on Wednesday on terminology.
- R Project: Chap7.zip Updated: Now with h2o code working. Using a different activation function.
- h2O Deep Learning Booklet: Deep Learning Booklet
- h2O Deep Learning Tutorial: h2o Deep Learning Tutorial
- h2O Performance Tuning Guide: h2o Preformance Guide
- Presentation: SVM SVM.pdf images.pdf
- MLwR_v2_07.r
- MLwR_v2_07_h2o.r
- letterdata.csv
- R Project: Chap7.zip Updated: Now with h2o code working. Using a different activation function.
- Website Spotlight: MNIST
- Website Spotlight: UCI MLR nminst dataset
- kaggle Competition: Digit Recognizer
- Spotlight paper: EMNIST
- Website Spotlight: EMNIST Dataset
- Spotlight Website: cedar
- Website Spotlight: IMAGENET
- Website Spotlight:
Week 13:
- Resume Suggestion: Considering everything these days it might be good to add line to your resume. "Willing and able to work remotely."
- ASA: ASA Virtual Undergraduate Career Fair
- Homework: Homework 8 has been posted.
- Presentation: ANN ANN.pdf images.pdf
- MLwR_v2_07.r
- MLwR_v2_07_h2o.r
- concrete.csv
- R Project: Chap7.zip
- Spotlight blog: Deep Learning with MXNet
- Spotlight blog: Deep Learning in R
- YouTube: 3Blue1Brown
- YouTube: Neural Networks Demistified Video
- Image, Translation, Speech:
- Microsoft: Cognitive Services Translator Speech API
- Google: Cloud Services TensorFlow
- Amazon: AWS Deep Learning
- h2o: Deep Learning Deep Learning with R Tutorials
- Nvidia: NVidia's GPUs - The Engine of Deep Learning
- Baidu: Baidu Reseach
- IBM: IBM Watson
- Spotlight YouTube:
- sentdex: Machine Learning
- Google Developers: {ML} Recipes
Week 12:
- Homework: Homework 7 has been posted.
- Project: The first part of the Project has been posted.
- Blackboard: I plan to start a Discussion Forum for each class going forward. Please give it a try. I would like to collect everyone's questions through the Discussion Forum so everyone can see the questions being asked and the answers.
- Presentation: Evaluation of Logistic Regression Models (This includes more advances topics related to Logisitic Regression. If you are considering applying to an MS Statistics program you might take a look at this more closely. This is material would be covered in a Categorical Data class.)
- Presentation: Tuning Tuning.pdf
- MLwR_v2_11.r
- credit.csv
- R Project: Chap11.zip
- Presentation: Predicting wine quality using Random Forests
- Website Spotlight: Netflix Prize
- Website Spotlight: kaggle Titanic Google Cloud & YouTube - 8M Video Understanding
- Website Spotlight: Numerai
- Website Spotlight: Quantopian
- Website Spotlight: Data Science Game
Week 11:
- Homework: Homework 6 has been posted.
- Presentation: CART CART.pdf
- MLwR_v2_06.r
- whitewines.csv
- redwines.csv
- R Project: Chap06.zip
- Presentation: LogisticRegression Logistic.pdf
- Spotlight blog: How to perform Logistic Regression in R
- Presentation: Evaluation Evaluation.pdf
- MLwR_v2_06_Logistic.r
- challenger.csv
- credit.csv
- R Project: Chap06.zip
- Spotlight blog: Evaluating Logistic Regression Models
- Website Spotlight: R Data Analysis Examples: Logit Regression
- Website Spotlight: Generalized Linear Models
- YouTube: ROC Curves Video
- YouTube: The tradeoff between Sensitivity and Specificity
Week 10:
- Spring Break: Next week is Spring Break so there will be no class on Monday and Wednesday next week.
- Homework: Homework 5 has been posted.
- Software Spotlight: RStudio Cloud Instead of dealing with Java (on Windows you need the Windows Offline (64 bit) version) to run rWeka, you could use the RStudio Cloud.
- Presentation: Regression.html Regression.pdf images.pdf
- MLwR_v2_06.r
- challenger.csv
- insurance.csv
- R Project: Chap06.zip
- Website Spotlight: UC-r Linear Regression Tutorial
- Website Spotlight: UC-r Linear Model Selection Tutorial
- Software Spotlight: Salford Systems Minitab
- Software Spotlight: Togaware onepager
Week 9:
- Office Hours: Office Hours Monday from 2-3pm are canelled. We are interviewing a candidate for a tenure track faculty position.
- This week we will be starting our online class meetings. In preparation for these meetings. Please get Zoom working on your computer. Here is the link to the csueb Zoom login page. I would also suggest having R and RStudio installed on the computer you plan to be using for class.
- We will go over the Midterm in class on Monday.
- Homework: Homework 4 has been postponed.
- Finish going over Chapter 5 code.
- Presentation: Rules.html Rules.pdf images.pdf
- MLwR_v2_05_update2.r The update includes the use of training and test data with the application of the Rules based methods.
- mushrooms.csv
- R Project: Chap05.zip
- R Notebook: Chap05.Rmd
Week 8:
- The university is now closed for the rest of the week, March 11-15. THE LATEST INFORMATION ON CORONAVIRUS (COVID-19)
- There will be no class today, March 11, 2020.
- We will have class on Monday as regularly scheduled. The class discussion and office hours will be conducted using Zoom Video Conference. Please give Zoom Video Conference a try before Monday. To join the Zoom Video Conference log into Blackboard, go to the class and then click on the Video Conference link on the left, and then click on the first link Zoom Video Conference. You will need a microphone and headphones to be able to speak and hear. Today will also be a test for me to experience class and office hours online. Today I will be in my office hour on-line from 2 to 3pm. See you on-line! Lets give it a try.
-
On Monday we will start by discussing how we will proceed until Wednesday April 8 when the decision to continue, or not, on-line will be decided. Then we will go over the Midterm solution and then try the Classification Tree on the credit data.
-
Homework: Homework 4 has been posted.
- Midterm: I plan to discuss the midterm solutions on Wednesday in class.
- Presentation: Classification Classification.pdf images.pdf
- Presentation: Classification2 Classification2.pdf
- MLwR_v2_05_update2.r
- credit.csv
- R Project: Chap05.zip
- R Notebook: Chap05.Rmd
- Website Spotlight: UC-r Regression Trees Tutorial
- Books Spotlight: wikibooks
- Software Spotlight: BigML Are You Ready for Big Machine Learning?
- Software Spotlight: DataRobot
Week 7:
- On Monday we will finish our discussion of the Naive Bayes code and we will go over the Quiz solution in preparation for the Midterm on Wednesday. The Quiz solution has been posted in Blackboard.
- Midterm: The Midterm will be on Wednesday in class. The Midterm will cover Chapters 1-4.
- Homework: Homework 4 has been posted. The material for this Homework will be presented after the Midterm, next week.
Week 6:
- Quiz: There will be a quiz in class on Wednesday February 26 covering Chapters 1-4.
- On Monday we will continue our discussion of Naive Bayes and run the R code from last week. Chap04.zip
- Guest Speaker: The Data Science Workgroup Meeting on Friday February 28, 2:10 to 3pm, we will have Darren Keeley, from PG&E, speaking about his experiences there doing data visualization and other things. flyer
Week 5:
- Office Hours: Cancelled today Unfortunately I have a department meeting during my office hours today. If you have time please come to the talk at 4pm that will be given by a candidate for a tenure track faculty position in our department.
- Homework: Homework 3 has been postponed until next week.
- Quiz: Next week, there will be a quiz in class on Wednesday February 26 covering Chapters 1-4.
- Presentation: Naive Bayes.html Naive Bayes.pdf images.pdf
- Presentation: Naive Bayes SMS spam filtering Naive Bayes SMS spam filtering.pdf
- Notes: BayesNotes.pdf
- sms_spam.csv
- MLwR_v2_04.r
- R Project: Chap04.zip
- R Notebook: NB.Rmd
- What is a VCorpus? StackExchange
- tm Vignettes
- Hint: Recall from class that some people running R on Windows had a fonts problem. To solve the problem we added a line to the code giving the third DTM to the first. Since all of the steps used to create the first DTM are also done for the third DTM.
> # compare the result
> sms_dtm
> sms_dtm2
> sms_dtm3
> sms_dtm <- sms_dtm3
- Website Spotlight: UC-r Naive Bayes Tutorial
- Website Spotlight: GormAnalysis - Introduction to Naive Bayes
- Website Spotlight: Rseek
- Website Spotlight: METACRAN
- Website Spotlight: crantastic!
- Website Spotlight: RDocumentation
Week 4:
- Homework: Homework 3 has been posted.
- LibCal: LibCal Learn Excel.
- Presentation: kNN_diagnosing_breast_cancer kNN_diagnosing_breast_cancer.pdf
- wisc_bc_data.csv
- MLwR_v2_03.r
- R Project: Chap03.zip
- R Notebook: Chap03-RNotebook
Week 3:
- Announcement: CSU STUDENT RESEARCH COMPETITION (SRC)
- Announcement: CAL STATE EAST BAY STUDENT RESEARCH SYMPOSIUM
- Homework: Homework 2 has been posted.
- Presentation: kNN kNN.pdf images.pdf
Week 2:
- Presentation: IntroR IntroR.pdf
- usedcars.csv
- MLwR_v2_02.r
- R Project: Chap02.zip
- Rpubs: https://rpubs.com/esuess/Chap02
- Spotlight blog: Analytics Vidhya
- Spotlight blog: A Complete Tutorial to learn Data Science in R from Scratch
- Spotlight blog: Machine Learning basics for a newbie
- Spotlight blog: Essentials of Machine Learning Algorithms (with Python and R code)
- Spotlight blog: Introduction to k-nearest neighbors: Simplified
- Spotlight blog: New Year Resolutions for a Data Scientist
- Spotlight blog: 25 Open Datasets for Deep Learning Every Data Scientist Must Work With
- Book Spotlight: Quick-R
- Book Spotlight: RDM
- Book Spotlight: StatSoft Textbook kNN Naive Bayes Classifier
- Book Spotlight: Tom Mitchell Chapter 3
- Book Spotlight: Hands On Machine Learning with R
- Website Spotlight: caret parsnip
- Website Spotlight: scikit learn Naive Bayes
- Spotlight blog: 6 Easy Steps to Learn Naive Bayes Algoithm
- Cheat Sheets:
- Google Machine Learning course: Machine Learning Crash Course
- New AI course: fast.ai
Week 1:
- Or you can read the book through the CSUEB library > Data Batabases > Safari Books Online
- Book images images.pdf
- Presentation: Welcome Welcome.pdf
- Homework: Homework 1 has been posted.
- Spotlight Software:
- Spotlight Software: python Anaconda jupyter scikit-learn
- Spotlight Software: TensorFlow Keras RStudio Tensorflow RStudio Keras h2O.ai
- Spotlight Blog post: Prime Hints For Running A Data Project In R
Excellent References:
Machine Learning:
Data Science:
- r4ds
- ModernDive
- YaRrr!
- R Data Science Essentials
- Python Data Science Essentials
- Doing Data Science
- Data Science from Scratch
- What is Data Science? (fast easy read)
- Data Driven (fast easy read)
- A Simple Introduction to Data Science
Learning R:
- Data Camp: Introduction to R
- Data Camp: Machine Learning with Tree-Based Models in R
- RProgramming.net
- Introduction to MRO
- R-Exercises
- R Markdown: The Definitive Guide
Learning Python:
Learning SQL:
Reading related to the Digital Economy:
- The Second Machine Age: Work, Progress and Prosperity in a Time of Brilliant Technologies
- Race Against the Machine
- Wired For Innovation
- Strategies for e-business success
- Understanding the Digital Economy
- Introduction to AI for Marketing and Product Innovation
More Big Picture:
- Fourth Paradigm of Science: Data-Intensive Scientific Discovery
- McKinsey Global Institute Big Data: The next frontier for innovation, competition, and productivity
- leada The Data Analytics Handbook
- Data Analysts + Data Scientists
- CEO's + Managers
- Researchers + Academics
- Big Data Edition
- The Master Algorithm
- Pedro Domingos: "The Master Algorithm" | Talks at Google