Stat 650 Advanced R for Data Science
Department of Statistics and Biostatistics
California State University, East Bay
Fall 2020
Course Description | Homework | Important Dates | Software |
Syllabus | Handouts | Jamboard | Links |
Blackboard | podcasts | Data | Online Books |
Week 8: Finals week
- I will hold my usual office hours, MW 2-3pm.
- I will log into class on MW at noon and at 8pm to answer questions. After all questions are answered I will be signing out of the Zoom call. So if you have questions, please log into the class at noon or 8pm. I will not be available online after the last person's questions are answered.
- Next week Stat. 651 will start.
Week 7:
- Final: The final has been posted, see the Homework link.
- Homework: Homework 6 has been updated.
- Quiz: There will be a Quiz in class on Monday September 28, 2020. We will work on the first question in class and the rest of the quiz will be completed at home. Due Wednesday September 30, 2020.
- Homework Solutions: The solutions to Homework 3, 4 and 5 has been posted in Blackboard.
- Next week is Finals Week for the class There will be no class on Monday. I will hold regular office hours. The solution to the quiz will be posted.
- Final: There will be a take-home final exam given on Wednesday September 30, 2020. Due Friday October 9, 2020.
- Project: Due Sunday October 11, 2020 by midnight.
- Student Evaluations: Please fill out the student evaluation for the Stat. 650, you should receive an email. Here is a video about how to complete the Student Evaluations of Learning Experience (SELE).
- On Monday we will discuss dplyr's unite() and separate().
- Homework: Homework 6 has been posted. All homework is due by Friday October 9, 2020.
- RNotebook:
- RStudio Cheatsheets: RStudio Cheatsheets
- Spotlight Tidyverse packages:
- Presentation:
- RNotebook:
- RNotebook:
- UseR 2017 purrr tutorial:
- RProject:
- Spotlight Youtube Video:
- Spotlight Tutorial:
Week 6:
- Practice Quiz 1: There will be a Practice Quiz in class this week on Wednesday.
- Quiz 1: There will be a Quiz in class on September 28, 2020. The quiz will be similar to the Practice Quiz 1 but with different data.
- Homework: Homework 5 has been posted.
- Organization and neatness. Interview questions:
- Can you manage your time?
- Are you an organized person?
- Share with me a project you worked on where you used your data analysis skills and produced a report?
- Explain any difficulties you over came in completing your project?
- Would you be willing to share an example of your work?
- Project: The project has been posted, see the Homework link. Airline On-Time Performance Data
- Monday we will look at an example of JSON formated data. See the analytics.bart.gov website. This is a good example of dynamically updated data. Download the JSON files and read them into Excel. You will need to be on Windows and have the newest Excel. Sorry this need the newest Excel 365.
- RNotebook:
- BART.Rmd
- BART.nb.html
- BART.docx
- BART.pdf
- BART.zip updated
- Spotlight R Package:
- YouTube Video
- Practice Quiz:
- Wednesday we will take a further look at the examples of spread and gather in the book. And the new functions the tidyr package, pivot_longer() and pivot_wider() We will start to discuss looping.
- RNotebook:
- RNotebook:
Midterm Information:
I have received a number of emails about not being able to knit the midterm when using the fordgobike01.Rmd separately to create the dataframes and then using a separate .Rmd file to access dataframes in the Global Environment for answering the Midterm questions. It turns out that RStudio does not allow this through the interface Preview/Knit button, this is considered non-reproducible. Here is a link to a StackOverFlow post about this topic. To fix this problem I would suggest you move the code you have used to download the data to an R Script and the source() the R Script at the start of your R Notebook to load the data. Save the data as an .Rds file and load the .Rds file. Use the readr R package functions write_rds() and read_rds().
Suggestions:
- Make sure you name your file correctly. Lastname and Firstname are the names you have in the Unversity computer systems.
- If you are are having problem visualizing the missing data, try running your code on subsets of your data. I would suggest chunks of approximately 25,000 observations.
- DO NOT submit your data to Blackboard.
- DO NOT submit a .zip file. At this point I would suggest submitting a .Rmd file containing the solutions to the questions, a .R Scipt containing your data wrangling steps, and a .docx or .pdf file that I will read.
Week 5:
- Homework Solutions: The solutions to Homework 1 and 2 have been posted in Blackboard.
- Midterm: The midterm is due in two week, Friday September 25. Please note that I do not want to see too much of my code from fordgobike01.Rmd as the first thing in your Midterm notebook. None of the reading in of the data is asked about in the questions. You might start with a head(lyftbaywheels) and then start with question 1, put the code to read in the data into a separate R script. One comment: The Amila package may not run well because of the size of our dataset. Consider taking a random sample of the data. Or try the visdat package instead.
- Project: The project has been posted, see the Homework link. Airline On-Time Performance Data
- Presentation:
- Spotlight Websites:
- RNotebook:
- SPotlight Website: UC Business Analytics R Programming Guide Check out the tidyr chapter.
- R Notebook:
- StockData01.Rmd
- StockData01.nb.html
- StockData02.Rmd Pivot_Wider and Pivot_Longer
- StockData02.html
- Software Spotlight: Meld When you are working with a number of versions of a program someone else has written it is very useful to view the code using side-by-side viewing. I use Meld a lot for this purpose. If you find a different program you like let me know.
Week 4:
- Homework: Homework 4 has been posted.
- Today we will go over how to create an R Project. I would recommend doing your work for the Midterm in an R Project.
- Spotlight Blog post: Prime Hints For Running A Data Project In R
- Demo the Midterm code.
Week 3:
- Homework: Homework 3 has been posted.
- On Monday we will start by going over the Data Wrangling code. Note that the sawapi website has ended. As an alternative, lets try the CDECRetrieve. This is a package created by an alumni of the MS Statistics program. If you try to install this package you will get a message that it is not available. Welcome to open source software. We will have to install this package from GitHub. See if you can get the example from the website working.
- Midterm: The Midterm will be a take-home and the data posted. See the Homework link.
- Presentation:
- RNotebook:
- Multiple_Tables.Rmd
- Multiple_Tables.nb.html
- Multiple_Tables.docx
- Multiple_Tables.pdf (Note: skimr cannot be used to knit to a .pdf file. So that code needs to be removed.)
- Presentation:
- Scripts for merging nycflights13:
Week 2:
- Homework: Homework 2 has been posted.
- RNotebook: Suggestions for completing your homework for the class. It should be easy to read.
- Presentation:
- RNotebook:
- Reference book: r4ds
- Join the class slack channel. Get a slack account using your horizon.csueastbay.edu account. You should be able to find Workspace csueb-ds-650.slack.com. Just copy this link into a browser and then log into Slack.
- R Cheetsheets:
- Presentation:
- Project: The Project will be posted at the end of next week.
- Midterm: The Midterm will be a take-home.
- RNotebook:
- RNotebook: with answers to the questions
Week 1:
- Presentation:
- Homework: Homework 1 has been posted.
- RNotebook:
- Spotlight Software:
References:
Learning R:
- Data Camp: Introduction to R
- Data Camp: Machine Learning with Tree-Based Models in R
- Pluralsight: TryR
- RProgramming.net
- Introduction to MRO
- TidyTuesday
Learning Python:
Learning SQL:
Other classes. What is the difference between Statistical Learning, Machine Learning, Data Science, Data Mining, KDD, etc.?
- Stanford Statistical Learning Class
- Stanford Machine Learning Class Machine Learning Open Classroom
- CMU Machine Learning
- CalTech Learning from Data
- galvanize
- general assembly
- simplilearn
- Microsoft Data Science
- The Open Source Data Science Masters
- IBM Cognitive Class
- Intel AI Developer Program
Excellent References:
Data Science:
- r4ds
- ModernDive
- Yarrr!
- R Markdown: The Definitive Guide
- R Data Science Essentials
- Python Data Science Essentials
- Doing Data Science
- Data Science from Scratch
- Data Driven (fast easy read)
- A Simple Introduction to Data Science
Reading related to the Digital Economy:
- The Second Machine Age: Work, Progress and Prosperity in a Time of Brilliant Technologies
- Race Against the Machine
- Wired For Innovation
- Strategies for e-business success
- Understanding the Digital Economy
- The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power
- AI for Marketing and Product Innovation: Powerful New Tools for Predicting Trends, Connecting with Customers, and Closing Sales machineVantage Videos
More Big Picture:
- Fourth Paradigm of Science: Data-Intensive Scientific Discovery
- McKinsey Global Institute Big Data: The next frontier for innovation, competition, and productivity
- leada The Data Analytics Handbook
- Data Analysts + Data Scientists
- CEO's + Managers
- Researchers + Academics
- Big Data Edition