Stat 694: Applied Research in Statistics and Biostatistics
Department of Statistics and Biostatistics, CSU East Bay
Fall 2023:
Week 15:
- If anyone is ready to give their class presentation for their class project, please let me know.
- Today we are going to go over some simulation and calculation examples for the confidence interval, power, effect size, and likelihood ratio test.
- Simulation of the confidence level
- Power and Effect Size
- Likelihood Ratio Test examples
- Distribution of the p-value under the null hypothesis.
Week 14:
- Class tomorrow, Friday Dec. 1, 2023 will be online from noon to 2pm.
- Today: We will discuss the class project, what to prepare for the class presentation and what to submit at the end of the course to show your progress on your project.
- Class Presentations: Your presentation can be a PowerPoint presentation or an Quarto Notebook Presentation, it should be a well organized demo of your code and notebook, it could be a discussion of the findings of your data analysis. I leave the kind of presentation up to you. The best presentations are the ones where people have a small number of memorable take-away ideas. Final thing is to say a few words about further work that you plan on pursuing.
- Project Submission: You should submit a summary report of the progress you have made on your project. 1) Your report should start with a summary what you have worked on and summarize the data you have used in your project. 2) You report should explain in sentences what you have done. If you have manipulated your data, explain how you have worked with your data. If you have developed models, explain what models you have worked on and what the results were. 3) Explain your findings. Did your approach to working on your project work? What did you learn? 4) Suggest how you would proceed to improve your work and what your next steps will be.
- Today: After discussing the class project we will discuss some topics related to conducting hypothesis testing and reporting in R and JASP.
- Running Hypothesis Tests:
- Effect Size:
- Power:
- Reporting:
- Performance:
- Doing Hypothesis Testing in R and beyond: Questions: Should be be doing hypothesis testing in R or Python by programming? Should we be doing hypothesis testing on a calculator? Should we be using a spreadsheet program? Should we be using point-and-click software? Should we be using a combination of these approaches? What is the best way to do hypothesis testing? What is the best way to report the results of hypothesis testing?
- Quarto Notebook:
Week 13:
- Class Presentations: After the Thanksgiving Holiday week we will start to have project presentations. Your presentation can be a PowerPoint presentation, it should be a well organized demo of your code and notebook, it could be an overall discussion of the findings of your data analysis. I leave the kind of presentation up to you. The best presentations are the ones where people have a small number of memorable take-away ideas. Final thing is to say a few words about further work that you plan on pursuing.
- Quarto Notebook: kNN
- Presentation: kNN
- Presentation: k-NN Disgnosing Breast Cancer
- Presentation: Clustering
Excellent References:
Machine Learning with R:
- Machine Learning with R Read online from the University Library Databases A-Z > O > O-Reilly Online Learning E-Books
- Introduction to Statistical Learning
- Elements of Statistical Learning
- Computer Age Statistical Inference
- Hands-On Machine Learning with R
- tidymodels
- mlr3 book
- Introduction to Machine Learning (I2ML)
- Interpretable Machine Learning
Machine Learning with Python:
- w3schools Machine Learning in Python
- Real Python Machine Learning Tutorials
- Introduction to Machine Learning with Python
- scikit-learn
- pycaret
- pycaret book
- Python Machine Learning Read online from the University Library Databases A-Z > O > O-Reilly Online Learning E-Books
Big Picture:
- Fourth Paradigm of Science: Data-Intensive Scientific Discovery
- McKinsey Global Institute Big Data: The next frontier for innovation, competition, and productivity
- Data Scientist: The Sexiest Job of the 21st Century
- Is Data Scientist Still the Sexiest Job of the 21st Centry?
- Data Driven Data Jujitsu Building Data Science Teams Ethics and Data Science
LLMs
- Here is a nice introduction to LLMs anc ChatGPT from packt publishing that is on the O-Reilly platform, which is available online from the University Library. Introduction to ChatGPT and OpenAI Read online from the University Library Databases A-Z > O > O-Reilly Online Learning E-Books
Week 12:
- No class. Holiday.
Week 11:
- Today: I will be giving an introductory discussion to a visiting group of interested Data Science students from a local Community College during the first hour of the class. I will be logging into Zoom during my discussion for our class.
- Quarto Notebook: kNN
- Holiday: No class on Friday, November 20th because of the university holiday.
- Today: We will finish our discussion about p-values and the problems with statistical significance in research.
- Reproducibility:
Week 10:
- Today: We will take a second look at how p-values behave with small sample sizes.
- Reference:
- RNotebook:
Week 9:
- Today: We will discuss problems with statistically significant and some issues around an effort to rename the term statistically discernible. Continue to discuss what an API is and test out the tidytransit R package to download data for BART.
- References:
- Sifting the evidence—what’s wrong with significance tests?
- Scientific method: Statistical errors
- Scientists rise up against statistical significance
- The Significant Problem of P Values
- ASA p-value statement
- ASA II: The ASA’s P-value Project: Why it’s Doing More Harm than Good
- Statistically discernible
Week 8:
- Today: We will take a look at the BART website and try to download data from the BART website using the new API.
- R packages:
- BART links:
- Then we will take a look at the BART Analytics website to see how data is commonly stored in json files on the internet. Finally we will discuss what an API is and try to test out an example of using an API.
- In the past we looked at an example of JSON formatted data. See the analytics bart.gov website. This is a good example of dynamically updated data. Download the JSON files and read them into Excel.
- RNotebook:
- BART.Rmd
- BART.nb.html
- BART.docx
- BART.pdf
- BART.zip updated
- Last thing today is to take a look at the Parking Meter data from DataSF. For next time get an account.
Week 7:
- Today: Visualization using ggplot2.
- NYC ASA Event: ASA VIRTUAL TRAVELING COURSE - DATA VISUALIZATION WITH R flyer
- Spotlight Youtube:
- Spotlight Podcasts:
- Spotlight blogs:
- Spotlight books:
- Data Visualization using ggplot
- Stat. 651 Presentation:
- RNotebook:
- RNotebook:
- RNotebook:
- RNotebook: Maps, storms on a map
Week 6:
- Class today, Friday September 29, 2023 will be online from noon to 2pm.
- Today: Project Stand-up. Today everyone gives an informal summary of their project, progress, and received suggestions for next steps.
- The main suggestion is to focus on getting started on a small reasonable idea that is possible to complete in the time allowed.
- Assignment: Homework 5 has been posted.
- Gantt Charts ganttrify
- Today: Visualization using ggplot2.
- Spotlight Website: TED The best Hans Rosling talks you’ve ever seen
- Spotlight Website: Hans Rosling’s 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four
- Today: we will experiment with the ggplot2 commands for Data Visualization.
- Books:
- Stat. 651 Presentation:
- RNotebook:
- RNotebook:
- RNotebook:
- RNotebook: Maps, storms on a map
Week 5:
- Today: we will experiment with the dplyr commands for Data Wrangling. We will discuss the pipe |> or CTRL+ALT M.
- Question: How many ways are there to make change for a dollar? Develop a way to count the number of ways to make change. First think about what it means to make change.
- Quarto Notebook:
- Question: Use bing chat to ask for the R code to count the number of ways to make change for a dollar. See this bing chat history for an example of what is possible, bing chat.
- Quarto Notebook: with answers to the questions
- Questions: StarWars_questions.txt
- Assignment: Homework 4 has been posted.
Week 4:
- This weeks class meeting will be online only. I will not be in the classroom. Please make arrangement to log into the Zoom meeting during class this week on Friday at noon.
- Today: Begin to look at Data Wrangling with the Tidyverse and Prompt Engineering.
- Book: mdsr3e
- Book: r4ds2e
- Presentation:
- Quarto Notebook:
- Quarto Notebook: with answers to the questions
- Questions: StarWars_questions.txt
- Update: There is a new R package called starwarsdb. Update the above presentation and RMarkdown documents.
- Access Google Colab in your university Colab account. Add the CoLab Google App. Under Runtime > Change Runtime Type, you can select R.
- Test out some R code running on Colab or start to work with Python.
- Learn R or Python: Statology
- What is prompt engineering? Try the following prompt: “Acting as a Professor of Statistics and Machine Learning, present a tutorial introducing the ideas of Prompt Engineering, give examples of good prompts for teaching the ideas of linear regression and logistic regression, give further examples introducing the Naive Bayes Classifier and the ideas behind Decision Trees. Give some Python code that can be run on an example dataset for Naive Bayes.”
- Online Course: IBM CognativeClass.ai The Art of Prompt Engineering
- Assignment: Homework 3 has been posted.
Week 3:
- For those that are not registered for the class, but are registered CSU East Bay MS Statistics or Biostatistics students, I have posted an Announcement in the Statistics Majors Canvas group. Please let me know if you are able to find the Announcement with the links and please let me know if you are able to view them.
- Assignment: Homework 1 and Homework 2 have been posted.
- Today: We will discuss the use of Git and GitHub. Please create a GitHub account if you do not already have one.
- We will create a first GitHub Repository.
- We will create an RProject with git.
- Next week will create an RProject with version control on GitHub.
- Will generate RStudio credentials and give access to GitHub. In Rstudio Tools > Global Options > Git/SVN > Create SSH Key… > Create > Copy Key. In GitHub Setting (pull down upper right) > Settings > SSH & GPT Keys > New SSH key > Paste your key > Add SSHN key.
- Finally we will download some files into our version controlled RProject and Commit changes and Push update to GitHub. Download the ob_prob7_suess files into your RProject, commit them and push them to your GitHub Repository.
- Happy Git and GitHub for R users
- Today: We will install and try the DataExplorer R package for AutoEDA.
Week 2:
- See the Assignment link for the Homework assignments.
- Start today with by logging into O’Reilly Online Learning E-Books from the library website. A-Z Databases Note there are books from Apress, Packt, Manning, O’Reilly, CRC, etc.
- How does using the O’Reilly Online Learning E-Books differ from using other similar platforms, such as packt or LinkedIn Learning.
- Demo the differences between R scripts and R Projects (always use R Projects). Introduce Quarto.
- Install the following R packages that contain datasets: palmerpenguines, fueleconomy, nycflights13
- Big Data websites EPA Download Fuel Economy Data, BTS Airline On-Time Statistics, lyft BayWheels
- Install the following R package for AutoEDA: DataExplorer
- Download the following files onto your computer and create an R Project containing these files.
- Read r4ds2e
Week 1:
- Welcome to the Data Science Workgroup.
- We will be meeting on Friday’s at noon.
- Discuss topics of discussion for the semester.
- Guest speakers?
- What will be your research project for the semester? A first draft will be due next week.
- I have begun working on the COVID19 Data Hub. I am looking for people who can help monitor the Github comments and to test the software on a regular basis.
- I have begun working on experimenting with ChatGPT and Bing Chat for learning about Data Science and thinking about how to responsibly using AI for learning.
Spring 2023:
Weeks 7:
- There will be no meeting next week on Fri. March 10.
- Today we will discuss an example related to Recommendation Engines. We will discuss Market Basket Analysis, Association Rules, and Transactional Data.
- Presentation: Transnational Data and Association Rules
Weeks 6:
- Try out Google CoLab for R. Access CoLab from your university Google account. https://colab.research.google.com/notebook#create=true&language=r
- Here is a CoLab notebook simulating the solutions to an interesting probability question. And simulating the result in parallel. Google interview question.ipynb
Weeks 5:
- Today we will explore the nycflights13 dataset. We will explore it using the DataExplorer R package. And we will clean up some of the columns of data and then merge the data into one large dataframe.
- We will also take a look at the openxls and dbplyr R packages. We will see how to write the 5 data tables to an Excel Workbook and see how to export the data into an sqlite database.
- R project:
- Software Spotlight:
Weeks 4:
- Today we will discuss AutoEDA.
- What is EDA?
how many rows, n, number of observations
how many columns, number of variables
how many numeric variables, how many categorical
examine the amount of missing data NA
what are the summary statistics for each numeric variable
what are the summary statistics for each categorical variable
visualize the numeric data, is the data symmetric or skewed, is your data normally distributed
visualize the categorical data
what numeric variables are correlated? make scatterplots
Are there any outliers in variables?
variable names, do they follow good practice
R janitor No spaces clean_names()
- Question: What are you interested in learning about this semester and, if you are interested in working on a project, what is the topic of a project you are interested in working on this semester.
- I would like you to read the arXiv.org paper The Landscape of R Packages for Automated Exploratory Data Analysis and familiarize yourself with the author’s GitHub mstaniak autoEDA-resources. The GitHub gives the names and links to all of the main AutoEDA R packages that are available.
- Another nice paper about autoEDA in R is the SmartEDA: An R Package for Automated Exploratory Data Analysis.
- Start by installing the following R packages:
- To start learning about missing data I would like you read over the following book Flexible Imputation of Missing Data
- Start by installing the following R packages:
- Quarto Notebook:
- Linux News: Dat Linux 1
- If you have an interest in learning about Linux along with learning about Data Science you might consider installing Dat Linux on an old computer. I would not recommend installing it on your main computer unless you are already familiar with Linux.
- All of the main tools for Data Science can be easily found and installed unsing Dat Linux.
- There is a very nice list of books that are available online.
- So this is a great place to start learning Linux along with Data Science.
Weeks 3:
- Today we will discuss what topics to work on this semester.
- Learn more basic R and Python code from statology.
- Work with a large data from Lyft BayWheels or BART Ridership.
- The World is changing with Large Language Models (LLMs)
- What can we do with LLMs? Well many new things. Giving a text prompt we can generate images, generate movies, write papers, explain things, learn things, etc.
- open.ai
- google.com
- Lets test out the RTutor which is based on GPT3. Try some prompts to analyze some data.
- It would be good if we all learn something about how to be a capable Prompt Engineer this semester for doing Statistics and Data Science.
- We need to learn about Deep Learning Models, Attention and Stable Diffusion.
- Follow me on Twiter @esuess for re-tweets about Data Science. (Moving to Mastodon, but not active yet.)
Weeks 1 & 2:
- Unfortunately I have had conflicts on the first two Fridays this semester. I am hoping that we can start next week.
- It would be good to hear what people are interested in discussing this semester related to Data Science.
- My current interest is in testing our ChatGPT to see how helpful it is to learn about Probability, Statistics, Data Science, Machine Learning, and beyond.
- If you are interest in getting started with ChatGPT, create an account and start testing it out by entering questions. The most effective way I have learned so far is to start each question with, “Acting as a statistician, explain *****.” This seems to work quite well. I am interested in hearing what your experience is like.
- I have also been testing RTutor.ai with the datasets that are provided and I have tested a few datasets I am interested in. I would suggest finding a homework problem and try to ask the RTutor to create the R code to compute the answers to the problem and then create the Report.
Fall 2022:
Week 16:
- Class today, Friday December 2, 2022 will be online from noon to 2pm.
- Class Presentations will be online today. To do the presentation you will be made a co-host and you can share slides or code.
- If you would rather present in person, please let me know and we can do that next Friday December 16 from the classroom.
Week 15:
- Class tomorrow, Friday November 4, 2022 will be online from noon to 2pm.
- Class Presentations: After the Thanksgiving week we will start to have project presentations. Your presentation can be a PowerPoint presentation, it would be a well organized demo of your code and notebook, it could be a discussion of the findings of your data analysis. I leave the kind of presentation up to you. The best presentations are the ones where people have a small number of memoriable take-away ideas. Final thing is to say a few words about further work that you plan on pursuing.
- Presentation: kNN
- Presentation: k-NN Disgnosing Breast Cancer
- Presentation: Clustering
Excellent References:
Machine Learning with R:
- Machine Learning with R Read online from the University Library Databases A-Z > O > O-Reilly Online Learning E-Books
- Introduction to Statistical Learning
- Elements of Statistical Learning
- Computer Age Statistical Inference
- Hands-On Machine Learning with R
- tidymodels
- mlr3 book
- Introduction to Machine Learning (I2ML)
- Interpretable Machine Learning
Machine Learning with Python:
- w3schools Machine Learning in Python
- Real Python Machine Learning Tutorials
- Introduction to Machine Learning with Python
- scikit-learn
- pycaret
- pycaret book
- Python Machine Learning Read online from the University Library Databases A-Z > O > O-Reilly Online Learning E-Books
Big Picture:
- Fourth Paradigm of Science: Data-Intensive Scientific Discovery
- McKinsey Global Institute Big Data: The next frontier for innovation, competition, and productivity
- Data Scientist: The Sexiest Job of the 21st Century
- Is Data Scientist Still the Sexiest Job of the 21st Centry?
- Data Driven Data Jujitsu Building Data Science Teams Ethics and Data Science
Week 14:
- No class. Holiday.
Week 13:
- Class tomorrow, Friday November 4, 2022 will be online from noon to 2pm.
- Topics: Continue the discussion from last week. Bootstrap.
- Spotlight book: Modern Dive See Chapter 7 for a discussion about Sampling and Chapter 8 for an introduction to Bootstrapping.
- Spotlight book: islr
Week 12:
- Class tomorrow, Friday October 28, 2022 will be online from noon to 2pm.
- Topics: This week we will begin discussing Machine Learning. We will discuss Training/Testing Datasets, Classification and Prediction algorithms, and Accuracy. We will introduce kNN and Clustering.
- Statistical Machine Learning:
- Presentation: Welcome
Week 11:
- Class tomorrow, Friday October 21, 2022 will be online from noon to 2pm.
- Progress Reports.
Week 10:
- Class tomorrow, Friday October 21, 2022 will be online from noon to 2pm.
- Today: We will begin with progress reports. We will also start to take a look at how to use the API from the DataSF website. We will take a look at the Parking Meters data.
Week 9:
- Class tomorrow, Friday October 14, 2022 will be online from noon to 2pm.
- Today: Today we will continue to look at the BART Analytics website bart.gov and we will build a dashboard similar to the one online using the json files. We will make all of the plots using ggplot2 and then use the R packages patchwork and plotly to build a single plot with all of the subplots.
Week 8:
- Class tomorrow, Friday October 7, 2022 will be online from noon to 2pm.
- Spotlight Youtube:
- Spotlight Podcasts:
- Spotlight blogs:
- Spotlight books:
- Data Visualization using ggplot
- Today: Today we will start by testing a few ggplot2 plots. Then we will take a look at the BART Analytics website to see how data is commonly stored in json files on the internet. Finally we will discuss what an API is and test out an example of using an API.
- Today we will look at an example of JSON formatted data. See the analytics bart.gov website. This is a good example of dynamically updated data. Download the JSON files and read them into Excel. You will need to be on Windows and have the newest Excel.
- RNotebook:
- BART.Rmd
- BART.nb.html
- BART.docx
- BART.pdf
- BART.zip updated
- Last thing today is to take a look at the Parking Meter data from DataSF. For next time get an account.
Week 7:
- Next Week: Visualization using ggplot2.
- Spotlight Website: TED The best Hans Rosling talks you’ve ever seen
- Spotlight Website: Hans Rosling’s 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four
- Today: we will experiment with the ggplot2 commands for Data Visualization.
- Books:
- Stat. 651 Presentation:
- RNotebook:
- RNotebook:
- RNotebook:
- RNotebook: Maps, storms on a map
Week 6:
- Class today, Friday September 22, 2022 will be online from noon to 2pm.
- Today: Project Stand-up. Today everyone get a summary of their project, progress, and received suggestions for next steps.
- The main suggestion was to focus on getting started on a small reasonable idea that is possible to complete in the time allowed.
- A main observation is that most data that is publicly available is not raw data (measurements on individuals) but is summarized counts and means. So it is very difficult to produce prediction or forecasts when the raw data is not available.
Week 5:
- Class today, Friday September 16, 2022 will be online from noon to 2pm.
- Today: we will experiment with the dplyr commands for Data Wrangling. We will discuss the pipe %>% or CTRL+ALT M.
- RNotebook: Updated
- Homework05 has been posted.
- Gantt Charts ganttrify
Week 4:
- Class today, Friday September 8, 2022, will be in class and online from noon to 2pm.
- Today: Begin to look at Data Wrangling with the Tidyverse.
- Presentation:
- RNotebook:
- RNotebook: with answers to the questions
- Update: There is a new R package called starwarsdb. Update the above presentation and RMarkdown documents.
- r4ds
- Homework04 has been posted.
Week 3:
- Class today, Friday September 2, 2022, will be online from noon to 2pm.
- Today: We will discuss the use of Git and GitHub. Please create a GitHub account if you do not already have one.
- We will create a first GitHub Repository.
- We will create an RProject with git.
- Next week will create an RProject with version control on GitHub.
- Will generate RStudio credentials and give access to GitHub. In Rstudio Tools > Global Options > Git/SVN > Create SSH Key… > Create > Copy Key. In GitHub Setting (pull down upper right) > Settings > SSH & GPT Keys > New SSH key > Paste your key > Add SSHN key.
- Finally we will download some files into our version controlled RProject and Commit changes and Push update to GitHub. Download the ob_prob7_suess files into your RProject, commit them and push them to your GitHub Repository.
- Happy Git and GitHub for R users
- Today: We will install and try the DataExplorer R package for AutoEDA.
Week 2:
- See the Assignment link for the Homework assignments.
- Start today with by logging into O’Reilly Online Learning E-Books from the library website. A-Z Databases Note there are books from Apress, Packt, Manning, O’Reilly, CRC, etc.
- How does using the O’Reilly Online Learning E-books differ from using other similar platforms, such as packt or LinkedIn Learning.
- Demo the differences between R scripts and R Projects (always use R Projects). Introduce Quarto.
- Install the following R packages that contain datasets: palmerpenguines, fueleconomy, nycflights13
- Install the following R package for AutoEDA: DataExplorer
- Download the following files onto your computer and create an R Project containing these files.
Week 1:
- Welcome to the Data Science Workgroup. book
- We will be meeting on Friday’s at noon.
- Discuss topics of discussion for the semester.
- Guest speakers?
- What will be your research project for the semester? A first draft will be due next week.
- I have begun working on the COVID19 Data Hub. I am looking for people who can help monitor the Github comments and to test the software on a regular basis.
Week 0:
- Welcome to Statistics 694.