Statistics 650: Homework


Final: (due Sunday October 11, 2020 by midnight)

Complete the final in an R Notebook, using the usual filenaming stucture. Submit your .docx or .pdf and your .Rmd files.

Return of the List Column: Episode P

For the final I would like you to return to the starwars dataset from the dplyr package.

  1. Show a table of the startwars dataset. The films, vehicles, and starships variables are of what type? Show the first element of each variable.
  2. Using the example from the purrr compute the slope of the regression mass ~ height for each of the two species, Human and Droid. (Hint: Be sure to use drop_na(mass, height) to remove any rows with missing values before trying to fit the regression.

Homework 6: updated 2020-09-30 ( not collected )


Homework 5: (due Sunday October 11, 2020 by midnight)

Problems:

  1. Exercises 5.7

Project

(due Sunday October 11, 2020 by midnight)

Instructions: This is a take-home project. Your work is to be completed individually. You may use your book, Google, and questions can be asked of the instructor through Slack (so everyone in the class can see what the questions are and what the answers are) or by email. You are not to share code with others in the class.

Using an R Notebook produce your solutions to the following questions. Start by creating an R Notebook Lastname_Firstname_Stat650_Project.Rmd. Answer the questions in order.

All data come from somewhere

There are two parts of the project.

Part I: Create a similar R dataset to the nycflights13 for the flights out of the Bay Area in 2019. The three airports to use are San Francisco, Oakland, and San Jose.

Part II: Answer the same questions, from the book, using the data from the Bay Area in 2019.

Part I:

Introduction:

The questions for Part I are designed to be similar to the questions you might be asked after you have "come up to speed" in your new Data Science job. Your job is to fully replicate a existing dataset with data from a different location. Your assignment is to download the relevant data and then analyze the data to compare it with the current results from the data that is already available.

The job: (a different job from the Midterm)

Suppose you are working for a travel company and you have been tasked to create a similar dataset to the already available nycflights13 dataset for the Bay Area. The task is to find the original website for the Airline On-Time Performance Data and create a dataset called baflights19.

You have two weeks to get the data and code to answer the same questions and some new questions.

The current code:

You have already analyzed the nycflights13 as part of the Homework. This code will be helpful in this assignment.

You may start with the baflights19.zip. This R Project contains an R Script that downloads the data using the anyflights R package. updated The newly posted .zip file contains the data. Currently the anyflights package has some open issues on Github, so the function to download the data does not work on all platforms.

The original airlines data that Hadley Wickham created for the ASA Data Competition is available on this website. Data Expo: Airline on-time performance.

Questions to answer:

These questions should be answered in order, in your your R Notebook.

  1. The US Government website Airline On-Time Performance Data is where the data can be downloaded. What government adgency hosts this website and how can you download the data? Download the data for the Bay Area airports, SFO, OAK, SJC available months in 2019. Try to download the same columns as in the flights dataframe in nycflights13. Can you do this? If not, what can you download? What has to be done to produce the same variables and data for the Bay Area airports.
  2. Now use the anyflights R package to create the same dataframes in the nycflights13 data set, but for the Bay Area airports in 2019. Name the dataset baflights19. You may start with the baflights19.zip. updated The newly posted .zip file contains the data. Currently the anyflights package has some open issues on Github, so the function to download the data does not work on all platforms. This part is complete. Run the fs::dir_ls("data") command to see that the files are in the data subdirectory.
  3. Once you have your data downloaded, develop your code for the first month of data. The last step will be to include all of the data and perform an overall analysis for 2019. The data includes all flights that departed from the Bay Area. Including all flights departing from San Francisco (SFO), Oakland (OAK), and San Jose (SJC). How many departing flights were there in January 2019? How many departing flights were there from each airport in January 2019?
  4. Compare the variables that are available in the baflights19 flights dataframe with the variables in the nycflights13 dataframe. Make a table of the variables that are in both datasets, with a description of each variable. Hint: In RStudio see Help > RMarkdown Quick Reference > Tables. Report any differences in the variables.
  5. Answer Exercises 4.2, 4.3, 4.4 (you may only be able to answer part of 4.4) on page 89 of the book, changing nycflights13 to baflights19. Answer all of the questions for the Bay Area in 2019.

Part II.

Questions to answer.

  1. Answer Exercises 4.6, and 4.7 for the Bay Area 2019 data. (Most of the questions asked cannot be answered due to the lack of data.)

For 4.6 there should be enough data to look for outliers in the wind_speed variable.

For 4.7 there should be enough data to look at visib over the months.


Homework 4:

(due Monday September 28, 2020)

Problems:

  1. Exercises 5.1, 5.4

Midterm

(due Friday September 25, 2020 by midnight)

Instructions: This is a take-home midterm. Your work is to be completed individually. You may use your book, Google, and questions can be asked of the instructor through Slack (so everyone in the class can see what the questions are and what the answers are). You are not to share code with other in the class.

Using an R Notebook produce your solutions to the following questions. Start by downloading the fordgobike01.Rmd and renaming it Lastname_Firstname_Stat650_Midterm_lyftbaywheels.Rmd. Get the code to work and once you have created the dataset and dataframe for the 2017-2018 years, remove all of the code that is not related to the questions asked to be answered.

Introduction:

The questions on the Midterm are designed to be similar to questions you might be asked during the first week of a new Data Science job. The questions relate to "coming up to speed" on a current project being worked on by others in the company. Your job is to fully understand the problem and the current state of the code. Your assignment is to update and add to the code to answer some new questions.

The new job:

Suppose you have been hired at a new Bike Share company. Your new company has been looking at the public data available from the Lyft Bay Wheels company for insights about when people use their bikes. You have one week to get the current code working and update it and to add to the code to answer some new questions. See the Questions to answer: below.

The current code:

The current code is hosted on the instructor's The East Bay R Language Beginners Group. Suppose you were hired at the Meetup.

  1. Read over the presentation to begin understanding the data and the work that has been done so far.
  2. The data that will be analyzed is the System Data. Familiarize yourself with the variables in the data.
  3. The data is shared using the General Bikeshare Feed Specification (GBFS). The .csv files you will be downloading from the Lyft Bay Wheels website were created from this API. We will not be using the API on the Midterm, this will be discussed later.
  4. Get the code running. Download the fordgobike01.Rmd and rename the file Lastname_Firstname_Stat650_Midterm_lyftbaywheels.Rmd. As you are getting the code to run, clean the code up. That is, remove an code chucks that are not related to the questions being asked or does not run. (Hint: The ggmap code does not run without a Google Maps API key. This will be discussed in Stat. 651.) As you understand what the code does, add comments to the file to more clearly explain what each chunk of code does. If there are any redundant parts of the code, remove them.

Questions to anaswer:

These questions should be answered in your R Notebook. Each questions should start on a newpage. All answers need to be computed using R code.

  1. Explain what the GBFS is?
  2. Explain any difficulties you encountered getting the code to work.
  3. The analysis is to work with the data since Lyft BayWheels started, start with the data since May 2019. Modify the code to download the data to be analyzed. How many bike rentals have there been before the COVID-19 lockdown in CA? How many bike rentals were there been after the lockdown? How many bike rentals have there been since the beginning of Lyft BayWheels?
  4. There is a part of the code that uses the as.integer() function for some reason. Explain what this function is being used for in the code.
  5. In 2020, what month had the highest number of riders? What month had the lowest number of riders? Interpret any seasonal patterns.
  6. What start station had the highest number of rides? That is, which start station was used most to start rides?
  7. Using the Amelia R package and the missmap() function determine the rate of missing data in the month of June 2020. Or try the visdat package and the vis_miss() function. Or check out the the naniar R package. (This might not work on your computer if you have too little RAM.) If you cannot get your code to run, sample the data first.
  8. What Type of rider uses the Lyft BayWheels more? Subscribers or Customers?

In progress: I am working on an update to the code below for the solutions to the Midterm.

I have updated the R Notebooks for 2020. Note that FordGoBikes has a new sponsor Lyft and the service is now called Lyft Bay Wheels. DO NOT USE The code below it is not for direct use for the midterm, but you are welcome to see what it does. Use the code that is to be used is in the fordgobike01.Rmd file. We will discuss the updates when we go over the solution to the midterm later in the class.


Homework 3:

(due Monday September 21, 2020)

Using an R Notebook produce your solutions to the following questions. Start by making an R Notebook with file name Lastname_Firstname_Stat650_hw3.Rmd. Then knit the .Rmd file to either Lastname_Firstname_Stat650_hw3.docx or Lastname_Firstname_Stat650_hw3.pdf. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

Upload one file to Blackboard.

Problems:

  1. Run all of the code in Section 4.2 in your R Notebook.
  2. Exercises 4.1, 4.2, 4.3, 4.4, 4.5

Homework 2:

(due Wednesday September 9, 2020, this is because Monday September 7, 2020 is a holiday next week.)

Using an R Notebook produce your solutions to the following questions. Start by making an R Notebook with file name Lastname_Firstname_Stat650_hw2.Rmd. Then knit the .Rmd file to either Lastname_Firstname_Stat650_hw2.docx or Lastname_Firstname_Stat650_hw2.pdf. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

Upload one file to Blackboard.

Problems:

  1. Exercises 2.2, 2.3
  2. Exercise 3.1

Read:


Homework 1:

(due Monday August 31, 2020)

Using an R Notebook produce your solutions to the following questions. Start by making an R Notebook with file name Lastname_Firstname_Stat650_hw1.Rmd. Then knit the .Rmd file to either Lastname_Firstname_Stat650_hw1.docx or Lastname_Firstname_Stat650_hw1.pdf. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

Upload one file to Blackboard.

Problems:

  1. Appendix B Exercises B.1, B.2, B.3, B.4
  2. Appendix D Exercises D.1, D.6, D.11