To run this notebook, in your Midterm R Project run the R script first to create the lyftbaywheels dataframe for 2019 and 2020..
Note that the script saves the dataframe to an .Rds, .csv, .feather file in the data_final subdirectory. This can be read into the Global Environment where you answer the questions from the Midterm.
Read the data into the Global Environment so the R Notebook will Preview and Knit.
library(pacman)
p_load(tidyverse, tictoc, feather, naniar, DataExplorer)
Try running your data wrangling R script. Or try using the Source > Source as Local Job …
What is the advantage of using the source() function to run your R script separately?
Only need to run this once. So after you have created a .Rds file you do not need to run this again. So note the eval=FALSE., this prevents the knit from running this R code chunck when knitting your R Notebook
source('lyftbaywheels01-update2020-ver03.R')
Test out some different ways to read a dataframe to the harddrive. The winner is feather!
tic()
lyftbaywheels1 <- readRDS("data_final/lyftbaywheels_final.Rds")
toc()
3.42 sec elapsed
tic()
lyftbaywheels2 <- read_feather("data_final/lyftbaywheels_final.feather")
toc()
0.954 sec elapsed
Your solution to the Midterm should show you have “come up to speed” on the current project of working with and analyzing the LyftBayWheels data.
Answers to the questions:
1. Explain what the GBFS is?
Answer: GBFS stands for General Bikeshare Feed Specification, a standardized data feed for bike share system availability. It is maintained by the North American Bike Share Association.
2. Explain any difficulties you encountered getting the code to work.
Answer: The answer to this question varies.
I have had many difficulties working with the FordGoBike data. My difficulties started with the errors I was receiving the the start_id and end_id variables not being read in correctly with the red_csv(). Figuring out the type conflict took some effort to find this problem. Learning to convert type of columns was a learning experience. Dealing with the day of the week variable was another difficulty. Getting the day of the week was possible with the day() function, but then realizing this is the number for the calendar month, this was not useful. Then discovering the wday() function was very useful, but the label option was necessary to convert to M, Tues, W, Th, F values.
Other difficulties:
a. Running out of memory on their computer. The main way to deal with this is to reduce the amount of data imported. So maybe working with on May, June, July 2018 data, rather than working with all of the data. b. Dealing with the missing data. Wlyftbaywheels1hen replacing the age values greater than 100 may change the type of variable to char and then it needs to be returned to integer. c. There were some mistakes in the code. For example, when fixing the type problems with the June, July, August 2018 data files there was a mistake with the end_station_id be assigned the start_station_id. d. libcurl on Windows.
3. The analysis is to work with the data since Lyft BayWheels started, start with the data since May 2019. Modify the code to download the data to be analyzed. How many bike rentals have there been before the COVID-19 lockdown in CA? How many bike rentals were there been after the lockdown? How many bike rentals have there been since the beginning of Lyft BayWheels?
Answer: See the R script that contains the data wrangling steps.
lyftbaywheels1 %>% count()
4. There is a part of the code that uses the as.integer() function for some reason. Explain what this function is being used for in the code.
Answer: The as.integer() function is used in the fordgobike01.Rmd code to change the type of variable, for start_station_id and end_station_id, in the June, July, August 2018 dataframes from char to integer values, which are what they are in the earlier dataframes.
Note that the read_bulk function from the readbulk R package does not have this problem.
5. In 2020, what month had the highest number of riders? What month had the lowest number of riders? Interpret any seasonal patterns.
Answer: The Age variable is created using this code:
lyftbaywheels1 %>% head()
### 6. What start station had the highest number of rides? That is, which start station was used most to start rides?
Answer: July had the highest number of riders. January had the lowest. There seems to be more riders in the summer than in the winter. This is from the number of records/observations/rows. See plot below.
lyftbaywheels1 %>% ggplot(aes(x=as.factor(month))) + geom_bar() + facet_grid(year ~ .)

7. Using the Amelia R package and the missmap() function determine the rate of missing data in the month of June 2020. Or try the visdat package and the vis_miss() function. Or check out the the naniar R package. (This might not work on your computer if you have too little RAM.) If you cannot get your code to run, sample the data first.
Answer: Write your answer.
lyftbaywheels1 %>% select(start_station_id, start_station_name) %>%
count(start_station_id, start_station_name) %>%
arrange(desc(n))
8. What Type of rider uses the Lyft BayWheels more? Subscribers or Customers?
Answer: Calculations are after the tables.
Age:
Mean age by gender:
lyftbaywheels1 %>% head()
