--- title: 'Stat. 450 Section 1 or 2: Homework 4' output: word_document: default pdf_document: default html_notebook: default html_document: df_print: paged --- **Prof. Eric A. Suess** So how should you complete your homework for this class? - First thing to do is type all of your information about the problems you do in the text part of your R Notebook. - Second thing to do is type all of your R code into R chunks that can be run. - If you load the tidyverse in an R Notebook chunk, be sure to include the "message = FALSE" in the {r}, so {r message = FALSE}. - Last thing is to spell check your R Notebook. Edit > Check Spelling... or hit the F7 key. Homework 4: Read: Chapter 5 Do 5.4.1 Exercise 4 Do 5.5.2 Exericise 1, 4 Do 5.6.7 Exercise 1 ```{r message=FALSE} library(tidyverse) ``` # 5.4.1 ## 4. Yes. The contains() helper function picks out all of the variables in the dataset that contains the word TIME. The function is also not case sensitive. ```{r} library(nycflights13) flights ``` ```{r} flights %>% select(contains("TIME")) ``` The select() helpers are not case sensitive, when R is case sensitive. To change the default. Don't know why it does not show the columns like above. ```{r} flights %>% select(contains("TIME", ignore.case = FALSE)) ``` \newpage # 5.5.2 ## 1. Minutes since midnight. ```{r} flights ``` Covert dep_time and sechedule_dep_time to minutes since midnight. dep_time %/% 100 * 60 This give the minutes since midnight. dep_time %% 100 This gives the reminder in minutes. ```{r} flights %>% mutate(dep_time_mins = ( ( (dep_time %/% 100) * 60 ) + (dep_time %% 100)), sched_dep_time_mins = ( ( (sched_dep_time %/% 100) * 60 ) + (sched_dep_time %% 100)) ) ``` \newpage ## 4. Ten most delayed flights. There are no ties in these 10. ```{r} flights %>% arrange(desc(dep_delay)) %>% head(10) ``` \newpage # 5.6.7 ## 1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. 1. median and mean of dep_delay time in minutes. 2. sd of dep_delay time in minites 3. median and mean of arr_delay time in minutes. 4. sd of dep_delay time in minutes 5. is the distribution of arr_delay symmetric or skewed? Same questions for dep_delay? Which is more important: arrival delay or departure delay? **Arrival delay** is more important. ```{r} flights %>% select(dep_delay, arr_delay) %>% summarize( n=n(), dep_delay_median = median(dep_delay, na.rm = TRUE), dep_delay_mean = mean(dep_delay, na.rm = TRUE), dep_delay_sd = sd(dep_delay, na.rm = TRUE), arr_delay_median = median(arr_delay, na.rm = TRUE), arr_delay_mean = mean(arr_delay, na.rm = TRUE), arr_delay_sd = sd(arr_delay, na.rm = TRUE) ) ``` What proportion of flights are on time or arrive early? Approximtely 60% of all flights are on time. ```{r} flights %>% summarize(flt_ontime = mean(arr_delay <= 0, na.rm = TRUE) ) ``` Which arrier/airline has the best ontime rate? ```{r} flights %>% group_by(carrier) %>% summarize(flt_ontime = mean(arr_delay <= 0, na.rm = TRUE) ) %>% arrange(flt_ontime) ``` What proportion of flight are 10 mins or more late? ```{r} flights %>% summarize(flt_late10 = mean(arr_delay >= 10, na.rm = TRUE) ) ``` ```{r} flights %>% group_by(carrier) %>% summarize(flt_late10 = mean(arr_delay >= 10, na.rm = TRUE) ) %>% arrange(flt_late10) ``` What proportion of flight are 30 mins or more late? ```{r} flights %>% summarize(flt_late30 = mean(arr_delay >= 30, na.rm = TRUE) ) ``` ```{r} flights %>% group_by(carrier) %>% summarize(flt_late30 = mean(arr_delay >= 30, na.rm = TRUE) ) %>% arrange(flt_late30) ```