--- title: "Transformation Pipes " author: "Prof. Eric A. Suess" output: word_document: default html_document: df_print: paged pdf_document: default html_notebook: default --- # Chapter 4 Data Transformation The 5 verbs of data wrangling - Pick observations by their values (**filter()**). - Reorder the rows (**arrange()**). - Pick variables by their names (**select()**). - Create new variables with functions of existing variables (**mutate()**). - Collapse many values down to a single summary (**summarise()**). - (**group_by()**) ```{r message=FALSE} library(nycflights13) library(tidyverse) ``` We will continue to work with the *flights* dataset that is in the ggplot2 package. ```{r} flights ``` Change the code from the Transformation presentation to using the pipe %>%. Note that when using pipes you do not include the data in the next function call, it is piped into the function. The functions in the tidyverse work this way. ## filter() ```{r echo=TRUE} flights %>% filter(month == 1, day == 1) ``` ## arrange() ```{r echo=TRUE} flights %>% arrange(year, month, day) ``` ## arrange() ```{r echo=TRUE} flights %>% arrange(desc(dep_delay)) ``` ## select() ```{r echo=TRUE} flights %>% select(year, month, day) ``` ## select() ```{r echo=TRUE} flights %>% select(time_hour, air_time, everything()) ``` ## mutate() ```{r echo=TRUE} flights %>% select(year:day, ends_with("delay"), distance, air_time) %>% mutate(gain = dep_delay - arr_delay, speed = distance / air_time * 60) ``` ## summarize() ```{r echo=TRUE} summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) flights %>% group_by(year, month, day) %>% summarise(delay = mean(dep_delay, na.rm = TRUE)) ``` ## Combining multiple operations with the pipe %>% ```{r echo=TRUE} delay <- flights %>% group_by(dest) %>% summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") delay ``` ## Combining multiple operations with the pipe %>% ```{r echo=TRUE, eval=FALSE} delay %>% ggplot(mapping = aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE) ``` ## Combining multiple operations with the pipe %>% It looks like delays increase with distance up to ~750 miles and then decrease. Maybe as flights get longer there's more ability to make up delays in the air? > `geom_smooth()` using method = 'loess' and formula 'y ~ x' ## Combining multiple operations with the pipe %>% Does this code read better? This is the same code as above! ```{r echo=TRUE} delays <- flights %>% group_by(dest) %>% summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") ```