--- title: "Transformation" author: "Prof. Eric A. Suess" output: html_document: df_print: paged pdf_document: default html_notebook: default word_document: default --- # Chapter 4 Data Transformation The 5 verbs of data wrangling - Pick observations by their values (**filter()**). - Reorder the rows (**arrange()**). - Pick variables by their names (**select()**). - Create new variables with functions of existing variables (**mutate()**). - Collapse many values down to a single summary (**summarise()**). - (**group_by()**) ```{r message=FALSE} library(nycflights13) library(tidyverse) ``` We will continue to work with the *flights* dataset that is in the ggplot2 package. ```{r} flights ``` Change the code from the Transformation presentation to using the pipe %>%. ## filter() ```{r echo=TRUE} filter(flights, month == 1, day == 1) ``` ## arrange() ```{r echo=TRUE} arrange(flights, year, month, day) ``` ## arrange() ```{r echo=TRUE} arrange(flights, desc(dep_delay)) ``` ## select() ```{r echo=TRUE} select(flights, year, month, day) ``` ## select() ```{r echo=TRUE} select(flights, time_hour, air_time, everything()) ``` ## mutate() ```{r echo=TRUE} flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time ) mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60 ) ``` ## summarize() ```{r echo=TRUE} summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) by_day <- group_by(flights, year, month, day) summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)) ``` ## Combining multiple operations using functions and assignment <- ```{r echo=TRUE} by_dest <- group_by(flights, dest) delay <- summarise(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) delay <- filter(delay, count > 20, dest != "HNL") ``` ## Combining multiple operations using functions and assignment <-, note the ggplot "piping" using the + ```{r echo=TRUE, eval=FALSE} ggplot(data = delay, mapping = aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE) ``` It looks like delays increase with distance up to ~750 miles and then decrease. Maybe as flights get longer there's more ability to make up delays in the air? > `geom_smooth()` using method = 'loess' and formula 'y ~ x' ## Combining multiple operations with the pipe %>% Does this code read better? ```{r echo=TRUE} delays <- flights %>% group_by(dest) %>% summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") head(delays) ```