--- title: "Combining Multiple Tables" output: pdf_document: default html_notebook: default word_document: default --- Today we will start to look at the nycflights13 data tables. You can think of this dataset as if it was a spreadsheet with multiple spreadsheets within a workbook. Take a look at the *flights* table and the *airlines* table. Do the two tables have a common variable that can be used as a key to match the rows of the tables? ```{r message=FALSE} library(tidyverse) library(nycflights13) library(skimr) flights %>% head() airlines %>% head() airports %>% head() ``` ```{r} flightsJoined <- flights %>% inner_join(airlines, by = c("carrier" = "carrier") ) flightsJoined %>% head() glimpse(flightsJoined) ``` A package that I like for summarizing the variables in a dataframe is the *skim* function from the *skimr* package. What do you think? ```{r} skim(flightsJoined) ``` Check that the new column *name* has been added. ```{r} flightsJoined %>% select(carrier, name, flight, origin, dest) %>% head() ``` How big are the dataframes? Note the base R finction *nrow* just give a number as output, using the dplyr function *count()* give a dataframe. ```{r} nrow(flights) nrow(flightsJoined) flights %>% count() flightsJoined %>% count() ``` Suppose we are interested in the flights from NYC airports to the West Coast. The Pacific Time Zone is time zone -8 in the airports table. (I do not know why there are more airports that given in the book.) ```{r} airportsPT <- airports %>% filter(tz == -8) airportsPT %>% head() airportsPT %>% count() ``` Now if we are interested in the flights from NYC to the west coast, find the airports in the Pacific Time Zone and join the airportPT we will get the flights to the west coast. ```{r} nycDestPT <- flights %>% inner_join(airportsPT, by = c("dest" = "faa") ) nycDestPT %>% count() ``` If we *left_join* we will get all of the rows in *flights*. Check out the *map* function, it applies a function to all columns. ```{r} nycDest <- flights %>% left_join(airportsPT, by = c("dest" = "faa") ) nycDest %>% count() nycDest %>% head() nycDest %>% map(~sum(is.na(.))) ```