Combining Multiple Tables

Today we will start to look at the nycflights13 data tables.

You can think of this dataset as if it was a spreadsheet with multiple spreadsheets within a workbook.

Take a look at the flights table and the airlines table. Do the two tables have a common variable that can be used as a key to match the rows of the tables?

library(tidyverse)
library(nycflights13)
library(skimr)
flights %>% head()

airlines %>% head()

airports %>% head()

flightsJoined <- flights %>%
  inner_join(airlines, by = c("carrier" = "carrier") )
flightsJoined %>% head()

glimpse(flightsJoined)

Observations: 336,776
Variables: 20
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 558, 558, 559, 559...
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 600, 600, 600, 559...
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -1, 0, 0, 1, -8, -...
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, 924, 923, 941, 70...
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, 917, 937, 910, 70...
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31, -4, -8, -7, 12, ...
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6", "B6", "UA", ...
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194, 1124, 707, ...
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N516JB", "N829AS"...
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "LGA", "JFK", "...
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO", "ORD", "PBI", "...
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 345, 361, 257, 44, ...
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, 1005, 2475, 2565,...
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0, 10, 5, 10, 1...
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:...
$ name           <chr> "United Air Lines Inc.", "United Air Lines Inc.", "American Airlines Inc.", "J...

A package that I like for summarizing the variables in a dataframe is the skim function from the skimr package. What do you think?

skim(flightsJoined)

Skim summary statistics
 n obs: 336776 
 n variables: 20 

── Variable type:character ───────────────────────────────────────────────────────────────────────────────
 variable missing complete      n min max empty n_unique
  carrier       0   336776 336776   2   2     0       16
     dest       0   336776 336776   3   3     0      105
     name       0   336776 336776   9  27     0       16
   origin       0   336776 336776   3   3     0        3
  tailnum    2512   334264 336776   5   6     0     4043

── Variable type:integer ─────────────────────────────────────────────────────────────────────────────────
       variable missing complete      n    mean      sd   p0  p25  p50  p75 p100     hist
       arr_time    8713   328063 336776 1502.05  533.26    1 1104 1535 1940 2400 ▁▁▃▇▆▆▇▆
            day       0   336776 336776   15.71    8.77    1    8   16   23   31 ▇▇▇▇▆▇▇▇
       dep_time    8255   328521 336776 1349.11  488.28    1  907 1401 1744 2400 ▁▁▇▆▆▇▆▂
         flight       0   336776 336776 1971.92 1632.47    1  553 1496 3465 8500 ▇▅▂▃▂▁▁▁
          month       0   336776 336776    6.55    3.41    1    4    7   10   12 ▇▅▇▃▅▇▅▇
 sched_arr_time       0   336776 336776 1536.38  497.46    1 1124 1556 1945 2359 ▁▁▂▇▆▇▇▆
 sched_dep_time       0   336776 336776 1344.25  467.34  106  906 1359 1729 2359 ▁▃▇▆▆▇▇▂
           year       0   336776 336776 2013       0    2013 2013 2013 2013 2013 ▁▁▁▇▁▁▁▁

── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────
  variable missing complete      n    mean     sd  p0 p25 p50  p75 p100     hist
  air_time    9430   327346 336776  150.69  93.69  20  82 129  192  695 ▇▇▂▃▁▁▁▁
 arr_delay    9430   327346 336776    6.9   44.63 -86 -17  -5   14 1272 ▇▁▁▁▁▁▁▁
 dep_delay    8255   328521 336776   12.64  40.21 -43  -5  -2   11 1301 ▇▁▁▁▁▁▁▁
  distance       0   336776 336776 1039.91 733.23  17 502 872 1389 4983 ▆▇▂▂▁▁▁▁
      hour       0   336776 336776   13.18   4.66   1   9  13   17   23 ▁▃▇▆▅▇▇▂
    minute       0   336776 336776   26.23  19.3    0   8  29   44   59 ▇▂▃▃▅▂▃▅

── Variable type:POSIXct ─────────────────────────────────────────────────────────────────────────────────
  variable missing complete      n        min        max     median n_unique
 time_hour       0   336776 336776 2013-01-01 2013-12-31 2013-07-03     6936

Check that the new column name has been added.

flightsJoined %>% select(carrier, name, flight, origin, dest) %>%
  head()

How big are the dataframes? Note the base R finction nrow just give a number as output, using the dplyr function count() give a dataframe.

nrow(flights)

[1] 336776

nrow(flightsJoined)

[1] 336776

flights %>% count()

flightsJoined %>% count()

Suppose we are interested in the flights from NYC airports to the West Coast.

The Pacific Time Zone is time zone -8 in the airports table. (I do not know why there are more airports that given in the book.)

airportsPT <- airports %>% filter(tz == -8)
airportsPT %>% head()

airportsPT %>% count()

Now if we are interested in the flights from NYC to the west coast, find the airports in the Pacific Time Zone and join the airportPT we will get the flights to the west coast.

nycDestPT <- flights %>% inner_join(airportsPT, by = c("dest" = "faa") )
nycDestPT %>% count()

If we left_join we will get all of the rows in flights.

Check out the map function, it applies a function to all columns.

nycDest <- flights %>% left_join(airportsPT, by = c("dest" = "faa") )
nycDest %>% count()

nycDest %>% head()

nycDest %>% map(~sum(is.na(.)))

$year
[1] 0

$month
[1] 0

$day
[1] 0

$dep_time
[1] 8255

$sched_dep_time
[1] 0

$dep_delay
[1] 8255

$arr_time
[1] 8713

$sched_arr_time
[1] 0

$arr_delay
[1] 9430

$carrier
[1] 0

$flight
[1] 0

$tailnum
[1] 2512

$origin
[1] 0

$dest
[1] 0

$air_time
[1] 9430

$distance
[1] 0

$hour
[1] 0

$minute
[1] 0

$time_hour
[1] 0

$name
[1] 290452

$lat
[1] 290452

$lon
[1] 290452

$alt
[1] 290452

$tz
[1] 290452

$dst
[1] 290452

$tzone
[1] 290452

LS0tCnRpdGxlOiAiQ29tYmluaW5nIE11bHRpcGxlIFRhYmxlcyIKb3V0cHV0OgogIGh0bWxfbm90ZWJvb2s6IGRlZmF1bHQKICB3b3JkX2RvY3VtZW50OiBkZWZhdWx0CiAgcGRmX2RvY3VtZW50OiBkZWZhdWx0Ci0tLQoKVG9kYXkgd2Ugd2lsbCBzdGFydCB0byBsb29rIGF0IHRoZSBueWNmbGlnaHRzMTMgZGF0YSB0YWJsZXMuCgpZb3UgY2FuIHRoaW5rIG9mIHRoaXMgZGF0YXNldCBhcyBpZiBpdCB3YXMgYSBzcHJlYWRzaGVldCB3aXRoIG11bHRpcGxlIHNwcmVhZHNoZWV0cyB3aXRoaW4gYSB3b3JrYm9vay4KClRha2UgYSBsb29rIGF0IHRoZSAqZmxpZ2h0cyogdGFibGUgYW5kIHRoZSAqYWlybGluZXMqIHRhYmxlLiAgRG8gdGhlIHR3byB0YWJsZXMgaGF2ZSBhIGNvbW1vbiB2YXJpYWJsZSB0aGF0IGNhbiBiZSB1c2VkIGFzIGEga2V5IHRvIG1hdGNoIHRoZSByb3dzIG9mIHRoZSB0YWJsZXM/CgpgYGB7ciBtZXNzYWdlPUZBTFNFfQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShueWNmbGlnaHRzMTMpCmxpYnJhcnkoc2tpbXIpCgpmbGlnaHRzICU+JSBoZWFkKCkKYWlybGluZXMgJT4lIGhlYWQoKQphaXJwb3J0cyAlPiUgaGVhZCgpCmBgYAoKYGBge3J9CmZsaWdodHNKb2luZWQgPC0gZmxpZ2h0cyAlPiUKICBpbm5lcl9qb2luKGFpcmxpbmVzLCBieSA9IGMoImNhcnJpZXIiID0gImNhcnJpZXIiKSApCgpmbGlnaHRzSm9pbmVkICU+JSBoZWFkKCkKCmdsaW1wc2UoZmxpZ2h0c0pvaW5lZCkKYGBgCgpBIHBhY2thZ2UgdGhhdCBJIGxpa2UgZm9yIHN1bW1hcml6aW5nIHRoZSB2YXJpYWJsZXMgaW4gYSBkYXRhZnJhbWUgaXMgdGhlICpza2ltKiBmdW5jdGlvbiBmcm9tIHRoZSAqc2tpbXIqIHBhY2thZ2UuICBXaGF0IGRvIHlvdSB0aGluaz8KCmBgYHtyfQpza2ltKGZsaWdodHNKb2luZWQpCmBgYAoKQ2hlY2sgdGhhdCB0aGUgbmV3IGNvbHVtbiAqbmFtZSogaGFzIGJlZW4gYWRkZWQuCgpgYGB7cn0KZmxpZ2h0c0pvaW5lZCAlPiUgc2VsZWN0KGNhcnJpZXIsIG5hbWUsIGZsaWdodCwgb3JpZ2luLCBkZXN0KSAlPiUKICBoZWFkKCkKCmBgYAoKSG93IGJpZyBhcmUgdGhlIGRhdGFmcmFtZXM/ICBOb3RlIHRoZSBiYXNlIFIgZmluY3Rpb24gKm5yb3cqIGp1c3QgZ2l2ZSBhIG51bWJlciBhcyBvdXRwdXQsIHVzaW5nIHRoZSBkcGx5ciBmdW5jdGlvbiAqY291bnQoKSogZ2l2ZSBhIGRhdGFmcmFtZS4KCmBgYHtyfQpucm93KGZsaWdodHMpCgpucm93KGZsaWdodHNKb2luZWQpCgpmbGlnaHRzICU+JSBjb3VudCgpCgpmbGlnaHRzSm9pbmVkICU+JSBjb3VudCgpCmBgYAoKU3VwcG9zZSB3ZSBhcmUgaW50ZXJlc3RlZCBpbiB0aGUgZmxpZ2h0cyBmcm9tIE5ZQyBhaXJwb3J0cyB0byB0aGUgV2VzdCBDb2FzdC4KClRoZSBQYWNpZmljIFRpbWUgWm9uZSBpcyB0aW1lIHpvbmUgLTggaW4gdGhlIGFpcnBvcnRzIHRhYmxlLiAgKEkgZG8gbm90IGtub3cgd2h5IHRoZXJlIGFyZSBtb3JlIGFpcnBvcnRzIHRoYXQgZ2l2ZW4gaW4gdGhlIGJvb2suKQoKYGBge3J9CmFpcnBvcnRzUFQgPC0gYWlycG9ydHMgJT4lIGZpbHRlcih0eiA9PSAtOCkKCmFpcnBvcnRzUFQgJT4lIGhlYWQoKQoKYWlycG9ydHNQVCAlPiUgY291bnQoKQpgYGAKCgpOb3cgaWYgd2UgYXJlIGludGVyZXN0ZWQgaW4gdGhlIGZsaWdodHMgZnJvbSBOWUMgdG8gdGhlIHdlc3QgY29hc3QsIGZpbmQgdGhlIGFpcnBvcnRzIGluIHRoZSBQYWNpZmljIFRpbWUgWm9uZSBhbmQgam9pbiB0aGUgYWlycG9ydFBUIHdlIHdpbGwgZ2V0IHRoZSBmbGlnaHRzIHRvIHRoZSB3ZXN0IGNvYXN0LgoKCmBgYHtyfQpueWNEZXN0UFQgPC0gZmxpZ2h0cyAlPiUgaW5uZXJfam9pbihhaXJwb3J0c1BULCBieSA9IGMoImRlc3QiID0gImZhYSIpICkKCm55Y0Rlc3RQVCAlPiUgY291bnQoKQpgYGAKCklmIHdlICpsZWZ0X2pvaW4qIHdlIHdpbGwgZ2V0IGFsbCBvZiB0aGUgcm93cyBpbiAqZmxpZ2h0cyouCgpDaGVjayBvdXQgdGhlICptYXAqIGZ1bmN0aW9uLCBpdCBhcHBsaWVzIGEgZnVuY3Rpb24gdG8gYWxsIGNvbHVtbnMuCgpgYGB7cn0KbnljRGVzdCA8LSBmbGlnaHRzICU+JSBsZWZ0X2pvaW4oYWlycG9ydHNQVCwgYnkgPSBjKCJkZXN0IiA9ICJmYWEiKSApCgpueWNEZXN0ICU+JSBjb3VudCgpCgpueWNEZXN0ICU+JSBoZWFkKCkKCm55Y0Rlc3QgJT4lIG1hcCh+c3VtKGlzLm5hKC4pKSkKYGBgCgoKCgo=