Today we will start to look at the nycflights13 data tables.
You can think of this dataset as if it was a spreadsheet with multiple spreadsheets within a workbook.
Take a look at the flights table and the airlines table. Do the two tables have a common variable that can be used as a key to match the rows of the tables?
library(tidyverse)
library(nycflights13)
library(skimr)
flights %>% head()
airlines %>% head()
airports %>% head()
flightsJoined <- flights %>%
inner_join(airlines, by = c("carrier" = "carrier") )
flightsJoined %>% head()
glimpse(flightsJoined)
Observations: 336,776
Variables: 20
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 558, 558, 559, 559...
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 600, 600, 600, 559...
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -1, 0, 0, 1, -8, -...
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, 924, 923, 941, 70...
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, 917, 937, 910, 70...
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31, -4, -8, -7, 12, ...
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6", "B6", "UA", ...
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194, 1124, 707, ...
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N516JB", "N829AS"...
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "LGA", "JFK", "...
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO", "ORD", "PBI", "...
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 345, 361, 257, 44, ...
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, 1005, 2475, 2565,...
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0, 10, 5, 10, 1...
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:...
$ name <chr> "United Air Lines Inc.", "United Air Lines Inc.", "American Airlines Inc.", "J...
A package that I like for summarizing the variables in a dataframe is the skim function from the skimr package. What do you think?
skim(flightsJoined)
Skim summary statistics
n obs: 336776
n variables: 20
── Variable type:character ───────────────────────────────────────────────────────────────────────────────
variable missing complete n min max empty n_unique
carrier 0 336776 336776 2 2 0 16
dest 0 336776 336776 3 3 0 105
name 0 336776 336776 9 27 0 16
origin 0 336776 336776 3 3 0 3
tailnum 2512 334264 336776 5 6 0 4043
── Variable type:integer ─────────────────────────────────────────────────────────────────────────────────
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
arr_time 8713 328063 336776 1502.05 533.26 1 1104 1535 1940 2400 ▁▁▃▇▆▆▇▆
day 0 336776 336776 15.71 8.77 1 8 16 23 31 ▇▇▇▇▆▇▇▇
dep_time 8255 328521 336776 1349.11 488.28 1 907 1401 1744 2400 ▁▁▇▆▆▇▆▂
flight 0 336776 336776 1971.92 1632.47 1 553 1496 3465 8500 ▇▅▂▃▂▁▁▁
month 0 336776 336776 6.55 3.41 1 4 7 10 12 ▇▅▇▃▅▇▅▇
sched_arr_time 0 336776 336776 1536.38 497.46 1 1124 1556 1945 2359 ▁▁▂▇▆▇▇▆
sched_dep_time 0 336776 336776 1344.25 467.34 106 906 1359 1729 2359 ▁▃▇▆▆▇▇▂
year 0 336776 336776 2013 0 2013 2013 2013 2013 2013 ▁▁▁▇▁▁▁▁
── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
air_time 9430 327346 336776 150.69 93.69 20 82 129 192 695 ▇▇▂▃▁▁▁▁
arr_delay 9430 327346 336776 6.9 44.63 -86 -17 -5 14 1272 ▇▁▁▁▁▁▁▁
dep_delay 8255 328521 336776 12.64 40.21 -43 -5 -2 11 1301 ▇▁▁▁▁▁▁▁
distance 0 336776 336776 1039.91 733.23 17 502 872 1389 4983 ▆▇▂▂▁▁▁▁
hour 0 336776 336776 13.18 4.66 1 9 13 17 23 ▁▃▇▆▅▇▇▂
minute 0 336776 336776 26.23 19.3 0 8 29 44 59 ▇▂▃▃▅▂▃▅
── Variable type:POSIXct ─────────────────────────────────────────────────────────────────────────────────
variable missing complete n min max median n_unique
time_hour 0 336776 336776 2013-01-01 2013-12-31 2013-07-03 6936
Check that the new column name has been added.
flightsJoined %>% select(carrier, name, flight, origin, dest) %>%
head()
How big are the dataframes? Note the base R finction nrow just give a number as output, using the dplyr function count() give a dataframe.
nrow(flights)
[1] 336776
nrow(flightsJoined)
[1] 336776
flights %>% count()
flightsJoined %>% count()
Suppose we are interested in the flights from NYC airports to the West Coast.
The Pacific Time Zone is time zone -8 in the airports table. (I do not know why there are more airports that given in the book.)
airportsPT <- airports %>% filter(tz == -8)
airportsPT %>% head()
airportsPT %>% count()
Now if we are interested in the flights from NYC to the west coast, find the airports in the Pacific Time Zone and join the airportPT we will get the flights to the west coast.
nycDestPT <- flights %>% inner_join(airportsPT, by = c("dest" = "faa") )
nycDestPT %>% count()
If we left_join we will get all of the rows in flights.
Check out the map function, it applies a function to all columns.
nycDest <- flights %>% left_join(airportsPT, by = c("dest" = "faa") )
nycDest %>% count()
nycDest %>% head()
nycDest %>% map(~sum(is.na(.)))
$year
[1] 0
$month
[1] 0
$day
[1] 0
$dep_time
[1] 8255
$sched_dep_time
[1] 0
$dep_delay
[1] 8255
$arr_time
[1] 8713
$sched_arr_time
[1] 0
$arr_delay
[1] 9430
$carrier
[1] 0
$flight
[1] 0
$tailnum
[1] 2512
$origin
[1] 0
$dest
[1] 0
$air_time
[1] 9430
$distance
[1] 0
$hour
[1] 0
$minute
[1] 0
$time_hour
[1] 0
$name
[1] 290452
$lat
[1] 290452
$lon
[1] 290452
$alt
[1] 290452
$tz
[1] 290452
$dst
[1] 290452
$tzone
[1] 290452