October 16, 2019

Relational data

nycflights

library(tidyverse)
library(nycflights13)

flights

flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

table1

airlines
## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

table2

airports
## # A tibble: 1,458 x 8
##    faa   name                    lat    lon   alt    tz dst   tzone        
##    <chr> <chr>                 <dbl>  <dbl> <int> <dbl> <chr> <chr>        
##  1 04G   Lansdowne Airport      41.1  -80.6  1044    -5 A     America/New_…
##  2 06A   Moton Field Municipa…  32.5  -85.7   264    -6 A     America/Chic…
##  3 06C   Schaumburg Regional    42.0  -88.1   801    -6 A     America/Chic…
##  4 06N   Randall Airport        41.4  -74.4   523    -5 A     America/New_…
##  5 09J   Jekyll Island Airport  31.1  -81.4    11    -5 A     America/New_…
##  6 0A9   Elizabethton Municip…  36.4  -82.2  1593    -5 A     America/New_…
##  7 0G6   Williams County Airp…  41.5  -84.5   730    -5 A     America/New_…
##  8 0G7   Finger Lakes Regiona…  42.9  -76.8   492    -5 A     America/New_…
##  9 0P2   Shoestring Aviation …  39.8  -76.6  1000    -5 U     America/New_…
## 10 0S9   Jefferson County Intl  48.1 -123.    108    -8 A     America/Los_…
## # … with 1,448 more rows

table3

planes
## # A tibble: 3,322 x 9
##    tailnum  year type       manufacturer  model  engines seats speed engine
##    <chr>   <int> <chr>      <chr>         <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed win… EMBRAER       EMB-1…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  3 N103US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  5 N10575   2002 Fixed win… EMBRAER       EMB-1…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  7 N107US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## # … with 3,312 more rows

Working with Relational data

  • Mutating joins, which add new variables to one data frame from matching observations in another.
  • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.
  • Set operations, which treat observations as if they were set elements.

Keys

The variable(s) used to connect two tables is called a key.

  • A primary key uniquely identifies an observation in its own table.
  • A foreign key uniquely identifies an observation in another table.
  • a surrogate key can be created when a unique identifier is not available.

Understanding joins

From Section 13.4.1

Understanding joins

Understanding joins

Outer joins

Outer joins