Some of the code from Chapter 4, Section 1.

In this chapter dplyr is introduced. We will be using dplyr all year.

The main idea of data wrangling with dplyr are the 5 verbs.

select() # take a subset of columns

filter() # take a subset of rows

mutate() # add or modify existing columns

arrange() # sort the rows

summarize() # aggregate the data across rows

The dplyr package is part of the tidyverse. We will install and load the tidyverse.

library(mdsr)
library(tidyverse)

Star Wars dataset

data("starwars")
glimpse(starwars)
Observations: 87
Variables: 13
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa"…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 228, 1…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.0, 84…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", NA, "b…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "light…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue", "red…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, 41.9,…
$ gender     <chr> "male", NA, NA, "male", "female", "male", "female", NA, "male", …
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "Tatooi…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Human", "…
$ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "The Empire Stri…
$ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imperia…
$ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1", <>,…

select()

starwars %>% select(name, species)

filter()

starwars %>% 
  filter(species == "Droid")

select()

starwars %>% 
  select(name, ends_with("color"))

mutate()

starwars %>% 
  mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)

arrange()

starwars %>% 
  arrange(desc(mass))

summarize()

starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  ) %>%
  filter(n > 1)

Questions

Develop the R code to answer the following questions.

  1. How many films are in the dataset?
starwars %>% select(films) %>%
  unlist() %>%
  unique()
[1] "Revenge of the Sith"     "Return of the Jedi"      "The Empire Strikes Back"
[4] "A New Hope"              "The Force Awakens"       "Attack of the Clones"   
[7] "The Phantom Menace"     
  1. Are there more Droids or humans in the Star Wars movies? There are 5 Droids and 35 Humans. So more Humans.
starwars %>% select(species) %>%
  filter(species=="Droid" | species=="Human") %>%
  group_by(species) %>%
  summarize(n=n())
NA
  1. Which of the Star Wars movies was Luke Skywalker in?
starwars %>% filter(name=="Luke Skywalker") %>%
  select(films) %>%
  unlist()
                   films1                    films2                    films3 
    "Revenge of the Sith"      "Return of the Jedi" "The Empire Strikes Back" 
                   films4                    films5 
             "A New Hope"       "The Force Awakens" 
  1. Pose a question and answer it by wrangling the starwars dataset.

What was the distribution of hights? What was the distribution of hights by species?

starwars %>% ggplot(aes(x=height)) +
  geom_histogram()

starwars %>% ggplot(aes(x=height, color=gender)) +
  geom_histogram(aes(y=..density..))

starwars %>% ggplot(aes(x=height, color=gender)) +
  geom_density(aes(y=..density..))

Presidential examples

Try out the code in Chapter 4 Section 1 using the presidential data set.

presidential

Star Wars API and R package

More Star Wars stuff you might find interesting.

  • Check out the Star Wars website.
  • Check out the Star Wars API sawpi.
  • And check out the R package rwars.

rwars package

This is a package that connects to the sawpi to pull data from the API.

If the package does not install from CRAN you can isntall it from github.

library(devtools)
install_github("ironholds/rwars")
library(rwars)
planet_schema <- get_planet_schema()
names(planet_schema)
[1] "properties"  "$schema"     "type"        "required"    "description"
[6] "title"      

rwars package

Get an individual starship - an X-wing.

Hopefully it won’t time out and will actually bring the data back.

x_wing <- get_starship(12)
x_wing
$name
[1] "X-wing"

$model
[1] "T-65 X-wing"

$manufacturer
[1] "Incom Corporation"

$cost_in_credits
[1] "149999"

$length
[1] "12.5"

$max_atmosphering_speed
[1] "1050"

$crew
[1] "1"

$passengers
[1] "0"

$cargo_capacity
[1] "110"

$consumables
[1] "1 week"

$hyperdrive_rating
[1] "1.0"

$MGLT
[1] "100"

$starship_class
[1] "Starfighter"

$pilots
$pilots[[1]]
[1] "https://swapi.co/api/people/1/"

$pilots[[2]]
[1] "https://swapi.co/api/people/9/"

$pilots[[3]]
[1] "https://swapi.co/api/people/18/"

$pilots[[4]]
[1] "https://swapi.co/api/people/19/"


$films
$films[[1]]
[1] "https://swapi.co/api/films/2/"

$films[[2]]
[1] "https://swapi.co/api/films/3/"

$films[[3]]
[1] "https://swapi.co/api/films/1/"


$created
[1] "2014-12-12T11:19:05.340000Z"

$edited
[1] "2014-12-22T17:35:44.491233Z"

$url
[1] "https://swapi.co/api/starships/12/"

Alternative API that can be accessed via an R package

The compstatr R package gives direct access to the St. Louis Metropolitan Police Department’s website.

library(compstatr)
cs_last_update()
[1] "August 2019"
i <- cs_create_index()
aug19 <- cs_get_data(year = 2019, month = "August", index = i)
aug19

The ukpolice R package to download data from UK Police public data API.

library(ukpolice)
library(ggplot2)
library(dplyr)
tv_ss <- ukc_stop_search_force("thames-valley", date = "2018-12")
tv_ss2 <- tv_ss %>% 
  filter(!is.na(officer_defined_ethnicity) & outcome != "" ) %>%
  group_by(officer_defined_ethnicity, outcome) %>%
  summarise(n = n()) %>%
  mutate(perc = n/sum(n))
p1 <- ggplot(tv_ss2, aes(x = outcome, y = perc,
                             group = outcome, fill = outcome)) + 
  geom_col(position = "dodge") + 
  scale_y_continuous(labels = scales::percent,
                     breaks = seq(0.25, 0.8, by = 0.25)) + 
  scale_x_discrete(labels = scales::wrap_format(15)) + 
  theme(legend.position = "none", axis.text.x = element_text(size = 8)) + 
  labs(x = "Outcome", 
       y = "Percentage of stop and searches resulting in outcome",
       title = "Stop and Search Outcomes by Police-Reported Ethnicity",
       subtitle = "Thames Valley Police Department, December 2018",
       caption = "(c) Evan Odell | 2019 | CC-BY-SA") + 
  facet_wrap(~officer_defined_ethnicity)
p1

Alternatively you could use the other ukpolice R package that is available through github.

And here is a nice blog post about crime in SF Using R for Crime Analysis

LS0tCnRpdGxlOiAiRGF0YSBXcmFuZ2xpbmcgUiB3aXRoIEFuc3dlcnMiCmF1dGhvcjogIlByb2YuIEVyaWMgQS4gU3Vlc3MiCm91dHB1dDoKICBwZGZfZG9jdW1lbnQ6IGRlZmF1bHQKICBodG1sX2RvY3VtZW50OgogICAgZGZfcHJpbnQ6IHBhZ2VkCiAgaHRtbF9ub3RlYm9vazogZGVmYXVsdAogIHdvcmRfZG9jdW1lbnQ6IGRlZmF1bHQKLS0tCgpTb21lIG9mIHRoZSBjb2RlIGZyb20gQ2hhcHRlciA0LCBTZWN0aW9uIDEuCgpJbiB0aGlzIGNoYXB0ZXIgZHBseXIgaXMgaW50cm9kdWNlZC4gIFdlIHdpbGwgYmUgdXNpbmcgZHBseXIgYWxsIHllYXIuCgpUaGUgbWFpbiBpZGVhIG9mIGRhdGEgd3JhbmdsaW5nIHdpdGggZHBseXIgYXJlIHRoZSA1IHZlcmJzLiAKCioqc2VsZWN0KCkqKiAgIyB0YWtlIGEgc3Vic2V0IG9mIGNvbHVtbnMKCioqZmlsdGVyKCkqKiAgIyB0YWtlIGEgc3Vic2V0IG9mIHJvd3MKCioqbXV0YXRlKCkqKiAgIyBhZGQgb3IgbW9kaWZ5IGV4aXN0aW5nIGNvbHVtbnMKCioqYXJyYW5nZSgpKiogICMgc29ydCB0aGUgcm93cwoKKipzdW1tYXJpemUoKSoqICAjIGFnZ3JlZ2F0ZSB0aGUgZGF0YSBhY3Jvc3Mgcm93cwoKVGhlIGRwbHlyIHBhY2thZ2UgaXMgcGFydCBvZiB0aGUgdGlkeXZlcnNlLiAgV2Ugd2lsbCBpbnN0YWxsIGFuZCBsb2FkIHRoZSB0aWR5dmVyc2UuCgpgYGB7ciBtZXNzYWdlPUZBTFNFfQpsaWJyYXJ5KG1kc3IpCmxpYnJhcnkodGlkeXZlcnNlKQpgYGAKCiMgU3RhciBXYXJzIGRhdGFzZXQKCmBgYHtyfQpkYXRhKCJzdGFyd2FycyIpCmdsaW1wc2Uoc3RhcndhcnMpCmBgYAoKIyBzZWxlY3QoKQoKYGBge3J9CnN0YXJ3YXJzICU+JSBzZWxlY3QobmFtZSwgc3BlY2llcykKYGBgCgojIGZpbHRlcigpCgpgYGB7cn0Kc3RhcndhcnMgJT4lIAogIGZpbHRlcihzcGVjaWVzID09ICJEcm9pZCIpCmBgYAoKIyBzZWxlY3QoKQoKYGBge3J9CnN0YXJ3YXJzICU+JSAKICBzZWxlY3QobmFtZSwgZW5kc193aXRoKCJjb2xvciIpKQpgYGAKCiMgbXV0YXRlKCkKCmBgYHtyfQpzdGFyd2FycyAlPiUgCiAgbXV0YXRlKG5hbWUsIGJtaSA9IG1hc3MgLyAoKGhlaWdodCAvIDEwMCkgIF4gMikpICU+JQogIHNlbGVjdChuYW1lOm1hc3MsIGJtaSkKYGBgCgojIGFycmFuZ2UoKQoKYGBge3J9CnN0YXJ3YXJzICU+JSAKICBhcnJhbmdlKGRlc2MobWFzcykpCmBgYAoKIyBzdW1tYXJpemUoKQoKYGBge3J9CnN0YXJ3YXJzICU+JQogIGdyb3VwX2J5KHNwZWNpZXMpICU+JQogIHN1bW1hcmlzZSgKICAgIG4gPSBuKCksCiAgICBtYXNzID0gbWVhbihtYXNzLCBuYS5ybSA9IFRSVUUpCiAgKSAlPiUKICBmaWx0ZXIobiA+IDEpCmBgYAoKIyBRdWVzdGlvbnMKCkRldmVsb3AgdGhlIFIgY29kZSB0byBhbnN3ZXIgdGhlIGZvbGxvd2luZyBxdWVzdGlvbnMuCgoxLiBIb3cgbWFueSBmaWxtcyBhcmUgaW4gdGhlIGRhdGFzZXQ/CgpgYGB7cn0Kc3RhcndhcnMgJT4lIHNlbGVjdChmaWxtcykgJT4lCiAgdW5saXN0KCkgJT4lCiAgdW5pcXVlKCkKYGBgCgoKCjIuIEFyZSB0aGVyZSBtb3JlIERyb2lkcyBvciBodW1hbnMgaW4gdGhlIFN0YXIgV2FycyBtb3ZpZXM/ICBUaGVyZSBhcmUgNSBEcm9pZHMgYW5kIDM1IEh1bWFucy4gIFNvIG1vcmUgSHVtYW5zLgoKYGBge3J9CnN0YXJ3YXJzICU+JSBzZWxlY3Qoc3BlY2llcykgJT4lCiAgZmlsdGVyKHNwZWNpZXM9PSJEcm9pZCIgfCBzcGVjaWVzPT0iSHVtYW4iKSAlPiUKICBncm91cF9ieShzcGVjaWVzKSAlPiUKICBzdW1tYXJpemUobj1uKCkpCiAgICAKYGBgCgozLiBXaGljaCBvZiB0aGUgU3RhciBXYXJzIG1vdmllcyB3YXMgTHVrZSBTa3l3YWxrZXIgaW4/CgpgYGB7cn0Kc3RhcndhcnMgJT4lIGZpbHRlcihuYW1lPT0iTHVrZSBTa3l3YWxrZXIiKSAlPiUKICBzZWxlY3QoZmlsbXMpICU+JQogIHVubGlzdCgpCmBgYAoKNC4gUG9zZSBhIHF1ZXN0aW9uIGFuZCBhbnN3ZXIgaXQgYnkgd3JhbmdsaW5nIHRoZSBzdGFyd2FycyBkYXRhc2V0LgoKV2hhdCB3YXMgdGhlIGRpc3RyaWJ1dGlvbiBvZiBoaWdodHM/ICBXaGF0IHdhcyB0aGUgZGlzdHJpYnV0aW9uIG9mIGhpZ2h0cyBieSBzcGVjaWVzPwoKYGBge3J9CnN0YXJ3YXJzICU+JSBnZ3Bsb3QoYWVzKHg9aGVpZ2h0KSkgKwogIGdlb21faGlzdG9ncmFtKCkKCnN0YXJ3YXJzICU+JSBnZ3Bsb3QoYWVzKHg9aGVpZ2h0LCBjb2xvcj1nZW5kZXIpKSArCiAgZ2VvbV9oaXN0b2dyYW0oYWVzKHk9Li5kZW5zaXR5Li4pKQoKc3RhcndhcnMgJT4lIGdncGxvdChhZXMoeD1oZWlnaHQsIGNvbG9yPWdlbmRlcikpICsKICBnZW9tX2RlbnNpdHkoYWVzKHk9Li5kZW5zaXR5Li4pKQoKYGBgCgoKCgojIFByZXNpZGVudGlhbCBleGFtcGxlcwoKVHJ5IG91dCB0aGUgY29kZSBpbiBDaGFwdGVyIDQgU2VjdGlvbiAxIHVzaW5nIHRoZSBwcmVzaWRlbnRpYWwgZGF0YSBzZXQuCgpgYGB7cn0KcHJlc2lkZW50aWFsCmBgYAoKIyMgU3RhciBXYXJzIEFQSSBhbmQgUiBwYWNrYWdlCgpNb3JlIFN0YXIgV2FycyBzdHVmZiB5b3UgbWlnaHQgZmluZCBpbnRlcmVzdGluZy4KCi0gQ2hlY2sgb3V0IHRoZSBbU3RhciBXYXJzXShodHRwczovL3d3dy5zdGFyd2Fycy5jb20vKSB3ZWJzaXRlLiAgCi0gQ2hlY2sgb3V0IHRoZSBTdGFyIFdhcnMgQVBJIFtzYXdwaV0oaHR0cHM6Ly9zd2FwaS5jby8pLgotIEFuZCBjaGVjayBvdXQgdGhlIFIgcGFja2FnZSBbcndhcnNdKGh0dHBzOi8vZ2l0aHViLmNvbS9Jcm9uaG9sZHMvcndhcnMpLgoKIyMgcndhcnMgcGFja2FnZQoKVGhpcyBpcyBhIHBhY2thZ2UgdGhhdCBjb25uZWN0cyB0byB0aGUgW3Nhd3BpXShodHRwczovL3N3YXBpLmNvLykgdG8gcHVsbCBkYXRhIGZyb20gdGhlIEFQSS4KCklmIHRoZSBwYWNrYWdlIGRvZXMgbm90IGluc3RhbGwgZnJvbSBDUkFOIHlvdSBjYW4gaXNudGFsbCBpdCBmcm9tIGdpdGh1Yi4KCiAgICBsaWJyYXJ5KGRldnRvb2xzKQogICAgaW5zdGFsbF9naXRodWIoImlyb25ob2xkcy9yd2FycyIpCiAgICAKCgpgYGB7ciBlY2hvPVRSVUV9CmxpYnJhcnkocndhcnMpCgpwbGFuZXRfc2NoZW1hIDwtIGdldF9wbGFuZXRfc2NoZW1hKCkKbmFtZXMocGxhbmV0X3NjaGVtYSkKCmBgYAoKIyMgcndhcnMgcGFja2FnZQoKR2V0IGFuIGluZGl2aWR1YWwgc3RhcnNoaXAgLSBhbiBYLXdpbmcuCgpIb3BlZnVsbHkgaXQgd29uJ3QgdGltZSBvdXQgYW5kIHdpbGwgYWN0dWFsbHkgYnJpbmcgdGhlIGRhdGEgYmFjay4KCmBgYHtyIGVjaG89VFJVRX0KeF93aW5nIDwtIGdldF9zdGFyc2hpcCgxMikKeF93aW5nCmBgYAoKIyMgQWx0ZXJuYXRpdmUgQVBJIHRoYXQgY2FuIGJlIGFjY2Vzc2VkIHZpYSBhbiBSIHBhY2thZ2UKClRoZSBbY29tcHN0YXRyXShodHRwczovL2NyYW4uci1wcm9qZWN0Lm9yZy93ZWIvcGFja2FnZXMvY29tcHN0YXRyL3ZpZ25ldHRlcy9jb21wc3RhdHIuaHRtbCkgUiBwYWNrYWdlIGdpdmVzIGRpcmVjdCBhY2Nlc3MgdG8gdGhlIFN0LiBMb3VpcyBNZXRyb3BvbGl0YW4gUG9saWNlIERlcGFydG1lbnQncyBbd2Vic2l0ZV0oaHR0cHM6Ly93d3cuc2xtcGQub3JnL0NyaW1lcmVwb3J0cy5zaHRtbCkuCgpgYGB7cn0KbGlicmFyeShjb21wc3RhdHIpCgpjc19sYXN0X3VwZGF0ZSgpCgppIDwtIGNzX2NyZWF0ZV9pbmRleCgpCgphdWcxOSA8LSBjc19nZXRfZGF0YSh5ZWFyID0gMjAxOSwgbW9udGggPSAiQXVndXN0IiwgaW5kZXggPSBpKQphdWcxOQpgYGAKClRoZSBbdWtwb2xpY2VdKGh0dHBzOi8vZ2l0aHViLmNvbS9ldmFub2RlbGwvdWtwb2xpY2UpIFIgcGFja2FnZSB0byBkb3dubG9hZCBkYXRhIGZyb20gVUsgUG9saWNlIHB1YmxpYyBkYXRhIEFQSS4KCmBgYHtyfQpsaWJyYXJ5KHVrcG9saWNlKQpsaWJyYXJ5KGdncGxvdDIpCmxpYnJhcnkoZHBseXIpCgp0dl9zcyA8LSB1a2Nfc3RvcF9zZWFyY2hfZm9yY2UoInRoYW1lcy12YWxsZXkiLCBkYXRlID0gIjIwMTgtMTIiKQoKdHZfc3MyIDwtIHR2X3NzICU+JSAKICBmaWx0ZXIoIWlzLm5hKG9mZmljZXJfZGVmaW5lZF9ldGhuaWNpdHkpICYgb3V0Y29tZSAhPSAiIiApICU+JQogIGdyb3VwX2J5KG9mZmljZXJfZGVmaW5lZF9ldGhuaWNpdHksIG91dGNvbWUpICU+JQogIHN1bW1hcmlzZShuID0gbigpKSAlPiUKICBtdXRhdGUocGVyYyA9IG4vc3VtKG4pKQoKcDEgPC0gZ2dwbG90KHR2X3NzMiwgYWVzKHggPSBvdXRjb21lLCB5ID0gcGVyYywKICAgICAgICAgICAgICAgICAgICAgICAgICAgICBncm91cCA9IG91dGNvbWUsIGZpbGwgPSBvdXRjb21lKSkgKyAKICBnZW9tX2NvbChwb3NpdGlvbiA9ICJkb2RnZSIpICsgCiAgc2NhbGVfeV9jb250aW51b3VzKGxhYmVscyA9IHNjYWxlczo6cGVyY2VudCwKICAgICAgICAgICAgICAgICAgICAgYnJlYWtzID0gc2VxKDAuMjUsIDAuOCwgYnkgPSAwLjI1KSkgKyAKICBzY2FsZV94X2Rpc2NyZXRlKGxhYmVscyA9IHNjYWxlczo6d3JhcF9mb3JtYXQoMTUpKSArIAogIHRoZW1lKGxlZ2VuZC5wb3NpdGlvbiA9ICJub25lIiwgYXhpcy50ZXh0LnggPSBlbGVtZW50X3RleHQoc2l6ZSA9IDgpKSArIAogIGxhYnMoeCA9ICJPdXRjb21lIiwgCiAgICAgICB5ID0gIlBlcmNlbnRhZ2Ugb2Ygc3RvcCBhbmQgc2VhcmNoZXMgcmVzdWx0aW5nIGluIG91dGNvbWUiLAogICAgICAgdGl0bGUgPSAiU3RvcCBhbmQgU2VhcmNoIE91dGNvbWVzIGJ5IFBvbGljZS1SZXBvcnRlZCBFdGhuaWNpdHkiLAogICAgICAgc3VidGl0bGUgPSAiVGhhbWVzIFZhbGxleSBQb2xpY2UgRGVwYXJ0bWVudCwgRGVjZW1iZXIgMjAxOCIsCiAgICAgICBjYXB0aW9uID0gIihjKSBFdmFuIE9kZWxsIHwgMjAxOSB8IENDLUJZLVNBIikgKyAKICBmYWNldF93cmFwKH5vZmZpY2VyX2RlZmluZWRfZXRobmljaXR5KQoKcDEKYGBgCgpBbHRlcm5hdGl2ZWx5IHlvdSBjb3VsZCB1c2UgdGhlIG90aGVyIFt1a3BvbGljZV0oaHR0cHM6Ly9naXRodWIuY29tL25qdGllcm5leS91a3BvbGljZSkgUiBwYWNrYWdlIHRoYXQgaXMgYXZhaWxhYmxlIHRocm91Z2ggZ2l0aHViLgoKQW5kIGhlcmUgaXMgYSBuaWNlIGJsb2cgcG9zdCBhYm91dCBjcmltZSBpbiBTRiBbVXNpbmcgUiBmb3IgQ3JpbWUgQW5hbHlzaXNdKGh0dHBzOi8vd2V0bGFuZHMuaW8vbWFwcy9DcmltZS1BbmFseXNpcy1Vc2luZy1SLmh0bWwuKQoK