--- title: "Data Wrangling R with Answers" author: "Prof. Eric A. Suess" output: html_document --- Some of the code from Chapter 4, Section 1. In this chapter dplyr is introduced. We will be using dplyr all year. The main idea of data wrangling with dplyr are the 5 verbs. **select()** # take a subset of columns **filter()** # take a subset of rows **mutate()** # add or modify existing columns **arrange()** # sort the rows **summarize()** # aggregate the data across rows The dplyr package is part of the tidyverse. We will install and load the tidyverse. ```{r message=FALSE} library(mdsr) library(tidyverse) ``` # Star Wars dataset ```{r} data("starwars") glimpse(starwars) ``` # select() ```{r} starwars %>% select(name, species) ``` # filter() ```{r} starwars %>% filter(species == "Droid") ``` # select() ```{r} starwars %>% select(name, ends_with("color")) ``` # mutate() ```{r} starwars %>% mutate(name, bmi = mass / ((height / 100) ^ 2)) %>% select(name:mass, bmi) ``` # arrange() ```{r} starwars %>% arrange(desc(mass)) ``` # summarize() ```{r} starwars %>% group_by(species) %>% summarise( n = n(), mass = mean(mass, na.rm = TRUE) ) %>% filter(n > 1) ``` # Questions Develop the R code to answer the following questions. 1. How many films are in the dataset? ```{r} starwars %>% select(films) %>% unlist() %>% unique() ``` 2. Are there more Droids or humans in the Star Wars movies? There are 5 Droids and 35 Humans. So more Humans. ```{r} starwars %>% select(species) %>% filter(species=="Droid" | species=="Human") %>% group_by(species) %>% summarize(n=n()) ``` 3. Which of the Star Wars movies was Luke Skywalker in? ```{r} starwars %>% filter(name=="Luke Skywalker") %>% select(films) %>% unlist() ``` 4. Pose a question and answer it by wrangling the starwars dataset. What was the distribution of hights? What was the distribution of hights by species? ```{r} starwars %>% ggplot(aes(x=height)) + geom_histogram() starwars %>% ggplot(aes(x=height, color=gender)) + geom_histogram(aes(y=..density..)) starwars %>% ggplot(aes(x=height, color=gender)) + geom_density(aes(y=..density..)) ``` # Presidential examples Try out the code in Chapter 4 Section 1 using the presidential data set. ```{r} presidential ``` ## Star Wars API and R package More Star Wars stuff you might find interesting. - Check out the [Star Wars](https://www.starwars.com/) website. - Check out the Star Wars API [sawpi](https://swapi.co/). **Discontinued** - And check out the R package [starwarsdb](https://github.com/gadenbuie/starwarsdb). ## starwarsdb package This is a package contains 9 data tables. ```{r} library(starwarsdb) data(package = "starwarsdb") schema ``` ## starwarsdb package Get an individual starship - an X-wing. Hopefully it won't time out and will actually bring the data back. ```{r} X <- schema %>% filter(endpoint == "films") %>% pull(properties) ``` ```{r} films_vehicles films_vehicles %>% filter(vehicle == "X-wing") ``` ```{r} pilots pilots %>% filter(vehicle == "X-wing") ``` ```{r} people ``` ```{r} species ``` ```{r} people %>% right_join(species, c("species" = "name")) %>% select(name, species, height, average_height) %>% mutate(differ = height - average_height) %>% filter(differ > 0) %>% arrange(desc(differ)) ``` ## Alternative API that can be accessed via an R package The [ukpolice](https://github.com/evanodell/ukpolice) R package to download data from UK Police public data API. ```{r} library(ukpolice) library(ggplot2) library(dplyr) tv_ss <- ukc_stop_search_force("thames-valley", date = "2020-07") tv_ss2 <- tv_ss %>% filter(!is.na(officer_defined_ethnicity) & outcome != "" ) %>% group_by(officer_defined_ethnicity, outcome) %>% summarise(n = n()) %>% mutate(perc = n/sum(n)) theme_set(theme_bw()) p1 <- ggplot(tv_ss2, aes(x = outcome, y = perc, group = outcome, fill = outcome)) + geom_col(position = "dodge") + scale_y_continuous(labels = scales::percent, breaks = seq(0.25, 0.8, by = 0.25)) + scale_x_discrete(labels = scales::wrap_format(20)) + theme(legend.position = "none", axis.text.x = element_text(size = 7, angle = 45, hjust = 1)) + labs(x = "Outcome", y = "Percentage of stop and searches resulting in outcome", title = "Stop and Search Outcomes by Police-Reported Ethnicity", subtitle = "Thames Valley Police Department, July 2020", caption = "(c) Evan Odell | CC-BY-SA") + facet_wrap(~officer_defined_ethnicity) p1 ``` And here is a nice blog post about crime in SF [Using R for Crime Analysis](https://wetlands.io/maps/Crime-Analysis-Using-R.html).