Prof. Eric A. Suess
So how should you complete your homework for this class?
Read: Chapter 1, 2, 3
Download an install the current version of R and RStudio.
Do 3.2.4 Exercises 1, 2, 3, 4, 5
Do 3.3.1 Exercises 1, 2, 3, 4, 6
Do 3.5.1 Exercises 1, 2, 4
We see nothing. Well actually we see the first layer of a ggplot2 plot.
library(tidyverse)
ggplot(data = mpg)
By viewing the mpg dataframe we see there are 234 rows and 11 columns.
mpg
The variable drv has levels: f = front-wheel drive, r = rear wheel drive, 4 = 4wd
glimpse(mpg)
Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "...
$ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 quattro", "a4...
$ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 3.1...
$ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 2008, 1999, 1...
$ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8...
$ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)", "manual(m...
$ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "4", "4", "4"...
$ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15, 15, 17, 16, 1...
$ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25, 24, 25, 23, 2...
$ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p"...
$ class <chr> "compact", "compact", "compact", "compact", "compact", "compact", "compac...
Scatterplot of y = hwy versus x = cyl. The average highway miles per gallon goes down as the number of cylinders increases.
ggplot(mpg, aes(y = hwy, x = cyl)) +
geom_point()
This is not useful because there are many observations on each point in the plot. Plotting categorical variables in a scatterplot is not useful. It would be better to make a continency table.
ggplot(mpg, aes(y = class, x = drv)) +
geom_point()
count(mpg, drv, class)
If color is in the aes as a mapping it would need a variable from the dataframe to give the plot different colors. For example, putting in drv as the color. Alternatively, to change all of the points to blue, the color needs to be outside of the aes.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
The categorical variables are the ones with
mpg
?mpg
glimpse(mpg)
Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", ...
$ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 quattro", "a...
$ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 3....
$ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 2008, 1999, ...
$ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, ...
$ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)", "manual(...
$ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "4", "4", "4...
$ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15, 15, 17, 16, ...
$ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25, 24, 25, 23, ...
$ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p...
$ class <chr> "compact", "compact", "compact", "compact", "compact", "compact", "compa...
Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
Brighter colors are used for higher values of color. Bigger shapes are used for higher values of size. A continuous variable cannot be used for shape.
ggplot(data = mpg) +
geom_point(mapping = aes(x = cyl, y = hwy, color = displ))
ggplot(data = mpg) +
geom_point(mapping = aes(x = cyl, y = hwy, size = displ))
# ggplot(data = mpg) +
# geom_point(mapping = aes(x = cyl, y = hwy, shape = displ)) # gives a error
ggplot(data = mpg) +
geom_point(mapping = aes(x = cyl, y = hwy, shape = drv))
Can use two mappings for the same variable. This is not good practice!
ggplot(data = mpg) +
geom_point(mapping = aes(x = cyl, y = hwy, size = drv, shape = drv))
What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?
The color changes for the TRUE and FALSE values of the inequality.
ggplot(data = mpg) +
geom_point(mapping = aes(x = cyl, y = hwy, colour = displ < 5))
Is a continuous variable
ggplot(data = mpg) +
geom_point(mapping = aes(y = hwy, x = cyl)) +
facet_wrap(~ displ, nrow = 2)
The missing cells in the plot means there is no data available for that combination of values of the variables.
Switch x and y in the second plot to see the relationship.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
ggplot(data = mpg) +
geom_point(mapping = aes(y = drv, x = cyl))
Compare to 3.3.1 Exercise 1.
Facetting make is easier to see where the data is relative to the other variable used for color.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
facet_wrap(~ class, nrow = 2)