--- title: "ExploratoryDataAnalysis" author: "Prof. Eric A. Suess" date: "September 11, 2019" output: html_notebook: default html_document: df_print: paged pdf_document: default word_document: default --- Today we will discuss Exploratory Data Analysis (EDA). This is the process of exploring your data using visualization and transformations and modeling (will discuss modeling more later). ```{r message=FALSE} library(tidyverse) ``` Lets take a look at the *diamonds* data set and the variable carat. ```{r} diamonds ``` ```{r} ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5) ``` ```{r} diamonds %>% count(cut_width(carat, 0.5)) ``` Looking at the smaller diamonds. ```{r} smaller <- diamonds %>% filter(carat < 3) diamonds %>% ggplot(mapping = aes(x = carat)) + geom_histogram(binwidth = 0.1) ``` Look at carat by cut. ```{r} smaller %>% ggplot(mapping = aes(x = carat, colour = cut)) + geom_freqpoly(binwidth = 0.1) ``` Looking for *typical values*. ```{r} smaller %>% ggplot(mapping = aes(x = carat)) + geom_histogram(binwidth = 0.01) ``` Looking for *unusual values*. Lets look at the *y* variable. ```{r} diamonds %>% ggplot(mapping = aes(x = y)) + geom_histogram(binwidth = 0.5) ``` Are there outliers? ```{r} diamonds %>% ggplot(mapping = aes(x = y)) + geom_histogram(binwidth = 0.5) + coord_cartesian(ylim = c(0, 50)) ``` Lets find the outliers. ```{r} unusual <- diamonds %>% filter(y < 3 | y > 20) %>% select(price, x, y, z) %>% arrange(y) unusual ``` Remove outliers. ```{r} diamonds2 <- diamonds %>% filter(between(y, 3, 20)) ``` Better to convert them to **NA**, which means not available. ```{r} diamonds2 <- diamonds %>% mutate(y = ifelse(y < 3 | y > 20, NA, y)) ``` Scatterplots. ```{r} diamonds2 %>% ggplot(mapping = aes(x = x, y = y)) + geom_point() ``` ```{r} ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + geom_point(na.rm = TRUE) ``` Categorical variable. cut ```{r} diamonds %>% ggplot(mapping = aes(x = cut)) + geom_bar() ``` Continuous variable. price ```{r} diamonds %>% ggplot(mapping = aes(x = price, y = ..density..)) + geom_freqpoly(mapping = aes(colour = cut), binwidth = 500) ``` Putting them together in one plot. ```{r} diamonds %>% ggplot(mapping = aes(x = cut, y = price)) + geom_boxplot() ``` For a different data set. mpg ```{r} mpg %>% ggplot(mapping = aes(x = class, y = hwy)) + geom_boxplot() ``` Re-order. ```{r} mpg %>% ggplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) + geom_boxplot() ``` Flip. ```{r} mpg %>% ggplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) + geom_boxplot() + coord_flip() ```