title: "ExploratoryDataAnalysis"
author: "Prof. Eric A. Suess"
date: "September 11, 2019"

Today we will discuss Exploratory Data Analysis (EDA). This is the process of exploring your data using visualization and transformations and modeling (will discuss modeling more later).

```{r message=FALSE}
library(tidyverse)
```

Lets take a look at the *diamonds* data set and the variable carat.

```{r}
diamonds
```

```{r}
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
```

```{r}
diamonds %>% 
  count(cut_width(carat, 0.5))
```

Looking at the smaller diamonds.

```{r}
smaller <- diamonds %>% 
  filter(carat < 3)

diamonds %>% 
  ggplot(mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)
```

Look at carat by cut.

```{r}
smaller %>% 
  ggplot(mapping = aes(x = carat, colour = cut)) +
  geom_freqpoly(binwidth = 0.1)
```

Looking for *typical values*.

```{r}
smaller %>% 
  ggplot(mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)
```

Looking for *unusual values*. Lets look at the *y* variable.

```{r}
diamonds %>% 
  ggplot(mapping = aes(x = y)) +
  geom_histogram(binwidth = 0.5)
```

Are there outliers?

```{r}
diamonds %>% 
  ggplot(mapping = aes(x = y)) +
  geom_histogram(binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))
```

Lets find the outliers.

```{r}
unusual <- diamonds %>% 
  filter(y < 3 | y > 20) %>% 
  select(price, x, y, z) %>% 
  arrange(y)
unusual
```

Remove outliers.

```{r}
diamonds2 <- diamonds %>% 
  filter(between(y, 3, 20))
```

Better to convert them to **NA**, which means not available.

```{r}
diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))
```

Scatterplots.

```{r}
diamonds2 %>% 
  ggplot(mapping = aes(x = x, y = y)) +
  geom_point()
```

```{r}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
  geom_point(na.rm = TRUE)
```

Categorical variable. cut

```{r}
diamonds %>% 
  ggplot(mapping = aes(x = cut)) +
  geom_bar()
```

Continuous variable. price

```{r}
diamonds %>% 
  ggplot(mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
```

Putting them together in one plot.

```{r}
diamonds %>% 
  ggplot(mapping = aes(x = cut, y = price)) +
  geom_boxplot()
```

For a different data set. mpg

```{r}
mpg %>% 
  ggplot(mapping = aes(x = class, y = hwy)) +
  geom_boxplot()
```

Re-order.

```{r}
mpg %>% 
  ggplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  geom_boxplot()
```

Flip.

```{r}
mpg %>% 
  ggplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  geom_boxplot() +
  coord_flip()
```