---
title: "Basic Statistics and Hypothesis Testing in R"
author: "Prof. Eric A. Suess"
date: "November 28, 2018"
output:
  word_document: default
  html_notebook: default
  pdf_document: default
---

If you want to learn about Statistics using base R a nice website is the [Quick-R](https://www.statmethods.net/index.html) website, see [Statistics > t-tests](https://www.statmethods.net/stats/ttest.html)

These are some example of basic statistics and hypothesis testing in R.  Most of the code here is from base R.

We will use the *mtcars* data set.

```{r, warning=FALSE}
library(tidyverse)
```


```{r}
mtcars
```

# Summary Statistics

```{r}
mtcars %>% summarize(mpg_mean = mean(mpg), mpg_sd = sd(mpg))
```

# Subsets and statistics.

```{r}
mtcars %>% group_by(vs) %>%
  summarize(mpg_mean = mean(mpg), mpg_sd = sd(mpg))
```

Note that the t.test function does not work well with the tidyverse.  There is a new package called *infer* that works with the tidyverse.  And if you are interested chech out the *broom* package.

I like using the formula interface when doing hypothesis testing.

## t test

```{r}
?t.test

with(mtcars, boxplot(mpg ~ vs))

output1 <- with(mtcars, t.test(mpg ~vs))

output1

summary(output1)

output1$statistic
output1$p.value
```

## ANOVA

```{r}
?aov

with(mtcars, boxplot(mpg ~cyl))

output2 <- with(mtcars, aov(mpg ~ cyl))

output2

summary(output2)
```

## Linear Regression

```{r}
?lm

attach(mtcars)

plot(mpg ~ wt)

output3 <-lm(mpg ~ wt)

output3

summary(output3)


plot(mpg ~ wt)
abline(lm(mpg ~ wt))

detach(mtcars)

```

## Using ggplot

```{r}
mtcars %>% ggplot(aes(x = wt, y = mpg)) +
  geom_point() + 
  geom_smooth(method=lm) +
  geom_smooth()
```


If you want to learn Hypothesis Testing using modern R code check out the book [moderndive](https://moderndive.com).  See [Chapter 10](https://moderndive.com/10-hypothesis-testing.html).  The authors of this book are working on a new package called [infer R package](https://infer.netlify.com/).

```{r}
library(infer)
```

The two sample t test example from the website.

```{r}
library(nycflights13)
library(dplyr)
library(stringr)
library(infer)

set.seed(2017)
fli_small <- flights %>% 
  sample_n(size = 500) %>% 
  mutate(half_year = case_when(
    between(month, 1, 6) ~ "h1",
    between(month, 7, 12) ~ "h2"
  )) %>% 
  mutate(day_hour = case_when(
    between(hour, 1, 12) ~ "morning",
    between(hour, 13, 24) ~ "not morning"
  )) %>% 
  select(arr_delay, dep_delay, half_year, 
         day_hour, origin, carrier)
```

```{r}
obs_t <- fli_small %>%
  specify(arr_delay ~ half_year) %>%
  calculate(stat = "t", order = c("h1", "h2"))
```

```{r}
obs_t <- fli_small %>% 
  t_stat(formula = arr_delay ~ half_year, order = c("h1", "h2"))
```


```{r}
t_null_perm <- fli_small %>%
  # alt: response = arr_delay, explanatory = half_year
  specify(arr_delay ~ half_year) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "t", order = c("h1", "h2"))
```


```{r}
visualize(t_null_perm) +
  shade_p_value(obs_stat = obs_t, direction = "two_sided")
```

Randomized p-value

```{r}
t_null_perm %>% 
  get_p_value(obs_stat = obs_t, direction = "two_sided")
```


Theoretical p-value

```{r}
t_null_theor <- fli_small %>%
  # alt: response = arr_delay, explanatory = half_year
  specify(arr_delay ~ half_year) %>%
  hypothesize(null = "independence") %>%
  # generate() ## Not used for theoretical
  calculate(stat = "t", order = c("h1", "h2"))
```

```{r}
visualize(t_null_theor, method = "theoretical") +
  shade_p_value(obs_stat = obs_t, direction = "two_sided")
```

Overlay

```{r}
visualize(t_null_perm, method = "both") +
  shade_p_value(obs_stat = obs_t, direction = "two_sided")
```

Compute the Theoretical p-value

```{r}
fli_small %>% 
  t_test(formula = arr_delay ~ half_year,
         alternative = "two_sided",
         order = c("h1", "h2")) %>% 
  dplyr::pull(p_value)
```