---
title: "Stat. 652 - NHANES"
author: "Prof. Eric A. Suess"
format: 
  html:
    embed-resources: true
---

We begin to examine the NHANES data before applying any machine learning models.

1. Explore the variables.
2. Check for missing data.
3. Check for outliers.
4. Use AutoEDA.


```{r}
library(pacman)

p_load(NHANES, skimr, Amelia, naniar, DataExplorer, tidyverse, forcats)
```

Note the use of the *data()* function which loads the data into your current environment.

What are the dimensions of your dataset?  Is this a big dataset?

```{r}
help(NHANES)

data("NHANES")

NHANES
```

The *skim()* function compute summary statistics for the entire dataset.

Note that for the categorical variables the first level always has the highest frequency. This can be a problem when you are interested in predicting the lower frequency category.

```{r}
skim(NHANES)
```

To switch the order of the categories, use the *forcats* R package and the *fct_relevel()* function.

```{r}
NHANES |> select(SleepTrouble) |> 
  group_by (SleepTrouble) |> 
  summarize(n = n()) 

NHANES |> select(SleepTrouble) |> 
  group_by (SleepTrouble) |> 
  mutate(SleepTrouble = fct_relevel(SleepTrouble, "Yes")) |> 
  summarize(n = n()) 
  
```

```{r}
NHANES <- NHANES |> 
  mutate(SleepTrouble = fct_relevel(SleepTrouble, "Yes")) 
```


## Visualize the missing values in the NHANES dataframe

```{r}
Amelia::missmap(NHANES)
```

```{r}
naniar::vis_miss(NHANES)
```

```{r}
naniar::gg_miss_var(NHANES)
```

```{r}
NHANES_Sleep <- NHANES |> select(starts_with("Sleep"))
head(NHANES_Sleep)

naniar::gg_miss_var(NHANES_Sleep)
```


```{r}
create_report(NHANES, y = "SleepHrsNight")
```

```{r}
create_report(NHANES, y = "SleepTrouble")
```