--- title: "Stat. 652 - NHANES" author: "Prof. Eric A. Suess" format: html: embed-resources: true --- We begin to examine the NHANES data before applying any machine learning models. 1. Explore the variables. 2. Check for missing data. 3. Check for outliers. 4. Use AutoEDA. ```{r} library(pacman) p_load(NHANES, skimr, Amelia, naniar, DataExplorer, tidyverse, forcats) ``` Note the use of the *data()* function which loads the data into your current environment. What are the dimensions of your dataset? Is this a big dataset? ```{r} help(NHANES) data("NHANES") NHANES ``` The *skim()* function compute summary statistics for the entire dataset. Note that for the categorical variables the first level always has the highest frequency. This can be a problem when you are interested in predicting the lower frequency category. ```{r} skim(NHANES) ``` To switch the order of the categories, use the *forcats* R package and the *fct_relevel()* function. ```{r} NHANES |> select(SleepTrouble) |> group_by (SleepTrouble) |> summarize(n = n()) NHANES |> select(SleepTrouble) |> group_by (SleepTrouble) |> mutate(SleepTrouble = fct_relevel(SleepTrouble, "Yes")) |> summarize(n = n()) ``` ```{r} NHANES <- NHANES |> mutate(SleepTrouble = fct_relevel(SleepTrouble, "Yes")) ``` ## Visualize the missing values in the NHANES dataframe ```{r} Amelia::missmap(NHANES) ``` ```{r} naniar::vis_miss(NHANES) ``` ```{r} naniar::gg_miss_var(NHANES) ``` ```{r} NHANES_Sleep <- NHANES |> select(starts_with("Sleep")) head(NHANES_Sleep) naniar::gg_miss_var(NHANES_Sleep) ``` ```{r} create_report(NHANES, y = "SleepHrsNight") ``` ```{r} create_report(NHANES, y = "SleepTrouble") ```