AutoEDA

June 8, 2020

Introduction

Today we are going to learn about the automated exploration of the variables in a data set. This type of data analysis is usually called Automated Exploratory Data Analysis (AutoEDA).

AutoEDA gives a quick view into the high level details of a data set. When presented with a new data set running some AutEDA algorithms can be very useful to learn about the data.

How many rows of data are in the data set?
Which variables are numeric? Which are Categorical and need to be coded as factors in R?
How many missing observations are there in the data set? In each of the variables? Is imputation possible?
What are the distributions of each variable? Are there outliers?

Introduction

Are there any variables that are completely missing? Any that take just one variable?
What are the relationships between the variables/columns? Can the number of columns be reduced? PCA
Are there any groups in the data? Can the observations/rows be put into subgroups? Clustering

Introduction

In the Stat. 650 courses we will learn about how to write code to answer these questions? For now we would like to get to the answers with the least amount of code. To do this AutoEDA can be used.

We will be using the Tidyverse in R to write the code. Note that some of the main AutoEDA R packages use the Tidyverse also.

arXiv.org

What is arXiv.org?

Answer: It is a preprint service. This means it hosts draft research papers before they are formally published. It has also become one of the main places where researchers put their research papers.

AutoEDA

I would like you to read the arXiv.org paper The Landscape of R Packages for Automated Exploratory Data Analysis and familiarize yourself with the author's GitHub mstaniak autoEDA-resources. The GitHub gives the names and links to all of the main AutoEDA R packages that are available.

Using AutoEDA for data exploration is much faster than writing individual function calls to each variable/column in a data set. It can be used to summarize all of the variables, numeric and categorical, and to correlate variables with the target variable in the data set.

GitHub

Everyone who works in Data Science uses GitHub or at least access it for information. You should get an account and start learning how it used.

GitHub is where developers post their code, to work share with others, and to distributed their code.

You can also Star GitHub projects. This is a good way to give some initial feedback to a developer.

Note that GitHub can used to post a website on the internet.

R Packages

There are many R Packages on GitHub. So what is an R Package?

It is good to read the book R packages by Hadley Wickham. If you read the first 3 chapters of the book you will have a good idea about how to create a package.

Note that in R Studio there is a way to create an R Project that sets up everything you need with a Hello World R function in the Package.

Installing R Packages from GitHub

There are two well known ways to install packages from from GitHub.

CRAN and ROpenSci

The R community contributes and maintains the R packages on CRAN.

There is also ROpenSci which also hosts R packages that can be installed. Often these are packages that are being developed and may one day be submitted to CRAN.

AutoEDA

There are many packages on CRAN and on GitHub that can be used to perform automated EDA. My favorites are:

AutoEDA

Lets give some of the AutoEDA packages a try.

Lets take a look at some further data sets in R.

fueleconomy - common, vehicles

nycflights13 - airlines, airports, flights, planes, weather

Bike Share Data

Covid-19

Here is a blog post about Top 100 R resources on Novel COVID-19 Coronavirus.

Which of these R packages are on CRAN or on GitHub?

CRAN COVID-19 Data Hub
GitHub nCov2019 for studying COVID-19 coronavirus outbreak