Sampling

June 3, 2020

Introduction

Today we are going to get started with the Tidyverse. Today we will use the sample_n() function from the dplyr R package, which is part of the Tidyverse. We will also make some plots using the ggplot2 R package, which is also part of the Tidyverse.

Filenames are important

Everyone has a resume. Is your resume file name, resume.docx? If so, there is a lower chance you will be selected for a job you apply to. (That is just my opinion. I only have a little anacdotal evidence of this.) Be sure to name your resume something that people can find on their computer. How about lastname_firstname_resume_date.docx? This file will be much easier to find in a directory full of resumes. And most importantly it will not get over writtend by another resume.docx file that someone else submits.

Filenames

Please read over Jenny Bryan's presentation naming things.

She has excellent ideas about how to name files. The second on machine readable file names is something to pay attension to.

You homework files should not be homework1.Rmd for example. I suggest lastname_firstname_Stat694_hw1.Rmd and lastname_firstname_Stat694_hw1.pdf.

R Project

If you are going to work as a Data Scientist you should start doing all of your work in R Projects. Not only will this make interacting with the file system on your computer easier, it will help you keep track of where all your data files are. Hint: Create a /data subdirectory in your R Projects and put your data files there.

File > New Project …

Lets try this now.

Read Chapter 8 of the r4ds book.

Read this R-bloggers blog post RStudio Projects and Working Directories: A Beginner’s Guide.

R Notebooks

If you are going to work as a Data Scientist you should start doing all of your work in R Notebooks. (Or in Python notebooks.)

R Notebooks are structured to blend your R code chunks with the R output and your written text that explains your work. R Notebooks are constructed so Reproducible Research can be done.

Data sets

Today we are going to look at a number of data sets in R and in R packages, and we will look at where some of the data sets come from.

Sampling and Stratified Sampling

Everyone should know how to take a random sample from a data file. It is easy. Today we will try the sample_n() function.

Being able to sample a data file is important for many reasons. One of the main reasons is to downsample the data file so we can develop our code more efficiently. Using a smaller sample of a bigger data file will take less time to process.