Statistics 694: Homework
Homework 5:
(due Friday December 4, 2020)
- Run the code in mpg-auto.zip using the random_forest model. Does the Random Forest perform better than the Linear Regression?
Homework 4:
(due Friday November 6, 2020)
- Install the DataExplorer R package and run the main function create_report(). See the Introduction to DataExplorer . Run a report of mtcars and another dateset of your choice. I would suggest the storms dateset from the nasaweather R package. Describe the first two Pricipal Components for each dataset that you produce a report for.
- Find a blog post that demonstrates the use of clustering in R. Run the code and produce a plot of the clusters determined by the algorithm.
Homework 3:
(due Wednesday November 4, 2020)
Provide a summary of your class project in an R Notebook. Include a direct reference to the data you are using. If you have used R code to analyze your data show the code and the results in your R Notebook. If you have specific questions, please include them at the top of your summary.
Homework 2:
- Assocate CoLab with your university Google email.
- Start an R Notebook on CoLab and write up a specific summary of your research plan.
- Give your Notebook an appropriate filename: lastname_firstname_Stat 694_ProjPlan
- In text chuncks, state exactly what your project will entail.
- Give a direct link to the data you will be using.
- Explain any algorithms or model(s) you will be implementing.
- Share the notebook with me. And submit a link to your notebook in Blackboard.
- Hints: Find a dataset that is already prepared. If you are proposing to create your own dataset by scaping with web, be sure to check the legal requirements for access the website you are trying to scrape. If the website has an API you should use it rather than web scraping the site. Also, check to see if someone has created an R package to access the API. Check out kaggle or data.world for possible data. If you are interested in working with time dependent data check out fpp3. If you are interesting working with text data check out tidytextmining. If you are interested in modeling in R check out the new book tmwr.
Homework 1:
(due Wednesday September 4, 2020)
- Project Proposal: Write up a proposal for your project idea. Give as much detail as you can about the topic, data, planned analysis, etc. For those enrolled in the class, please submit your proposal in Blackboard. If you are not enrolled, please send it to me in an email with the subject RE: Stat 694 project prosal.
Please disregard everything that is below.
Homework 4:
Read:
- Happy Git and GitHub for the useR Chapter 1 and 6
Complete the following problems: Upload your individual files in Blackboard. Please do not zip your files together before submitting, I cannot see you work directly if it is zipped.
- In your GitHub account create a Repository. Configure R Studio to be used with git and GitHub. Create an .Rmd test file and commit and push the file. Give the link to your GitHub Repository. Submit the link to your file on your GitHub.
- Run the code 08_Chicago_Crime.zip, was there a spike in crime during the time of the protests a few weeks ago? Turn in a .Rmd and .docx file.
- Install the gutenberger R package and download a book from the Gutenberg Project and view it.
- Download the cat-and-dogs data from kaggle. Show that you can run the following .R code. cats-and-dogs-dir-ver02.R Turn in a .Rmd and .docx file.
- Create a disk.frame from the Fannie Mae Single-Family Loan Performance data set. Start by using only one year of data. This may be difficult to do.
Homework 3:
Read:
- r4ds Section II Wrangle, Chapters 9 - 11
Complete the following problems: Upload your individual files in Blackboard. Please do not zip your files together before submitting, I cannot see you work directly if it is zipped.
- Install SQLite DB Browser and test it out with the .sqlite databases created using the R code in dbplyr04.zip. Nothing to turn in for this problem.
- Read Chapter 12 Section 3 Pivoting in r4ds. Do 12.3.3 Exercise 1. Turn in lastname_firstname_Stat694_hw3_prob2.Rmd and lastname_firstname_Stat694_hw3_prob2.pdf or .docx
- Read Chapter 13 in r4ds. Do 13.5.1 Exercises 3. Turn in lastname_firstname_Stat694_hw3_prob3.Rmd and lastname_firstname_Stat694_hw3_prob3.pdf or .docx
- Download the Chicago Crime data since 2001 as a .csv file. Load it into a data.table and export it to a .sqlite file. Try to access the database from the SQLite DB Browser, Nothing to turn in for this problem.
Project:
Submit in Blackboard.
I would like to receive your Idea, Plan & Development propsal for your class Project next week.
The Project should be something that you are interested in working on beyond this class assignment.
One thing that can be very valuable in the interview process is being able to discuss and show your own work, where you have a genuine interest and motiviation. This might be asked about in a questions such as, "Please tell me about any projects or research you have worked on outside of a class."
In the end you could write a blog post, or upload your work to GitHub, or post a Notebook on Kaggle.
For the class Project you will be asked to present an R Notebook explaining what you have worked on.
Some ideas.
- Pick a kaggle competion to work on.
- Find an R package on GitHub or RSciNet and figure out how to use it.
- Learn and show the use of the Unix command line to work with data.
- Learn how to write a Unix shell script.
- Pick an R package to learn about and demonstrate its usage.
Homework 2:
Read:
- r4ds Chapters 7
- arXiv.org paper The Landscape of R Packages for Automated Exploratory Data Analysis and familiarize yourself with the author's GitHub mstaniak autoEDA-resources
- arXiv.org paper DriveML: An R Package for Driverless Machine Learning
- R packages Chapter 1, 2, 3
Complete the following problems: Upload your files in Blackboard. Please do not zip your files together before submitting, I cannot see you work directly if it is zipped.
- Open a kaggle account and find their Micro Courses. Open a GitHub account. Nothing to turn in for this problem.
- Load the mlbench R package. Using the Data Explorer R package run a report of the BostonHousing data using medv as the target variable. Label your report file lastname_firstname_Stat694_hw2_prob1_BostonHousing_report.html convert to a .pdf.
- The nycflights13.zip R Project contains R Tidyverse code to merge all of the many data tables in the nycflights data set into one dataframe/tibble. Run the code to create an overall dataframe. From the code determine what variable(s) has issues with the recorded data values in the variable(s). Turn in lastname_firstname_Stat694_hw2_prob2.Rmd and lastname_firstname_Stat694_hw2_prob2.pdf or .docx
- Download the lyft Bay Wheels trip data for 2020. There are 5 months of data available. Using R code download the data files, unzip the data files, read all of the data files into R, and use the Data Explorer R package to summarize all 5 months of data together. If your computer is not able to process all of the data to downsample the data. Turn in lastname_firstname_Stat694_hw2_prob1_BostonHousing_report.html (convert to a .pdf) and lastname_firstname_Stat694_hw2_prob3.pdf or .docx You may use the posted code to develop your .Rmd file. Do not turn in the instructor's .Rmd file. You can use some of the code and add your own comments.
- Install the snedata R package from GitHub. Download the MNIST data set and plot a few of the 28 x 28 images. Turn in lastname_firstname_Stat694_hw2_prob4.pdf or .docx
- Download and install the instructor's R Package MyMeanSDPackage.tar.gz. Nothing to turn in for this problem.
Homework 1:
Read:
- r4ds Chapters 1, 2, 3, 6, 8
Complete the following problems: Upload your 2 files in Blackboard. Please do not zip your files together before submitting, I cannot see you work directly if it is zipped.
- Install R and R Studio. Open an R Studio Cloud account. Nothing to turn in for this problem.
- Try not to make a mess of your work. Use an R Project or multiple R Projects if you are working with multiple different datasets. Create your project within your class directory in a subdirectory Homework. Nothing to turn in for this problem.
- Install the Tidyverse. Use options(Ncpus = 8). Change to the number of cpu cores on your machine. Nothing to turn in for this problem.
- Save the mtcars dataset to a .zip file using R code. Unzip the file in the directory you are working in. Turn in lastname_firstname_Stat694_hw1_prob3.pdf or .docx
- Familiarize yourself with the Fannie Mae Single-Family Loan Performance Data. Download the Acquisition Data and the Performance Data sample files. Write R code to read in these two sample files. Take a 20% sample of each data set. Turn in lastname_firstname_Stat694_hw1_prob4.pdf or .docx.
- Download the Fannie Mae Single-Family Loan Performance data set. Nothing to turn in for this problem.
Everything below this line is from past offering of Stat. 694 and from previous Data Science Workgroup meetings.
Homework 1:
Develop a plan for a Data Science Project to be completed during the class. Your project should be related to a topic of interest to you and should, hopefully, be related to a career opportunity you planning to pursue.
Start a Slack discussion directly with the instructor about your project.
Do not post your ideas in the general or random channels in Slack.
Decribe the following:
- Brainstorm if you do not have an idea. Try to come up with an idea.
Look at some job descriptions. BAJobs.com
Look at some data competitions. Kaggle Look at some sources of data. data.world - Once you have an idea, describe the idea as clearly as possible.
- Make a list of steps you plan to do to complete your project. The list might start with identify a source for the data you plan to use. Describe the project of your work.
- State whether you plan to produce a written report or an blog post or an App. A written report should be produced using an R Notebook using good reproducible research pacticies. A blog post should be posted on your blog. An App could be shared on shiny.org
RStudio resources:
- RStudio
- RStudio Cheatsheets
- RStudio Cheatsheet: IDE
- RStudio Cheatsheet: R Markdown
- RStudio Cheatsheet: R Markdown
- RStudio Shiny
- RStudio Cheatsheets: shiny
- RStudio: Learn Shiny
- RStudio Packages
- RStudio Cheatsheets: Package Development
Fall 2018 / Spring 2019
Homework 4:
(due Friday Oct. 26, 2018)
- Login to the University Library > Database A-Z > Safari. Find a book or video series related to your Data Science Project. Email me your selection.
Homework 3:
- Turn in Homework 1 through slack.
- Get a data.world account and try out their SQL Tutorial.
- If you are interested in SQL, try to update my sqldf2 R code to use the tidyverse and to use dbplry with sqlite. This would be part of an interesting project.
- Try the google Dataset Search engine to find some relevant data for your project, if you do not have a dataset yet.
Homework 2:
- Go through the video or written Shiny tutorials.
Homework 1:
Develop a plan for a Data Science Project to be completed during the class. Your project should be related to a topic of interest to you and should, hopefully, be related to a career opportunity you planning to pursue.
Start a slack discussion directly with the instructor about your project.
Do not post your ideas in the general or random channels in slack.
Decribe the following:
- Brainstorm if you do not have an idea. Try to come up with an idea.
Look at some job descriptions. BAJobs.com
Look at some data competitions. Kaggle Look at some sources of data. data.world - Once you have an idea, describe the idea as clearly as possible.
- Make a list of steps you plan to do to complete your project. The list might start with identify a source for the data you plan to use. Describe the project of your work.
- State whether you plan to produce a written report or an blog post or an App. A written report should be produced using an R Notebook using good reproducible research pacticies. A blog post should be posted on your blog. An App could be shared on shiny.org
RStudio resources:
- RStudio
- RStudio Cheatsheets
- RStudio Cheatsheet: IDE
- RStudio Cheatsheet: R Markdown
- RStudio Cheatsheet: R Markdown
- RStudio Shiny
- RStudio Cheatsheets: shiny
- RStudio: Learn Shiny
- RStudio Packages
- RStudio Cheatsheets: Package Development