Homework - Prof. Suess

Statistics 452: Homework

Project Part II:

(due Friday May 15, 2020)

Using an R Notebook produce your solutions to the following questions. Start by making an R Notebook with file name Lastname_Firstname_Stat452_project.Rmd. Then knit the .Rmd file to either Lastname_Firstname_Stat452_project.docx. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 452 Project: Give a title"

author: "Your name"

date: "TBA"

In a single .docx or .pdf submit your final report.
Include your project proposal at the start of your file.
Your Project Report should include.
- A title, your name, Class and Section. This is not a homework assignment. Think about trying to write something you would present to a potential future employer.
- Introduction, The Five Steps, Conclusions, Appendix.
- The main part of your Project Report should be a maximum of 10 pages. Not all of your output should be shown in the Report, if you have included pages of numbers, such as all of the trees produced from a decision tree algorithm you should suppress this and not show the output.
- Appendix: Include the code and output you used for your project.

Homework 10:

Not collected.

Read: Chapter 9 Clustering

The header of your R Notebooks should include

title: "Stat. 452 Homework 10"

author: "Your name"

date: "May 11, 2020"

Upload one file to Blackboard.

Produce one .docx or .pdf file. Try to do your work using a Project in R.

Perform the Cluster analysis on the sns data. Produce a report explaining the data, the analysis, and the findings.
- Organize you report using the Five Steps.

Homework 9:

Not collected.

Read: Chapter 8 Market Basket Analysis

The header of your R Notebooks should include

title: "Stat. 452 Homework 9"

author: "Your name"

date: "May 4, 2020"

Upload one file to Blackboard.

Produce one .docx or .pdf file. Try to do your work using a Project in R.

Perform the Association analysis on the groceries analysis letter data. Produce a report explaining the data, the analysis, and the findings.
- Organize you report using the Five Steps.

Homework 8:

(due Monday April 27, 2020)

Read: Chapter 7 Neural Networks and SVMs

The header of your R Notebooks should include

title: "Stat. 452 Homework 8"

author: "Your name"

date: "April 27, 2020"

Upload one file to Blackboard.

Produce one .docx or .pdf file. Try to do your work using a Project in R.

Perform the ANN analysis on the concrete data. Produce a report explaining the data, the analysis, and the findings.
- Organize you report using the Five Steps.
Develop an ANN for the redwines.csv data from Homework 5.
- Organize you report using the Five Steps.
Read the blog post Multilable classification with neuralnet package and run the code.
Optional: Read Chapter 1 and 2 of Deep Learning with R and try to run the Rnotebook for Chapter 2. See the Source Code download link.
Optional: If you want to get started with Tensorflow.
- Read Leon Jenssen's example Building a simple neural network using Keras and Tensorflow and try the R code in this Rnotebook iris01.Rmd.

Homework 7:

(due Monday April 20, 2020)

Using an R Notebook produce your solutions to the following questions. Start by making an R Notebook with file name Lastname_Firstname_Stat452_hw4.Rmd. Then knit the .Rmd file to either Lastname_Firstname_Stat452_hw4.docx. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 452 Homework 7"

author: "Your name"

date: "April 20, 2020"

Upload one file to Blackboard.

Read: Chapter 11
Read: Chapter 7 Neural Networks

Produce one .docx or .pdf file. Try to do your work using a Project in R.

Perform the Logistic Regression analysis of the credit data. Produce a report explaining the data, the analysis, and the findings.
- Organize you report using the Five Steps.
Perform the Random Forest analysis of the credit data. Produce a report using an Rnotebook explaining the data, the analysis, and the findings.
- Organize you report using the Five Steps.

Project Part I:

(due Friday April 24, 2020)

Part 1: Choose a data set and a type of algorithm you would like to investigate. Write a short plan for your investigation.
- Give the name and source of the data set you plan to work with. You should try to find a data set that is not small.
- State what algorithm you plan to use and for what purpose, classification, prediction, clustering, other.
- Write you plan and submit it through Blackboard.

Homework 6:

(due Monday April 13 2020)

The header of your R Notebooks should include

title: "Stat. 452 Homework 6"

author: "Your name"

date: "April 13, 2020"

Upload one file to Blackboard.

Read: Chapter 6
Read: Chapter 10

Do the following:

Perform the Regression Tree based analysis of the redwine data. Produce a report using an Rnotebook explaining the data, the analysis, and the findings.
- Organize you report using the Five Steps.

Homework 5:

(due Monday April 6, 2020)

The header of your R Notebooks should include

title: "Stat. 452 Homework 5"

author: "Your name"

date: "April 6, 2020"

Upload one file to Blackboard.

Read: Chapter 6
Read: Chapter 10

Do the following:

(If you can get rWeka to work with Java 64 bit, that would be good. Otherwise, try this using the RStudio Cloud.) Perform the Rule based analysis of the mushroom. Produce a report explaining the data, the analysis, and the findings.
- Organize you report using the Five Steps.
- Be sure to include:
- Show the prediction that the algorithm produced.
- Give the Accuracy of the predictions.
- Include the confusion matrix.
Perform the Linear Regression analysis of the insurance data. Produce a report using an Rnotebook explaining the data, the analysis, and the findings.
- Organize you report using the Five Steps.

Homework 4:

(due Monday March 23, 2019)

The header of your R Notebooks should include

title: "Stat. 452 Homework 4"

author: "Your name"

date: "March 23, 2019"

Upload one file to Blackboard.

Read: Chapter 5
Read: Chapter 10

Do the following:

Perform the Tree based analysis of the credit data. Produce a report explaining the data, the analysis, and the findings.
- Be sure to include:
- Show the prediction that the algorithm produced.
- Give the Accuracy of the predictions.
- Include the confusion matrix.

Homework 3:

(due Monday February 24, 2020)

Using an R Notebook produce your solutions to the following questions. Start by making an R Notebook with file name Lastname_Firstname_Stat452_hw3.Rmd. Then knit the .Rmd file to either Lastname_Firstname_Stat452_hw3.docx. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 452 Homework 3"

author: "Your name"

date: "Feburary 17, 2020"

Upload one file to Blackboard.

Read: Chapter 4
Read: Chapter 10

Student Question: Is the denominator of Bayes Rule on page 97, of the First Edition of the book correct? Answer No. The multiplication rule for independent events does not hold. The independence that is assumed in Naive Bayes is class-conditional independence. This means the words are independent given the class is spam or ham, not unconditionally.

Student Question: How do I randomize a data set? The author gives only examples of data sets that have already been randomized. Answer: See Step 2 of the Example given in Chapter 5. 132-133 / 139-140. Read Chapter 10, Section on The holdout method.

Do the following:

Perform the SMS spam filtering analysis. Produce a report explaining the data, the analysis, and the findings.
- Organize you report using the Five Steps.
- Be sure to include:
- Show the prediction that the algorithm produced.
- Give the Accuracy of the predictions.
- Include the confusion matrix.
Find an interesting dataset that is appropriate for applying the naive Bayes algorithm, load the data into R, and proceed to classify the data using naive Bayes. (You can find an example dataset anywhere you want. One suggestion is to try the first example from the e1071 package naiveBayes function, see Rdocumentation. This example uses the HouseVotes84 data from the mlbench package, see RDocumentation. I do think everyone should try this example. Print out the dataset and see that it is full of Y and N. Also, note the NAs.)
Find a Google Sheets Add-ons app that can perform Sentiment Analysis. See if you can figure out what the algorithm that is used. (This problem is to look in the Google Sheets Add-ons and not in the Google Chrome AppStore.)

Hint: For Problem 2 you will need to take a random sample of the original data to make the training dataset and then use the remaining data to make the testing dataset. Use the following code. Replace launch with the name of your dataset.

> indx <- sample(1:nrow(launch), as.integer(0.9*nrow(launch)))
> indx

> launch_train <- launch[indx,]
> launch_test <- launch[-indx,]

Hint: For Problem 2 you need to find a dataset, I would suggest looking on the UCI ML Repository.
You will need to add the names yourself or find ne that has the names included.

Homework 2:

(due Monday February 10, 2020)

Using an R Notebook produce your solutions to the following questions. Start by making an R Notebook with file name Lastname_Firstname_Stat452_hw2.Rmd. Then knit the .Rmd file to either Lastname_Firstname_Stat452_hw2.docx. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 452 Homework 2"

author: "Your name"

date: "Feburary 10, 2020"

Upload one file to Blackboard.

Read: Chapter 3

Do the following:

Perform the cancer diagnosis kNN analysis. Produce a report explaining the data, the analysis, and the findings.
- Organize you report using the Five Steps.
- Be sure to include:
  1. Show the prediction that the algorithm produced.
  2. Give the Accuracy of the predictions. See Page 318 (or 299).
  3. Include the confusion matrix.
Find an interesting dataset from the UCI ML Repository that is appropriate for applying the kNN algorithm and load the data into R and proceed to classify the data using kNN.
Do problem 7a,b,c, see page 54, in An Introduction to Statistical Learning.

> indx <- sample(1:nrow(launch), as.integer(0.9*nrow(launch)))
> indx

> launch_train <- launch[indx,]
> launch_test <- launch[-indx,]

Hint: For Problem 2 you need to find a dataset, I would suggest looking on the UCI ML Repository.
You will need to add the names yourself or find ne that has the names included.

For example the Wine Data Set is a good one to try. Note that the Alchol target variable has 3 classes. You can use kNN with a target variable with 3 classes.
Your data file should look like this after adding the column names. wine.csv
For example the Bank Marketing Data Set, which contains a .csv file.
To use this dataset with kNN you need to remove all of the non-numeric variables. Try to open the data using R Environment > Import Dataset. Try From Text (base)... or the others if that does not work.

Homework 1:

(due Monday February 3, 2020)

Using an R Notebook produce your solutions to the following questions. Start by making an R Notebook with file name Lastname_Firstname_Stat452_hw1.Rmd. Then knit the .Rmd file to either Lastname_Firstname_Stat452_hw1.docx. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 452 Homework 1"

author: "Your name"

date: "Feburary 3, 2020"

Upload one file to Blackboard.

Read: Preface, Chapter 1 and Chapter 2.
Download an install the current version of R and RStudio.
Register with packt publishing and get the data files for the book.

Do the following:

Do a google search on the following terms and develop a working definition of each.
- Statistical Learning
- Statistical Machine Learning
- Machine Learning
- Predicitive Analytics
- Artifical Intelligence
- Deep Learning
Run all of the code from Chapter 2 to become familiar with R. (If you have experience with R, this will get you familiar with the code from the author.) Show some of the relevant output from R and discuss what you have learned from the data.
Download this book and become familiar with the materials on the websites.
- An Introduction to Statistical Learning
- What do the authors of the Introduction book say about Statistical Learning?
- (Optional: If you are interested start watching the 15 hours of video.)