Statistics 653: Homework


Final

(due Friday May 14, 2021)

This is a take-home Final Exam. You may ask questions of the Professor or search google. This is exam is to be completed independently.

Part 1: Make an R notebook with the file name lastname_firstname_Stat653_final_part1.Rmd and do the following:

  1. What is a tokenizer? (Note: The unest_tokens() function in the tidytext R package uses the tokenizer R package.)
  2. What is the formula for calculating the TF-IDF? Explain the two parts of the calculation. What is the advantage of using the TF-IDF over just using word counts?
  3. Read the excellent blog post from the Demonstration of tidytext using Darwin’s "On the Origin of Species". Run all of the code in this blog post in your R Notebook and explain each step presented. Discuss any differences you see in the sentiment analysis performed using different lexicons.
  4. Read the excellent blog post by Julia Silge GENDER ROLES WITH TEXT MINING AND N-GRAMS. Run all of the code in this blog post in your R Notebook and explain each step presented. What are the differences discovered in gender roles between the authors? Would this analysis be possible without using bigrams? Another blog post by Julia that is interesting She Giggles, He Gallops.

This next topic is not part of the Final! This is just for your information if you are currious.

Part 2: Make an R notebook with the file name lastname_firstname_Stat653_final_part2.Rmd and do the following:

  1. From the text2vec website. Read Collocation section. Run all of the code. Explain collocations.

Homework 6:

(due Monday May 10, 2021)


Homework 5:

(due Monday May 3, 2021)

Using an R Notebook produce your solutions to the following questions. Start by making an R Notebook with file name Lastname_Firstname_Stat652_hw5.Rmd. Then knit the .Rmd file to either Lastname_Firstname_Stat652_hw6.docx. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 653 Homework 5"

author: "Your name"

date: "May 4, 2020"

Upload one file to Blackboard.

Problems:

  1. Run the R code from Chapter 6.

Project

(due Monday May 14, 2021)

Run the LDA algorithm on a different set of books from Section 6.2 Example: the great library heist.

Downloads some books from Gutenberg of interest to you to use for the data. Run the code on the new books. Explain the Topics.

Alternatively, you can try LDA on songs from Genius or on papers from Arxiv. If you are interested in trying topic modeling with songs you will need to download the lyrics from many songs. If you are interested in trying topic modeling with papers from Arxiv you will need to download the pdf files for the papers and then extract the text from the papers. Or you will need to download a very large collection of abstracts from different areas of research.


Midterm:

(due Monday May 3, 2021)

Instructions: This is a take-home midterm. Your work is to be completed individually. You may use your book, google, and questions can be asked of the instructor. You are not to share code with other in the class.

Using an R Notebook produce your solutions to questions 1 - 4. Turn in an updated ver01 of the code provided for question 5, submit your .Rmd and a .docx or .pdf.

  1. What is the formula to compute the TF-IDF? Give an example of what the TF-IDF is used for in text mining.
  2. See Section 3.4 of the tidytext book. Download 4 other related books from gutenberg. Compute the TF-IDF of the words in each book. Make bargraphs of the top 15 words in each book. Comment.
  3. For the books you have selected for the previous problem, conduct the same analysis using bigrams.
  4. See Section 5.3 of the tidytext book. Download 4 papers from Arxiv or 4 songs from Genius. Perform a sentiment analysis of the papers or songs.
  5. Download the Large Movie Review Dataset. Re-run the code from the Quiz using the data training data only to see if having 12,500 training reviews gives more accuracy. Your assignment is to run the code in ver01 and write clear explanations above each R code chunk. If your computer is fast enough, use ver02, which uses the entire Large Dataset. See the code below. Please turn in a separate .Rmd and .docx or .pdf for this question.

Question 5:



Quiz:

(due Monday April 26, 2021)

What does vectorization mean? Explain.

Build Document-Term Matricies from the IMBD movie review data, using the text2vec R package. For each DTM described and computed on the website, build a logistic regression cross-validated classifier using the cv.glmnet() R function for the sentiment variable. The data is available in the text2vec package.

 > library(text2vec)
 > data("movie_review")

From the mdsr2e book, read mdsr2e Chapter 11 and Section 11.5 to learn about regularization.

Run all of the code from the text2vec Vectorization tutorial. Compare the results of the models using the prediction Accuracy computed using the test data. Add the following code after each model.

```{r}
preds_01 <- ifelse(preds, preds > 0.5)
mean(preds_01 == test$sentiment)
library(gmodels)
CrossTable(preds_01, test$sentiment,
       prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
       dnn = c('predicted', 'actual'))
```

Which vectorization method produced the best classification model?

(Extra credit:) Implement the Naive Bayes Classifier for each vectorization method. How does the Naive Bayes classifier compare with cross-validated logistic regression?

(Extra extra credit:) Implement cross-validated logistic regression, Naive Bayes, and a feedforward neural network model using tidymodels, for each vectorization method. How do the models compare?




Homework 4:

(due Monday April 19, 2021)

Using an R Notebook produce your solutions to the following questions. Start by making an R Notebook with file name Lastname_Firstname_Stat652_hw4.Rmd. Then knit the .Rmd file to either Lastname_Firstname_Stat652_hw4.docx. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 653 Homework 4"

author: "Your name"

date: "April 19, 2021"

Upload one file to Blackboard.

Problems:

  1. Run the R code from Chapter 5. 05-document-term-matrices.Rmd The code related to downloading financial reports does not work.
  2. Run the code using aRxiv and genius presented in class.

Homework 3:

(due Monday April 12, 2021)

Using R Notebooks produce your output from running the code from the book in Chapter 3 and 4.
Then knit the .Rmd file to either Lastname_Firstname_Stat652_hw3.docx or .pdf. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 653 Homework 3"

author: "Your name"

date: "April 12, 2021"

Upload one file to Blackboard.

Problems:

  1. Run the R code from Chapter 3. 03-tf-idf.Rmd
  2. Run the code from frequencies.R on the harrypotter data.
  3. Run the R code from Chapter 4. 04-word-combinations.Rmd Use the R Project I have provided. ngrams.zip Some of the code needs to be changed to make a Word Document.
  4. Make a network graph of one of the Harry Potter books or choose a different book from Gutenberg.

Homework 2:

(due Monday April 5, 2021)

Using R Notebooks produce your output from running the code from the book in Chapter 2.
Then knit the .Rmd file to either Lastname_Firstname_Stat652_hw2.docx or .pdf. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 653 Homework 2"

author: "Your name"

date: "April 5, 2021"

Problems:

  1. Run the R code from Chapter 2. Copy the code directly from the book into your own R Notebook.
  2. Try out the sentimentr package on a sent of Tweets you find on Twitter. Just use the Twitter Search.

Homework 1:

(due Monday March 22, 2021)

Using R Notebooks produce your output from running the code from the book in Chapter 1.
Then knit the .Rmd file to either Lastname_Firstname_Stat652_hw2.docx or .pdf. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 653 Homework 1"

author: "Your name"

date: "March 23, 2020"

Problems:

  1. Run the R code from Chapter 1. Copy the code directly from the book into your own R Notebook.
  2. Install the R package harrypotter and run the code from the UC-r Text Mining: Sentiment Analysis.

Fix:

The default mirror for the Gutenberg website that is used by the gutenbergr R package is not work. Here is the link to MIRROS.ALL. The following code can be used to try a different mirror. Try this R Notebook gutenberg_test.Rmd

> hgwells <- gutenberg_download(c(35, 36, 5230, 159), mirror = "http://gutenberg.readingroo.ms/")

There are a few places in the UC-r Harry Potter code where the values from the sentiment analysis are used but the index is wrongly used as the values variable. To see this, view the variables in the dataframe that is being used.