Assignments

Final (due Friday May 12, 2023)

This is a take-home Final Exam. You may ask questions of the Professor or search Google. This is exam is to be completed independently.

Part 1: Make an Quarto notebook with the file name lastname_firstname_Stat653_final_part1.qmd and do the following:

What is a tokenizer? (Note: The unest_tokens() function in the tidytext R package uses the tokenizer R package.)
What is the formula for calculating the TF-IDF? Explain the two parts of the calculation. What is the advantage of using the TF-IDF over just using word counts?
Read the excellent blog post from the Demonstration of tidytext using Darwin’s “On the Origin of Species”. Run all of the code in this blog post in your R Notebook and explain each step presented. Discuss any differences you see in the sentiment analysis performed using different lexicons.
Read the excellent blog post by Julia Silge GENDER ROLES WITH TEXT MINING AND N-GRAMS. Run all of the code in this blog post in your R Notebook and explain each step presented. What are the differences discovered in gender roles between the authors? Would this analysis be possible without using bigrams? Another blog post by Julia that is interesting She Giggles, He Gallops.

Suggestions: My first suggestion is to open the .txt file you download from the Gutenberg website and look at it closely. On what line does the test of the book start and on what line does the book end? How are the Chapters started? Depending on which version of the book you select the Chapters are indicted different that just by a number.

Part 2: Make an Quarto notebook with the file name lastname_firstname_Stat653_final_part2.Rmd and do the following:

From the text2vec website. Read Collocation section. Run all of the code. Explain collocations.

Suggestions: Start by downloading the text8.zip file into your R Project. In the line that has ~/

txt = readLines(“text8”)

The tokenizers R package no longer include stopwords, so you should change all uses of stopwords to

stopwords = tm::stopwords(“en”)

There is the line where GloVe$new() is used. The line of code on the website has changed. Before

glove = GloVe$new(50, vocabulary = vocabulary_with_phrases, x_max = 50)

fixed

glove = GloVe$new(rank = 50, x_max = 50)

Hopefully this fixes the issues with the code. If not, please send me an email with your .qmd file and I will try to help.

Where to download movie scripts.

Spotlight blog: WHERE TO DOWNLOAD MOVIE SCRIPTS: 10 GREAT SITES
Spotlight website: Archive.org
awesomefilm
simplyscripts

Homework06: (not collected)

Read: Chapter 7, 8, and 9

Homework05: (complete by Monday May 1, 2023)

Using Quarto Notebooks produce your output from running the code from the book in Chapter 5. Then Render the .qmd file to either Lastname_Firstname_Stat652_hw5.docx or .pdf. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 653 Homework 5"

author: "Your name"

date: "May 1, 2023"

Read: Chapter 6

Problems:

Run the R code from Chapter 6.

Project (complete by Monday May 8, 2023)

Run the LDA algorithm on a different set of books from Section 6.2 Example: the great library heist.

Downloads some books from Gutenberg of interest to you to use for the data. Run the code on the new books. Explain the Topics.

Alternatively, you can try LDA on songs from Genius or on papers from Arxiv. If you are interested in trying topic modeling with songs you will need to download the lyrics from many songs. If you are interested in trying topic modeling with papers from Arxiv you will need to download the pdf files for the papers and then extract the text from the papers. Or you will need to download a very large collection of abstracts from different areas of research.

Problem: The geniusr does not seem to be working currently. For the Project use Arxiv and download abstracts.
Hint: To use geniusr you need to register and set up a project with an API key.

Midterm: (complete by Monday May 1, 2023)

Instructions: This is a take-home midterm. Your work is to be completed individually. You may use your book, Google, and questions can be asked of the instructor. You are not to share code with others in the class.

Using a Quarto Notebook produce your solutions to questions 1 - 4. Turn in an updated ver01 of the code provided for question 5, submit your .qmd and a .docx or .pdf.

What is the formula to compute the TF-IDF? Give an example of what the TF-IDF is used for in text mining.
See Section 3.4 of the tidytext book. Download 4 other related books from Gutenberg. Compute the TF-IDF of the words in each book. Make bargraphs of the top 15 words in each book. Comment.
For the books you have selected for the previous problem, conduct the same analysis using bigrams.
See Section 5.3 of the tidytext book. Download 4 abstracts from Arxiv or 4 songs from Genius. Perform a sentiment analysis of the abstracts or songs.
Download the Large Movie Review Dataset. Re-run the code from the Quiz using the training data only to see if having 12,500 training reviews gives more accuracy. Your assignment is to run the code in ver01 and write clear explanations above each R code chunk. If your computer is fast enough, use ver02, which uses the entire Large dataset. See the code below. Please turn in a separate .qmd and .docx or .pdf for this question.

Problem: The geniusr does not seem to be working currently. For 4. use Arxiv and download 4 abstracts.
Hint: To use geniusr you need to register and set up a project with an API key.

Question 5:

R Project: Midterm.zip
Spotlight Website: Large Movie Review Dataset
Spotlight Paper: Learning Word Vectors for Sentiment Analysis

Quiz: (complete by Monday April 24, 2023)

What does vectorization mean? Explain.

Build Document-Term Matricies from the IMBD movie review data, using the text2vec R package. For each DTM described and computed on the website, build a logistic regression cross-validated classifier using the cv.glmnet() R function for the sentiment variable. The data is available in the text2vec package.

> library(text2vec)
> data("movie_review")

From the mdsr2e book, read mdsr2e Chapter 11 and Section 11.5 to learn about regularization.

Run all of the code from the text2vec Vectorization tutorial. Compare the results of the models using the prediction Accuracy computed using the test data. Add the following code after each model.

> preds_01 <- ifelse(preds, preds > 0.5)
> mean(preds_01 == test$sentiment)
> library(gmodels)
> CrossTable(preds_01, test$sentiment,
>       prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
>       dnn = c('predicted', 'actual'))

Which vectorization method produced the best classification model?

(Extra credit:) Implement the Naive Bayes Classifier for each vectorization method. How does the Naive Bayes classifier compare with cross-validated logistic regression?

(Extra extra credit:) Implement cross-validated logistic regression, Naive Bayes, and a feedforward neural network model using tidymodels, for each vectorization method. How do the models compare?

Spotlight Website: Large Movie Review Dataset

Homework04: (complete by Monday April 17, 2023)

Using Quarto Notebooks produce your output from running the code from the book in Chapter 5. Then Render the .qmd file to either Lastname_Firstname_Stat652_hw4.docx or .pdf. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 653 Homework 4"

author: "Your name"

date: "April 17, 2023"

Read: Chapter 5

Read: Chapter 15 in the Modern Data Science with R mrds2e

Read: Chapter 19 and 20 in the R for Data Science with R, 2ed r4ds

Problems:

Run the R code from Chapter 5. Copy the code into separate code chunks and explain what the code is doing directly above each code chunk. The code related to downloading financial reports does not work.
Run the code using aRxiv and geniusr presented in class.
Try the Tweets code for your tweets from Homework02. Here is the link to the RStudio stringr CheatSheet.

Comments:

Please see the instructions and what I have said in class. You should not be downloading the code from the author’s website in their .Rmd files and trying to run them. There are many many formatting commands added to the R code chunks that make it so the code will not Render.

For example, echo = FALSE and eval = FALSE will prevent the code from being run when rendered.

This is why the instructions are to copy the code from the book (or if you want from the author’s .Rmd file) into a new .qmd file without the formatting commands that the authors used to write their book. Or you can go through the .Rmd file and remove anything in the curly braces other than the r. ```{r}

Also, you need to add your comments about what is going on in the code. You should not have the text from the book in your Notebook, you should have your own text.

Homework03: (complete by Monday April 10, 2023)

Using Quarto Notebooks produce your output from running the code from the book in Chapter 3 and 4. Then Render the .qmd file to either Lastname_Firstname_Stat652_hw3.docx or .pdf. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 653 Homework 3"

author: "Your name"

date: "April 10, 2023"

Read: Chapter 3 and Chapter 4

Problems:

Run the R code from Chapter 3. 03-tf-idf.Rmd Copy the code into separate code chunks and explain what the code is doing directly above each code chunk.
Run the code from frequencies.R on the harrypotter data. Copy the code into separate code chunks and explain what the code is doing directly above each code chunk.
Run the R code from Chapter 4. 04-word-combinations.Rmd Use the R Project I have provided. ngrams.zip Some of the code needs to be changed to make a Word Document. Copy the code into separate code chunks and explain what the code is doing directly above each code chunk.
Make a network graph of one of the Harry Potter books or choose a different book from Gutenberg.

Homework02: (complete by Monday April 3, 2023)

Using Quarto Notebooks produce your output from running the code from the book in Chapter 2. Then Render the .qmd file to either Lastname_Firstname_Stat652_hw2.docx or .pdf. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 653 Homework 2"

author: "Your name"

date: "April 3, 2023"

Read: Chapter 2 and Chapter 3

Problems:

Run the R code from Chapter 2. Copy the code into separate code chunks and explain what the code is doing directly above each code chunk.
Try out the sentimentr package on a set of Tweets you find on Twitter. Just use the Twitter Search. Alternatively, ask ChatGPT to write 10 positive and 10 negative Tweets related to a topic of interest to you.

Homework01: (complete by Monday March 19, 2023)

Using Quarto Notebooks produce your output from running the code from the book in Chapter 1. Then Render the .qmd file to either Lastname_Firstname_Stat652_hw1.docx or .pdf. Use your own last name and first name in the filename. At the top of your first page you should include Name, Class, Section, and homework assignment.

The header of your R Notebooks should include

title: "Stat. 653 Homework 1"

author: "Your name"

date: "March 19, 2023"

Read: Chapter 1

Problems:

Run the R code from Chapter 1. Copy the code directly from the book into your own Quarto Notebook.
Install the R package harrypotter and run the code from the UC-r Text Mining: Sentiment Analysis.