Stat653 - Prof. Suess

Stat 653 Statistical Natural Language Processing

Department of Statistics and Biostatistics

California State University, East Bay

Spring 2021


Course Description	Homework	Important Dates	Software
Syllabus	Handouts		Links
Blackboard	podcasts	Data	Online Books

Week 8: Finals week

I will have my usual office hours MW 2-3pm.
I will log in to class at noon on Monday and Wednesday and answer questions. Please do not log in at the end of class time, when the last question is answered I will log out.

Week 7:

Evaluations: Please do respond to the class evaluation. I would like to hear your feedback about the class. Topics, R, etc.
Final: The final for the class will be take-home. You can complete it at home individually. I will be available during the scheduled exam time for questions.
Project: Your project should be completed before the end of the quarter. The last day you can submit is Friday May 14.
R project: This includes an introduction to the tokenizer package.
- POS.zip
Spotlight blog: Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger
R code: text2vec Word embeddings text2vec - GloVe
- Glove.zip
Intel Developer Program NLP
Presentation: word2vec
Spotlight blog: 8 Excellent Pretrained Models to get you Started with Natural Language Processing (NLP)
Spotlight blog: learnmonkey 5 Real-World Sentiment Analysis Examples
Spotlight conference: Google I/O 2021

Week 6:

Homework: Homework 5 has been posted.
Case Studies: R Projects
- NASA.zip
- usenet.zip
Spotlight website: 20 Newsgroups
Spotlight Databases:
- sqlite
- mysql
- elastic
- neo4j
- mongoDB
- crate.io
Twitter:
- Help Center
- Docs
R packages:
- Rtweet
- streamR
Python packages:
- twint
- twitterscraper
Spotlight blog: How to Scrape Tweets from Twitter with Python Twint
Spotlight YouTube: Scrape Tweets with No Limitation and No API Key with Twint
Case Studies: R Projects
- Twitter.zip updated to include an rtweet example. I am still waiting to be approved.
- Twitter_twint_zip
Spotlight book: Mastering Text Mining with R
CRAN taskview: CRAN Task View: Natural Language Processing
Presentation:

Week 5:

Resume Suggestion: Considering everything these days it might be good to add line to your resume. "Willing and able to work remotely."
Midterm: There will be take-home Midterm given this week.
Quiz: The take-home Quiz has been posted. See the Homework link. I have provided new updated code using all of the updated Large Dataset.
Project: The Project will be given this week.
R code: 06-topic-models.Rmd
Spotlight github: TidyMuller
Spotlight blog: Here is a very nice blog post to follow to get started Text Crunching and Data Munging by Josephine Lukito.
Spotlight github: Topic Models Learning and R Resources
Spotlight github: Latent Dirichlet Allocation Using Gibbs Sampling
Case Studies: R Projects
R packages:
Spotlight blog: A beginner's guide to collecting and mapping Twitter data using R
Spotlight blog: Setting Up Twitter for Text mining in R.
Spotlight blog: Practice getting data from the Twitter API
Spotlight paper: Sentiment Analysis of Global Warming Using Twitter Data

Week 4:

Quiz: The take-home Quiz has been posted. See the Homework link.
Homework: Homework 3's due date has been extended until next week. Due Tuesday April 20.
Spotlight Conference: Nvidia GTC21
Homework: Homework 4 has been posted.
This is the point in the course where we will discuss accessing text based data from APIs and data scraping.
We have used gutenbergr, the code for tm.plugin.webmining is not working currently, we will try aRxiv and genius, next week we will look at scraping text from wikipedia and downloading data from Twitter.
R Code: Note that the code that download finacial articles is currently not working.
- 05-document-term-matrices.Rmd
R Code:
- aRxiv.html
- aRxiv.pdf
- aRxiv.Rmd
- songs.html
- songs.pdf
- songs.Rmd
Spotlight blog: Sentify
Spotlight paper: MusicMood: Predicting the mood of music from songlyrics using machine learning
Spotlight website: FMA
Spotlight dataset: Million Song Dataset musixmatch
Spotlight R package: rmusix
Spotlight dataset: IMDb Datasets
Spotlight data: Internet Archive > Television
Spotlight Chapter: Read mdsr2e Chapter 19
Presentation:
- Injesting.html
- Injesting.pdf
- Injesting.Rmd
Presentation:
- TopicModeling.html
- TopicModeling.pdf
- TopicModeling.Rmd
Spotlight paper: topicmodels
Spotlight blog: A gentle introduction to topic modeling using R
YouTube Video: Prof. David Blei - Probabilistic Topic Models and User Behavior
Spotlight blog: Intuitive Guide to Latent Dirichlet Allocation
Spotlight paper: Latent Dirichlet Allocation
Coursera class: Text Mining and Analytics
R packages:
- lda
- text2vec
- quanteda
- spacyr
- Syuzhet
Python packageds:
- NLTK
- spacy
- gensim
Spotlight Software:
- DocFetcher Search files fast using an index files in a directory on your computer. The index scores the filename and the contents of the files.
- Meld Compare text files
- atom Text editor
- Visual Studio Code Text editor
- Sublime Text Text editor

Week 3:

Homework: Homework 3 has been posted.
Github: tidy-text-mining
Spotlight book: Speech and Language Processing This is a bit more advanced book. In Chapter 3 there is a very nice presentation of n-grams and in Chapter 4 there is a very nice presentation of naive Bayes.
Google search some n-grams: Google Search Search Terms: Gelato, Gelato Trader Joes, Gelato Italy
Google search books: Google books Search Terms: Suess, Trumbo
Google search n-grams in books: Google N-gram Viewer Search Terms: Suess, Trumbo, Seuss
Google Trends: Google Trends Search Terms: Eric Suess (and do a regional serch California)
Presentation:
Presentation:
- DTM.html
- DTM.pdf
- DTM.Rmd
Spotlight Website: Business Science
Communication Software:
- loom
- slack Please send me an email from your university account, Subject line, Please add me to the Stat 650 Slack channel.
Presentation: images.pdf
- Naive Bayes.html
- Naive Bayes.pdf
Presentation:
- Naive Bayes SMS spam filtering.html
- Naive Bayes SMS spam filtering.pdf
Notes: BayesNotes.pdf
- sms_spam.csv
- MLwR_v2_04.r
- R Project: Chap04.zip
- R Notebook: NB.Rmd
- What is a VCorpus? StackExchange
- tm Vignettes
- Hint: Recall from class that some people running R on Windows had a fonts problem. To solve the problem we added a line to the code giving the third DTM to the first. Since all of the steps used to create the first DTM are also done for the third DTM.

> # compare the result
> sms_dtm
> sms_dtm2
> sms_dtm3
> sms_dtm <- sms_dtm3

Week 2:

Spring Break: Next week is Spring Break so there will be no class on Monday and Wednesday next week.
Twitter: Twitter Search
Homework: Homework 2 has been posted.
Homework: There is a problem with the default Guenberg mirror for our location. See the Homework link for a fix.
Presentation:
Sentiment:
- sentimentr.R
Presentation:
Word and Document Frequencies:
- frequencies.R
Presentation:
- ngrams.html
- ngrams.pdf
- ngrams.Rmd
- R Project: ngrams.zip
Presentation:

Week 1:

Monday Video: The Welcome - vodeo has been moved to Blackboard under Course Materials.
Office Hours: Office Hours Monday from 2-3pm are cancelled. We are interviewing a candidate for a tenure track faculty position.
Book: Text Mining with R, A Tidy Approach
Github: tidy-text-mining
Reference: r4ds
Homework: Homework 1 has been posted.
Harry Potter books
- harrypotter.R
UC-r Text Mining:
- UC Business Analytics R Programming Guide
- Text Mining: Creating Tidy Text
Presentation:
Presentation:
Sentiment:
- sentiment.R

Learning R:

Data Camp: Introduction to R
Data Camp: Machine Learning with Tree-Based Models in R
RProgramming.net
Introduction to MRO
R-Exercises

Learning Python:

Learning SQL:

W3SQL

Excellent References:

Data Science:

Reading related to AI and ML for Marketing:

Reading related to the Digital Economy:

More Big Picture:

To post: