Stat 653 Statistical Natural Language Processing
Department of Statistics and Biostatistics
California State University, East Bay
Spring 2021
Course Description | Homework | Important Dates | Software |
Syllabus | Handouts | Links | |
Blackboard | podcasts | Data | Online Books |
Week 8: Finals week
- I will have my usual office hours MW 2-3pm.
- I will log in to class at noon on Monday and Wednesday and answer questions. Please do not log in at the end of class time, when the last question is answered I will log out.
Week 7:
- Evaluations: Please do respond to the class evaluation. I would like to hear your feedback about the class. Topics, R, etc.
- Final: The final for the class will be take-home. You can complete it at home individually. I will be available during the scheduled exam time for questions.
- Project: Your project should be completed before the end of the quarter. The last day you can submit is Friday May 14.
- R project: This includes an introduction to the tokenizer package.
- Spotlight blog: Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger
- R code: text2vec Word embeddings text2vec - GloVe
- Intel Developer Program NLP
- Presentation: word2vec
- Spotlight blog: 8 Excellent Pretrained Models to get you Started with Natural Language Processing (NLP)
- Spotlight blog: learnmonkey 5 Real-World Sentiment Analysis Examples
- Spotlight conference: Google I/O 2021
Week 6:
- Homework: Homework 5 has been posted.
- Case Studies: R Projects
- Spotlight website: 20 Newsgroups
- Spotlight Databases:
- Twitter:
- R packages:
- Python packages:
- Spotlight blog: How to Scrape Tweets from Twitter with Python Twint
- Spotlight YouTube: Scrape Tweets with No Limitation and No API Key with Twint
- Case Studies: R Projects
- Twitter.zip updated to include an rtweet example. I am still waiting to be approved.
- Twitter_twint_zip
- Spotlight book: Mastering Text Mining with R
- CRAN taskview: CRAN Task View: Natural Language Processing
- Presentation:
Week 5:
- Resume Suggestion: Considering everything these days it might be good to add line to your resume. "Willing and able to work remotely."
- Midterm: There will be take-home Midterm given this week.
- Quiz: The take-home Quiz has been posted. See the Homework link. I have provided new updated code using all of the updated Large Dataset.
- Project: The Project will be given this week.
- R code: 06-topic-models.Rmd
- Spotlight github: TidyMuller
- Spotlight blog: Here is a very nice blog post to follow to get started Text Crunching and Data Munging by Josephine Lukito.
- Spotlight github: Topic Models Learning and R Resources
- Spotlight github: Latent Dirichlet Allocation Using Gibbs Sampling
- Case Studies: R Projects
- R packages:
- Spotlight blog: A beginner's guide to collecting and mapping Twitter data using R
- Spotlight blog: Setting Up Twitter for Text mining in R.
- Spotlight blog: Practice getting data from the Twitter API
- Spotlight paper: Sentiment Analysis of Global Warming Using Twitter Data
Week 4:
- Quiz: The take-home Quiz has been posted. See the Homework link.
- Homework: Homework 3's due date has been extended until next week. Due Tuesday April 20.
- Spotlight Conference: Nvidia GTC21
- Homework: Homework 4 has been posted.
- This is the point in the course where we will discuss accessing text based data from APIs and data scraping.
- We have used gutenbergr, the code for tm.plugin.webmining is not working currently, we will try aRxiv and genius, next week we will look at scraping text from wikipedia and downloading data from Twitter.
- R Code: Note that the code that download finacial articles is currently not working.
- R Code:
- Spotlight blog: Sentify
- Spotlight paper: MusicMood: Predicting the mood of music from songlyrics using machine learning
- Spotlight website: FMA
- Spotlight dataset: Million Song Dataset musixmatch
- Spotlight R package: rmusix
- Spotlight dataset: IMDb Datasets
- Spotlight data: Internet Archive > Television
- Spotlight Chapter: Read mdsr2e Chapter 19
- Presentation:
- Presentation:
- Spotlight paper: topicmodels
- Spotlight blog: A gentle introduction to topic modeling using R
- YouTube Video: Prof. David Blei - Probabilistic Topic Models and User Behavior
- Spotlight blog: Intuitive Guide to Latent Dirichlet Allocation
- Spotlight paper: Latent Dirichlet Allocation
- Coursera class: Text Mining and Analytics
- R packages:
- Python packageds:
- Spotlight Software:
- DocFetcher Search files fast using an index files in a directory on your computer. The index scores the filename and the contents of the files.
- Meld Compare text files
- atom Text editor
- Visual Studio Code Text editor
- Sublime Text Text editor
Week 3:
- Homework: Homework 3 has been posted.
- Github: tidy-text-mining
- Spotlight book: Speech and Language Processing This is a bit more advanced book. In Chapter 3 there is a very nice presentation of n-grams and in Chapter 4 there is a very nice presentation of naive Bayes.
- Google search some n-grams: Google Search Search Terms: Gelato, Gelato Trader Joes, Gelato Italy
- Google search books: Google books Search Terms: Suess, Trumbo
- Google search n-grams in books: Google N-gram Viewer Search Terms: Suess, Trumbo, Seuss
- Google Trends: Google Trends Search Terms: Eric Suess (and do a regional serch California)
- Presentation:
- Presentation:
- Spotlight Website: Business Science
- Communication Software:
- Presentation: images.pdf
- Presentation:
- Notes: BayesNotes.pdf
- sms_spam.csv
- MLwR_v2_04.r
- R Project: Chap04.zip
- R Notebook: NB.Rmd
- What is a VCorpus? StackExchange
- tm Vignettes
- Hint: Recall from class that some people running R on Windows had a fonts problem. To solve the problem we added a line to the code giving the third DTM to the first. Since all of the steps used to create the first DTM are also done for the third DTM.
> # compare the result
> sms_dtm
> sms_dtm2
> sms_dtm3
> sms_dtm <- sms_dtm3
Week 2:
- Spring Break: Next week is Spring Break so there will be no class on Monday and Wednesday next week.
- Twitter: Twitter Search
- Homework: Homework 2 has been posted.
- Homework: There is a problem with the default Guenberg mirror for our location. See the Homework link for a fix.
- Presentation:
- Sentiment:
- Presentation:
- Word and Document Frequencies:
- Presentation:
- ngrams.html
- ngrams.pdf
- ngrams.Rmd
- R Project: ngrams.zip
- Presentation:
Week 1:
- Monday Video: The Welcome - vodeo has been moved to Blackboard under Course Materials.
- Office Hours: Office Hours Monday from 2-3pm are cancelled. We are interviewing a candidate for a tenure track faculty position.
- Book: Text Mining with R, A Tidy Approach
- Github: tidy-text-mining
- Reference: r4ds
- Homework: Homework 1 has been posted.
- Harry Potter books
- UC-r Text Mining:
- Presentation:
- Presentation:
- Sentiment:
Learning R:
- Data Camp: Introduction to R
- Data Camp: Machine Learning with Tree-Based Models in R
- RProgramming.net
- Introduction to MRO
- R-Exercises
Learning Python:
Learning SQL:
Excellent References:
Data Science:
- r4ds
- ModernDive
- Yarrr!
- R Data Science Essentials
- Python Data Science Essentials
- Doing Data Science
- Data Science from Scratch
- Data Driven (fast easy read)
- A Simple Introduction to Data Science
- R Markdown: The Definitive Guide
Reading related to AI and ML for Marketing:
- AI for Marketing and Product Innovation: Powerful New Tools for Predicting Trends, Connecting with Customers, and Closing Sales
- MachineVantage AI Videos
Reading related to the Digital Economy:
- The Second Machine Age: Work, Progress and Prosperity in a Time of Brilliant Technologies
- Race Against the Machine
- Wired For Innovation
- Strategies for e-business success
- Understanding the Digital Economy
More Big Picture:
- Fourth Paradigm of Science: Data-Intensive Scientific Discovery
- McKinsey Global Institute Big Data: The next frontier for innovation, competition, and productivity
To post: