Stat 653: Statistical Natural Language Processing
Department of Statistics and Biostatistics, CSU East Bay
Spring 2023:
Week 8:
- Due date changed: For some reason the folders in Canvas were not visible until late on Friday. I have extended the due dates for the Midterm and the last homework until Monday. Sorry for the confusion.
Week 7:
- Final: The final for the class will be take-home. You can complete it at home individually. I will be available during the scheduled exam time for questions.
- Project: Your project should be completed before the end of the quarter. The last day you can submit is Friday May 12.
- R project: This includes an introduction to the tokenizer package. POS.zip
- Spotlight blog: Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger
- R code: text2vec Word embeddings text2vec - GloVe Glove.zip
- Intel Developer Program
- Presentation: word2vec
- Spotlight blog: 8 Excellent Pretrained Models to get you Started with Natural Language Processing (NLP)
- Spotlight blog: learnmonkey 5 Real-World Sentiment Analysis Examples
- Spotlight Transformers: Hugging Face NLP Course Deep RL Course
- Presentation:
- Spotlight conference: Google I/O 2023
- What are LLMs? What is a Large Language Model (LLM)?
- Warning: “Godfather of artificial intelligence” leaves Google to talk about the tech’s potential dangers
- Warning: Titans of AI Andrew Ng and Yann LeCun oppose call for pause on powerful AI systems
- Spotlight book: Deep Learning This is a bit more advanced book.
- Spotlight talk: Deep Learning
- Spotlight Website: deeplearning.ai
Week 6:
- Homework: Homework 5 and Homework 6 have been posted.
- Case Studies: R Projects
- Spotlight website: 20 Newsgroups
- Spotlight Databases:
- Twitter:
- R packages:
- Python packages:
- Mastodon:
- R packages:
- Case Studies: R Projects
- Twitter.zip updated to include an rtweet example. I am still waiting to be approved.
- Case Studies: R Projects
- Mastodon.zip
- Spotlight book: Mastering Text Mining with R
- CRAN taskview: CRAN Task View: Natural Language Processing
- Spotlight Course: Google Machine Learning
Week 5:
- Quiz: The take-home Quiz has been posted. See the Homework link. I have provided new updated code using all of the updated Large Dataset.
- Midterm: There will be take-home Midterm given this week.
- Project: The Project will be given this week.
- Presentation:
- R code: 06-topic-models.Rmd
- Spotlight paper: topicmodels
- Spotlight blog: A gentle introduction to topic modeling using R
- YouTube Video: Prof. David Blei - Probabilistic Topic Models and User Behavior
- Spotlight blog: Intuitive Guide to Latent Dirichlet Allocation
- Spotlight paper: Latent Dirichlet Allocation
- Coursera class: Text Mining and Analytics
- R packages:
- Python packages:
- Spotlight Software:
- DocFetcher Search files fast using an index files in a directory on your computer. The index scores the filename and the contents of the files.
- Meld Compare text files
- Visual Studio Code Text editor
- Sublime Text Text editor
- Spotlight Github: TidyMuller
- Spotlight blog: Here is a very nice blog post to follow to get started Text Crunching and Data Munging by Josephine Lukito.
- Spotlight Github: Topic Models Learning and R Resources
- Spotlight Github: Latent Dirichlet Allocation Using Gibbs Sampling
- Case Studies: R Projects These will be posted by Wednesday.
- R packages:
- Spotlight blog: A beginner’s guide to collecting and mapping Twitter data using R
- Spotlight blog: Setting Up Twitter for Text mining in R
- Spotlight blog: Practice getting data from the Twitter API
- Spotlight paper: Sentiment Analysis of Global Warming Using Twitter Data
Week 4:
- Quiz: The take-home Quiz has been posted. See the Homework link.
- Homework: Homework 4 has been posted.
- This is the point in the course where we will discuss accessing text based data from APIs and data scraping. We have used gutenbergr, the code for tm.plugin.webmining is not working currently, we will try aRxiv and geniusr, next week we will look at scraping text from wikipedia and downloading data from Mastodon.
- R Code: Note that the code that download financial articles from Chapter 5 is currently not working.
- Spotlight book: Notes for “Text Mining with R: A Tidy Approach”
- Spotlight blog: BloombergGPT
- R Code:
- aRxiv.html
- aRxiv.qmd
- songs.html
- songs.qmd This code does not work yet, the geniusr package has changed.
- Spotlight blog: Shiny Contest Sentify
- Spotlight paper: MusicMood: Predicting the mood of music from songlyrics using machine learning
- Spotlight website: FMA
- Spotlight dataset: Million Song Dataset musixmatch
- Spotlight R package: rmusix
- Spotlight dataset: IMDb Datasets
- Spotlight data: Internet Archive Television
- Spotlight Chapter: Read mdsr2e Chapter 19
- Presentation:
Week 3:
- Assignment: Homework 3 has been posted.
- Github: tidy-text-mining Code > Download ZIP
- Spotlight book: Speech and Language Processing This is a bit more advanced book. In Chapter 3 there is a very nice presentation of n-grams and in Chapter 4 there is a very nice presentation of naive Bayes.
- Google Search some n-grams: Google Search Search Terms: Gelato, Gelato Trader Joes, Gelato Italy
- Google search books: Google Books Search Terms: Suess, Trumbo
- Google search n-grams in books: Google N-gram Viewer Search Terms: Suess, Trumbo, Seuss
- Google Trends: Google Trends Search Terms: Eric Suess (and do a regional search California)
- Google Adwords: Google Adwords
- Presentation:
- Presentation:
- Spotlight Website: Business Science
- Presentation: images.pdf
- Presentation:
- Notes: BayesNotes.pdf
- sms_spam.csv
- MLwR_v2_04.r
- R Project: Chap04.zip
- R Notebook: NB.Rmd
- What is a VCorpus? StackExchange
- tm Vignettes
Week 2:
- Spring Break: Next week is Spring Break so there will be no class on Monday and Wednesday next week. (I will be catching up with my grading.)
- Twitter: Twitter Search
- Mastodon: Mastodon Search
- Assignment: Homework 2 has been posted.
- Presentation:
- Sentiment:
- Presentation:
- Word and Document Frequencies:
- Presentation:
- ngrams.html
- ngrams.qmd
- R Project: ngrams.zip
- Presentation:
Week 1:
- Book: From the University Library > Databases A-Z > O > O’Reilly Then login using your university login. Find Text Mining with R, A Tidy Approach.
- Book: Text Mining with R, A Tidy Approach
- Github: tidy-text-mining
- Reference: r4ds
- Assignment: Homework 1 has been posted.
- Harry Potter books:
- harrypotter.R
- Quarto Project: 01-HarryPotter.zip
- harrypotter.R - Colab
- UC-r Text Mining:
- Presentation:
- Presentation:
- Sentiment:
- Software Spotlight: posit Cloud Instead of dealing with Java (on Windows you need the Windows Offline (64 bit) version) to run rJava, you could use the posit Cloud.
- Google Colab to run R: https://colab.to/r
Week 0:
Learning R:
Learning Python:
Learn SQL:
Excellent References:
Data Science:
- Socviz
- r4ds
- ModernDive
- Yarrr!
- R Data Science Essentials
- Python Data Science Essentials
- Deep Learning Made Easy with R
- Doing Data Science
- Data Science from Scratch
- What is Data Science? (fast easy read)
- Ethics and Data Science (fast easy read)
- Data Driven (fast easy read)
- R Markdown: The Definitive Guide
Reading related to the Digital Economy:
- The Second Machine Age: Work, Progress and Prosperity in a Time of Brilliant Technologies
- Race Against the Machine
- Wired For Innovation
- Strategies for e-business success
- Understanding the Digital Economy
More Big Picture:
- Fourth Paradigm of Science: Data-Intensive Scientific Discovery
- McKinsey Global Institute Big Data: The next frontier for innovation, competition, and productivity
Music: