--- title: "Naive Bayes2" author: "Prof. Eric A. Suess" date: "February 24, 2021" output: beamer_presentation: default ioslides_presentation: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ## Introduction Today we will work on implementing the Naive Bayes analysis of the SMS data presented in the book. We will also discuss writing the reports for the class. ## Example - filtering SMS Spam filtering for SMS might be harder than for Email. The messages are shorter. Working with text data requires a new set of tools for data analysis. In R there are a variety of packages. - [tm](http://cran.r-project.org/web/packages/tm/index.html) - [tidytext](https://cran.r-project.org/web/packages/tidytext/index.html) - [Tidy Text Mining with R](https://www.tidytextmining.com/) - [Introduction to tidytext](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html) - [sentimentr](https://github.com/trinker/sentimentr) - [rtweet](https://github.com/ropensci/rtweet) - [text2vec](http://text2vec.org) ## Text Mining in R [Journal of Statistical Software](http://www.jstatsoft.org/) - [Text Mining Infrastructure in R](http://www.jstatsoft.org/v25/i05/paper) [The R Journal](http://journal.r-project.org/) - [RTextTools: A Supervised Learning Package for Text Classification](http://journal.r-project.org/archive/2013-1/collingwood-jurka-boydstun-etal.pdf) - [RcmdrPlugin.temis, a Graphical Integrated Text Mining Solution in R](http://journal.r-project.org/archive/2013-1/bouchetvalat-bastin.pdf) ## bag-of-words Today we will go through the code from the book to use naive Bayes to **classify** SMS messages. We will need to read in text data and count words. We will need to apply the naive Bayes algorithm to classify the messages. The idea with *bag-of-words* is that the words in the messages are considered separately and frequency is used. The *order* of the words is *not taken into consideration*. For the data preparation we will use the **tm** package to process the messages. There is a problem with **tolower** and **Dictionary**. We will use the updated commands. ## Wordclouds To compare the training and test datasets we will include wordclouds to see if there is any difference in the commonly used words in ham and spam. Using the **wordcloud** package and the **wordcloud** function. ## Naive Bayes To implement the naive Bayes algorithm we need to load the **e1071** package and use the **naiveBayes()** and **predict()** functions. ## Does the Laplace estimator help? The last part of the code tries to improve the model performance. To try and improve the model the **Laplace estimator** is used. In the book **laplace = 1** is used. Can you use 1.5? Does 2 help more? ## Code Writing Google's R Style Guide - [google R code](https://google.github.io/styleguide/Rguide.xml) - [R style guide](http://adv-r.had.co.nz/Style.html) - [The Tidyverse style guide](https://style.tidyverse.org/) ## Reports - [CS 6375](http://www.hlt.utdallas.edu/~vgogate/ml/2012s/projects.html) - [CS 391L Machine Learning Project Report Format](http://www.cs.utexas.edu/~mooney/cs391L/paper-template.html) - [CS 229 Machine Learning Final Reports](http://cs229.stanford.edu/projects2012.html) ## Sentiment Analysis of Twitter Data using R Here are a few interesting blog post about connecting to Twitter and performing Sentiment Analysis. - [Mining Twitter Data with R](https://sites.google.com/site/miningtwitter/questions/sentiment/sentiment) - [Sentiment Analysis on Twitter Data : Text Analytics Tutorial](https://mkmanu.wordpress.com/2014/08/05/sentiment-analysis-on-twitter-data-text-analytics-tutorial/)