--- title: "Naive Bayes" author: "Prof. Eric A. Suess" date: "2/18/2026" format: revealjs: embed-resources: true --- ## Introduction Today we will work on implementing the Naive Bayes analysis of the SMS data presented in the book. We will also discuss writing the reports for the class. ## Example - filtering SMS Spam filtering for SMS might be harder than for Email. The messages are shorter. Working with text data requires a new set of tools for data analysis. In R there are a variety of packages. - [tm](http://cran.r-project.org/web/packages/tm/index.html) - [tidytext](https://cran.r-project.org/web/packages/tidytext/index.html) - [Tidy Text Mining with R](https://www.tidytextmining.com/) - [Introduction to tidytext](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html) - [sentimentr](https://github.com/trinker/sentimentr) - [rtweet](https://github.com/ropensci/rtweet) - [text2vec](http://text2vec.org) ## Text Mining in R [Journal of Statistical Software](http://www.jstatsoft.org/) - [Text Mining Infrastructure in R](http://www.jstatsoft.org/v25/i05/paper) [The R Journal](http://journal.r-project.org/) ## bag-of-words Today we will go through the code from the book to use naive Bayes to **classify** SMS messages. We will need to read in text data and count words. We will need to apply the naive Bayes algorithm to classify the messages. The idea with *bag-of-words* is that the words in the messages are considered separately and frequency is used. The *order* of the words is *not taken into consideration*. For the data preparation we will use the **tm** package to process the messages. There is a problem with **tolower** and **Dictionary**. We will use the updated commands. ## Wordclouds To compare the training and test datasets we will include wordclouds to see if there is any difference in the commonly used words in ham and spam. Using the **wordcloud** package and the **wordcloud** function. ## Naive Bayes To implement the naive Bayes algorithm we need to load the **e1071** package and use the **naiveBayes()** and **predict()** functions. ## Does the Laplace estimator help? The last part of the code tries to improve the model performance. To try and improve the model the **Laplace estimator** is used. In the book **laplace = 1** is used. Can you use 1.5? Does 2 help more?