---
title: "Naive Bayes"
author: "Prof. Eric A. Suess"
date: "2/18/2026"
format: 
  revealjs:
    embed-resources: true
---

## Introduction

Today we will work on implementing the Naive Bayes analysis of the SMS data presented in the book.  

We will also discuss writing the reports for the class.


## Example - filtering SMS

Spam filtering for SMS might be harder than for Email.  The messages are shorter.

Working with text data requires a new set of tools for data analysis.  In R there are a variety of packages.

- [tm](http://cran.r-project.org/web/packages/tm/index.html)
- [tidytext](https://cran.r-project.org/web/packages/tidytext/index.html)
- [Tidy Text Mining with R](https://www.tidytextmining.com/)
- [Introduction to tidytext](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html)
- [sentimentr](https://github.com/trinker/sentimentr)
- [rtweet](https://github.com/ropensci/rtweet)
- [text2vec](http://text2vec.org)

## Text Mining in R

[Journal of Statistical Software](http://www.jstatsoft.org/)

- [Text Mining Infrastructure in R](http://www.jstatsoft.org/v25/i05/paper)

[The R Journal](http://journal.r-project.org/)


## bag-of-words

Today we will go through the code from the book to use naive Bayes to **classify** SMS messages.  We will need to read in text data and count words.  We will need to apply the naive Bayes algorithm to classify the messages.

The idea with *bag-of-words* is that the words in the messages are considered separately and frequency is used.  The *order* of the words is *not taken into consideration*.

For the data preparation we will use the **tm** package  to process the messages.

There is a problem with **tolower** and **Dictionary**.  We will use the updated commands.

## Wordclouds

To compare the training and test datasets we will include wordclouds to see if there is any difference in the commonly used words in ham and spam.

Using the **wordcloud** package and the **wordcloud** function.

## Naive Bayes

To implement the naive Bayes algorithm we need to load the **e1071** package and use the **naiveBayes()** and **predict()** functions.


## Does the Laplace estimator help?

The last part of the code tries to improve the model performance.  To try and improve the model the **Laplace estimator** is used.  In the book 

**laplace = 1**

is used.

Can you use 1.5?

Does 2 help more?