---
title: "Naive Bayes2"
author: "Prof. Eric A. Suess"
date: "February 24, 2021"
output:
  beamer_presentation: default
  ioslides_presentation: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

## Introduction

Today we will work on implementing the Naive Bayes analysis of the SMS data presented in the book.  

We will also discuss writing the reports for the class.


## Example - filtering SMS

Spam filtering for SMS might be harder than for Email.  The messages are shorter.

Working with text data requires a new set of tools for data analysis.  In R there are a variety of packages.

- [tm](http://cran.r-project.org/web/packages/tm/index.html)
- [tidytext](https://cran.r-project.org/web/packages/tidytext/index.html)
- [Tidy Text Mining with R](https://www.tidytextmining.com/)
- [Introduction to tidytext](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html)
- [sentimentr](https://github.com/trinker/sentimentr)
- [rtweet](https://github.com/ropensci/rtweet)
- [text2vec](http://text2vec.org)

## Text Mining in R

[Journal of Statistical Software](http://www.jstatsoft.org/)

- [Text Mining Infrastructure in R](http://www.jstatsoft.org/v25/i05/paper)

[The R Journal](http://journal.r-project.org/)

- [RTextTools: A Supervised Learning
Package for Text Classification](http://journal.r-project.org/archive/2013-1/collingwood-jurka-boydstun-etal.pdf)
- [RcmdrPlugin.temis, a Graphical
Integrated Text Mining Solution in R](http://journal.r-project.org/archive/2013-1/bouchetvalat-bastin.pdf)


## bag-of-words

Today we will go through the code from the book to use naive Bayes to **classify** SMS messages.  We will need to read in text data and count words.  We will need to apply the naive Bayes algorithm to classify the messages.

The idea with *bag-of-words* is that the words in the messages are considered separately and frequency is used.  The *order* of the words is *not taken into consideration*.

For the data preparation we will use the **tm** package  to process the messages.

There is a problem with **tolower** and **Dictionary**.  We will use the updated commands.

## Wordclouds

To compare the training and test datasets we will include wordclouds to see if there is any difference in the commonly used words in ham and spam.

Using the **wordcloud** package and the **wordcloud** function.

## Naive Bayes

To implement the naive Bayes algorithm we need to load the **e1071** package and use the **naiveBayes()** and **predict()** functions.


## Does the Laplace estimator help?

The last part of the code tries to improve the model performance.  To try and improve the model the **Laplace estimator** is used.  In the book 

**laplace = 1**

is used.

Can you use 1.5?

Does 2 help more?

## Code Writing

Google's R Style Guide

- [google R code](https://google.github.io/styleguide/Rguide.xml)
- [R style guide](http://adv-r.had.co.nz/Style.html)
- [The Tidyverse style guide](https://style.tidyverse.org/)

## Reports

- [CS 6375](http://www.hlt.utdallas.edu/~vgogate/ml/2012s/projects.html)
- [CS 391L Machine Learning Project Report Format](http://www.cs.utexas.edu/~mooney/cs391L/paper-template.html)
- [CS 229 Machine Learning Final Reports](http://cs229.stanford.edu/projects2012.html)

## Sentiment Analysis of Twitter Data using R

Here are a few interesting blog post about connecting to Twitter and performing Sentiment Analysis.

- [Mining Twitter Data with R](https://sites.google.com/site/miningtwitter/questions/sentiment/sentiment)
- [Sentiment Analysis on Twitter Data : Text Analytics Tutorial](https://mkmanu.wordpress.com/2014/08/05/sentiment-analysis-on-twitter-data-text-analytics-tutorial/)