2024-02-14
Today we will work on implementing the Naive Bayes analysis of the SMS data presented in the book.
We will also discuss writing the reports for the class.
Spam filtering for SMS might be harder than for Email. The messages are shorter.
Working with text data requires a new set of tools for data analysis. In R there are a variety of packages.
Today we will go through the code from the book to use naive Bayes to classify SMS messages. We will need to read in text data and count words. We will need to apply the naive Bayes algorithm to classify the messages.
The idea with bag-of-words is that the words in the messages are considered separately and frequency is used. The order of the words is not taken into consideration.
For the data preparation we will use the tm package to process the messages.
There is a problem with tolower and Dictionary. We will use the updated commands.
To compare the training and test datasets we will include wordclouds to see if there is any difference in the commonly used words in ham and spam.
Using the wordcloud package and the wordcloud function.
To implement the naive Bayes algorithm we need to load the e1071 package and use the naiveBayes() and predict() functions.
The last part of the code tries to improve the model performance. To try and improve the model the Laplace estimator is used. In the book
laplace = 1
is used.
Can you use 1.5?
Does 2 help more?
Google’s R Style Guide
Here are a few interesting blog post about connecting to Twitter and performing Sentiment Analysis.