Moving beyond word counts leads to using n-grams.
2-grams or bigrams can be used to examine which words tend to follow other words.
bigrams can be used to see negation, for example, the use of the word not before another word, such as happy. Only using word counts can miss the negation, such as not happy, in sentiment analysis.
The unnets_tokens() function has an option token which change set to “ngrams” with n = 2 to get bigrams.
The separate() and unite() functions can be very useful when working with n-grams.
Now we need to filter out stop_words from two sets or words, the first word and the second word in a bigram.
bigrams can be useful to analyze text data where one of the words in the bigram is the same for the filter value. In the book the authors look at Streets in Jane Austin’s books.
To do this we can filter on the second word of the bigram “street”.
The td-idf values can be computed for bigrams also.
A very useful application of bigrams is to look for negations in text when performing Sentiment Analysis.
Example: The use of the word not before another word, such as happy. Only using word counts can miss the negation, such as not happy, in sentiment analysis.
bigrams can be used to locate words that should have the opposite sentiment score.
In the book the AFINN lexicon is used in the example given, this lexion give numeric sentiment values, with positive and negative numbers.
The example creates
> negation_words <- c("not", "no", "never", "without")
and used this like the stop_words