---
title: "Frequencies"
author: "Prof. Eric A. Suess"
date: "March 22, 2021"
output:
  beamer_presentation: default
  ioslides_presentation: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

## Introduction

A central question in text mining and natural language processing is how to quantify what a document is about.

## Word Frequenies

Can we do this by looking at the words that make up the document? One measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a document, as we examined in Chapter 1. 

## Document Frequencies

Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.

## TF-IDF

The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.

$$ idf(term) = log \left( \frac{n_{documents}}{n_{documents \ containing \ term}} \right)$$

The Wikipedia page about [tf-idf](https://en.wikipedia.org/wiki/Tf–idf) is useful.

## Zipf's law

Zipf’s law states that the frequency that a word appears is inversely proportional to its rank.

The Wikipedia page about [Zipf's law](https://en.wikipedia.org/wiki/Zipf's_law) is very helpful.

Power law.

## TF-IDF

The point of tf-idf; it identifies words that are important to one document within a collection of documents.