---
title: "Document Term Matrix"
author: "Prof. Eric A. Suess"
date: "April 7, 2021"
output:
  beamer_presentation: default
  ioslides_presentation: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

## Document Term Matrix (DTM)

The Document Term Matrix is one of the most common data storage formats for text based data.

The DTM is based on the bag-of-words model.  Each word is a feature in the data set.  This leads to **sparse matricies**.

## Document Term Matrix (DTM)

Some of the most popular R libraries and Python packages use DTM.

- R: tm, quanteda, [CRAN Task View: NLP](https://cran.r-project.org/web/views/NaturalLanguageProcessing.html)
- Python: NLTK, Gensim, Spacy

## DTM

- each row represents one document (such as a book or article)
- each column represents one term
- each value (typically) contains the number of appearances of that term in that document

## DTM

tidytext to DTM

     > tidy() # to tidy format 
     > cast() # to DTM
     
## Problem

5.3.1 Example: mining financial articles does not run!