The Document Term Matrix is one of the most common data storage formats for text based data.
The DTM is based on the bag-of-words model. Each word is a feature in the data set. This leads to sparse matricies.
April 7, 2021
The Document Term Matrix is one of the most common data storage formats for text based data.
The DTM is based on the bag-of-words model. Each word is a feature in the data set. This leads to sparse matricies.
Some of the most popular R libraries and Python packages use DTM.
tidytext to DTM
> tidy() # to tidy format > cast() # to DTM
5.3.1 Example: mining financial articles does not run!