--- title: "Correlation" author: "Prof. Eric A. Suess" format: revealjs --- ## Counting and correlating pairs of words - Tokenizing by n-gram is a useful way to explore pairs of adjacent words. - Tidy data is a useful structure for comparing between variables or grouping by rows, but it can be challenging to compare between rows: for example, to count the number of times that two words appear within the same document, or to see how correlated they are. - Most operations for finding pairwise counts or correlations need to turn the data into a **wide matrix** first. ## Wide Format to examine correlation ![](images/widyr.jpg) ## Phi Coeffient - "We may instead want to examine correlation among words, which indicates how often they appear together relative to how often they appear separately." - See the Wikipedia page about the [Phi Coefficient](https://en.wikipedia.org/wiki/Phi_coefficient)