--- title: "Association Rules" author: "Prof. Eric A. Suess" date: "March 3, 2021" output: beamer_presentation: default ioslides_presentation: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ## Introduction Do you ever make ''impulse purchases''? Why is the product always ''nearby'' and ''easily available''? On some of the online retailer websites there is the ''wishlist'' or ''shopping cart''? ## Introduction **Recommendation systems** have been based on subjective experience of marketers. With online retailers machine learning has been used to learn the patterns of **purchasing behavior**. With barcode scanners, computerized inventory systems, and online shopping there is a lot of **transactional data** available for **data mining**. ## Introduction Do you know what an [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit) is? Answer: **Stock Keeping Unit** ## Introduction In this chapter we will learn about methods for identifying associations among items in transactional data. This is known as **market basket analysis**. ## Understanding association rules The result of **market basket analysis** is a set of **association rules**. For example, {peanut butter,jelly} -> {bread} Association rules are learned from subsets of **itemsets**. ## Understanding association rules Association rules were developed in the context of **Big Data** and **database science** and **data mining** for **knowledge discovery** (KDD). *Looking for the needle in the haystack*. Association rules are **unsupervised**, so there is **no need** for the algorithm to be **trained**. And there is no objective measure of performance for such rule learners. ## Apriori algorithm The complexity of transactional data is what makes association rule mining a challenging task. **Transactional datasets** are typically **extremely large**, both in terms of the **number of transactions** and the **number of features** or items for sale. The potential itemsets grows with the number of items for sale. The good thing is that many itemsets are rare. By **igoring rare itemsets**, it is possible to limit the search for rules. ## Apriori algorithm The most widely used algorithm is the **Apriori algorithm**. It employs a simple *a priori* belief as a guideline for reducing the association rule space, *all subsets of a frequent itemset must also be frequent*. This is the **Apriori property**. See the paper [Fast algorithms for mining association rules](http://www.vldb.org/conf/1994/P487.PDF), Agrawal and Srikant (1994). Or [A comparison of association rule discovery and bayesian network causal inference algorithms](https://www.semanticscholar.org/paper/A-Comparison-of-Association-Rule-Discovery-and-Bowes-Neufeld/1d943d6cf426f5c1b47c00bcf4d7dea7b8dddf6e/pdf), Bowes, et. al. ## Measuring rule interest Whether or not an association rule is deemed **interesting** is determined by two statistical measures: - support $P(X)$ - confidence $P(Y|X)$ ## Measuring rule interest By providing **minimum thresholds** for each of these metrics and applying the Apriori principle, it is easy to limit the number of rules reported. ## Measuring rule interest - support The **support** of an itemset measures how frequently it occurs in the data. $support(X) = \frac{count(X)}{N}$ where $N$ is the number of transactions in the database and $count(X)$ is the number of transactions that contain the itemset $X$. ## Measuring rule interest - confidence A rule's **confidence** is a measurement of its predictive power or accuracy. It is defined as the support of the itemset containing both $X$ and $Y$ divided by the support of the itemset containing only $X$. $confidence(X \rightarrow Y) = \frac{support(X,Y)}{support(X)}$ The confidence tells us the proportion of transactions where the presence of item or itemset $X$ results in the presence of item or itemset $Y$. ## Measuring rule interest - confidence Note $X \rightarrow Y$ is not the same as $Y \rightarrow X$. **Rules** that have **high support** and **high confidence** are referred to as **strong rules**. ## Building a set of rules with the Apriori principle The **Apriori principle** states that all subsets of a frequent itemset must also be frequent. The **Apriori algorithm** uses the **Apriori principle** to exclude potential association rules prior to actually evaluating them. The **process of creating rules** occurs in two phases: - find all itemsets that meet a minimum **support threshold** - create rules from these itemsets that meet a minimum **confidence threshold** ## Example The author gives an example of the use of Market Basket Analysis using **transaction data** to identify frequently purchased groceries with association rules. **Recommendation system** ## Example The example uses: - **unstructured data** - **nosql** - **sparse matrix** We will use the R packages - [arules](https://www.jstatsoft.org/article/view/v014i15/v14i15.pdf) - [arulesViz](https://journal.r-project.org/archive/2017/RJ-2017-047/RJ-2017-047.pdf) ## Example **Lift** is a metric used to measure how much more likely one item is to be purchased relative to its typical purchase rate, given that you know another item has been purchased. $lift(X \rightarrow Y) = \frac{confidence(X \rightarrow Y)}{support(Y)}$ Here $lift(X \rightarrow Y) = lift(Y \rightarrow X)$