Association Analysis

One of the popular methods of data mining is algorithms for finding association rules that help find patterns between related events. An example of such a rule is a statement that a customer who buys product A is likely to buy product B with the probability of 75%.

For the first time, the association analysis was used to find typical purchase patterns made in supermarkets, so sometimes it is named the market basket analysis.

Suppose there is a database that consists of purchasing transactions. Each transaction is a set of goods purchased by a customer per visit. Let I = {i1, i2, i3, ...in} is a set of goods named elements. Let D be a set of transactions where each transaction T is a set of elements from I,  I. Each transaction is a binary vector where t[k]=1, if ik element is present in the transaction, otherwise t[k]=0. The T transaction contains X, a set of elements from I, if  T. An association rule is an implication  Y, where  I,  I and  . The  Y rule has the support s, if s% transactions are of D, contain X  Y, s( Y) = s( Y). The rule confidence shows what is the probability that X follows from Y. The  Y rule is true with the confidence c, if c% transactions of D, containing X, also include Y, c( Y) = s( Y)/s(X).

For example, consider the statement: 75% of transactions containing bread also contain milk. 3% of the total of all transactions contain both items. 75% is the confidence level, 3% is the support.

In other words, the purpose of association analysis is discovering the following dependencies: if the transaction contains a certain itemset X, it can be concluded that another itemset Y must also appear in this transaction. Establishing these relationships enables the user to find very simple and intuitive rules.

Algorithms for finding association rules are intended for finding all rules. Support and confidence of these rules must be greater than predefined thresholds named minimum support and minimum confidence, respectively.

The task of finding association rules is divided into two subtasks:

  1. Finding all itemsets that satisfy the support threshold. Such itemsets are named frequent.

  2. Generating rules from the found itemsets, according to p.1, with the confidence that satisfies the confidence threshold.

Values for the minimum support and minimum confidence parameters are selected in such way as to limit the number of found rules. If support has a great value, the algorithms find rules that are well-known to analysts or so trivial that there is not point in this analysis. On the other hand, the low value of support generates a great number of rules that requires substantial computational resources. However, most interesting rules are found at the low value of support threshold. Although, too low of a value generates statistically unreasonable rules.

See also:

Library of Methods and Models | Association Analysis | ISmAssociationRules