Association rules: apriori algorithm

пятница, 10 апреля 2009, Александр Краковецкий

Association Rules Overview

In data mining, association rule mining is a popular and well researched method for discovering interesting relations between variables in large databases. Piatetsky-Shapiro describes analyzing and presenting strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Agrawal et al. introduced association rules for discovering regularities between products in large scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {onions, potatoes} => {beef} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy beef. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection and bioinformatics.

Apriori Algorithm Overview

The Apriori algorithm is used to analyze a list of transactions for items that are frequently purchased together. Considering a transaction where the sale of software is increased by the sale of e-books, Support and Confidence are two measures used to describe market based analysis association rules created with an Apriori algorithm. E.g. a Support measure of 1% and a Confidence measure of 50% means that 1% of transactions analyzed contain purchases of e-books and software and 50% of customers who bought an e-book also bought a software.

A set of items is known as an itemset. An itemset which contains k items is known as a k-itemset. E.g. a set of items {Books, CD, DVD, Video} is a 4-itemset.

The number of transactiobns that contain an itemset is known as the Frequency or Support Count of the itemset. If the number of transactions containing an itemset satisfies the minimum support count specified then the itemset is known as a Frequent Itemset. E.g. the 2-itemset {Books, DVD} has a support count of 5 in the database of transactions below.

The database below contains 9 transactions. Find the support count and confidence for the the 2-itemset {Books, DVD}. Using the market based analysis apriori algorithm create an assocation data mining rule between {Books} and {DVD}.

Firstly the number of transactions that contain the 2-itemset {Books, DVD} is 5. The number of transactions containing the itemset {Books} is 6.

Consequently the support for the 2-itemset {Books, DVD} is (5/9) * (100%) = 55.6%

The confidence for the 2-itemset {Books, DVD} is = (Support Count({Books, DVD}) / Support Count({Books}) * (100%) .

Consequently the confidence for the 2-itemset {Books, DVD} is = ((5/6) * 100%) = 83.3%

Transaction 1: {Books, CD, DVD}
Transaction 2: {CD, Games}
Transaction 3: {CD, DVD}
Transaction 4: {Books, CD, Games}
Transaction 5: {Books, DVD}
Transaction 6: {CD, DVD}
Transaction 7: {Books, DVD}
Transaction 8: {Books, CD, DVD, Video}
Transaction 9: {Books, CD, DVD}

The Apriori Algorithm basically finds the support count and confidence of itemsets eliminating those itemsets that do not meet a minimum support count and confidence measure from a final list of rules created. Considering the list of transactions above, the algorithm will perform the following steps for a minimum support count of 3:

The APriori algorithm creates a list of unique items in a 1-itemset Candidate Itemset corresponding to {Books, CD, DVD, Games, Video}
The support count of each item in the list above is obtained and any item that does not satisfy the minimum support count is eliminated from further analysis creating a 1-itemset Frequent Itemset
The 1-itemset frequent itemset is joined with itself to create a 2-itemset candidate itemset.
The steps taken for the 1-itemset candidate itemset is repeated for the 2-itemset candidate itemset.
The steps above are repeated until a frequent itemset is empty and no new candidate itemsets can be generated.

A confidence measure is created for each rule generated from the frequent itemsets.

1. Apriori Algorithm for Market Basket Analysis C# Implementation

Source: http://www.kdkeys.net/forums/thread/2043.aspx

You can download source code from the codeplex site.

2. Apriori Algorithm with JavaScript

Source: http://allmybrain.com/2007/11/12/implementing-the-apriori-data-mining-algorithm-with-javascript/

You can download source code from the codeplex site.

3. Association Rules Demo 1.0 in Russian

Program for association rules mining in Russian without source code from BaseGroup Labs. Binaries are available on the codeplex site.

References

Downloads

P.S. I will update materials and references about the apriori algorithm. You also may write your materials and references in the comments or send to the msugvn[at]gmail.com.

Поиск по сайту