http://www.hmwu.idv.tw
吳漢銘國立政治大學 統計學系
關聯性分析
Association Rules
C04
Market Basket Analysis
2/71http://www.analyticsvidhya.com/blog/2014/08/effective-cross-selling-market-basket-analysis/
https://blogs.adobe.com/digitalmarketing/wp-content/uploads/2013/08/pic1.jpg
應用實例
3/71Market Basket Analysis
Market Basket Analysis is one of the Data Mining approaches
to find associations and correlations between the different items that customers place in their shopping basket.
to help market owner to have much better opportunity to make a profit by controlling the order of products and marketing.
Retailers leverage Market Basket Analysis
to provide a window into consumer shopping behavior, revealing how consumers select products, make spending tradeoffs, and group items in a shopping cart.
to understand how baskets are built. It can help retailers
merchandise more effectively by leveraging market basket dynamics in pricing and promotion decisions.
4/71
R. Agrawal, T. Imieliński and A. Swami, “Mining Association Rule between Sets of Items in Large Databases,” The ACM SIGMOD International Conference on Management of Data, pp. 207-216, May 1993.
(被引用 19551 次)
Association Rule Mining
The ideas of Association Rule Learning (also called Association Rule Mining) come from the market basket analysis.
AR mining:
Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction.
5/71
Transaction ID
(TID) Items
1 Bread, Peanuts, Milk, Fruit, Jam 2 Bread, Jam, Soda, Chips, Milk, Fruit 3 Steak, Jam, Soda, Chips, Bread 4 Jam, Soda, Peanuts, Milk, Fruit 5 Jam, Soda, Chips, Milk, Bread 6 Fruit, Soda, Chips, Milk
7 Fruit, Soda, Peanuts, Milk 8 Fruit, Peanuts, Cheese, Yogurt
Rule
{bread} ⇒ {milk}
{soda} ⇒ {chips}
{bread} ⇒ {jam}
mining
Association Rule Mining
Formalizing the problem:
Transaction Database T: a set of transactions T = {t1, t2, …, tn}.
Each transaction contains a set of items I (itemset)(項集).
An itemset is a collection of items I = {i1 , i2 , …, im}.
k-itemset: an itemset that contains k items.
Association rules are rules presenting association or correlation between itemsets.
An association rule is in the form of A ⇒ B, where A and B are two disjoint itemsets, referred to respectively as the lhs (left-hand side) (先決條件) and rhs (right-hand side) (對應的 連結結果) of the rule.
6/71
Definition: Frequent Itemset
Support count (σ)
Frequency of occurrence of an itemset.
σ({Milk, Bread}) = 3, σ({Soda, Chips}) = 4.
Support (s) (支援度)
The occurring frequency of the rule.
The percentage of transactions that contains both itemsets A and B.
Support(A ⇒ B) = P(A ∩ B)
s({Milk, Bread}) = 3/8; s({Soda, Chips}) = 4/8
Frequent itemset (頻繁項集):
s(itemset) ≧ minsup (minimum support)
threshold.
Items that frequently appear together.
The strength of the association.
7/71
means "and"
Confidence and Lift
Confidence (c) (可靠度):
the percentage of cases containing A that also contain B.
confident(A ⇒ B) = P(B | A) = P(A ∩ B)/P(A)
confident(A ⇒ B) ≧ mincon (minimum confident)
Lift (提昇度):
the ratio of confidence to the percentage of cases containing B.
lift(A ⇒ B) = P(B | A)/P(B)
= confident(A ⇒ B) / P(B)
= P(A ∩ B)/P(A)P(B)
lift(A ⇒ B) = 1, A和B相互獨立,A對B出現的可能性沒有提昇作用。
lift(A ⇒ B) > 1,表示A對B的提昇程度愈大,連結性愈強。
8/71
NOTE: There are many other interestingness measures, such as chi-square, conviction, gini and leverage. An introduction to over 20 measures can be found in Tan, P.-N., Kumar, V., and Srivastava, J. (2002). Selecting the right interestingness measure for association patterns. In KDD ’02: Proceedings of the 8th
Find a Rule
Rules originating from the same itemset have identical support but can have different confidence.
Given a set of transactions T, the goal of association rule mining is to find all rules having
support ≥ minsup threshold.
confidence ≥ minconf threshold.
9/71
• {Bread, Jam} ⇒ {Milk}, s=0.375, c=3/4=0.75
• {Milk, Jam} ⇒ {Bread}, s=0.375, c=0.75
• {Bread} ⇒ {Milk, Jam}, s=0.375, c=0.75
• {Jam} ⇒ {Bread, Milk}, s=0.375, c=0.6
• {Milk} ⇒ {Bread, Jam}, s=0.375, c=0.5
The following rules are binary partitions of the same itemset: {Milk, Bread, Jam}
Mining Association Rules
Brute-force approach:
List all possible association rules.
Compute the support and confidence for each rule.
Prune rules that fail the minsup and minconf thresholds.
Brute-force approach is computationally prohibitive!
Two step approach:
Step (1): Frequent Itemset Generation:
Generate all itemsets whose support >= minsup
Step (2): Rule Generation:
Generate high confidence rules from frequent itemset.
Each rule is a binary partitioning of a frequent itemset.
Frequent itemset generation is computationally expensive.
10/71
Step (1): Frequent Itemset Generation
Given d items, there are 2d possible candidate itemsets.
11/71
Source: (1) Prof. Pier Luca Lanzi, Association Rule Basics, Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) (2) Tan,Steinbach, Kumar Introduction to Data Mining
> no.item <- 5
> sum(choose(no.item, 0:no.item)) [1] 32
> 2^no.item [1] 32
Total Number of Possible Association Rules
Given d unique items:
Total number of itemsets = 2d
Total number of possible association rules:
12/71
Frequent Itemset Generation Strategies
Reduce the number of candidates (M).
Complete search: M=2d .
Use pruning techniques to reduce M.
Reduce the number of transactions (N).
Reduce size of N as the size of itemset increases.
Reduce the number of comparisons (NM).
Use efficient data structures to store the candidates or transactions.
No need to match every candidate against every transaction.
13/71
Reducing the Number of Candidates:
Apriori Principle
Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent.
Apriori principle holds due to the anti-monotone property of support measure: support of an itemset never exceeds the support of its subsets.
14/71
Apriori Principle
15/71Source: (1) Prof. Pier Luca Lanzi, Association Rule Basics, Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) (2) Tan,Steinbach, Kumar Introduction to Data Mining
Illustrating Apriori Principle
16/71Item Count
Bread 4
Peanuts 4
Milk 6
Fruit 6
Jam 5
Soda 6
Chips 4
Steak 1
Cheese 1
Yogurt 1
Item Count
Bread, Jam 4
Peanuts, Fruit 4
Milk, Fruit 5
Milk, Jam 4
Milk, Soda 5
Fruit, Soda 4
Jam, Soda 4
Soda, Chips 4
1-itemsets 2-itemsets
3-itemsets
Item Count
Milk, Fruit, Soda 4
Minimum Support = 4
Definition of Apriori Algorithm
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules.
Apriori uses a "bottom up" approach, where frequent
subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are
tested against the data.
Apriori is designed to operate on database containing transactions (for example, collections of items bought by customers, or details of a website frequentation).
17/71
Steps To Perform Apriori Algorithm
18/71Scan the transaction database to get the support of S each
1-itemset, compare S with minsup, and get a support of 1-itemsets, L1.
Use Lk-1join Lk-1to generate a set of candidate k-itemsets. Use Apriori property to prune the unfrequented
k-itemsets from this set.
Scan the transaction database to get the support S of each candidate k-
itemset in the find set, compare S with minsup, and get a set of
frequent k-itemsets Lk.
The candidate set = Null
NO
YES
For each frequent itemset f, generate all nonempty subsets of f
For every nonempty subset s of f, output the rule “s ⇒ (f-s)” if confidence C of the rule
“s ⇒ (f-s)” (=support s of 1/support S of s)
>= minconf
Apriori Algorithm
Let k=1
Generate frequent itemsets of length 1.
Repeat until no new frequent itemsets are identified:
Generate length (k+1) candidate itemsets from length k frequent itemsets.
Prune candidate itemsets containing subsets of length k that are infrequent.
Count the support of each candidate by scanning the DB.
Eliminate candidates that are infrequent, leaving only those that are frequent.
Join Step: Ck is generated by joining Lk-1 with itself.
Prune Step: any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset.
19/71
Source: Prof. Pier Luca Lanzi, Association Rule Basics, Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Example of Apriori Run
20/71TID Items
1 A, C, D
2 B, C, E 3 A, B, C, E
4 B, E
Itemset Sup.
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset Sup.
{A} 2
{B} 3
{C} 3
{E} 3
Itemset Sup.
{A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2
Itemset Sup.
{A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2
Itemset {B, C, E}
Itemset Sup.
{B, C, E} 2 Itemset {A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C1 L1 C2
L2
C2 C3 L3
1st scan
2nd scan
3rd scan
Step (2): Rule Generation
Given a frequent itemset {L}, find all non-empty subsets {f} {L} such that the association rule {f}
{L – f} satisfies the minimum confidence.
Create the rule {f} {L – f}.
If L={A,B,C,D} is a frequent itemset, candidate rules:
{ABC} ⇒ {D}, {ABD} ⇒ {C}, {ACD} ⇒ {B}, {BCD} ⇒ {A}, {A} ⇒ {BCD}, {B} ⇒ {ACD},{C} ⇒ {ABD}, {D} ⇒ {ABC}, {AB} ⇒ {CD}, {AC} ⇒ {BD}, {AD} ⇒ {BC},
{BC} ⇒ {AD}, {BD} ⇒ {AC}, {CD} ⇒ {AB}.
If |L| = k, then there are 2k – 2 candidate association rules (ignoring {L} ⇒ {∅} and {∅} ⇒ {L}).
21/71
Generate Rules from Frequent Itemsets
Confidence does not have an anti-monotone property
c({ABC} ⇒ {D}) can be larger or smaller than c({AB} ⇒ {D})
But confidence of rules generated from the same itemset has an anti-monotone property
e.g., L = {A,B,C,D}:
c({ABC} ⇒ {D}) >= c({AB} ⇒ {CD}) >= c({A} ⇒ {BCD})
Confidence is anti-monotone with respect to the number of items on the right hand side of the rule.
We can apply this property to prune the rule generation.
22/71
confident(A ⇒ B) = P(B | A) = P(A ∩ B)/P(A)
Rule Generation for Apriori Algorithm
23/71Source: (1) Prof. Pier Luca Lanzi, Association Rule Basics, Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) (2) Tan,Steinbach, Kumar Introduction to Data Mining
Rule Generation for Apriori Algorithm
Candidate rule is generated by merging two rules that share the same prefix in the rule
consequent.
Join({CD} ⇒ {AB}, {BD} ⇒ {AC}) would produce the candidate rule {D} ⇒ {ABC}
Prune rule {D} ⇒ {ABC} if its subset {AD} ⇒ {BC} does not have high confidence.
24/71
{CD} ⇒ {AB} {BD} ⇒ {AC}
{D} ⇒ {ABC}
Apriori Advantages/Disadvantages
Advantages
Uses large itemset property.
Easily parallelized.
Easy to implement.
Disadvantages
Assumes transaction database is memory resident.
Requires many database scans.
Challenges in AR Mining
Apriori scans the data base multiple times.
Most often, there is a high number of candidates.
Support counting for candidates can be time expensive.
Several methods try to improve this points by
Reduce the number of scans of the data base.
Shrink the number of candidates.
Counting the support of candidates more efficiently.
25/71
Choose an Appropriate minsup and Pattern Evaluation
Choose an Appropriate minsup
If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products)
If minsup is set too low, it is computationally expensive and the number of itemsets is very large
A single minimum support threshold may not be effective.
Pattern Evaluation
Association rule algorithms tend to produce too many rules
many of them are uninteresting or redundant.
(Redundant if {A,B,C} ⇒ {D}and {A,B} ⇒ {D}have same support & confidence.)
Interestingness measures can be used to prune/rank the derived patterns.
In the original formulation of association rules, support &
confidence are the only measures used.
26/71
R Package: arules
arules: Mining Association Rules and Frequent Itemsets
Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
Also provides interfaces to C implementations of the association mining algorithms Apriori and Eclat by C. Borgelt.
apriori{arules}:
The Apriori algorithm employs level-wise search for frequent itemsets.
The defaults: (1) supp=0.1, the minimum support of rules; (2) conf=0.8, the minimum confidence of rules; and (3) maxlen=10, which is the maximum length of rules.
eclat{arules}:
The ECLAT algorithm finds frequent itemsets with equivalence classes, depth-first search and set intersection instead of counting.
interestMeasure{arules}: more than twenty measures for selecting interesting association rules can be calculated.
Other R packages:
arulesViz: A package for visualizing association rules based on package arules.
arulesSequences: provides functions for mining sequential patterns.
arulesNBMiner: implements an algorithm for mining negative binomial (NB) frequent itemsets and NB-precise rules.
27/71
http://lyle.smu.edu/IDA/arules/
https://cran.r-project.org/web/packages/arules/index.html http://michael.hahsler.net/research/arules_RUG_2015/demo/
(arules: Association Rule Mining with R — A Tutorial, Michael Hahsler, Mon Sep 21 10:51:59 2015)
apriori{arules}: Mining Associations with Apriori
Description
Mine frequent itemsets, association rules or association hyperedges using the Apriori algorithm. The Apriori algorithm employs level-wise search for frequent itemsets. The implementation of Apriori used includes some improvements (e.g., a prefix tree and item sorting).
Usage
apriori(data, parameter = NULL, appearance = NULL, control = NULL)
Arguments
data: object of class transactions or any data structure which can be coerced into transactions (e.g., a binary matrix or data.frame).
parameter: object of class APparameter or named list. The default behavior is to mine rules with support 0.1, confidence 0.8, and maxlen 10.
appearance: object of class APappearance or named list. With this argument item appearance can be restricted (implements rule templates). By default all items can appear unrestricted.
control: object of class APcontrol or named list. Controls the algorithmic performance of the mining algorithm (item sorting, etc.)
Note: Apriori only creates rules with one item in the RHS!
28/71
eclat{arules}: Mining Associations with Eclat
Description
Mine frequent itemsets with the Eclat algorithm. This algorithm uses simple intersection operations for equivalence class clustering along with bottom-up lattice traversal.
Usage
eclat(data, parameter = NULL, control = NULL)
Arguments
data: object of class transactions or any data structure which can be coerced into transactions (e.g., binary matrix, data.frame).
parameter: object of class ECparameter or named list (default values are: support 0.1 and maxlen 5)
control: object of class ECcontrol or named list for algorithmic controls.
29/71
Case Study 0:
Adult Data Set
30/71> # The AdultUCI datset contains the questionnaire data of the “Adult” database (originally called the “Census Income” Database) with 48842 observations on the 15 variables.
> library(arules)
> data(AdultUCI)
> head(AdultUCI)
age workclass fnlwgt education education-num marital-status occupation 1 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial 3 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners 4 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners 5 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty 6 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial relationship race sex capital-gain capital-loss hours-per-week native-country income 1 Not-in-family White Male 2174 0 40 United-States small 2 Husband White Male 0 0 13 United-States small 3 Not-in-family White Male 0 0 40 United-States small 4 Husband Black Male 0 0 40 United-States small 5 Wife Black Female 0 0 40 Cuba small 6 Wife White Female 0 0 40 United-States small
> data(Adult)
> ?Adult #see how to create transactions from AdultUCI
> Adult
transactions in sparse format with 48842 transactions (rows) and 115 items (columns)
> class(Adult) [1] "transactions"
attr(,"package") [1] "arules"
> ?transactions
Adult Data Set ( transactions form)
31/71> str(Adult)
Formal class 'transactions' [package "arules"] with 3 slots
..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. ..@ i : int [1:612200] 1 10 25 32 35 50 59 61 63 65 ...
.. .. ..@ p : int [1:48843] 0 13 26 39 52 65 78 91 104 117 ...
.. .. ..@ Dim : int [1:2] 115 48842 .. .. ..@ Dimnames:List of 2
.. .. .. ..$ : NULL .. .. .. ..$ : NULL
.. .. ..@ factors : list()
..@ itemInfo :'data.frame': 115 obs. of 3 variables:
.. ..$ labels : chr [1:115] "age=Young" "age=Middle-aged" "age=Senior" "age=Old" ...
.. ..$ variables: Factor w/ 13 levels "age","capital-gain",..: 1 1 1 1 13 13 13 13 13 13 ...
.. ..$ levels : Factor w/ 112 levels "10th","11th",..: 111 63 92 69 30 54 65 82 90 91 ...
..@ itemsetInfo:'data.frame': 48842 obs. of 1 variable:
.. ..$ transactionID: chr [1:48842] "1" "2" "3" "4" ...
> inspect(Adult[1:2])
items transactionID 1 {age=Middle-aged, workclass=State-gov, education=Bachelors, marital-status=Never-married, occupation=Adm-clerical, relationship=Not-in-family, race=White, sex=Male, capital-gain=Low, capital-loss=None, hours-per-week=Full-time, native-country=United-States, income=small} 1
2 {age=Senior, workclass=Self-emp-not-inc, education=Bachelors, marital-status=Married-civ-spouse, occupation=Exec-managerial, relationship=Husband, race=White, sex=Male, capital-gain=None, capital-loss=None, hours-per-week=Part-time, native-country=United-States, income=small}
Adult Data Set
32/71> summary(Adult)
transactions as itemMatrix in sparse format with 48842 rows (elements/itemsets/transactions) and 115 columns (items) and a density of 0.1089939 most frequent items:
capital-loss=None capital-gain=None native-country=United-States race=White 46560 44807 43832 41762 workclass=Private (Other)
33906 401333
element (itemset/transaction) length distribution:
sizes
9 10 11 12 13 19 971 2067 15623 30162
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.00 12.00 13.00 12.53 13.00 13.00 includes extended item information - examples:
labels variables levels 1 age=Young age Young 2 age=Middle-aged age Middle-aged 3 age=Senior age Senior
includes extended transaction information - examples:
transactionID 1 1 2 2 3 3
How to Create Transactions Data
Transactions can be created by coercion from lists containing transactions, but also from matrix and
data.frames. However, you will need to prepare your data first. Association rule mining can only use items and does not work with continuous variables.
33/71
http://127.0.0.1:18470/library/arules/html/transactions-class.html
> # creating transactions form a list
> a.list <- list(
+ c("a","b","c"), + c("a","b"), + c("a","b","d"), + c("c","e"),
+ c("a","b","d","e") + )
> names(a.list) <- paste0("Customer", c(1:5))
> a.list
$Customer1
[1] "a" "b" "c"
$Customer2 [1] "a" "b"
$Customer3
[1] "a" "b" "d"
$Customer4 [1] "c" "e"
$Customer5
[1] "a" "b" "d" "e"
# avoid "no method or default for coercing"
library(Matrix)
productTD <- as(product_by_user$Product, "transactions") inspect(productTD[1:5])
Coerce a List Into Transactions
34/71> alist.trans <- as(a.list, "transactions")
> summary(alist.trans) # analyze transactions transactions as itemMatrix in sparse format with
5 rows (elements/itemsets/transactions) and 5 columns (items) and a density of 0.56 most frequent items:
a b c d e (Other) 4 4 2 2 2 0 element (itemset/transaction) length distribution:
sizes 2 3 4 2 2 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.0 2.0 3.0 2.8 3.0 4.0 includes extended item information - examples:
labels 1 a 2 b 3 c
includes extended transaction information - examples:
transactionID 1 Customer1 2 Customer2 3 Customer3
> image(alist.trans)
Creating Transactions from a Matrix
35/71> a.matrix <- matrix(c(
+ 1,1,1,0,0, + 1,1,0,0,0, + 1,1,0,1,0,
+ 0,0,1,0,1), ncol = 5)
> dimnames(a.matrix) <- list(
paste("Customer", letters[1:4]), paste0("Item", c(1:5)))
> a.matrix
Item1 Item2 Item3 Item4 Item5 Customer a 1 0 0 0 0 Customer b 1 1 0 1 1 Customer c 1 1 1 0 0 Customer d 0 0 1 0 1
>
> amatirx.trans <- as(a.matrix, "transactions")
> amatirx.trans
transactions in sparse format with 4 transactions (rows) and
5 items (columns)
> inspect(amatirx.trans)
items transactionID [1] {Item1} Customer a [2] {Item1,Item2,Item4,Item5} Customer b [3] {Item1,Item2,Item3} Customer c [4] {Item3,Item5} Customer d
> summary(amatirx.trans)
transactions as itemMatrix in sparse format with 4 rows (elements/itemsets/transactions) and 5 columns (items) and a density of 0.5 most frequent items:
Item1 Item2 Item3 Item5 Item4 (Other) 3 2 2 2 1 0 element (itemset/transaction) length distribution:
sizes 1 2 3 4 1 1 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.75 2.50 2.50 3.25 4.00 includes extended item information - examples:
labels 1 Item1 2 Item2 3 Item3
includes extended transaction information - examples:
transactionID 1 Customer a 2 Customer b 3 Customer c
More Examples
36/71> # creating transactions from data.frame
> a.df <- data.frame(
+ age = as.factor(c(6, 8, NA, 9, 16)),
+ grade = as.factor(c("A", "C", "F", NA, "C")), + pass = c(TRUE, TRUE, FALSE, TRUE, TRUE))
> # note: factors are translated to
> # logicals and NAs are ignored
> a.df
age grade pass 1 6 A TRUE 2 8 C TRUE 3 <NA> F FALSE 4 9 <NA> TRUE 5 16 C TRUE
> adf.trans <- as(a.df, "transactions")
> inspect(adf.trans)
items transactionID [1] {age=6,grade=A,pass} 1 [2] {age=8,grade=C,pass} 2 [3] {grade=F} 3 [4] {age=9,pass} 4 [5] {age=16,grade=C,pass} 5
> as(adf.trans, "data.frame")
items transactionID 1 {age=6,grade=A,pass} 1 2 {age=8,grade=C,pass} 2 3 {grade=F} 3 4 {age=9,pass} 4 5 {age=16,grade=C,pass} 5
> # creating transactions from (IDs, items)
> a.df2 <- data.frame(
+ TID = c(1, 1, 2, 2, 2, 3),
+ item = c("a", "b", "a", "b", "c", "b"))
> a.df2 TID item 1 1 a 2 1 b 3 2 a 4 2 b 5 2 c 6 3 b
> a.df2.s <- split(a.df2[, "item"], a.df2[,"TID"])
> a.df2.s
$`1`
[1] a b
Levels: a b c
$`2`
[1] a b c Levels: a b c
$`3`
[1] b
Levels: a b c
> adf2.trans <- as(a.df2.s, "transactions")
> inspect(adf2.trans) items transactionID [1] {a,b} 1 [2] {a,b,c} 2 [3] {b} 3
Example: Create Transactions Data
37/71> data(AdultUCI)
> summary(AdultUCI)
> # remove attributes
> AdultUCI[["fnlwgt"]] <- NULL
> AdultUCI[["education-num"]] <- NULL
http://127.0.0.1:18470/library/arules/html/Adult.html
Example: Create Transactions Data
38/71> # map metric attributes
> AdultUCI[["age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15, 25, 45, 65, 100)), + labels = c("Young", "Middle-aged", "Senior", "Old"))
> AdultUCI[["hours-per-week"]] <- ordered(cut(AdultUCI[["hours-per-week"]], + c(0, 25, 40, 60, 168)),
+ labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))
> AdultUCI[["capital-gain"]] <- ordered(cut(AdultUCI[["capital-gain"]],
+ c(-Inf, 0, median(AdultUCI[["capital-gain"]][AdultUCI[["capital-gain"]] > 0]), Inf)), + labels = c("None", "Low", "High"))
> AdultUCI[["capital-loss"]] <- ordered(cut(AdultUCI[["capital-loss"]],
+ c(-Inf, 0, median(AdultUCI[["capital-loss"]][AdultUCI[["capital-loss"]] > 0]), Inf)), + labels = c("None", "Low", "High"))
>
> summary(AdultUCI[c("age", "hours-per-week", "capital-gain", "capital-loss")]) age hours-per-week capital-gain capital-loss
Young : 9627 Part-time : 5913 None:44807 None:46560 Middle-aged:24671 Full-time :28577 Low : 2345 Low : 1166 Senior :12741 Over-time :12676 High: 1690 High: 1116 Old : 1803 Workaholic: 1676
>
> # create transactions
> MyAdult <- as(AdultUCI, "transactions")
> MyAdult
transactions in sparse format with 48842 transactions (rows) and 115 items (columns)
Example: Create Transactions Data
39/71> summary(MyAdult)
transactions as itemMatrix in sparse format with 48842 rows (elements/itemsets/transactions) and 115 columns (items) and a density of 0.1089939 most frequent items:
capital-loss=None capital-gain=None native-country=United-States 46560 44807 43832 race=White workclass=Private (Other) 41762 33906 401333 element (itemset/transaction) length distribution:
sizes
9 10 11 12 13 19 971 2067 15623 30162
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.00 12.00 13.00 12.53 13.00 13.00 includes extended item information - examples:
labels variables levels 1 age=Young age Young 2 age=Middle-aged age Middle-aged 3 age=Senior age Senior
includes extended transaction information - examples:
transactionID 1 1 2 2 3 3
> inspect(MyAdult[1:2])
items transactionID [1] {age=Middle-aged,
workclass=State-gov, education=Bachelors, marital-status=Never-married, occupation=Adm-clerical, relationship=Not-in-family, race=White, sex=Male, capital-gain=Low, capital-loss=None, hours-per-week=Full-time, native-country=United-States, income=small} 1
Case Study 1: Groceries Data Set
Description: The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.
Format: Object of class transactions.
40/71
> library(arules)
> data(Groceries)
> ?Groceries
> str(Groceries)
Formal class 'transactions' [package "arules"] with 3 slots
..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. ..@ i : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...
.. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
.. .. ..@ Dim : int [1:2] 169 9835 .. .. ..@ Dimnames:List of 2
.. .. .. ..$ : NULL .. .. .. ..$ : NULL
.. .. ..@ factors : list()
..@ itemInfo :'data.frame': 169 obs. of 3 variables:
.. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
.. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
.. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
Groceries@itemInfo
summary, inspect
41/71> summary(Groceries)
transactions as itemMatrix in sparse format with 9835 rows (elements/itemsets/transactions) and 169 columns (items) and a density of 0.02609146 most frequent items:
whole milk other vegetables rolls/buns soda yogurt 2513 1903 1809 1715 1372 (Other)
34055
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 29 14
19 20 21 22 23 24 26 27 28 29 32 14 9 11 4 6 1 1 1 1 3 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 4.409 6.000 32.000 includes extended item information - examples:
labels level2 level1 1 frankfurter sausage meat and sausage 2 sausage sausage meat and sausage
3 liver loaf sausage meat and sausage > inspect(Groceries[1:4])
items
1 {citrus fruit,semi-finished bread,margarine,ready soups}
2 {tropical fruit,yogurt,coffee}
3 {whole milk}
4 {pip fruit,yogurt,cream cheese ,meat spreads}
Apply apriori
42/71> rule0 <- apriori(Groceries) Apriori
Parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 0.1 1 10 rules FALSE Algorithmic control:
filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 983
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [8 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
The default behavior is to mine rules with minimum support of 0.1,
minimum confidence of 0.8,
maximum of 10 items (maxlen), and
a maximal time for subset checking of 5 seconds (maxtime).
Apply apriori With Different Arguments43/71
> rule1 <- apriori(Groceries, parameter=list(support=0.005, confidence=0.64)) Apriori
Parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.64 0.1 1 none FALSE TRUE 0.005 1 10 rules FALSE Algorithmic control:
filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 49
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [120 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [4 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
> inspect(rule1)
lhs rhs support confidence lift 1 {butter,whipped/sour cream} => {whole milk} 0.006710727 0.6600000 2.583008 2 {pip fruit,whipped/sour cream} => {whole milk} 0.005998983 0.6483516 2.537421 3 {pip fruit,root vegetables,other vegetables} => {whole milk} 0.005490595 0.6750000 2.641713 4 {tropical fruit,root vegetables,yogurt} => {whole milk} 0.005693950 0.7000000 2.739554
Class ' rules '
44/71> str(rule1)
Formal class 'rules' [package "arules"] with 4 slots
..@ lhs :Formal class 'itemMatrix' [package "arules"] with 3 slots
.. .. ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. .. .. ..@ i : int [1:10] 25 30 15 30 15 19 22 14 19 29
.. .. .. .. ..@ p : int [1:5] 0 2 4 7 10 .. .. .. .. ..@ Dim : int [1:2] 169 4 .. .. .. .. ..@ Dimnames:List of 2
.. .. .. .. .. ..$ : NULL .. .. .. .. .. ..$ : NULL
.. .. .. .. ..@ factors : list()
.. .. ..@ itemInfo :'data.frame': 169 obs. of 3 variables:
.. .. .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
.. .. .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
.. .. .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
.. .. ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
..@ rhs :Formal class 'itemMatrix' [package "arules"] with 3 slots
.. .. ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. .. .. ..@ i : int [1:4] 24 24 24 24
.. .. .. .. ..@ p : int [1:5] 0 1 2 3 4 .. .. .. .. ..@ Dim : int [1:2] 169 4 .. .. .. .. ..@ Dimnames:List of 2
.. .. .. .. .. ..$ : NULL .. .. .. .. .. ..$ : NULL
.. .. .. .. ..@ factors : list()
.. .. ..@ itemInfo :'data.frame': 169 obs. of 3 variables:
.. .. .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
.. .. .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
.. .. .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
.. .. ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables ..@ quality:'data.frame': 4 obs. of 3 variables:
.. ..$ support : num [1:4] 0.00671 0.006 0.00549 0.00569 .. ..$ confidence: num [1:4] 0.66 0.648 0.675 0.7
.. ..$ lift : num [1:4] 2.58 2.54 2.64 2.74 ..@ info :List of 4
.. ..$ data : symbol Groceries
> rule1@quality
support confidence lift 1 0.006710727 0.6600000 2.583008 2 0.005998983 0.6483516 2.537421 3 0.005490595 0.6750000 2.641713
Select Top AR by Support
45/71> rule2 <- apriori(Groceries, parameter=list(support=0.001, confidence=0.5)) Apriori
Parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.5 0.1 1 none FALSE TRUE 0.001 1 10 rules FALSE Algorithmic control:
filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 9
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [157 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 done [0.01s].
writing ... [5668 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
>
> rule2.sorted_sup <- sort(rule2, by="support")
> inspect(rule2.sorted_sup[1:5])
lhs rhs support confidence lift 1472 {other vegetables,yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 1467 {tropical fruit,yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 1449 {other vegetables,whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385 1469 {root vegetables,yogurt} => {whole milk} 0.01453991 0.5629921 2.203354 1454 {pip fruit,other vegetables} => {whole milk} 0.01352313 0.5175097 2.025351
Select a Subset of Rules
46/71> # Select a subset of rules using partial matching on the items
> # in the right-hand-side and a quality measure
> rule2.sub <- subset(rule2, subset = rhs %pin% "whole milk" & lift > 1.3)
> rule2.sub
set of 2679 rules
>
> # Display the top 3 support rules
> inspect(head(rule2.sub, n = 3, by = "support"))
lhs rhs support confidence lift 1472 {other vegetables,yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 1467 {tropical fruit,yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 1449 {other vegetables,whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385
>
> # Display the first 3 rules
> inspect(rule2.sub[1:3])
lhs rhs support confidence lift 1 {honey} => {whole milk} 0.001118454 0.7333333 2.870009 3 {cocoa drinks} => {whole milk} 0.001321810 0.5909091 2.312611 4 {pudding powder} => {whole milk} 0.001321810 0.5652174 2.212062
>
> # Get labels for the first 3 rules
> labels(rule2.sub[1:3])
[1] "{honey} => {whole milk}" "{cocoa drinks} => {whole milk}"
[3] "{pudding powder} => {whole milk}"
> labels(rule2.sub[1:3], itemSep = " + ", setStart = "", setEnd="", ruleSep = " ---> ") [1] "honey ---> whole milk" "cocoa drinks ---> whole milk"
[3] "pudding powder ---> whole milk"
Select Top AR by Confidence, Lift
47/71> rule2.sorted_con <- sort(rule2, by="confidence")
> inspect(rule2.sorted_con[1:5])
lhs rhs support confidence lift 113 {rice,sugar} => {whole milk} 0.001220132 1 3.913649 258 {canned fish,hygiene articles} => {whole milk} 0.001118454 1 3.913649 1487 {root vegetables,butter,rice} => {whole milk} 0.001016777 1 3.913649 1646 {root vegetables,whipped/sour cream,flour} => {whole milk} 0.001728521 1 3.913649 1670 {butter,soft cheese,domestic eggs} => {whole milk} 0.001016777 1 3.913649
>
> rule2.sorted_lift <- sort(rule2, by="lift")
> inspect(rule2.sorted_lift[1:5])
lhs rhs support confidence lift 53 {Instant food products,soda} => {hamburger meat} 0.001220132 0.6315789 18.99565 37 {soda,popcorn} => {salty snack} 0.001220132 0.6315789 16.69779 444 {flour,baking powder} => {sugar} 0.001016777 0.5555556 16.40807 327 {ham,processed cheese} => {white bread} 0.001931876 0.6333333 15.04549 55 {whole milk,Instant food products} => {hamburger meat} 0.001525165 0.5000000 15.03823
sort(x, decreasing = TRUE, na.last = NA, by = "support", order = FALSE, ...)
## S4 method for signature 'associations'
head(x, n = 6L, by = NULL, decreasing = TRUE, ...)
## S4 method for signature 'associations'
tail(x, n = 6L, by = NULL, decreasing = TRUE, ...)
Select Top Frequent Itemsets
48/71> rule.freq_item <- apriori(Groceries, parameter=list(support=0.001, target="frequent itemsets"), control=list(sort=-1))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen target ext NA 0.1 1 none FALSE TRUE 0.001 1 10 frequent itemsets FALSE
Algorithmic control:
filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE -1 TRUE Absolute minimum support count: 9
...
checking subsets of size 1 2 3 4 5 6 done [0.02s].
writing ... [13492 set(s)] done [0.00s].
creating S4 object ... done [0.00s].
> rule.freq_item
set of 13492 itemsets
> inspect(rule.freq_item[1:5]) items support 1 {whole milk} 0.2555160 2 {other vegetables} 0.1934926 3 {rolls/buns} 0.1839349 4 {soda} 0.1743772 5 {yogurt} 0.1395018
Frequent k-itemsets
49/71> rule.fi_eclat <- eclat(Groceries, parameter=list(minlen=1, maxlen=3, support=0.001, target="frequent itemsets"), control=list(sort=-1))
Eclat ...
> rule.fi_eclat
set of 9969 itemsets
> inspect(rule.fi_eclat[1:5])
items support 1 {whole milk,honey} 0.001118454 2 {whole milk,cocoa drinks} 0.001321810 3 {whole milk,pudding powder} 0.001321810 4 {tidbits,rolls/buns} 0.001220132 5 {tidbits,soda} 0.001016777
> rule.fi_eclat <- eclat(Groceries, parameter=list(minlen=3, maxlen=5, support=0.001, target="frequent itemsets"), control=list(sort=-1))
Eclat ...
> rule.fi_eclat
set of 10344 itemsets
> inspect(rule.fi_eclat[1:5])
items support 1 {liver loaf,whole milk,yogurt} 0.001016777 2 {tropical fruit,other vegetables,curd cheese} 0.001016777 3 {whole milk,curd cheese,rolls/buns} 0.001016777 4 {other vegetables,whole milk,curd cheese} 0.001220132 5 {other vegetables,whole milk,cleaner} 0.001016777