Association Rules

(1)

http://www.hmwu.idv.tw

吳漢銘國立政治大學統計學系

關聯性分析

Association Rules

C04

(2)

Market Basket Analysis

^2/71

http://www.analyticsvidhya.com/blog/2014/08/effective-cross-selling-market-basket-analysis/

https://blogs.adobe.com/digitalmarketing/wp-content/uploads/2013/08/pic1.jpg

(3)

應用實例

^3/71

(4)

Market Basket Analysis

 Market Basket Analysis is one of the Data Mining approaches

 to find associations and correlations between the different items that customers place in their shopping basket.

 to help market owner to have much better opportunity to make a profit by controlling the order of products and marketing.

 Retailers leverage Market Basket Analysis

 to provide a window into consumer shopping behavior, revealing how consumers select products, make spending tradeoffs, and group items in a shopping cart.

 to understand how baskets are built. It can help retailers

merchandise more effectively by leveraging market basket dynamics in pricing and promotion decisions.

4/71

R. Agrawal, T. Imieliński and A. Swami, “Mining Association Rule between Sets of Items in Large Databases,” The ACM SIGMOD International Conference on Management of Data, pp. 207-216, May 1993.

(被引用 19551 次)

(5)

Association Rule Mining

 The ideas of Association Rule Learning (also called Association Rule Mining) come from the market basket analysis.

 AR mining:

 Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

 Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction.

5/71

Transaction ID

(TID) Items

1 Bread, Peanuts, Milk, Fruit, Jam 2 Bread, Jam, Soda, Chips, Milk, Fruit 3 Steak, Jam, Soda, Chips, Bread 4 Jam, Soda, Peanuts, Milk, Fruit 5 Jam, Soda, Chips, Milk, Bread 6 Fruit, Soda, Chips, Milk

7 Fruit, Soda, Peanuts, Milk 8 Fruit, Peanuts, Cheese, Yogurt

Rule

{bread} ⇒ {milk}

{soda} ⇒ {chips}

{bread} ⇒ {jam}

mining

(6)

Association Rule Mining

 Formalizing the problem:

 Transaction Database T: a set of transactions T = {t₁, t₂, …, t_n}.

 Each transaction contains a set of items I (itemset)(項集).

 An itemset is a collection of items I = {i₁ , i₂ , …, i_m}.

 k-itemset: an itemset that contains k items.

 Association rules are rules presenting association or correlation between itemsets.

 An association rule is in the form of A ⇒ B, where A and B are two disjoint itemsets, referred to respectively as the lhs (left-hand side) (先決條件) and rhs (right-hand side) (對應的連結結果) of the rule.

6/71

(7)

Definition: Frequent Itemset

 Support count (σ)

 Frequency of occurrence of an itemset.

 σ({Milk, Bread}) = 3, σ({Soda, Chips}) = 4.

 Support (s) (支援度)

 The occurring frequency of the rule.

 The percentage of transactions that contains both itemsets A and B.

 Support(A ⇒ B) = P(A ∩ B)

 s({Milk, Bread}) = 3/8; s({Soda, Chips}) = 4/8

 Frequent itemset (頻繁項集):

 s(itemset) ≧ minsup (minimum support)

threshold.

 Items that frequently appear together.

 The strength of the association.

7/71

means "and"

(8)

Confidence and Lift

 Confidence (c) (可靠度):

 the percentage of cases containing A that also contain B.

 confident(A ⇒ B) = P(B | A) = P(A ∩ B)/P(A)

 confident(A ⇒ B) ≧ mincon (minimum confident)

 Lift (提昇度):

 the ratio of confidence to the percentage of cases containing B.

 lift(A ⇒ B) = P(B | A)/P(B)

= confident(A ⇒ B) / P(B)

= P(A ∩ B)/P(A)P(B)

 lift(A ⇒ B) = 1, A和B相互獨立，A對B出現的可能性沒有提昇作用。

 lift(A ⇒ B) > 1，表示A對B的提昇程度愈大，連結性愈強。

8/71

NOTE: There are many other interestingness measures, such as chi-square, conviction, gini and leverage. An introduction to over 20 measures can be found in Tan, P.-N., Kumar, V., and Srivastava, J. (2002). Selecting the right interestingness measure for association patterns. In KDD ’02: Proceedings of the 8th

(9)

Find a Rule

 Rules originating from the same itemset have identical support but can have different confidence.

 Given a set of transactions T, the goal of association rule mining is to find all rules having

 support ≥ minsup threshold.

 confidence ≥ minconf threshold.

9/71

• {Bread, Jam} ⇒ {Milk}, s=0.375, c=3/4=0.75

• {Milk, Jam} ⇒ {Bread}, s=0.375, c=0.75

• {Bread} ⇒ {Milk, Jam}, s=0.375, c=0.75

• {Jam} ⇒ {Bread, Milk}, s=0.375, c=0.6

• {Milk} ⇒ {Bread, Jam}, s=0.375, c=0.5

The following rules are binary partitions of the same itemset: {Milk, Bread, Jam}

(10)

Mining Association Rules

 Brute-force approach:

 List all possible association rules.

 Compute the support and confidence for each rule.

 Prune rules that fail the minsup and minconf thresholds.

 Brute-force approach is computationally prohibitive!

 Two step approach:

Step (1): Frequent Itemset Generation:

 Generate all itemsets whose support >= minsup

Step (2): Rule Generation:

 Generate high confidence rules from frequent itemset.

 Each rule is a binary partitioning of a frequent itemset.

 Frequent itemset generation is computationally expensive.

10/71

(11)

Step (1): Frequent Itemset Generation

 Given d items, there are 2^d possible candidate itemsets.

11/71

Source: (1) Prof. Pier Luca Lanzi, Association Rule Basics, Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) (2) Tan,Steinbach, Kumar Introduction to Data Mining

> no.item <- 5

> sum(choose(no.item, 0:no.item)) [1] 32

> 2^no.item [1] 32

(12)

Total Number of Possible Association Rules

 Given d unique items:

 Total number of itemsets = 2^d

 Total number of possible association rules:

12/71

(13)

Frequent Itemset Generation Strategies

 Reduce the number of candidates (M).

 Complete search: M=2^d .

 Use pruning techniques to reduce M.

 Reduce the number of transactions (N).

 Reduce size of N as the size of itemset increases.

 Reduce the number of comparisons (NM).

 Use efficient data structures to store the candidates or transactions.

 No need to match every candidate against every transaction.

13/71

(14)

Reducing the Number of Candidates:

Apriori Principle

 Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent.

 Apriori principle holds due to the anti-monotone property of support measure: support of an itemset never exceeds the support of its subsets.

14/71

(15)

Apriori Principle

^15/71

(16)

Illustrating Apriori Principle

^16/71

Item Count

Bread 4

Peanuts 4

Milk 6

Fruit 6

Jam 5

Soda 6

Chips 4

Steak 1

Cheese 1

Yogurt 1

Item Count

Bread, Jam 4

Peanuts, Fruit 4

Milk, Fruit 5

Milk, Jam 4

Milk, Soda 5

Fruit, Soda 4

Jam, Soda 4

Soda, Chips 4

1-itemsets 2-itemsets

3-itemsets

Item Count

Milk, Fruit, Soda 4

Minimum Support = 4

(17)

Definition of Apriori Algorithm

 The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules.

 Apriori uses a "bottom up" approach, where frequent

subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are

tested against the data.

 Apriori is designed to operate on database containing transactions (for example, collections of items bought by customers, or details of a website frequentation).

17/71

(18)

Steps To Perform Apriori Algorithm

^18/71

Scan the transaction database to get the support of S each

1-itemset, compare S with minsup, and get a support of 1-itemsets, L₁.

Use L_k-1join L_k-1to generate a set of candidate k-itemsets. Use Apriori property to prune the unfrequented

k-itemsets from this set.

Scan the transaction database to get the support S of each candidate k-

itemset in the find set, compare S with minsup, and get a set of

frequent k-itemsets L_k.

The candidate set = Null

NO

YES

For each frequent itemset f, generate all nonempty subsets of f

For every nonempty subset s of f, output the rule “s ⇒ (f-s)” if confidence C of the rule

“s ⇒ (f-s)” (=support s of 1/support S of s)

>= minconf

(19)

Apriori Algorithm

 Let k=1

 Generate frequent itemsets of length 1.

 Repeat until no new frequent itemsets are identified:

 Generate length (k+1) candidate itemsets from length k frequent itemsets.

 Prune candidate itemsets containing subsets of length k that are infrequent.

 Count the support of each candidate by scanning the DB.

 Eliminate candidates that are infrequent, leaving only those that are frequent.

 Join Step: C_k is generated by joining L_k-1 with itself.

 Prune Step: any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset.

19/71

Source: Prof. Pier Luca Lanzi, Association Rule Basics, Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

(20)

Example of Apriori Run

^20/71

TID Items

1 A, C, D

2 B, C, E 3 A, B, C, E

4 B, E

Itemset Sup.

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

{A} 2

{B} 3

{C} 3

{E} 3

{A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2

{A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2

Itemset {B, C, E}

{B, C, E} 2 Itemset {A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

C₁ L₁ C₂

L₂

C₂ C₃ L₃

1st scan

2nd scan

3rd scan

(21)

Step (2): Rule Generation

 Given a frequent itemset {L}, find all non-empty subsets {f} {L} such that the association rule {f}

{L – f} satisfies the minimum confidence.

 Create the rule {f} {L – f}.

 If L={A,B,C,D} is a frequent itemset, candidate rules:

{ABC} ⇒ {D}, {ABD} ⇒ {C}, {ACD} ⇒ {B}, {BCD} ⇒ {A}, {A} ⇒ {BCD}, {B} ⇒ {ACD},{C} ⇒ {ABD}, {D} ⇒ {ABC}, {AB} ⇒ {CD}, {AC} ⇒ {BD}, {AD} ⇒ {BC},

{BC} ⇒ {AD}, {BD} ⇒ {AC}, {CD} ⇒ {AB}.

 If |L| = k, then there are 2^k – 2 candidate association rules (ignoring {L} ⇒ {∅} and {∅} ⇒ {L}).

21/71

(22)

Generate Rules from Frequent Itemsets

 Confidence does not have an anti-monotone property

 c({ABC} ⇒ {D}) can be larger or smaller than c({AB} ⇒ {D})

 But confidence of rules generated from the same itemset has an anti-monotone property

 e.g., L = {A,B,C,D}:

c({ABC} ⇒ {D}) >= c({AB} ⇒ {CD}) >= c({A} ⇒ {BCD})

 Confidence is anti-monotone with respect to the number of items on the right hand side of the rule.

 We can apply this property to prune the rule generation.

22/71

confident(A ⇒ B) = P(B | A) = P(A ∩ B)/P(A)

(23)

Rule Generation for Apriori Algorithm

^23/71

(24)

Rule Generation for Apriori Algorithm

 Candidate rule is generated by merging two rules that share the same prefix in the rule

consequent.

 Join({CD} ⇒ {AB}, {BD} ⇒ {AC}) would produce the candidate rule {D} ⇒ {ABC}

 Prune rule {D} ⇒ {ABC} if its subset {AD} ⇒ {BC} does not have high confidence.

24/71

{CD} ⇒ {AB} {BD} ⇒ {AC}

{D} ⇒ {ABC}

(25)

Apriori Advantages/Disadvantages

 Advantages

 Uses large itemset property.

 Easily parallelized.

 Easy to implement.

 Disadvantages

 Assumes transaction database is memory resident.

 Requires many database scans.

 Challenges in AR Mining

 Apriori scans the data base multiple times.

 Most often, there is a high number of candidates.

 Support counting for candidates can be time expensive.

 Several methods try to improve this points by

 Reduce the number of scans of the data base.

 Shrink the number of candidates.

 Counting the support of candidates more efficiently.

25/71

(26)

Choose an Appropriate minsup and Pattern Evaluation

 Choose an Appropriate minsup

 If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products)

 If minsup is set too low, it is computationally expensive and the number of itemsets is very large

 A single minimum support threshold may not be effective.

 Pattern Evaluation

 Association rule algorithms tend to produce too many rules

 many of them are uninteresting or redundant.

(Redundant if {A,B,C} ⇒ {D}and {A,B} ⇒ {D}have same support & confidence.)

 Interestingness measures can be used to prune/rank the derived patterns.

 In the original formulation of association rules, support &

confidence are the only measures used.

26/71

(27)

R Package: arules

 arules: Mining Association Rules and Frequent Itemsets

 Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

 Also provides interfaces to C implementations of the association mining algorithms Apriori and Eclat by C. Borgelt.

 apriori{arules}:

 The Apriori algorithm employs level-wise search for frequent itemsets.

 The defaults: (1) supp=0.1, the minimum support of rules; (2) conf=0.8, the minimum confidence of rules; and (3) maxlen=10, which is the maximum length of rules.

 eclat{arules}:

 The ECLAT algorithm finds frequent itemsets with equivalence classes, depth-first search and set intersection instead of counting.

 interestMeasure{arules}: more than twenty measures for selecting interesting association rules can be calculated.

 Other R packages:

 arulesViz: A package for visualizing association rules based on package arules.

 arulesSequences: provides functions for mining sequential patterns.

 arulesNBMiner: implements an algorithm for mining negative binomial (NB) frequent itemsets and NB-precise rules.

27/71

http://lyle.smu.edu/IDA/arules/

https://cran.r-project.org/web/packages/arules/index.html http://michael.hahsler.net/research/arules_RUG_2015/demo/

(arules: Association Rule Mining with R — A Tutorial, Michael Hahsler, Mon Sep 21 10:51:59 2015)

(28)

apriori{arules}: Mining Associations with Apriori

 Description

 Mine frequent itemsets, association rules or association hyperedges using the Apriori algorithm. The Apriori algorithm employs level-wise search for frequent itemsets. The implementation of Apriori used includes some improvements (e.g., a prefix tree and item sorting).

 Usage

apriori(data, parameter = NULL, appearance = NULL, control = NULL)

 Arguments

 data: object of class transactions or any data structure which can be coerced into transactions (e.g., a binary matrix or data.frame).

 parameter: object of class APparameter or named list. The default behavior is to mine rules with support 0.1, confidence 0.8, and maxlen 10.

 appearance: object of class APappearance or named list. With this argument item appearance can be restricted (implements rule templates). By default all items can appear unrestricted.

 control: object of class APcontrol or named list. Controls the algorithmic performance of the mining algorithm (item sorting, etc.)

 Note: Apriori only creates rules with one item in the RHS!

28/71

(29)

eclat{arules}: Mining Associations with Eclat

 Description

 Mine frequent itemsets with the Eclat algorithm. This algorithm uses simple intersection operations for equivalence class clustering along with bottom-up lattice traversal.

 Usage

eclat(data, parameter = NULL, control = NULL)

 Arguments

 data: object of class transactions or any data structure which can be coerced into transactions (e.g., binary matrix, data.frame).

 parameter: object of class ECparameter or named list (default values are: support 0.1 and maxlen 5)

 control: object of class ECcontrol or named list for algorithmic controls.

29/71

(30)

Case Study 0:

Adult Data Set

^30/71

> # The AdultUCI datset contains the questionnaire data of the “Adult” database (originally called the “Census Income” Database) with 48842 observations on the 15 variables.

> library(arules)

> data(AdultUCI)

> head(AdultUCI)

age workclass fnlwgt education education-num marital-status occupation 1 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial 3 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners 4 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners 5 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty 6 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial relationship race sex capital-gain capital-loss hours-per-week native-country income 1 Not-in-family White Male 2174 0 40 United-States small 2 Husband White Male 0 0 13 United-States small 3 Not-in-family White Male 0 0 40 United-States small 4 Husband Black Male 0 0 40 United-States small 5 Wife Black Female 0 0 40 Cuba small 6 Wife White Female 0 0 40 United-States small

> data(Adult)

> ?Adult #see how to create transactions from AdultUCI

> Adult

transactions in sparse format with 48842 transactions (rows) and 115 items (columns)

> class(Adult) [1] "transactions"

attr(,"package") [1] "arules"

> ?transactions

(31)

Adult Data Set ( transactions form)

^31/71

> str(Adult)

Formal class 'transactions' [package "arules"] with 3 slots

..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. ..@ i : int [1:612200] 1 10 25 32 35 50 59 61 63 65 ...

.. .. ..@ p : int [1:48843] 0 13 26 39 52 65 78 91 104 117 ...

.. .. ..@ Dim : int [1:2] 115 48842 .. .. ..@ Dimnames:List of 2

.. .. .. ..$ : NULL .. .. .. ..$ : NULL

.. .. ..@ factors : list()

..@ itemInfo :'data.frame': 115 obs. of 3 variables:

.. ..$ labels : chr [1:115] "age=Young" "age=Middle-aged" "age=Senior" "age=Old" ...

.. ..$ variables: Factor w/ 13 levels "age","capital-gain",..: 1 1 1 1 13 13 13 13 13 13 ...

.. ..$ levels : Factor w/ 112 levels "10th","11th",..: 111 63 92 69 30 54 65 82 90 91 ...

..@ itemsetInfo:'data.frame': 48842 obs. of 1 variable:

.. ..$ transactionID: chr [1:48842] "1" "2" "3" "4" ...

> inspect(Adult[1:2])

items transactionID 1 {age=Middle-aged, workclass=State-gov, education=Bachelors, marital-status=Never-married, occupation=Adm-clerical, relationship=Not-in-family, race=White, sex=Male, capital-gain=Low, capital-loss=None, hours-per-week=Full-time, native-country=United-States, income=small} 1

2 {age=Senior, workclass=Self-emp-not-inc, education=Bachelors, marital-status=Married-civ-spouse, occupation=Exec-managerial, relationship=Husband, race=White, sex=Male, capital-gain=None, capital-loss=None, hours-per-week=Part-time, native-country=United-States, income=small}

(32)

Adult Data Set

^32/71

> summary(Adult)

transactions as itemMatrix in sparse format with 48842 rows (elements/itemsets/transactions) and 115 columns (items) and a density of 0.1089939 most frequent items:

capital-loss=None capital-gain=None native-country=United-States race=White 46560 44807 43832 41762 workclass=Private (Other)

33906 401333

element (itemset/transaction) length distribution:

sizes

9 10 11 12 13 19 971 2067 15623 30162

Min. 1st Qu. Median Mean 3rd Qu. Max.

9.00 12.00 13.00 12.53 13.00 13.00 includes extended item information - examples:

labels variables levels 1 age=Young age Young 2 age=Middle-aged age Middle-aged 3 age=Senior age Senior

includes extended transaction information - examples:

transactionID 1 1 2 2 3 3

(33)

How to Create Transactions Data

 Transactions can be created by coercion from lists containing transactions, but also from matrix and

data.frames. However, you will need to prepare your data first. Association rule mining can only use items and does not work with continuous variables.

33/71

http://127.0.0.1:18470/library/arules/html/transactions-class.html

> # creating transactions form a list

> a.list <- list(

+ c("a","b","c"), + c("a","b"), + c("a","b","d"), + c("c","e"),

+ c("a","b","d","e") + )

> names(a.list) <- paste0("Customer", c(1:5))

> a.list

$Customer1

[1] "a" "b" "c"

$Customer2 [1] "a" "b"

$Customer3

[1] "a" "b" "d"

$Customer4 [1] "c" "e"

$Customer5

[1] "a" "b" "d" "e"

# avoid "no method or default for coercing"

library(Matrix)

productTD <- as(product_by_user$Product, "transactions") inspect(productTD[1:5])

(34)

Coerce a List Into Transactions

^34/71

> alist.trans <- as(a.list, "transactions")

> summary(alist.trans) # analyze transactions transactions as itemMatrix in sparse format with

5 rows (elements/itemsets/transactions) and 5 columns (items) and a density of 0.56 most frequent items:

a b c d e (Other) 4 4 2 2 2 0 element (itemset/transaction) length distribution:

sizes 2 3 4 2 2 1

labels 1 a 2 b 3 c

transactionID 1 Customer1 2 Customer2 3 Customer3

> image(alist.trans)

(35)

Creating Transactions from a Matrix

^35/71

> a.matrix <- matrix(c(

+ 1,1,1,0,0, + 1,1,0,0,0, + 1,1,0,1,0,

+ 0,0,1,0,1), ncol = 5)

> dimnames(a.matrix) <- list(

paste("Customer", letters[1:4]), paste0("Item", c(1:5)))

> a.matrix

Item1 Item2 Item3 Item4 Item5 Customer a 1 0 0 0 0 Customer b 1 1 0 1 1 Customer c 1 1 1 0 0 Customer d 0 0 1 0 1

>

> amatirx.trans <- as(a.matrix, "transactions")

> amatirx.trans

transactions in sparse format with 4 transactions (rows) and

5 items (columns)

> inspect(amatirx.trans)

items transactionID [1] {Item1} Customer a [2] {Item1,Item2,Item4,Item5} Customer b [3] {Item1,Item2,Item3} Customer c [4] {Item3,Item5} Customer d

> summary(amatirx.trans)

Item1 Item2 Item3 Item5 Item4 (Other) 3 2 2 2 1 0 element (itemset/transaction) length distribution:

sizes 1 2 3 4 1 1 1 1

labels 1 Item1 2 Item2 3 Item3

includes extended transaction information - examples:

transactionID 1 Customer a 2 Customer b 3 Customer c

(36)

More Examples

^36/71

> # creating transactions from data.frame

> a.df <- data.frame(

+ age = as.factor(c(6, 8, NA, 9, 16)),

+ grade = as.factor(c("A", "C", "F", NA, "C")), + pass = c(TRUE, TRUE, FALSE, TRUE, TRUE))

> # note: factors are translated to

> # logicals and NAs are ignored

> a.df

age grade pass 1 6 A TRUE 2 8 C TRUE 3 <NA> F FALSE 4 9 <NA> TRUE 5 16 C TRUE

> adf.trans <- as(a.df, "transactions")

> inspect(adf.trans)

items transactionID [1] {age=6,grade=A,pass} 1 [2] {age=8,grade=C,pass} 2 [3] {grade=F} 3 [4] {age=9,pass} 4 [5] {age=16,grade=C,pass} 5

> as(adf.trans, "data.frame")

items transactionID 1 {age=6,grade=A,pass} 1 2 {age=8,grade=C,pass} 2 3 {grade=F} 3 4 {age=9,pass} 4 5 {age=16,grade=C,pass} 5

> # creating transactions from (IDs, items)

> a.df2 <- data.frame(

+ TID = c(1, 1, 2, 2, 2, 3),

+ item = c("a", "b", "a", "b", "c", "b"))

> a.df2 TID item 1 1 a 2 1 b 3 2 a 4 2 b 5 2 c 6 3 b

> a.df2.s <- split(a.df2[, "item"], a.df2[,"TID"])

> a.df2.s

$`1`

[1] a b

Levels: a b c

$`2`

[1] a b c Levels: a b c

$`3`

[1] b

Levels: a b c

> adf2.trans <- as(a.df2.s, "transactions")

> inspect(adf2.trans) items transactionID [1] {a,b} 1 [2] {a,b,c} 2 [3] {b} 3

(37)

Example: Create Transactions Data

^37/71

> data(AdultUCI)

> summary(AdultUCI)

> # remove attributes

> AdultUCI[["fnlwgt"]] <- NULL

> AdultUCI[["education-num"]] <- NULL

http://127.0.0.1:18470/library/arules/html/Adult.html

(38)

Example: Create Transactions Data

^38/71

> # map metric attributes

> AdultUCI[["age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15, 25, 45, 65, 100)), + labels = c("Young", "Middle-aged", "Senior", "Old"))

> AdultUCI[["hours-per-week"]] <- ordered(cut(AdultUCI[["hours-per-week"]], + c(0, 25, 40, 60, 168)),

+ labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))

> AdultUCI[["capital-gain"]] <- ordered(cut(AdultUCI[["capital-gain"]],

+ c(-Inf, 0, median(AdultUCI[["capital-gain"]][AdultUCI[["capital-gain"]] > 0]), Inf)), + labels = c("None", "Low", "High"))

> AdultUCI[["capital-loss"]] <- ordered(cut(AdultUCI[["capital-loss"]],

+ c(-Inf, 0, median(AdultUCI[["capital-loss"]][AdultUCI[["capital-loss"]] > 0]), Inf)), + labels = c("None", "Low", "High"))

>

> summary(AdultUCI[c("age", "hours-per-week", "capital-gain", "capital-loss")]) age hours-per-week capital-gain capital-loss

Young : 9627 Part-time : 5913 None:44807 None:46560 Middle-aged:24671 Full-time :28577 Low : 2345 Low : 1166 Senior :12741 Over-time :12676 High: 1690 High: 1116 Old : 1803 Workaholic: 1676

>

> # create transactions

> MyAdult <- as(AdultUCI, "transactions")

> MyAdult

transactions in sparse format with 48842 transactions (rows) and 115 items (columns)

(39)

Example: Create Transactions Data

^39/71

> summary(MyAdult)

capital-loss=None capital-gain=None native-country=United-States 46560 44807 43832 race=White workclass=Private (Other) 41762 33906 401333 element (itemset/transaction) length distribution:

sizes

9 10 11 12 13 19 971 2067 15623 30162

labels variables levels 1 age=Young age Young 2 age=Middle-aged age Middle-aged 3 age=Senior age Senior

transactionID 1 1 2 2 3 3

> inspect(MyAdult[1:2])

items transactionID [1] {age=Middle-aged,

workclass=State-gov, education=Bachelors, marital-status=Never-married, occupation=Adm-clerical, relationship=Not-in-family, race=White, sex=Male, capital-gain=Low, capital-loss=None, hours-per-week=Full-time, native-country=United-States, income=small} 1

(40)

Case Study 1: Groceries Data Set

 Description: The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.

 Format: Object of class transactions.

40/71

> library(arules)

> data(Groceries)

> ?Groceries

> str(Groceries)

Formal class 'transactions' [package "arules"] with 3 slots

..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. ..@ i : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...

.. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...

.. .. ..@ Dim : int [1:2] 169 9835 .. .. ..@ Dimnames:List of 2

.. .. .. ..$ : NULL .. .. .. ..$ : NULL

.. .. ..@ factors : list()

..@ itemInfo :'data.frame': 169 obs. of 3 variables:

.. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...

.. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...

.. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...

..@ itemsetInfo:'data.frame': 0 obs. of 0 variables

Groceries@itemInfo

(41)

summary, inspect

^41/71

> summary(Groceries)

transactions as itemMatrix in sparse format with 9835 rows (elements/itemsets/transactions) and 169 columns (items) and a density of 0.02609146 most frequent items:

whole milk other vegetables rolls/buns soda yogurt 2513 1903 1809 1715 1372 (Other)

34055

element (itemset/transaction) length distribution:

sizes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 29 14

19 20 21 22 23 24 26 27 28 29 32 14 9 11 4 6 1 1 1 1 3 1

labels level2 level1 1 frankfurter sausage meat and sausage 2 sausage sausage meat and sausage

3 liver loaf sausage meat and sausage > inspect(Groceries[1:4])

items

1 {citrus fruit,semi-finished bread,margarine,ready soups}

2 {tropical fruit,yogurt,coffee}

3 {whole milk}

4 {pip fruit,yogurt,cream cheese ,meat spreads}

(42)

Apply apriori

^42/71

> rule0 <- apriori(Groceries) Apriori

Parameter specification:

confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 0.1 1 10 rules FALSE Algorithmic control:

filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 983

set item appearances ...[0 item(s)] done [0.00s].

set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].

sorting and recoding items ... [8 item(s)] done [0.00s].

creating transaction tree ... done [0.00s].

checking subsets of size 1 2 done [0.00s].

writing ... [0 rule(s)] done [0.00s].

creating S4 object ... done [0.00s].

The default behavior is to mine rules with minimum support of 0.1,

minimum confidence of 0.8,

maximum of 10 items (maxlen), and

a maximal time for subset checking of 5 seconds (maxtime).

(43)

Apply apriori With Different Arguments^43/71

> rule1 <- apriori(Groceries, parameter=list(support=0.005, confidence=0.64)) Apriori

checking subsets of size 1 2 3 4 done [0.00s].

> inspect(rule1)

lhs rhs support confidence lift 1 {butter,whipped/sour cream} => {whole milk} 0.006710727 0.6600000 2.583008 2 {pip fruit,whipped/sour cream} => {whole milk} 0.005998983 0.6483516 2.537421 3 {pip fruit,root vegetables,other vegetables} => {whole milk} 0.005490595 0.6750000 2.641713 4 {tropical fruit,root vegetables,yogurt} => {whole milk} 0.005693950 0.7000000 2.739554

(44)

Class ' rules '

^44/71

> str(rule1)

Formal class 'rules' [package "arules"] with 4 slots

..@ lhs :Formal class 'itemMatrix' [package "arules"] with 3 slots

.. .. ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. .. .. ..@ i : int [1:10] 25 30 15 30 15 19 22 14 19 29

.. .. .. .. ..@ p : int [1:5] 0 2 4 7 10 .. .. .. .. ..@ Dim : int [1:2] 169 4 .. .. .. .. ..@ Dimnames:List of 2

.. .. .. .. .. ..$ : NULL .. .. .. .. .. ..$ : NULL

.. .. .. .. ..@ factors : list()

.. .. ..@ itemInfo :'data.frame': 169 obs. of 3 variables:

.. .. .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...

.. .. .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...

.. .. .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...

.. .. ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables

..@ rhs :Formal class 'itemMatrix' [package "arules"] with 3 slots

.. .. ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. .. .. ..@ i : int [1:4] 24 24 24 24

.. .. .. .. ..@ p : int [1:5] 0 1 2 3 4 .. .. .. .. ..@ Dim : int [1:2] 169 4 .. .. .. .. ..@ Dimnames:List of 2

.. .. .. .. .. ..$ : NULL .. .. .. .. .. ..$ : NULL

.. .. .. .. ..@ factors : list()

.. .. ..@ itemInfo :'data.frame': 169 obs. of 3 variables:

.. .. .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...

.. .. .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...

.. .. .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...

.. .. ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables ..@ quality:'data.frame': 4 obs. of 3 variables:

.. ..$ support : num [1:4] 0.00671 0.006 0.00549 0.00569 .. ..$ confidence: num [1:4] 0.66 0.648 0.675 0.7

.. ..$ lift : num [1:4] 2.58 2.54 2.64 2.74 ..@ info :List of 4

.. ..$ data : symbol Groceries

> rule1@quality

support confidence lift 1 0.006710727 0.6600000 2.583008 2 0.005998983 0.6483516 2.537421 3 0.005490595 0.6750000 2.641713

(45)

Select Top AR by Support

^45/71

> rule2 <- apriori(Groceries, parameter=list(support=0.001, confidence=0.5)) Apriori

checking subsets of size 1 2 3 4 5 6 done [0.01s].

>

> rule2.sorted_sup <- sort(rule2, by="support")

> inspect(rule2.sorted_sup[1:5])

lhs rhs support confidence lift 1472 {other vegetables,yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 1467 {tropical fruit,yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 1449 {other vegetables,whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385 1469 {root vegetables,yogurt} => {whole milk} 0.01453991 0.5629921 2.203354 1454 {pip fruit,other vegetables} => {whole milk} 0.01352313 0.5175097 2.025351

(46)

Select a Subset of Rules

^46/71

> # Select a subset of rules using partial matching on the items

> # in the right-hand-side and a quality measure

> rule2.sub <- subset(rule2, subset = rhs %pin% "whole milk" & lift > 1.3)

> rule2.sub

set of 2679 rules

>

> # Display the top 3 support rules

> inspect(head(rule2.sub, n = 3, by = "support"))

lhs rhs support confidence lift 1472 {other vegetables,yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 1467 {tropical fruit,yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 1449 {other vegetables,whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385

>

> # Display the first 3 rules

> inspect(rule2.sub[1:3])

lhs rhs support confidence lift 1 {honey} => {whole milk} 0.001118454 0.7333333 2.870009 3 {cocoa drinks} => {whole milk} 0.001321810 0.5909091 2.312611 4 {pudding powder} => {whole milk} 0.001321810 0.5652174 2.212062

>

> # Get labels for the first 3 rules

> labels(rule2.sub[1:3])

[1] "{honey} => {whole milk}" "{cocoa drinks} => {whole milk}"

[3] "{pudding powder} => {whole milk}"

> labels(rule2.sub[1:3], itemSep = " + ", setStart = "", setEnd="", ruleSep = " ---> ") [1] "honey ---> whole milk" "cocoa drinks ---> whole milk"

[3] "pudding powder ---> whole milk"

(47)

Select Top AR by Confidence, Lift

^47/71

> rule2.sorted_con <- sort(rule2, by="confidence")

> inspect(rule2.sorted_con[1:5])

lhs rhs support confidence lift 113 {rice,sugar} => {whole milk} 0.001220132 1 3.913649 258 {canned fish,hygiene articles} => {whole milk} 0.001118454 1 3.913649 1487 {root vegetables,butter,rice} => {whole milk} 0.001016777 1 3.913649 1646 {root vegetables,whipped/sour cream,flour} => {whole milk} 0.001728521 1 3.913649 1670 {butter,soft cheese,domestic eggs} => {whole milk} 0.001016777 1 3.913649

>

> rule2.sorted_lift <- sort(rule2, by="lift")

> inspect(rule2.sorted_lift[1:5])

lhs rhs support confidence lift 53 {Instant food products,soda} => {hamburger meat} 0.001220132 0.6315789 18.99565 37 {soda,popcorn} => {salty snack} 0.001220132 0.6315789 16.69779 444 {flour,baking powder} => {sugar} 0.001016777 0.5555556 16.40807 327 {ham,processed cheese} => {white bread} 0.001931876 0.6333333 15.04549 55 {whole milk,Instant food products} => {hamburger meat} 0.001525165 0.5000000 15.03823

sort(x, decreasing = TRUE, na.last = NA, by = "support", order = FALSE, ...)

## S4 method for signature 'associations'

head(x, n = 6L, by = NULL, decreasing = TRUE, ...)

## S4 method for signature 'associations'

tail(x, n = 6L, by = NULL, decreasing = TRUE, ...)

(48)

Select Top Frequent Itemsets

^48/71

> rule.freq_item <- apriori(Groceries, parameter=list(support=0.001, target="frequent itemsets"), control=list(sort=-1))

Apriori

confidence minval smax arem aval originalSupport support minlen maxlen target ext NA 0.1 1 none FALSE TRUE 0.001 1 10 frequent itemsets FALSE

Algorithmic control:

filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE -1 TRUE Absolute minimum support count: 9

...

checking subsets of size 1 2 3 4 5 6 done [0.02s].

writing ... [13492 set(s)] done [0.00s].

> rule.freq_item

set of 13492 itemsets

> inspect(rule.freq_item[1:5]) items support 1 {whole milk} 0.2555160 2 {other vegetables} 0.1934926 3 {rolls/buns} 0.1839349 4 {soda} 0.1743772 5 {yogurt} 0.1395018

(49)

Frequent k-itemsets

^49/71

> rule.fi_eclat <- eclat(Groceries, parameter=list(minlen=1, maxlen=3, support=0.001, target="frequent itemsets"), control=list(sort=-1))

Eclat ...

> rule.fi_eclat

> inspect(rule.fi_eclat[1:5])

items support 1 {whole milk,honey} 0.001118454 2 {whole milk,cocoa drinks} 0.001321810 3 {whole milk,pudding powder} 0.001321810 4 {tidbits,rolls/buns} 0.001220132 5 {tidbits,soda} 0.001016777

> rule.fi_eclat <- eclat(Groceries, parameter=list(minlen=3, maxlen=5, support=0.001, target="frequent itemsets"), control=list(sort=-1))

Eclat ...

> rule.fi_eclat

> inspect(rule.fi_eclat[1:5])

items support 1 {liver loaf,whole milk,yogurt} 0.001016777 2 {tropical fruit,other vegetables,curd cheese} 0.001016777 3 {whole milk,curd cheese,rolls/buns} 0.001016777 4 {other vegetables,whole milk,curd cheese} 0.001220132 5 {other vegetables,whole milk,cleaner} 0.001016777