• 沒有找到結果。

Association Rules

N/A
N/A
Protected

Academic year: 2021

Share "Association Rules"

Copied!
71
0
0

加載中.... (立即查看全文)

全文

(1)

http://www.hmwu.idv.tw

吳漢銘國立政治大學 統計學系

關聯性分析

Association Rules

C04

(2)

Market Basket Analysis

2/71

http://www.analyticsvidhya.com/blog/2014/08/effective-cross-selling-market-basket-analysis/

https://blogs.adobe.com/digitalmarketing/wp-content/uploads/2013/08/pic1.jpg

(3)

應用實例

3/71

(4)

Market Basket Analysis

Market Basket Analysis is one of the Data Mining approaches

to find associations and correlations between the different items that customers place in their shopping basket.

to help market owner to have much better opportunity to make a profit by controlling the order of products and marketing.

Retailers leverage Market Basket Analysis

to provide a window into consumer shopping behavior, revealing how consumers select products, make spending tradeoffs, and group items in a shopping cart.

to understand how baskets are built. It can help retailers

merchandise more effectively by leveraging market basket dynamics in pricing and promotion decisions.

4/71

R. Agrawal, T. Imieliński and A. Swami, “Mining Association Rule between Sets of Items in Large Databases,” The ACM SIGMOD International Conference on Management of Data, pp. 207-216, May 1993.

(被引用 19551 次)

(5)

Association Rule Mining

The ideas of Association Rule Learning (also called Association Rule Mining) come from the market basket analysis.

AR mining:

Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction.

5/71

Transaction ID

(TID) Items

1 Bread, Peanuts, Milk, Fruit, Jam 2 Bread, Jam, Soda, Chips, Milk, Fruit 3 Steak, Jam, Soda, Chips, Bread 4 Jam, Soda, Peanuts, Milk, Fruit 5 Jam, Soda, Chips, Milk, Bread 6 Fruit, Soda, Chips, Milk

7 Fruit, Soda, Peanuts, Milk 8 Fruit, Peanuts, Cheese, Yogurt

Rule

{bread} ⇒ {milk}

{soda} ⇒ {chips}

{bread} ⇒ {jam}

mining

(6)

Association Rule Mining

Formalizing the problem:

Transaction Database T: a set of transactions T = {t1, t2, …, tn}.

Each transaction contains a set of items I (itemset)(項集).

An itemset is a collection of items I = {i1 , i2 , …, im}.

k-itemset: an itemset that contains k items.

Association rules are rules presenting association or correlation between itemsets.

An association rule is in the form of A ⇒ B, where A and B are two disjoint itemsets, referred to respectively as the lhs (left-hand side) (先決條件) and rhs (right-hand side) (對應的 連結結果) of the rule.

6/71

(7)

Definition: Frequent Itemset

Support count (σ)

Frequency of occurrence of an itemset.

σ({Milk, Bread}) = 3, σ({Soda, Chips}) = 4.

Support (s) (支援度)

The occurring frequency of the rule.

The percentage of transactions that contains both itemsets A and B.

Support(A ⇒ B) = P(A ∩ B)

s({Milk, Bread}) = 3/8; s({Soda, Chips}) = 4/8

Frequent itemset (頻繁項集):

s(itemset) ≧ minsup (minimum support)

threshold.

Items that frequently appear together.

The strength of the association.

7/71

means "and"

(8)

Confidence and Lift

Confidence (c) (可靠度):

the percentage of cases containing A that also contain B.

confident(A ⇒ B) = P(B | A) = P(A ∩ B)/P(A)

confident(A ⇒ B) ≧ mincon (minimum confident)

Lift (提昇度):

the ratio of confidence to the percentage of cases containing B.

lift(A ⇒ B) = P(B | A)/P(B)

= confident(A ⇒ B) / P(B)

= P(A ∩ B)/P(A)P(B)

lift(A ⇒ B) = 1, A和B相互獨立,A對B出現的可能性沒有提昇作用。

lift(A ⇒ B) > 1,表示A對B的提昇程度愈大,連結性愈強。

8/71

NOTE: There are many other interestingness measures, such as chi-square, conviction, gini and leverage. An introduction to over 20 measures can be found in Tan, P.-N., Kumar, V., and Srivastava, J. (2002). Selecting the right interestingness measure for association patterns. In KDD ’02: Proceedings of the 8th

(9)

Find a Rule

Rules originating from the same itemset have identical support but can have different confidence.

Given a set of transactions T, the goal of association rule mining is to find all rules having

support ≥ minsup threshold.

confidence ≥ minconf threshold.

9/71

{Bread, Jam} ⇒ {Milk}, s=0.375, c=3/4=0.75

{Milk, Jam} ⇒ {Bread}, s=0.375, c=0.75

{Bread} ⇒ {Milk, Jam}, s=0.375, c=0.75

{Jam} ⇒ {Bread, Milk}, s=0.375, c=0.6

{Milk} ⇒ {Bread, Jam}, s=0.375, c=0.5

The following rules are binary partitions of the same itemset: {Milk, Bread, Jam}

(10)

Mining Association Rules

Brute-force approach:

List all possible association rules.

Compute the support and confidence for each rule.

Prune rules that fail the minsup and minconf thresholds.

Brute-force approach is computationally prohibitive!

Two step approach:

Step (1): Frequent Itemset Generation:

Generate all itemsets whose support >= minsup

Step (2): Rule Generation:

Generate high confidence rules from frequent itemset.

Each rule is a binary partitioning of a frequent itemset.

Frequent itemset generation is computationally expensive.

10/71

(11)

Step (1): Frequent Itemset Generation

Given d items, there are 2d possible candidate itemsets.

11/71

Source: (1) Prof. Pier Luca Lanzi, Association Rule Basics, Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) (2) Tan,Steinbach, Kumar Introduction to Data Mining

> no.item <- 5

> sum(choose(no.item, 0:no.item)) [1] 32

> 2^no.item [1] 32

(12)

Total Number of Possible Association Rules

Given d unique items:

Total number of itemsets = 2d

Total number of possible association rules:

12/71

(13)

Frequent Itemset Generation Strategies

Reduce the number of candidates (M).

Complete search: M=2d .

Use pruning techniques to reduce M.

Reduce the number of transactions (N).

Reduce size of N as the size of itemset increases.

Reduce the number of comparisons (NM).

Use efficient data structures to store the candidates or transactions.

No need to match every candidate against every transaction.

13/71

(14)

Reducing the Number of Candidates:

Apriori Principle

Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent.

Apriori principle holds due to the anti-monotone property of support measure: support of an itemset never exceeds the support of its subsets.

14/71

(15)

Apriori Principle

15/71

Source: (1) Prof. Pier Luca Lanzi, Association Rule Basics, Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) (2) Tan,Steinbach, Kumar Introduction to Data Mining

(16)

Illustrating Apriori Principle

16/71

Item Count

Bread 4

Peanuts 4

Milk 6

Fruit 6

Jam 5

Soda 6

Chips 4

Steak 1

Cheese 1

Yogurt 1

Item Count

Bread, Jam 4

Peanuts, Fruit 4

Milk, Fruit 5

Milk, Jam 4

Milk, Soda 5

Fruit, Soda 4

Jam, Soda 4

Soda, Chips 4

1-itemsets 2-itemsets

3-itemsets

Item Count

Milk, Fruit, Soda 4

Minimum Support = 4

(17)

Definition of Apriori Algorithm

The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules.

Apriori uses a "bottom up" approach, where frequent

subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are

tested against the data.

Apriori is designed to operate on database containing transactions (for example, collections of items bought by customers, or details of a website frequentation).

17/71

(18)

Steps To Perform Apriori Algorithm

18/71

Scan the transaction database to get the support of S each

1-itemset, compare S with minsup, and get a support of 1-itemsets, L1.

Use Lk-1join Lk-1to generate a set of candidate k-itemsets. Use Apriori property to prune the unfrequented

k-itemsets from this set.

Scan the transaction database to get the support S of each candidate k-

itemset in the find set, compare S with minsup, and get a set of

frequent k-itemsets Lk.

The candidate set = Null

NO

YES

For each frequent itemset f, generate all nonempty subsets of f

For every nonempty subset s of f, output the rule “s ⇒ (f-s)” if confidence C of the rule

“s ⇒ (f-s)” (=support s of 1/support S of s)

>= minconf

(19)

Apriori Algorithm

Let k=1

Generate frequent itemsets of length 1.

Repeat until no new frequent itemsets are identified:

Generate length (k+1) candidate itemsets from length k frequent itemsets.

Prune candidate itemsets containing subsets of length k that are infrequent.

Count the support of each candidate by scanning the DB.

Eliminate candidates that are infrequent, leaving only those that are frequent.

Join Step: Ck is generated by joining Lk-1 with itself.

Prune Step: any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset.

19/71

Source: Prof. Pier Luca Lanzi, Association Rule Basics, Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

(20)

Example of Apriori Run

20/71

TID Items

1 A, C, D

2 B, C, E 3 A, B, C, E

4 B, E

Itemset Sup.

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset Sup.

{A} 2

{B} 3

{C} 3

{E} 3

Itemset Sup.

{A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2

Itemset Sup.

{A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2

Itemset {B, C, E}

Itemset Sup.

{B, C, E} 2 Itemset {A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

C1 L1 C2

L2

C2 C3 L3

1st scan

2nd scan

3rd scan

(21)

Step (2): Rule Generation

Given a frequent itemset {L}, find all non-empty subsets {f} {L} such that the association rule {f}

{L – f} satisfies the minimum confidence.

Create the rule {f} {L – f}.

If L={A,B,C,D} is a frequent itemset, candidate rules:

{ABC} ⇒ {D}, {ABD} ⇒ {C}, {ACD} ⇒ {B}, {BCD} ⇒ {A}, {A} ⇒ {BCD}, {B} ⇒ {ACD},{C} ⇒ {ABD}, {D} ⇒ {ABC}, {AB} ⇒ {CD}, {AC} ⇒ {BD}, {AD} ⇒ {BC},

{BC} ⇒ {AD}, {BD} ⇒ {AC}, {CD} ⇒ {AB}.

If |L| = k, then there are 2k – 2 candidate association rules (ignoring {L} ⇒ {∅} and {∅} ⇒ {L}).

21/71

(22)

Generate Rules from Frequent Itemsets

Confidence does not have an anti-monotone property

c({ABC} ⇒ {D}) can be larger or smaller than c({AB} ⇒ {D})

But confidence of rules generated from the same itemset has an anti-monotone property

e.g., L = {A,B,C,D}:

c({ABC} ⇒ {D}) >= c({AB} ⇒ {CD}) >= c({A} ⇒ {BCD})

Confidence is anti-monotone with respect to the number of items on the right hand side of the rule.

We can apply this property to prune the rule generation.

22/71

confident(A ⇒ B) = P(B | A) = P(A ∩ B)/P(A)

(23)

Rule Generation for Apriori Algorithm

23/71

Source: (1) Prof. Pier Luca Lanzi, Association Rule Basics, Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) (2) Tan,Steinbach, Kumar Introduction to Data Mining

(24)

Rule Generation for Apriori Algorithm

Candidate rule is generated by merging two rules that share the same prefix in the rule

consequent.

Join({CD} ⇒ {AB}, {BD} ⇒ {AC}) would produce the candidate rule {D} ⇒ {ABC}

Prune rule {D} ⇒ {ABC} if its subset {AD} ⇒ {BC} does not have high confidence.

24/71

{CD} ⇒ {AB} {BD} ⇒ {AC}

{D} ⇒ {ABC}

(25)

Apriori Advantages/Disadvantages

Advantages

Uses large itemset property.

Easily parallelized.

Easy to implement.

Disadvantages

Assumes transaction database is memory resident.

Requires many database scans.

Challenges in AR Mining

Apriori scans the data base multiple times.

Most often, there is a high number of candidates.

Support counting for candidates can be time expensive.

Several methods try to improve this points by

Reduce the number of scans of the data base.

Shrink the number of candidates.

Counting the support of candidates more efficiently.

25/71

(26)

Choose an Appropriate minsup and Pattern Evaluation

Choose an Appropriate minsup

If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products)

If minsup is set too low, it is computationally expensive and the number of itemsets is very large

A single minimum support threshold may not be effective.

Pattern Evaluation

Association rule algorithms tend to produce too many rules

many of them are uninteresting or redundant.

(Redundant if {A,B,C} ⇒ {D}and {A,B} ⇒ {D}have same support & confidence.)

Interestingness measures can be used to prune/rank the derived patterns.

In the original formulation of association rules, support &

confidence are the only measures used.

26/71

(27)

R Package: arules

arules: Mining Association Rules and Frequent Itemsets

Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

Also provides interfaces to C implementations of the association mining algorithms Apriori and Eclat by C. Borgelt.

apriori{arules}:

The Apriori algorithm employs level-wise search for frequent itemsets.

The defaults: (1) supp=0.1, the minimum support of rules; (2) conf=0.8, the minimum confidence of rules; and (3) maxlen=10, which is the maximum length of rules.

eclat{arules}:

The ECLAT algorithm finds frequent itemsets with equivalence classes, depth-first search and set intersection instead of counting.

interestMeasure{arules}: more than twenty measures for selecting interesting association rules can be calculated.

Other R packages:

arulesViz: A package for visualizing association rules based on package arules.

arulesSequences: provides functions for mining sequential patterns.

arulesNBMiner: implements an algorithm for mining negative binomial (NB) frequent itemsets and NB-precise rules.

27/71

http://lyle.smu.edu/IDA/arules/

https://cran.r-project.org/web/packages/arules/index.html http://michael.hahsler.net/research/arules_RUG_2015/demo/

(arules: Association Rule Mining with R — A Tutorial, Michael Hahsler, Mon Sep 21 10:51:59 2015)

(28)

apriori{arules}: Mining Associations with Apriori

Description

Mine frequent itemsets, association rules or association hyperedges using the Apriori algorithm. The Apriori algorithm employs level-wise search for frequent itemsets. The implementation of Apriori used includes some improvements (e.g., a prefix tree and item sorting).

Usage

apriori(data, parameter = NULL, appearance = NULL, control = NULL)

Arguments

data: object of class transactions or any data structure which can be coerced into transactions (e.g., a binary matrix or data.frame).

parameter: object of class APparameter or named list. The default behavior is to mine rules with support 0.1, confidence 0.8, and maxlen 10.

appearance: object of class APappearance or named list. With this argument item appearance can be restricted (implements rule templates). By default all items can appear unrestricted.

control: object of class APcontrol or named list. Controls the algorithmic performance of the mining algorithm (item sorting, etc.)

Note: Apriori only creates rules with one item in the RHS!

28/71

(29)

eclat{arules}: Mining Associations with Eclat

Description

Mine frequent itemsets with the Eclat algorithm. This algorithm uses simple intersection operations for equivalence class clustering along with bottom-up lattice traversal.

Usage

eclat(data, parameter = NULL, control = NULL)

Arguments

data: object of class transactions or any data structure which can be coerced into transactions (e.g., binary matrix, data.frame).

parameter: object of class ECparameter or named list (default values are: support 0.1 and maxlen 5)

control: object of class ECcontrol or named list for algorithmic controls.

29/71

(30)

Case Study 0:

Adult Data Set

30/71

> # The AdultUCI datset contains the questionnaire data of the “Adult” database (originally called the “Census Income” Database) with 48842 observations on the 15 variables.

> library(arules)

> data(AdultUCI)

> head(AdultUCI)

age workclass fnlwgt education education-num marital-status occupation 1 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial 3 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners 4 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners 5 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty 6 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial relationship race sex capital-gain capital-loss hours-per-week native-country income 1 Not-in-family White Male 2174 0 40 United-States small 2 Husband White Male 0 0 13 United-States small 3 Not-in-family White Male 0 0 40 United-States small 4 Husband Black Male 0 0 40 United-States small 5 Wife Black Female 0 0 40 Cuba small 6 Wife White Female 0 0 40 United-States small

> data(Adult)

> ?Adult #see how to create transactions from AdultUCI

> Adult

transactions in sparse format with 48842 transactions (rows) and 115 items (columns)

> class(Adult) [1] "transactions"

attr(,"package") [1] "arules"

> ?transactions

(31)

Adult Data Set ( transactions form)

31/71

> str(Adult)

Formal class 'transactions' [package "arules"] with 3 slots

..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. ..@ i : int [1:612200] 1 10 25 32 35 50 59 61 63 65 ...

.. .. ..@ p : int [1:48843] 0 13 26 39 52 65 78 91 104 117 ...

.. .. ..@ Dim : int [1:2] 115 48842 .. .. ..@ Dimnames:List of 2

.. .. .. ..$ : NULL .. .. .. ..$ : NULL

.. .. ..@ factors : list()

..@ itemInfo :'data.frame': 115 obs. of 3 variables:

.. ..$ labels : chr [1:115] "age=Young" "age=Middle-aged" "age=Senior" "age=Old" ...

.. ..$ variables: Factor w/ 13 levels "age","capital-gain",..: 1 1 1 1 13 13 13 13 13 13 ...

.. ..$ levels : Factor w/ 112 levels "10th","11th",..: 111 63 92 69 30 54 65 82 90 91 ...

..@ itemsetInfo:'data.frame': 48842 obs. of 1 variable:

.. ..$ transactionID: chr [1:48842] "1" "2" "3" "4" ...

> inspect(Adult[1:2])

items transactionID 1 {age=Middle-aged, workclass=State-gov, education=Bachelors, marital-status=Never-married, occupation=Adm-clerical, relationship=Not-in-family, race=White, sex=Male, capital-gain=Low, capital-loss=None, hours-per-week=Full-time, native-country=United-States, income=small} 1

2 {age=Senior, workclass=Self-emp-not-inc, education=Bachelors, marital-status=Married-civ-spouse, occupation=Exec-managerial, relationship=Husband, race=White, sex=Male, capital-gain=None, capital-loss=None, hours-per-week=Part-time, native-country=United-States, income=small}

(32)

Adult Data Set

32/71

> summary(Adult)

transactions as itemMatrix in sparse format with 48842 rows (elements/itemsets/transactions) and 115 columns (items) and a density of 0.1089939 most frequent items:

capital-loss=None capital-gain=None native-country=United-States race=White 46560 44807 43832 41762 workclass=Private (Other)

33906 401333

element (itemset/transaction) length distribution:

sizes

9 10 11 12 13 19 971 2067 15623 30162

Min. 1st Qu. Median Mean 3rd Qu. Max.

9.00 12.00 13.00 12.53 13.00 13.00 includes extended item information - examples:

labels variables levels 1 age=Young age Young 2 age=Middle-aged age Middle-aged 3 age=Senior age Senior

includes extended transaction information - examples:

transactionID 1 1 2 2 3 3

(33)

How to Create Transactions Data

Transactions can be created by coercion from lists containing transactions, but also from matrix and

data.frames. However, you will need to prepare your data first. Association rule mining can only use items and does not work with continuous variables.

33/71

http://127.0.0.1:18470/library/arules/html/transactions-class.html

> # creating transactions form a list

> a.list <- list(

+ c("a","b","c"), + c("a","b"), + c("a","b","d"), + c("c","e"),

+ c("a","b","d","e") + )

> names(a.list) <- paste0("Customer", c(1:5))

> a.list

$Customer1

[1] "a" "b" "c"

$Customer2 [1] "a" "b"

$Customer3

[1] "a" "b" "d"

$Customer4 [1] "c" "e"

$Customer5

[1] "a" "b" "d" "e"

# avoid "no method or default for coercing"

library(Matrix)

productTD <- as(product_by_user$Product, "transactions") inspect(productTD[1:5])

(34)

Coerce a List Into Transactions

34/71

> alist.trans <- as(a.list, "transactions")

> summary(alist.trans) # analyze transactions transactions as itemMatrix in sparse format with

5 rows (elements/itemsets/transactions) and 5 columns (items) and a density of 0.56 most frequent items:

a b c d e (Other) 4 4 2 2 2 0 element (itemset/transaction) length distribution:

sizes 2 3 4 2 2 1

Min. 1st Qu. Median Mean 3rd Qu. Max.

2.0 2.0 3.0 2.8 3.0 4.0 includes extended item information - examples:

labels 1 a 2 b 3 c

includes extended transaction information - examples:

transactionID 1 Customer1 2 Customer2 3 Customer3

> image(alist.trans)

(35)

Creating Transactions from a Matrix

35/71

> a.matrix <- matrix(c(

+ 1,1,1,0,0, + 1,1,0,0,0, + 1,1,0,1,0,

+ 0,0,1,0,1), ncol = 5)

> dimnames(a.matrix) <- list(

paste("Customer", letters[1:4]), paste0("Item", c(1:5)))

> a.matrix

Item1 Item2 Item3 Item4 Item5 Customer a 1 0 0 0 0 Customer b 1 1 0 1 1 Customer c 1 1 1 0 0 Customer d 0 0 1 0 1

>

> amatirx.trans <- as(a.matrix, "transactions")

> amatirx.trans

transactions in sparse format with 4 transactions (rows) and

5 items (columns)

> inspect(amatirx.trans)

items transactionID [1] {Item1} Customer a [2] {Item1,Item2,Item4,Item5} Customer b [3] {Item1,Item2,Item3} Customer c [4] {Item3,Item5} Customer d

> summary(amatirx.trans)

transactions as itemMatrix in sparse format with 4 rows (elements/itemsets/transactions) and 5 columns (items) and a density of 0.5 most frequent items:

Item1 Item2 Item3 Item5 Item4 (Other) 3 2 2 2 1 0 element (itemset/transaction) length distribution:

sizes 1 2 3 4 1 1 1 1

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.00 1.75 2.50 2.50 3.25 4.00 includes extended item information - examples:

labels 1 Item1 2 Item2 3 Item3

includes extended transaction information - examples:

transactionID 1 Customer a 2 Customer b 3 Customer c

(36)

More Examples

36/71

> # creating transactions from data.frame

> a.df <- data.frame(

+ age = as.factor(c(6, 8, NA, 9, 16)),

+ grade = as.factor(c("A", "C", "F", NA, "C")), + pass = c(TRUE, TRUE, FALSE, TRUE, TRUE))

> # note: factors are translated to

> # logicals and NAs are ignored

> a.df

age grade pass 1 6 A TRUE 2 8 C TRUE 3 <NA> F FALSE 4 9 <NA> TRUE 5 16 C TRUE

> adf.trans <- as(a.df, "transactions")

> inspect(adf.trans)

items transactionID [1] {age=6,grade=A,pass} 1 [2] {age=8,grade=C,pass} 2 [3] {grade=F} 3 [4] {age=9,pass} 4 [5] {age=16,grade=C,pass} 5

> as(adf.trans, "data.frame")

items transactionID 1 {age=6,grade=A,pass} 1 2 {age=8,grade=C,pass} 2 3 {grade=F} 3 4 {age=9,pass} 4 5 {age=16,grade=C,pass} 5

> # creating transactions from (IDs, items)

> a.df2 <- data.frame(

+ TID = c(1, 1, 2, 2, 2, 3),

+ item = c("a", "b", "a", "b", "c", "b"))

> a.df2 TID item 1 1 a 2 1 b 3 2 a 4 2 b 5 2 c 6 3 b

> a.df2.s <- split(a.df2[, "item"], a.df2[,"TID"])

> a.df2.s

$`1`

[1] a b

Levels: a b c

$`2`

[1] a b c Levels: a b c

$`3`

[1] b

Levels: a b c

> adf2.trans <- as(a.df2.s, "transactions")

> inspect(adf2.trans) items transactionID [1] {a,b} 1 [2] {a,b,c} 2 [3] {b} 3

(37)

Example: Create Transactions Data

37/71

> data(AdultUCI)

> summary(AdultUCI)

> # remove attributes

> AdultUCI[["fnlwgt"]] <- NULL

> AdultUCI[["education-num"]] <- NULL

http://127.0.0.1:18470/library/arules/html/Adult.html

(38)

Example: Create Transactions Data

38/71

> # map metric attributes

> AdultUCI[["age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15, 25, 45, 65, 100)), + labels = c("Young", "Middle-aged", "Senior", "Old"))

> AdultUCI[["hours-per-week"]] <- ordered(cut(AdultUCI[["hours-per-week"]], + c(0, 25, 40, 60, 168)),

+ labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))

> AdultUCI[["capital-gain"]] <- ordered(cut(AdultUCI[["capital-gain"]],

+ c(-Inf, 0, median(AdultUCI[["capital-gain"]][AdultUCI[["capital-gain"]] > 0]), Inf)), + labels = c("None", "Low", "High"))

> AdultUCI[["capital-loss"]] <- ordered(cut(AdultUCI[["capital-loss"]],

+ c(-Inf, 0, median(AdultUCI[["capital-loss"]][AdultUCI[["capital-loss"]] > 0]), Inf)), + labels = c("None", "Low", "High"))

>

> summary(AdultUCI[c("age", "hours-per-week", "capital-gain", "capital-loss")]) age hours-per-week capital-gain capital-loss

Young : 9627 Part-time : 5913 None:44807 None:46560 Middle-aged:24671 Full-time :28577 Low : 2345 Low : 1166 Senior :12741 Over-time :12676 High: 1690 High: 1116 Old : 1803 Workaholic: 1676

>

> # create transactions

> MyAdult <- as(AdultUCI, "transactions")

> MyAdult

transactions in sparse format with 48842 transactions (rows) and 115 items (columns)

(39)

Example: Create Transactions Data

39/71

> summary(MyAdult)

transactions as itemMatrix in sparse format with 48842 rows (elements/itemsets/transactions) and 115 columns (items) and a density of 0.1089939 most frequent items:

capital-loss=None capital-gain=None native-country=United-States 46560 44807 43832 race=White workclass=Private (Other) 41762 33906 401333 element (itemset/transaction) length distribution:

sizes

9 10 11 12 13 19 971 2067 15623 30162

Min. 1st Qu. Median Mean 3rd Qu. Max.

9.00 12.00 13.00 12.53 13.00 13.00 includes extended item information - examples:

labels variables levels 1 age=Young age Young 2 age=Middle-aged age Middle-aged 3 age=Senior age Senior

includes extended transaction information - examples:

transactionID 1 1 2 2 3 3

> inspect(MyAdult[1:2])

items transactionID [1] {age=Middle-aged,

workclass=State-gov, education=Bachelors, marital-status=Never-married, occupation=Adm-clerical, relationship=Not-in-family, race=White, sex=Male, capital-gain=Low, capital-loss=None, hours-per-week=Full-time, native-country=United-States, income=small} 1

(40)

Case Study 1: Groceries Data Set

Description: The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.

Format: Object of class transactions.

40/71

> library(arules)

> data(Groceries)

> ?Groceries

> str(Groceries)

Formal class 'transactions' [package "arules"] with 3 slots

..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. ..@ i : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...

.. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...

.. .. ..@ Dim : int [1:2] 169 9835 .. .. ..@ Dimnames:List of 2

.. .. .. ..$ : NULL .. .. .. ..$ : NULL

.. .. ..@ factors : list()

..@ itemInfo :'data.frame': 169 obs. of 3 variables:

.. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...

.. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...

.. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...

..@ itemsetInfo:'data.frame': 0 obs. of 0 variables

Groceries@itemInfo

(41)

summary, inspect

41/71

> summary(Groceries)

transactions as itemMatrix in sparse format with 9835 rows (elements/itemsets/transactions) and 169 columns (items) and a density of 0.02609146 most frequent items:

whole milk other vegetables rolls/buns soda yogurt 2513 1903 1809 1715 1372 (Other)

34055

element (itemset/transaction) length distribution:

sizes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 29 14

19 20 21 22 23 24 26 27 28 29 32 14 9 11 4 6 1 1 1 1 3 1

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 2.000 3.000 4.409 6.000 32.000 includes extended item information - examples:

labels level2 level1 1 frankfurter sausage meat and sausage 2 sausage sausage meat and sausage

3 liver loaf sausage meat and sausage > inspect(Groceries[1:4])

items

1 {citrus fruit,semi-finished bread,margarine,ready soups}

2 {tropical fruit,yogurt,coffee}

3 {whole milk}

4 {pip fruit,yogurt,cream cheese ,meat spreads}

(42)

Apply apriori

42/71

> rule0 <- apriori(Groceries) Apriori

Parameter specification:

confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 0.1 1 10 rules FALSE Algorithmic control:

filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 983

set item appearances ...[0 item(s)] done [0.00s].

set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].

sorting and recoding items ... [8 item(s)] done [0.00s].

creating transaction tree ... done [0.00s].

checking subsets of size 1 2 done [0.00s].

writing ... [0 rule(s)] done [0.00s].

creating S4 object ... done [0.00s].

The default behavior is to mine rules with minimum support of 0.1,

minimum confidence of 0.8,

maximum of 10 items (maxlen), and

a maximal time for subset checking of 5 seconds (maxtime).

(43)

Apply apriori With Different Arguments43/71

> rule1 <- apriori(Groceries, parameter=list(support=0.005, confidence=0.64)) Apriori

Parameter specification:

confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.64 0.1 1 none FALSE TRUE 0.005 1 10 rules FALSE Algorithmic control:

filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 49

set item appearances ...[0 item(s)] done [0.00s].

set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].

sorting and recoding items ... [120 item(s)] done [0.00s].

creating transaction tree ... done [0.00s].

checking subsets of size 1 2 3 4 done [0.00s].

writing ... [4 rule(s)] done [0.00s].

creating S4 object ... done [0.00s].

> inspect(rule1)

lhs rhs support confidence lift 1 {butter,whipped/sour cream} => {whole milk} 0.006710727 0.6600000 2.583008 2 {pip fruit,whipped/sour cream} => {whole milk} 0.005998983 0.6483516 2.537421 3 {pip fruit,root vegetables,other vegetables} => {whole milk} 0.005490595 0.6750000 2.641713 4 {tropical fruit,root vegetables,yogurt} => {whole milk} 0.005693950 0.7000000 2.739554

(44)

Class ' rules '

44/71

> str(rule1)

Formal class 'rules' [package "arules"] with 4 slots

..@ lhs :Formal class 'itemMatrix' [package "arules"] with 3 slots

.. .. ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. .. .. ..@ i : int [1:10] 25 30 15 30 15 19 22 14 19 29

.. .. .. .. ..@ p : int [1:5] 0 2 4 7 10 .. .. .. .. ..@ Dim : int [1:2] 169 4 .. .. .. .. ..@ Dimnames:List of 2

.. .. .. .. .. ..$ : NULL .. .. .. .. .. ..$ : NULL

.. .. .. .. ..@ factors : list()

.. .. ..@ itemInfo :'data.frame': 169 obs. of 3 variables:

.. .. .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...

.. .. .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...

.. .. .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...

.. .. ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables

..@ rhs :Formal class 'itemMatrix' [package "arules"] with 3 slots

.. .. ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. .. .. ..@ i : int [1:4] 24 24 24 24

.. .. .. .. ..@ p : int [1:5] 0 1 2 3 4 .. .. .. .. ..@ Dim : int [1:2] 169 4 .. .. .. .. ..@ Dimnames:List of 2

.. .. .. .. .. ..$ : NULL .. .. .. .. .. ..$ : NULL

.. .. .. .. ..@ factors : list()

.. .. ..@ itemInfo :'data.frame': 169 obs. of 3 variables:

.. .. .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...

.. .. .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...

.. .. .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...

.. .. ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables ..@ quality:'data.frame': 4 obs. of 3 variables:

.. ..$ support : num [1:4] 0.00671 0.006 0.00549 0.00569 .. ..$ confidence: num [1:4] 0.66 0.648 0.675 0.7

.. ..$ lift : num [1:4] 2.58 2.54 2.64 2.74 ..@ info :List of 4

.. ..$ data : symbol Groceries

> rule1@quality

support confidence lift 1 0.006710727 0.6600000 2.583008 2 0.005998983 0.6483516 2.537421 3 0.005490595 0.6750000 2.641713

(45)

Select Top AR by Support

45/71

> rule2 <- apriori(Groceries, parameter=list(support=0.001, confidence=0.5)) Apriori

Parameter specification:

confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.5 0.1 1 none FALSE TRUE 0.001 1 10 rules FALSE Algorithmic control:

filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 9

set item appearances ...[0 item(s)] done [0.00s].

set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].

sorting and recoding items ... [157 item(s)] done [0.00s].

creating transaction tree ... done [0.00s].

checking subsets of size 1 2 3 4 5 6 done [0.01s].

writing ... [5668 rule(s)] done [0.00s].

creating S4 object ... done [0.00s].

>

> rule2.sorted_sup <- sort(rule2, by="support")

> inspect(rule2.sorted_sup[1:5])

lhs rhs support confidence lift 1472 {other vegetables,yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 1467 {tropical fruit,yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 1449 {other vegetables,whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385 1469 {root vegetables,yogurt} => {whole milk} 0.01453991 0.5629921 2.203354 1454 {pip fruit,other vegetables} => {whole milk} 0.01352313 0.5175097 2.025351

(46)

Select a Subset of Rules

46/71

> # Select a subset of rules using partial matching on the items

> # in the right-hand-side and a quality measure

> rule2.sub <- subset(rule2, subset = rhs %pin% "whole milk" & lift > 1.3)

> rule2.sub

set of 2679 rules

>

> # Display the top 3 support rules

> inspect(head(rule2.sub, n = 3, by = "support"))

lhs rhs support confidence lift 1472 {other vegetables,yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 1467 {tropical fruit,yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 1449 {other vegetables,whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385

>

> # Display the first 3 rules

> inspect(rule2.sub[1:3])

lhs rhs support confidence lift 1 {honey} => {whole milk} 0.001118454 0.7333333 2.870009 3 {cocoa drinks} => {whole milk} 0.001321810 0.5909091 2.312611 4 {pudding powder} => {whole milk} 0.001321810 0.5652174 2.212062

>

> # Get labels for the first 3 rules

> labels(rule2.sub[1:3])

[1] "{honey} => {whole milk}" "{cocoa drinks} => {whole milk}"

[3] "{pudding powder} => {whole milk}"

> labels(rule2.sub[1:3], itemSep = " + ", setStart = "", setEnd="", ruleSep = " ---> ") [1] "honey ---> whole milk" "cocoa drinks ---> whole milk"

[3] "pudding powder ---> whole milk"

(47)

Select Top AR by Confidence, Lift

47/71

> rule2.sorted_con <- sort(rule2, by="confidence")

> inspect(rule2.sorted_con[1:5])

lhs rhs support confidence lift 113 {rice,sugar} => {whole milk} 0.001220132 1 3.913649 258 {canned fish,hygiene articles} => {whole milk} 0.001118454 1 3.913649 1487 {root vegetables,butter,rice} => {whole milk} 0.001016777 1 3.913649 1646 {root vegetables,whipped/sour cream,flour} => {whole milk} 0.001728521 1 3.913649 1670 {butter,soft cheese,domestic eggs} => {whole milk} 0.001016777 1 3.913649

>

> rule2.sorted_lift <- sort(rule2, by="lift")

> inspect(rule2.sorted_lift[1:5])

lhs rhs support confidence lift 53 {Instant food products,soda} => {hamburger meat} 0.001220132 0.6315789 18.99565 37 {soda,popcorn} => {salty snack} 0.001220132 0.6315789 16.69779 444 {flour,baking powder} => {sugar} 0.001016777 0.5555556 16.40807 327 {ham,processed cheese} => {white bread} 0.001931876 0.6333333 15.04549 55 {whole milk,Instant food products} => {hamburger meat} 0.001525165 0.5000000 15.03823

sort(x, decreasing = TRUE, na.last = NA, by = "support", order = FALSE, ...)

## S4 method for signature 'associations'

head(x, n = 6L, by = NULL, decreasing = TRUE, ...)

## S4 method for signature 'associations'

tail(x, n = 6L, by = NULL, decreasing = TRUE, ...)

(48)

Select Top Frequent Itemsets

48/71

> rule.freq_item <- apriori(Groceries, parameter=list(support=0.001, target="frequent itemsets"), control=list(sort=-1))

Apriori

Parameter specification:

confidence minval smax arem aval originalSupport support minlen maxlen target ext NA 0.1 1 none FALSE TRUE 0.001 1 10 frequent itemsets FALSE

Algorithmic control:

filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE -1 TRUE Absolute minimum support count: 9

...

checking subsets of size 1 2 3 4 5 6 done [0.02s].

writing ... [13492 set(s)] done [0.00s].

creating S4 object ... done [0.00s].

> rule.freq_item

set of 13492 itemsets

> inspect(rule.freq_item[1:5]) items support 1 {whole milk} 0.2555160 2 {other vegetables} 0.1934926 3 {rolls/buns} 0.1839349 4 {soda} 0.1743772 5 {yogurt} 0.1395018

(49)

Frequent k-itemsets

49/71

> rule.fi_eclat <- eclat(Groceries, parameter=list(minlen=1, maxlen=3, support=0.001, target="frequent itemsets"), control=list(sort=-1))

Eclat ...

> rule.fi_eclat

set of 9969 itemsets

> inspect(rule.fi_eclat[1:5])

items support 1 {whole milk,honey} 0.001118454 2 {whole milk,cocoa drinks} 0.001321810 3 {whole milk,pudding powder} 0.001321810 4 {tidbits,rolls/buns} 0.001220132 5 {tidbits,soda} 0.001016777

> rule.fi_eclat <- eclat(Groceries, parameter=list(minlen=3, maxlen=5, support=0.001, target="frequent itemsets"), control=list(sort=-1))

Eclat ...

> rule.fi_eclat

set of 10344 itemsets

> inspect(rule.fi_eclat[1:5])

items support 1 {liver loaf,whole milk,yogurt} 0.001016777 2 {tropical fruit,other vegetables,curd cheese} 0.001016777 3 {whole milk,curd cheese,rolls/buns} 0.001016777 4 {other vegetables,whole milk,curd cheese} 0.001220132 5 {other vegetables,whole milk,cleaner} 0.001016777

參考文獻

相關文件

2 machine learning, data mining and statistics all need data. 3 data mining is just another name for

In developing LIBSVM, we found that many users have zero machine learning knowledge.. It is unbelievable that many asked what the difference between training and

• Information retrieval : Implementing and Evaluating Search Engines, by Stefan Büttcher, Charles L.A.

Since the FP-tree reduces the number of database scans and uses less memory to represent the necessary information, many frequent pattern mining algorithms are based on its

This bioinformatic machine is a PC cluster structure using special hardware to accelerate dynamic programming, genetic algorithm and data mining algorithm.. In this machine,

We try to explore category and association rules of customer questions by applying customer analysis and the combination of data mining and rough set theory.. We use customer

Furthermore, in order to achieve the best utilization of the budget of individual department/institute, this study also performs data mining on the book borrowing data

Step 5: Receive the mining item list from control processor, then according to the mining item list and PFP-Tree’s method to exchange data to each CPs. Step 6: According the