頻繁樣式的勘測

(1)

Frequent Pattern Mining

- Mining Association Rules

2 Ming-Yen Lin, IECS.FCU

Outline

• 何謂頻繁樣式的勘測（frequent pattern mining）?

• 頻繁樣式的勘測方法

• 植基於條件式的（Constraint-based）頻繁樣式勘測

• 循序樣式（sequential patterns）

• 頻繁樣式的應用

• 頻繁樣式的研究問題

頻繁樣式的勘測

• 頻繁樣式(

Frequent patterns): patterns (set of items, sequence, etc.)在資料庫中經常出現的樣式/模式 (pattern:項目集、順序等) [AIS93]

• 頻繁樣式的勘測

: 找出資料中的規律(regularities)

– What products were often purchased together? — Beer and diapers?!

– What are the subsequent purchases after buying a PC?

– What kinds of DNA are sensitive to this new drug?

– Can we automatically classify web documents?

F.P.M. : frequent pattern mining

F.P.M. 是data mining的基本功能/工作

• 許多data mining task的基礎

– Association rules, correlation, causality

– sequential patterns, temporal/cyclic association, partial periodicity

– spatial and multimedia patterns – associative classification – cluster analysis

– iceberg cube, …

• 廣泛的應用

– 購物籃分析 – 交叉行銷 – 型錄設計 – 行銷活動分析

– web log (click stream)分析, DNA sequence analysis, …

(2)

基本觀念:頻繁項目集

• 項目集 Itemset X={x

₁

, …, x

_k

}

– 例：{A,C}, {B,E,F}, {C,E}

• 項目集的支持度(

support）

– s(A) = 3/4

• 頻繁項目集: 符合最小支持度（ m.s.: minimum support ）的項目集

• 勘測頻繁樣式:找出所有的頻繁樣式

B, E, F 40

A, D 30

A, C 20

A, B, C 10

Items bought Transaction-id

Pattern = set of items

基本觀念:關聯規則

• 關聯規則

的勘測:頻繁

項目集

找出後，決定

「有興趣的」

項目集之

間的關係

– 信賴度confidence, c, 某交易如果包含 X ，此交易同時會包含Y的條件機率

– 支持度support, s,某交易包

含 X∪Y的機率

• m.s. = 50%, m.c. = 50%

– A Æ C (50%, 66.7%) – C Æ A (50%, 100%)

B, E, F 40

A, D 30

A, C 20

A, B, C 10

Customer buys diaper Customer

buys both

Customer buys beer

m.s.: minimum support 最小支持度 m.c.: minimum confidence 最小信賴度

探勘關聯規則（例）

rule A

Æ C:

support = support({A}∪{C}) = 50%

confidence = support({A}∪{C})/support({A}) = 66.6%

Min. support 50%

Min. confidence 50%

B, E, F 40

A, D 30

A, C 20

A, B, C 10

50%

{A, C}

50%

{C}

50%

{B}

75%

{A}

Support Frequent pattern

關聯規則的類別

• 布林式的（boolean ）與數量式的（quantitative）

– buys(x, “SQLServer”) ^ buys(x, “DMBook”) Æ buys(x,

“DM Software”) [0.2%, 60%]

– age(x, “30..39”) ^ income(x, “42..48K”) Æ buys(x, “PC”) [1%, 75%]

• 單一維度(single dimension)與多維度(multiple dimensional)

• 單一層次(single level)與多層次(multiple-level)

– What brands of beers are associated with what brands of diapers?

(3)

關聯規則的延伸與應用

• Correlation, causality analysis & mining interesting rules

• Maxpatterns and frequent closed itemsets

• Constraint-based mining

• Sequential patterns

• Association-based classification

• Computing iceberg cubes

頻繁樣式的勘測方法

• Apriori 方法與其變化、改進

– 不產生「候選樣式」的探勘方法

• 最大樣式(max-patterns)與封閉樣式(closed - patterns)

– 精簡的表示方式

• 多維度、多層次頻繁樣式

• 有意義程度(Interestingness): correlation and causality

Apriori: Candidate Generation-and-test

• 產生「候選者」後檢視方式

– 由長度 (k) 的「當選者」（frequent itemset) 產生長度 (k+1) 的「候選者」(candidate itemset)

– 針對資料庫數其支持度，檢視是否是frequent

• 「候選者」條件：其子集合必定是frequent

– 特性：anti-monotone

– 包含 {beer, diaper, nuts}的交易必定包含 {beer, diaper}

– 若{beer, diaper, nuts} frequent Æ {beer, diaper}一定 frequent

– 任一 non-frequent 項目集之superset根本就不可能是頻繁項目集（所以就不會是「候選者」，不必產生，也不必數）

– 可以排除許多「無用的組合」 12

Ming-Yen Lin, IECS.FCU

Apriori 方法實例

Database

1^stscan

C₁ L₁

L₂

C₂ C₂

2^ndscan

C₃ 3^rdscan L₃

B, E 40

A, B, C, E 30

B, C, E 20

A, C, D 10

Items Tid

1 {D}

3 {E}

3 {C}

3 {B}

2 {A}

sup Itemset

3 {E}

3 {C}

3 {B}

2 {A}

sup Itemset

{C, E}

{B, E}

{B, C}

{A, E}

{A, C}

{A, B}

Itemset 1

{A, B}

2 {A, C}

1 {A, E}

2 {B, C}

3 {B, E}

2 {C, E}

sup Itemset

2 {A, C}

2 {B, C}

3 {B, E}

2 {C, E}

sup Itemset

{B, C, E}

Itemset

2 {B, C, E}

sup Itemset

sup 以count表示，未以%表示

m.s. = 2

(4)

Apriori 演算法細節

L

₁

= {frequent items}; k = 1 ; if L

_k

= ∅ stop ;

C

_k+1

= 由 L

_k

產生 ;

對資料庫 D 中每一個交易 t 執行

所有包含於 t 的、 C

_k+1

中的 candidate的個數 (support count) 加一

L

_k+1

= C

_k+1之

candidate滿足最小支持者 (minimum support)

k = k +1 ;

答案：

L

_k

的聯集 ;

• L_k:大小為 k 的 frequent itemset

• C_k:大小為 k 的candidate itemset

Apriori 關鍵細節 I

• C

_k+1

= 由 L

_k

產生 ;

– Step 1: self-joining L

_k

（ L

_k

自交）

– Step 2: pruning（消去不可能者）

• 例：L₃={abc, abd, acd, ace, bcd}依序排好

– Self-joining: L

₃

*L

₃

• abcd：由 abc and abd

• acde：由 acd and ace

– Pruning:

• 消去acde :因 ade 不在 L₃

• C₄={abcd}

Apriori 關鍵細節 II

• 所有包含於 t 的 candidate的count加一; ^難

在哪？

– candidate的總個數太多

– 各個 t中包含不止一個 candidate

• 方法:

– 好好安排 candidate (放在「hash tree」結構) – 找出所有可能被 t包含的candidate，加以測試

1,4,7 2,5,8 3,6,9 Subset function

2 3 4 5 6 7 1 4 5

1 3 6 1 2 4

4 5 7 1 2 5 4 5 8

1 5 9

3 4 5 3 5 6 3 5 7 6 8 9

3 6 7 3 6 8 Transaction: 1 2 3 5 6

1 + 2 3 5 6

1 2 + 3 5 6 1 3 + 5 6

:candidate ¹⁶

勘測方法的挑戰

Challenges (也就是Apriori 方法變化改進的方向)

• 全部資料庫檢視次數太多

– 想辦法減少

• candidates個數還是太多

– 想辦法減少

• 數 candidate support 太麻煩

– 想辦法數快點

(5)

其他方法

• Apriori的改良

– DIC – DHP – Partition – Sampling – ...

• 不產生candidate，壓縮資料庫再找

– FP-Growth, H-mine

• 用項目的交集（資料庫改以直向排列）

– Eclat/MaxEclat ,VIPER

Association Rules 視覺化 : : Pane Graph

Association Rules 視覺化 : Rule Graph

最大樣式Max-patterns

• Frequent pattern {a₁, …, a₁₀₀} Æ (¹⁰⁰₁) + (¹⁰⁰₂) + … + (₁¹₀⁰₀⁰) = 2¹⁰⁰-1 = 1.27*10³⁰

frequent sub- patterns!

• 最大樣式 Max-pattern: 沒有「真」(proper) super pattern的樣式 (PS: frequent)

– BCDE, ACD are max-patterns – BCD is not a max-pattern

A,C,D,F 30

B,C,D,E, 20

A,B,C,D,E 10

Items Tid

m.s.=2

(6)

封閉樣式(Frequent Closed Patterns)

• Conf(acÆd)=100% Î 只紀錄 acd 就好

• 對於某一個頻繁項目集 X, 如果沒有項目 y 造成以下情形「每一個包含 X的交易也包含 y 」，則X 稱為一個封閉樣式

– “acd” is a frequent closed pattern

• 頻繁樣式的一種精簡表示方式

• 簡化樣式與規則的個數

c, e, f 50

a, c, d, f 40

c, e, f 30

a, b, e 20

a, c, d, e, f 10

Items TID

多層次 Association Rules

• 項目通常有其概念架構

• 設定 support 要具彈性: 低層次的 support應該比較低

• 依據維度與層次將交易資料庫編碼

• 探索多層次的探勘

uniform support

Milk [support = 10%]

2% Milk [support = 6%]

Skim Milk [support = 4%]

Level 1 min_sup = 5%

Level 2 min_sup = 5%

Level 1 min_sup = 5%

Level 2 min_sup = 3%

reduced support

多維度 Association

• 單一維度（dimension）規則

buys(X, “milk”) ⇒ buys(X, “bread”)

• 多維度: ≥ 2 維度或陳述（predicate）

– 維度內(Inter-dimension) assoc. rules (no repeated predicates) age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,“coke”) – hybrid-dimension assoc. rules (repeated predicates)

age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)

• 類別屬性(Categorical Attributes)

– finite number of possible values, no ordering among values

• 數量屬性(Quantitative Attributes)

– numeric, implicit ordering among values

Distance-based Association Rules

• Binning 不見得抓得住區間資料的語意(semantic)

• 以距離為主的分割更距離散化(discretization)考量

– density/number of points in an interval – “closeness” of points in an interval

Price($)

Equi-width (width $10)

Equi-depth (depth 2)

Distance- based

7 [0,10] [7,20] [7,7]

20 [11,20] [22,50] [20,22]

22 [21,30] [51,53] [50,53]

50 [31,40]

51 [41,50]

53 [51,60]

(7)

有意義程度Interestingness Measure: Correlations

• play basketball ⇒ eat cereal [40%, 66.7%] 誤導！

– The overall percentage of students eating cereal is 75% which is higher than 66.7%.

• play basketball ⇒ not eat cereal [20%, 33.3%] 更精準, 雖然 support and confidence 較低

• 度量相依性/相關事件: lift

5000 2000

3000 Sum(col.)

1250 250

1000 Not cereal

3750 1750

2000 Cereal

Sum (row) Not basketball

Basketball

) ( ) (

) (

,

P A P B

B A corr

_A_B =

P

∪

植基於條件式的頻繁樣式勘測

• Finding all the patterns in a database

autonomously? — unrealistic!

– The patterns could be too many but not focused!

• Data mining should be an interactive process

– User directs what to be mined using a data mining query language(or a graphical user interface)

• Constraint-based mining

– User flexibility: providesconstraintson what to be mined – System optimization: explores such constraints for efficient

mining—constraint-based mining

LexMiner

• Ming-Yen Lin and Suh-Yin Lee,

"A Fast Lexicographic Algorithm for

Association Rule Mining in Web Applications,"

Proceedings of the ICDCS Workshop on Knowledge Discovery and Data Mining in the World-Wide Web (ICDCS00), Taipei, Taiwan, R.O.C., pp. F7-F14, 2000.

Apriori use Hash Tree to stored Candidates

• Excess Comparison,

Eg. T₁= {1, 2, 3, 4, 5, 6}

• Duplicate Counting Avoidance,

Eg. T₂= {1, 3, 4, 200, 401, 403}

• Large Storage Requirement

• High Splitting Cost

d = 1

d = 2

d = 3 (1,3,4)

(1,3,9) (1,401,403) (200,401,403) (200,401,555) (1,2,83)

(1,2,95) (1,2,96)

1

2 3

2

(2,3,4) (2,3,5)

(198,201,203) (198,400,555) root

3

hash function = x MOD 199, size of each leaf = 5, d = depth

. . .

: leaf : interior

: empty leaf

...

0

(8)

LexTree: Lexicographically Ordered Tree

• Intrinsic Property: Lexicographic Order

– Items in each transaction, eg. {7, 11, 20, 29, 37}

– k-itemsets, eg. (1, 3, 4) < (1, 3, 10) < (1, 4, 10)

• Storing itemsets: by lexicographic order

• LexTree: compact, hierarchical tree

– candidate LexTree: efficient, redundant-free support counting

– frequent LexTree: effective candidate generation

• LexMiner Algorithm

LexTree Structure

10 11

Root

3

4

7 1

3

4 10 11 4

10 11 15 10

15

# # # # # # # # # #

7

11 20 20 11

# # #

4

7

10 10

15 7

11

20

# # #

(1, 3, 4), (1, 3, 10), (1, 3, 11), (1, 4, 10), (1, 4, 11), (1, 4, 15), (1, 10, 15), (3, 4, 7), (3, 4, 10), (3, 4, 11), (3, 7, 11), (3, 7, 20), (3, 11, 20), (4, 7, 10), (4, 10, 15), (7, 11, 20)

Item_ID sibling support

Internal node: Leaf node:

Item_ID sibling next

: null link : support

#

LexTree Construction

Insert candidates(1, 3, 4), (1, 3, 10), (1, 3, 11), (1, 4, 10) (1)

Root Last[1] 1 Last[2]

Last[3]

3 04

Insert candidate (1, 3, 4) (2)

Last[1] 1 Last[2]

Last[3]

3

04 10

0

Insert candidate (1, 3, 10) Root

(3)

Last[3]

1

11 Last[1]

Last[2] ³

04 10

0 0

Insert candidate (1, 3, 11) Root

(4) Insert candidate (1, 4, 10)

Last[3]

10 11 Last[1]

Last[2]

3

04 10

0 0

Root

4 1

0

LexTree Construction (Cont.)

(5) Insert candidate (1, 10, 15) after (1, 4, 11), (1, 4, 15) inserted

Last[3]

11 10 Last[1]

Last[2]

3

04 10

0 0

Root

4 1

0 11 15

0 0 15

0 10

(6) Insert candidate (3, 4, 7)

11 10

Last[1]

3

04 10

0 0

Root

4 1

0 11 15

0 0 15

0 10

7 0

4 3

Last[3]

Last[2]

(9)

Notations

• D : The database of transactions

• T : A transaction, T={x

₁

, x

₂

, …, x

_p

, …, x

_m

} x

₁

, x

₂

, …, x

_k

: Items

• minsup : The minimum support specified by the user

• X : k-itemset, X=(x

₁

, x

₂

, …, x

_k

)

• X.support : The support of itemset X

• C

_k

: The set of candidate k-itemsets

• L

_k

: The set of frequent k-itemsets

• Γ

_Ck

: The candidate k-itemset LexTree

• Γ

_Lk

: The frequent k-itemset LexTree

Algorithm LexMiner

Find L₁

Find L₂ L₁≠ ∅

Y

k=3

N

Answer = ∪_kL_k

L_k-1≠ ∅

N Y

Build frequentΓ_Lk-1from L_k-1 Generate C_kfrom Γ_Lk-1 Store C_kin candidateΓ_Ck

L_k= {X ∈ C_k| X.support≥ minsup}

∀T∈D, Find_and_increment(T , Γ_Ck)

k++

Algorithm Find_and_Increment

cp≠ leaf

tpadvanced Find_and_Increment(tp, cp)

cpadvanced Find_and_Increment(tp, cp) Find_and_Increment(tp+1, cp.sibling) Find_and_Increment(tp+1, cp.next)

<

>

=

item[tp]

? cp.ID Internal node

Leaf node cp.support++

advance tp

advance cp

Advancetpandcp

>

=

<

Y

N

item[tp]

? cp.ID

while (not end_of_list) and (cp≠ null) ³⁶

7 × 11 × (20, 29, 37)

× 20 × (29, 37)

× 29 × (37) 11× 20 × (29, 37)

× 29 × (37) 20× 29 × (37)

Example 1: #Comparison Minimized

 At pass 3, T₁={7, 11, 20, 29, 37}

Intrinsically ordered 3-itemsets in T₁ { 7, 11} × {20, 29, 37}

{ 7, 20} × {29, 37}

{ 7, 29} × {37}

{11, 20} × {29, 37}

{11, 29} × {37}

{20, 29} × {37}

10 11

Root

3

4

7 1

3

4 10 11

4

10 11 15

10

15

# # # # # # # # # #

7

11 20 20

11

# # #

Na Nb Nc 4

7

10 10

15 7

11

20

# # #

Nd

Ne Nf

Candidate 3-itemset LexTree, Γ_C3

(10)

Example 2: Fast Support Counting

3 x 4 x (7,10,11) x 7 x (10,11) x 10 x (11) 4 x 7 x (10,11)

x 10 x (11) 7 x 10 x (11)

10 11

Root

3

4

7 1

3

4 10 11

4

10 11 15

10

15

# # # # # # # # # #

7

11 20 20

11

# # #

Na Nb Nc 4

7

10 10

15 7

11

20

# # #

Nd

Ne Nf

Ng Nh Ni

Nj Nk Nl

Candidate 3-itemset LexTree, Γ_C3

 At pass 3, T₂={3, 4, 7, 10, 11}

Intrinsically ordered 3-itemsets in T₂

Efficient Candidate Generation

• Common prefixed L_k-1 : linked by sibling

• Join: C_k = L_k-1×

L

_k-1, then

• Prune: candidate itemset having any subset that is not in

Γ

_Lk-1

– Searching in

Γ_Lk-1

: similar technique for find_and_increment

insert into C

_k

select p[1], p[2], …, p[k-1], q[k-1]

from L

_k-1

p, L

_k-1

q

where p[1]=q[1], …, p[k-2] = q[k-2], p[k-1] < q[k-1] ;

FP-growth

• J. Han, J. Pei, and Y. Yin: “Mining frequent

patterns without candidate generation

”. In Proc.

ACM-SIGMOD’2000, pp. 1-12, Dallas, TX, May 2000.

• Compress DB into a tree (FP-tree)

• Find frequent itemsets in FP-tree

Construction of FP-tree from a Transaction Database

{}

f:4 c:1 b:1 p:1 b:1 c:3 a:3

b:1 m:2

p:2 m:1 Header Table

Item frequency head

f 4

c 4

a 3

b 3

m 3

p 3

min_support = 0.5 TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}

200 {a, b, c, f, l, m, o} {f, c, a, b, m}

300 {b, f, h, j, o, w} {f, b}

400 {b, c, k, s, p} {c, b, p}

500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Order frequent items in

frequency descending order

3. Scan DB again, construct FP-tree

(11)

Mining Frequent Patterns with FP-trees

• Idea: Frequent pattern growth

– Recursively grow frequent patterns by pattern and database partition

• Method

– For each frequent item, construct its conditional pattern- base, and then its conditional FP-tree

– Repeat the process on each newly created conditional FP- tree

– Until the resulting FP-tree is empty, or it containsonly one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

From FP-tree to Conditional Pattern-Base

• Starting at the frequent item header table in the FP-tree

• Traverse the FP-tree by following the link of each frequent item p

• Accumulate all of transformed prefix pathsof item p to form p’s conditional pattern base

Conditional pattern bases item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {}

f:4 c:1 b:1 p:1 b:1 c:3 a:3

b:1 m:2

f 4

c 4

a 3

b 3

m 3

p 3

Transformed Prefix Paths

• Derive the transformed prefix paths of item p

– For each item p in the tree, collect p’s prefix path with count = p’s frequency

– Why only prefix path? Why this count? Complete?

Conditional pattern bases item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {}

f:4 c:1

b:1 p:1 b:1 c:3 a:3

b:1 m:2

f 4

c 4

a 3

b 3

m 3

p 3

From Conditional Pattern-Bases to Conditional FP-trees

• For each pattern-base

– Accumulate the count for each item in the base

– Construct the FP-tree for the frequent items of the pattern base

m-conditional pattern base:

fca:2, fcab:1

{}

f:3 c:3 a:3

m-conditional FP-tree All frequent patterns relate to m m,

fm, cm, am, fcm, fam, cam, fcam

¼ ¼

{}

f:4 c:1 b:1 p:1 b:1 c:3 a:3

b:1 m:2

f 4

c 4

a 3

b 3

m 3

p 3

(12)

Recursion: Mining Each Conditional FP-tree Until …

{}

f:3 c:3 a:3

m-conditional FP-tree

Cond. pattern base of “am”: (fc:3)

{}

f:3 c:3

am-conditional FP-tree

Cond. pattern base of “cm”: (f:3) ^{}

f:3

cm-conditional FP-tree

Cond. pattern base of “cam”: (f:3)

{}

f:3

cam-conditional FP-tree

A Special Case: Single FP-tree Path

• Suppose a (conditional) FP-tree T has a single path P

• The complete set of frequent patterns of T can be generated by enumeration of all the combinations of the sub-paths of P

{}

f:3 c:3 a:3

m-conditional FP-tree

All frequent patterns concerning m

m,

fm, cm, am, fcm, fam, cam, fcam

¼

A More General (Special) Case: Single Prefix Path in FP-tree

• Suppose a (conditional) FP-tree T has a shared single prefix-path P

• Mining can be decomposed into two parts

– Reduction of the single prefix path into one node – Concatenation of the mining results of the two parts

¼

a₂:n₂ a₃:n₃ a₁:n₁ {}

b₁:m₁ C₁:k₁ C₂:k₂ C₃:k₃

r₁

a₂:n₂

+

a₃:n₃ a₁:n₁ {}

r₁

=

循序樣式（sequential patterns）

• 交易資料庫 (transaction databases), 時序資料庫 (time-series databases) 與序列資料庫 (sequence databases)

• 頻繁樣式(frequent patterns)與循序樣式(sequential patterns)

• 循序樣式的應用 Applications of sequential pattern mining

– 顧客購買序列

• First buy computer, then CD-ROM, and then digital camera, within 3 months.

– 醫療處方, 天災 (e.g., earthquakes), 科學、工程程序, 股票等 – 電話通聯樣式，Weblog click streams

– DNA sequences and gene structures

(13)

何謂循序樣式之勘測?

• 給一堆序列(sequence)，找出所有

frequent

subsequences（子序列）

A sequence database

A sequence : < (ef) (ab) (df) c b >

An elementmay contain a set of items.

Items within an element are unordered and we list them alphabetically.

<a(bc)dc> is a subsequence of <<a(abc)(ac)d(cf)>

Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

<eg(af)cbc>

40

<(ef)(ab)(df)cb>

30

<(ad)c(bc)(ae)>

20

<a(abc)(ac)d(cf)>

10

sequence SID

循序樣式勘測的困難何在？

• 隱藏於資料庫中的可能的循序樣式個數相當龐大可能

• 探勘演算法必須

– 找出所有滿足 minimum support (frequency) threshold的樣式

– 要具有高度效率與可擴充性，減少資料庫檢視次數

– 可以跟各種使用者所指定的constraints 搭配

Sequential Patterns基本特性： Apriori

• 基本特性

: Apriori (Agrawal & Sirkant’94)

– If a sequence Sis not frequent

– then none of the super-sequencesof S is frequent

– 例如, <hb> infrequent Æ <hab> and <(ah)b> 也 infrequent

<a(bd)bcb(ade)>

50

<(be)(ce)d>

40

<(ah)(bf)abf>

30

<(bf)(ce)b(fg)>

20

<(bd)cb(ac)>

10

Sequence

Seq. ID Given support threshold

min_sup =2

GSP 演算法細節

L

₁

= {frequent sequence of length 1}; k = 1 ; if L

_k

= ∅ stop ;

C

_k+1

= 由 L

_k

產生 ;

對資料庫 D 中每一個交易 t 執行

所有包含於 t 的、 C

_k+1

中的 candidate的個數 (support count) 加一

L

_k+1

= C

_k+1之

candidate滿足最小支持者 (minimum support)

k = k +1 ;

答案：

L

_k

的聯集 ;

• L_k:大小為 k 的 frequent sequence

• C_k:大小為 k 的candidate sequence

(14)

找 Length-1的 Sequential Patterns

• Initial candidates: all singleton sequences – <a>, , <c>, <d>, <e>, <f>, <g>, <h>

• Scan database once, count support for candidates

<a(bd)bcb(ade)>

50

<(be)(ce)d>

40

<(ah)(bf)abf>

30

<(bf)(ce)b(fg)>

20

<(bd)cb(ac)>

10

Sequence Seq. ID

min_sup =2

1

<h>

1

<g>

2

<f>

3

<e>

3

<d>

4

<c>

5

3

<a>

Sup Cand

產生 Length-2 Candidates

<ff>

<fe>

<fd>

<fc>

<fb>

<fa>

<f>

<ef>

<ee>

<ed>

<ec>

<eb>

<ea>

<e>

<df>

<de>

<dd>

<dc>

<db>

<da>

<d>

<cf>

<ce>

<cd>

<cc>

<cb>

<ca>

<c>

<bf>

<be>

<bd>

<bc>

<bb>

<ba>

<af>

<ae>

<ad>

<ac>

<ab>

<aa>

<a>

<f>

<e>

<d>

<c>

<a>

<f>

<(ef)>

<e>

<(df)>

<(de)>

<d>

<(cf)>

<(ce)>

<(cd)>

<c>

<(bf)>

<(be)>

<(bd)>

<(bc)>

<(af)>

<(ae)>

<(ad)>

<(ac)>

<(ab)>

<a>

<f>

<e>

<d>

<c>

<a>

51 length-2 Candidates

Without Apriori property,

8*8+8*7/2=92 candidates

Apriori prunes 44.57% candidates

找Length-2 Sequential Patterns

• Scan database one more time, collect support count for each length-2 candidate

• There are 19 length-2 candidates which pass the minimum support threshold

– They are length-2 sequential patterns

產生& 找出 Length-2 Candidates

• 產生

– length-2 sequential patterns 自交 (Self-join)

• Based on the Apriori property

• <ab>, <aa> and <ba> are all length-2 sequential patterns Æ <aba> is a length-3 candidate

• <(bd)>, <bb> and <db> are all length-2 sequential patterns Æ <(bd)b> is a length-3 candidate

– 46 candidates are generated, prunethe impossible

• 找 Length-3 Sequential Patterns

– Scan database once more, collect support counts for candidates

– 19 out of 46 candidates pass support threshold

(15)

The GSP Mining Process

<abb> <aab> <aba> <baa> <bab> …

<abba> <(bd)bc> …

<(bd)cba>

1^stscan: 8 cand. 6 length-1 seq.

pat.

2^ndscan: 51 cand. 19 length-2 seq.

pat. 10 cand. not in DB at all

3^rdscan: 46 cand. 19 length-3 seq.

pat. 20 cand. not in DB at all 4^thscan: 8 cand. 6 length-4 seq.

pat.

5^thscan: 1 cand. 1 length-5 seq.

pat.

Cand. cannot pass sup. threshold

Cand. not in DB at all

<a(bd)bcb(ade)>

50

<(be)(ce)d>

40

<(ah)(bf)abf>

30

<(bf)(ce)b(fg)>

20

<(bd)cb(ac)>

10

Sequence Seq. ID

min_sup =2

GSP瓶頸

• candidates 個數太多

– 1,000 frequent length-1 sequences: generate length-2

• 全部資料庫檢視次數太多

• 真正難的：mining longsequential patterns

– An exponential number of short candidates 天文數字 – A length-100 sequential pattern needs 10³⁰

candidate sequences!

500 , 499 , 2 1

999 1000 1000

1000× + × =

30 100

100

1

10 1 100 2

≈

−

⎟⎟=

⎠

⎜⎜ ⎞

⎝

∑⎛

=

i i

頻繁樣式的應用

• 以關聯性來分類Association-based classification

• Iceberg cube computation

• Database compression by fascicles and frequent patterns

• Mining sequential patterns (GSP, PrefixSpan, SPADE, etc.)

• Mining partial periodicity, cyclic associations, etc.

• Mining frequent structures, trends, etc

頻繁樣式的研究問題

• Multi-dimensional gradient analysis: patterns regarding changes and differences

• Mining fault-tolerant associations

– “3 out of 4 courses excellent” leads to A in data mining

• Fascicles and database compression by frequent pattern mining

• Partial periodic patterns

• DNA sequence analysis and pattern classification

(16)

Tools: Association Rule Mining

• Free

– ARTool, http://www.cs.umb.edu/~laur/ARtool/

– Apriori, http://fuzzy.cs.uni-

magdeburg.de/~borgelt/#Software

• Commercial

– IBM Intelligent Miner for Data,

http://www.software.ibm.com/data/intelli-mine/

– DBMiner 2.0, http://www.dbminer.com/

– clementine, http://www.spss.com/clementine/

Apriori (1/2)

• apriori [options] infile outfile [appfile]

– infile: file to read transactions from – outfile: file to write association rules

– appfile: file stating item appearances (optional)

• options:

-t# target type (s: item sets, r: rules (default), h: hyperedges) -m# minimal number of items per set/rule/hyperedge (default:

1)

-n# maximal number of items per set/rule/hyperedge (default:

5)

-s# minimal support of a set/rule/hyperedge (default: 10%) -c# minimal confidence of a rule/hyperedge (default: 80%)

Apriori (2/2) options

-b/f/r# blank characters, field and record separators (default: " \t\r", " \t", "\n")

-o use original definition of the support of a rule (body

& head)

-p# output format for support/confidence (default:

"%.1f%%")

-x extended support output (print both rule support types) -a print absolute support (number of transactions)

-e# additional rule evaluation measure (default: none) (# always means a number, a letter, or a string that

specifies the parameter of the option.)

Apriori Input Format

• text file (field and record separators and blanks)

– Record separators: lines

– field separators fields (or columns): words

– Blanks : fill fields (columns), e.g. to align them.

• Examples _1,2,3

1,4,5 2,3,4 1,2,3,4 2,3 1,2,4 4,5 1,2,3,4 3,4,5 1,2,3

(17)

Item Appearances File

• item may appear only in rule bodies (antecedents):

– i in b body a ante antecedent

• item may appear only in rule heads (consequents):

– o out h head c cons consequent

• item may appear in rule bodies (antecedents) or in rule heads (consequents):

– io inout bh b&h ac a&c both

• item may appear neither in rule bodies (antecedents) nor in rule heads (consequents):

– n neither none ign ignore -

• Example 1: Generate only rules with item "x" in the consequent.

in

x out ₆₆

Sample Command

• apriori test1.tab test.rul

• apriori -b"(" -f, -r")" test2.tab test2.rul

• apriori -f ",.;:" -l test3.tab test3.rul

• apriori test1.tab -

0 1 0 2 0 3 1 1 1 4 1 5 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 2 4 3 5 1 5 2 5 4 6 4 6 5 7 1 7 2 7 3 7 4 8 3 8 4 8 5 9 1 9 2 9 3

(0,1)(0,2)(0,3)(1,1)(1,4)(1,5)(2,2)(2,3)(2,4)(3,1)(3,2)(3,3)(3,4)(4,2)(4,3)...

1,2,3 1,4,5 2.3.4 1,2,3,4 2:3 1,2,4 4,5 1,2,3,4 3;4;5 1,2,3 Example 2: Item "x" may appear only in a

rule head (consequent),

item "y" only in a rule body (antecedent);

appearance of all other items is not restricted.

both x head y body

Sample Output

3 <- 2 (70.0%, 85.7%) 2 <- 3 (70.0%, 85.7%) 2 <- 1 (60.0%, 83.3%) 4 <- 5 (30.0%, 100.0%) 3 <- 2 1 (50.0%, 80.0%) 2 <- 3 1 (40.0%, 100.0%) 4 <- 3 5 (10.0%, 100.0%) 4 <- 1 5 (10.0%, 100.0%) 2 <- 3 4 1 (20.0%, 100.0%)

頻繁樣式的勘測

Frequent Pattern Mining

- Mining Association Rules

Outline

頻繁樣式的勘測

Frequent patterns): patterns (set of items, sequence, etc.)在資料庫中經常出現的樣式/模式 (pattern:項目集、順序等) [AIS93]

: 找出資料中的規律(regularities)

F.P.M. 是data mining的基本功能/工作

• 許多data mining task的基礎

• 廣泛的應用

基本觀念:頻繁項目集

• 項目集 Itemset X={x

, …, x

}

• 項目集的支持度(

– s(A) = 3/4

• 頻繁項目集: 符合最小支持度（ m.s.: minimum support ）的項目集

• 勘測頻繁樣式:找出所有的頻繁樣式

基本觀念:關聯規則

• 關聯規則

找出後，決定

「有興趣的」

• m.s. = 50%, m.c. = 50%

探勘關聯規則（例）

Æ C:

support = support({A}∪{C}) = 50%

confidence = support({A}∪{C})/support({A}) = 66.6%

關聯規則的類別

• 布林式的 （boolean ）與 數量式的（quantitative）

• 單一維度(single dimension)與多維度(multiple dimensional)

• 單一層次(single level)與多層次(multiple-level)

關聯規則 的延伸與應用

• Maxpatterns and frequent closed itemsets

• Sequential patterns

• Association-based classification

頻繁樣式的勘測方法

• Apriori 方法與其變化、改進

– 不產生「候選樣式」的探勘方法

• 多維度、多層次頻繁樣式

Apriori: Candidate Generation-and-test

Apriori 方法實例

Apriori 演算法細節

L

= {frequent items}; k = 1 ; if L

= ∅ stop ;

C

= 由 L

產生 ;

所有包含於 t 的、 C

中的 candidate的個數 (support count) 加一

L

= C

candidate滿足最小支持者 (minimum support)

k = k +1 ;

L

的聯集 ;

Apriori 關鍵細節 I

• C

= 由 L

產生 ;

– Step 1: self-joining L

（ L

自交）

– Step 2: pruning（消去不可能者）

– Self-joining: L

*L

– Pruning:

Apriori 關鍵細節 II

勘測方法的挑戰

– 想辦法減少

– 想辦法減少

– 想辦法數快點

其他方法

– DIC – DHP – Partition – Sampling – ...

– FP-Growth, H-mine

– Eclat/MaxEclat ,VIPER

Association Rules 視覺化 : : Pane Graph

Association Rules 視覺化 : Rule Graph

最大樣式Max-patterns

frequent sub- patterns!

• 布林式的（boolean ）與數量式的（quantitative）

關聯規則的延伸與應用

• 設定 support 要具彈性: 低層次的 support應該比較低