• 沒有找到結果。

頻繁樣式的勘測

N/A
N/A
Protected

Academic year: 2022

Share "頻繁樣式的勘測"

Copied!
17
0
0

加載中.... (立即查看全文)

全文

(1)

Frequent Pattern Mining

- Mining Association Rules

2 Ming-Yen Lin, IECS.FCU

Outline

• 何謂頻繁樣式的勘測 (frequent pattern mining)?

• 頻繁樣式的勘測方法

• 植基於條件式的(Constraint-based)頻繁樣 式勘測

• 循序樣式(sequential patterns)

• 頻繁樣式的應用

• 頻繁樣式的研究問題

3 Ming-Yen Lin, IECS.FCU

頻繁樣式的勘測

• 頻繁樣式(

Frequent patterns): patterns (set of items, sequence, etc.)在資料庫中經常出現的樣式/模式 (pattern:項目集、順序等) [AIS93]

• 頻繁樣式的勘測

: 找出資料中的規律(regularities)

– What products were often purchased together? — Beer and diapers?!

– What are the subsequent purchases after buying a PC?

– What kinds of DNA are sensitive to this new drug?

– Can we automatically classify web documents?

F.P.M. : frequent pattern mining

4 Ming-Yen Lin, IECS.FCU

F.P.M. 是data mining的基本功能/工作

• 許多data mining task的基礎

– Association rules, correlation, causality

– sequential patterns, temporal/cyclic association, partial periodicity

– spatial and multimedia patterns – associative classification – cluster analysis

– iceberg cube, …

• 廣泛的應用

– 購物籃分析 – 交叉行銷 – 型錄設計 – 行銷活動分析

– web log (click stream)分析, DNA sequence analysis, …

(2)

5 Ming-Yen Lin, IECS.FCU

基本觀念:頻繁項目集

• 項目集 Itemset X={x

1

, …, x

k

}

– 例:{A,C}, {B,E,F}, {C,E}

• 項目集的支持度(

support)

– s(A) = 3/4

• 頻繁項目集: 符合最小支持度( m.s.: minimum support )的項目集

• 勘測頻繁樣式:找出所有的頻繁樣式

B, E, F 40

A, D 30

A, C 20

A, B, C 10

Items bought Transaction-id

Pattern = set of items

6 Ming-Yen Lin, IECS.FCU

基本觀念:關聯規則

• 關聯規則

的勘測:頻繁

項目集

找出後,決定

「有興趣的」

項目集之

間的關係

– 信賴度confidence, c, 某交 易如果包含 X ,此交易 同時會包含Y的條件機率

– 支持度support, s,某交易包

含 X∪Y的機率

• m.s. = 50%, m.c. = 50%

– A Æ C (50%, 66.7%) – C Æ A (50%, 100%)

B, E, F 40

A, D 30

A, C 20

A, B, C 10

Items bought Transaction-id

Customer buys diaper Customer

buys both

Customer buys beer

m.s.: minimum support 最小支持度 m.c.: minimum confidence 最小信賴度

7 Ming-Yen Lin, IECS.FCU

探勘關聯規則(例)

rule A

Æ C:

support = support({A}∪{C}) = 50%

confidence = support({A}∪{C})/support({A}) = 66.6%

Min. support 50%

Min. confidence 50%

B, E, F 40

A, D 30

A, C 20

A, B, C 10

Items bought Transaction-id

50%

{A, C}

50%

{C}

50%

{B}

75%

{A}

Support Frequent pattern

8 Ming-Yen Lin, IECS.FCU

關聯規則的類別

• 布林式的 (boolean )與 數量式的(quantitative)

– buys(x, “SQLServer”) ^ buys(x, “DMBook”) Æ buys(x,

“DM Software”) [0.2%, 60%]

– age(x, “30..39”) ^ income(x, “42..48K”) Æ buys(x, “PC”) [1%, 75%]

• 單一維度(single dimension)與多維度(multiple dimensional)

• 單一層次(single level)與多層次(multiple-level)

– What brands of beers are associated with what brands of diapers?

(3)

9 Ming-Yen Lin, IECS.FCU

關聯規則 的延伸與應用

• Correlation, causality analysis & mining interesting rules

• Maxpatterns and frequent closed itemsets

• Constraint-based mining

• Sequential patterns

• Association-based classification

• Computing iceberg cubes

10 Ming-Yen Lin, IECS.FCU

頻繁樣式的勘測方法

• Apriori 方法與其變化、改進

– 不產生「候選樣式」的探勘方法

• 最大樣式(max-patterns)與封閉樣式(closed - patterns)

– 精簡的表示方式

• 多維度、多層次頻繁樣式

• 有意義程度(Interestingness): correlation and causality

11 Ming-Yen Lin, IECS.FCU

Apriori: Candidate Generation-and-test

• 產生「候選者」後檢視方式

– 由長度 (k) 的「當選者」(frequent itemset) 產生長度 (k+1) 的「候選者」(candidate itemset)

– 針對資料庫數其支持度,檢視是否是frequent

• 「候選者」條件:其子集合必定是frequent

– 特性:anti-monotone

– 包含 {beer, diaper, nuts}的交易必定包含 {beer, diaper}

– 若{beer, diaper, nuts} frequent Æ {beer, diaper}一定 frequent

– 任一 non-frequent 項目集之superset根本就不可能是頻繁 項目集(所以就不會是「候選者」,不必產生,也不必 數)

– 可以排除許多「無用的組合」 12

Ming-Yen Lin, IECS.FCU

Apriori 方法實例

Database

1stscan

C1 L1

L2

C2 C2

2ndscan

C3 3rdscan L3

B, E 40

A, B, C, E 30

B, C, E 20

A, C, D 10

Items Tid

1 {D}

3 {E}

3 {C}

3 {B}

2 {A}

sup Itemset

3 {E}

3 {C}

3 {B}

2 {A}

sup Itemset

{C, E}

{B, E}

{B, C}

{A, E}

{A, C}

{A, B}

Itemset 1

{A, B}

2 {A, C}

1 {A, E}

2 {B, C}

3 {B, E}

2 {C, E}

sup Itemset

2 {A, C}

2 {B, C}

3 {B, E}

2 {C, E}

sup Itemset

{B, C, E}

Itemset

2 {B, C, E}

sup Itemset

sup 以count表示,未以%表示

m.s. = 2

(4)

13 Ming-Yen Lin, IECS.FCU

Apriori 演算法細節

L

1

= {frequent items}; k = 1 ; if L

k

= ∅ stop ;

C

k+1

= 由 L

k

產生 ;

對資料庫 D 中每一個交易 t 執行

所有包含於 t 的、 C

k+1

中的 candidate的個數 (support count) 加一

L

k+1

= C

k+1

candidate滿足最小支持者 (minimum support)

k = k +1 ;

答案:

L

k

的聯集 ;

• Lk:大小為 k 的 frequent itemset

• Ck:大小為 k 的candidate itemset

14 Ming-Yen Lin, IECS.FCU

Apriori 關鍵細節 I

• C

k+1

= 由 L

k

產生 ;

– Step 1: self-joining L

k

( L

k

自交)

– Step 2: pruning(消去不可能者)

• 例:L3={abc, abd, acd, ace, bcd}依序排好

– Self-joining: L

3

*L

3

• abcd: 由 abc and abd

• acde:由 acd and ace

– Pruning:

消去acde :因 ade 不在 L3

• C4={abcd}

15 Ming-Yen Lin, IECS.FCU

Apriori 關鍵細節 II

• 所有包含於 t 的 candidate的count加一;

在哪?

– candidate的總個數太多

– 各個 t中包含不止一個 candidate

• 方法:

– 好好安排 candidate (放在「hash tree」結構) – 找出所有可能被 t包含的candidate,加以測試

1,4,7 2,5,8 3,6,9 Subset function

2 3 4 5 6 7 1 4 5

1 3 6 1 2 4

4 5 7 1 2 5 4 5 8

1 5 9

3 4 5 3 5 6 3 5 7 6 8 9

3 6 7 3 6 8 Transaction: 1 2 3 5 6

1 + 2 3 5 6

1 2 + 3 5 6 1 3 + 5 6

:candidate 16

Ming-Yen Lin, IECS.FCU

勘測方法的挑戰

Challenges (也就是Apriori 方法變化改進的方 向)

• 全部資料庫檢視次數太多

– 想辦法減少

• candidates個數還是太多

– 想辦法減少

• 數 candidate support 太麻煩

– 想辦法數快點

(5)

17 Ming-Yen Lin, IECS.FCU

其他方法

• Apriori的改良

– DIC – DHP – Partition – Sampling – ...

• 不產生candidate,壓縮資料庫再找

– FP-Growth, H-mine

• 用項目的交集(資料庫改以直向排列)

– Eclat/MaxEclat ,VIPER

18 Ming-Yen Lin, IECS.FCU

Association Rules 視覺化 : : Pane Graph

19 Ming-Yen Lin, IECS.FCU

Association Rules 視覺化 : Rule Graph

20 Ming-Yen Lin, IECS.FCU

最大樣式Max-patterns

• Frequent pattern {a1, …, a100} Æ (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030

frequent sub- patterns!

• 最大樣式 Max-pattern: 沒有「真」(proper) super pattern的樣式 (PS: frequent)

– BCDE, ACD are max-patterns – BCD is not a max-pattern

A,C,D,F 30

B,C,D,E, 20

A,B,C,D,E 10

Items Tid

m.s.=2

(6)

21 Ming-Yen Lin, IECS.FCU

封閉樣式(Frequent Closed Patterns)

• Conf(acÆd)=100% Î 只紀錄 acd 就好

• 對於某一個頻繁項目集 X, 如果沒有項目 y 造成以下情形「每一個包含 X的交易也包含 y 」,則X 稱為一個封閉樣式

– “acd” is a frequent closed pattern

• 頻繁樣式的一種精簡表示方式

• 簡化樣式與規則的個數

c, e, f 50

a, c, d, f 40

c, e, f 30

a, b, e 20

a, c, d, e, f 10

Items TID

22 Ming-Yen Lin, IECS.FCU

多層次 Association Rules

• 項目通常有其概念架構

• 設定 support 要具彈性: 低層次的 support應該比較 低

• 依據維度與層次將交易資料庫編碼

• 探索多層次的探勘

uniform support

Milk [support = 10%]

2% Milk [support = 6%]

Skim Milk [support = 4%]

Level 1 min_sup = 5%

Level 2 min_sup = 5%

Level 1 min_sup = 5%

Level 2 min_sup = 3%

reduced support

23 Ming-Yen Lin, IECS.FCU

多維度 Association

單一維度(dimension)規則

buys(X, “milk”) ⇒ buys(X, “bread”)

多維度: ≥ 2 維度或陳述(predicate)

– 維度內(Inter-dimension) assoc. rules (no repeated predicates) age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,“coke”) – hybrid-dimension assoc. rules (repeated predicates)

age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)

類別屬性(Categorical Attributes)

– finite number of possible values, no ordering among values

數量屬性(Quantitative Attributes)

– numeric, implicit ordering among values

24 Ming-Yen Lin, IECS.FCU

Distance-based Association Rules

• Binning 不見得抓得住區間資料的語意(semantic)

• 以距離為主的分割更距離散化(discretization)考量

– density/number of points in an interval – “closeness” of points in an interval

Price($)

Equi-width (width $10)

Equi-depth (depth 2)

Distance- based

7 [0,10] [7,20] [7,7]

20 [11,20] [22,50] [20,22]

22 [21,30] [51,53] [50,53]

50 [31,40]

51 [41,50]

53 [51,60]

(7)

25 Ming-Yen Lin, IECS.FCU

有意義程度Interestingness Measure: Correlations

• play basketball ⇒ eat cereal [40%, 66.7%] 誤導!

– The overall percentage of students eating cereal is 75% which is higher than 66.7%.

• play basketball ⇒ not eat cereal [20%, 33.3%] 更精準, 雖然 support and confidence 較低

• 度量相依性/相關事件: lift

5000 2000

3000 Sum(col.)

1250 250

1000 Not cereal

3750 1750

2000 Cereal

Sum (row) Not basketball

Basketball

) ( ) (

) (

,

P A P B

B A corr

AB =

P

26 Ming-Yen Lin, IECS.FCU

植基於條件式的頻繁樣式勘測

• Finding all the patterns in a database

autonomously? — unrealistic!

– The patterns could be too many but not focused!

• Data mining should be an interactive process

– User directs what to be mined using a data mining query language(or a graphical user interface)

• Constraint-based mining

– User flexibility: providesconstraintson what to be mined – System optimization: explores such constraints for efficient

mining—constraint-based mining

27 Ming-Yen Lin, IECS.FCU

LexMiner

• Ming-Yen Lin and Suh-Yin Lee,

"A Fast Lexicographic Algorithm for

Association Rule Mining in Web Applications,"

Proceedings of the ICDCS Workshop on Knowledge Discovery and Data Mining in the World-Wide Web (ICDCS00), Taipei, Taiwan, R.O.C., pp. F7-F14, 2000.

28 Ming-Yen Lin, IECS.FCU

Apriori use Hash Tree to stored Candidates

• Excess Comparison,

Eg. T1= {1, 2, 3, 4, 5, 6}

• Duplicate Counting Avoidance,

Eg. T2= {1, 3, 4, 200, 401, 403}

• Large Storage Requirement

• High Splitting Cost

d = 1

d = 2

d = 3 (1,3,4)

(1,3,9) (1,401,403) (200,401,403) (200,401,555) (1,2,83)

(1,2,95) (1,2,96)

1

2 3

2

(2,3,4) (2,3,5)

(198,201,203) (198,400,555) root

3

hash function = x MOD 199, size of each leaf = 5, d = depth

. . .

. . .

: leaf : interior

: empty leaf

...

0

(8)

29 Ming-Yen Lin, IECS.FCU

LexTree: Lexicographically Ordered Tree

• Intrinsic Property: Lexicographic Order

– Items in each transaction, eg. {7, 11, 20, 29, 37}

– k-itemsets, eg. (1, 3, 4) < (1, 3, 10) < (1, 4, 10)

• Storing itemsets: by lexicographic order

• LexTree: compact, hierarchical tree

– candidate LexTree: efficient, redundant-free support counting

– frequent LexTree: effective candidate generation

• LexMiner Algorithm

30 Ming-Yen Lin, IECS.FCU

LexTree Structure

10 11

Root

3

4

7 1

3

4 10 11 4

10 11 15 10

15

# # # # # # # # # #

7

11 20 20 11

# # #

4

7

10 10

15 7

11

20

# # #

(1, 3, 4), (1, 3, 10), (1, 3, 11), (1, 4, 10), (1, 4, 11), (1, 4, 15), (1, 10, 15), (3, 4, 7), (3, 4, 10), (3, 4, 11), (3, 7, 11), (3, 7, 20), (3, 11, 20), (4, 7, 10), (4, 10, 15), (7, 11, 20)

Item_ID sibling support

Internal node: Leaf node:

Item_ID sibling next

: null link : support

#

31 Ming-Yen Lin, IECS.FCU

LexTree Construction

Insert candidates(1, 3, 4), (1, 3, 10), (1, 3, 11), (1, 4, 10) (1)

Root Last[1] 1 Last[2]

Last[3]

3 04

Insert candidate (1, 3, 4) (2)

Last[1] 1 Last[2]

Last[3]

3

04 10

0

Insert candidate (1, 3, 10) Root

(3)

Last[3]

1

11 Last[1]

Last[2] 3

04 10

0 0

Insert candidate (1, 3, 11) Root

(4) Insert candidate (1, 4, 10)

Last[3]

10 11 Last[1]

Last[2]

3

04 10

0 0

Root

4 1

0

32 Ming-Yen Lin, IECS.FCU

LexTree Construction (Cont.)

(5) Insert candidate (1, 10, 15) after (1, 4, 11), (1, 4, 15) inserted

Last[3]

11 10 Last[1]

Last[2]

3

04 10

0 0

Root

4 1

0 11 15

0 0 15

0 10

(6) Insert candidate (3, 4, 7)

11 10

Last[1]

3

04 10

0 0

Root

4 1

0 11 15

0 0 15

0 10

7 0

4 3

Last[3]

Last[2]

(9)

33 Ming-Yen Lin, IECS.FCU

Notations

• D : The database of transactions

• T : A transaction, T={x

1

, x

2

, …, x

p

, …, x

m

} x

1

, x

2

, …, x

k

: Items

• minsup : The minimum support specified by the user

• X : k-itemset, X=(x

1

, x

2

, …, x

k

)

• X.support : The support of itemset X

• C

k

: The set of candidate k-itemsets

• L

k

: The set of frequent k-itemsets

• Γ

Ck

: The candidate k-itemset LexTree

• Γ

Lk

: The frequent k-itemset LexTree

34 Ming-Yen Lin, IECS.FCU

Algorithm LexMiner

Find L1

Find L2 L1≠ ∅

Y

k=3

N

Answer = kLk

Lk-1≠ ∅

N Y

Build frequentΓLk-1 from Lk-1 Generate Ckfrom ΓLk-1 Store Ckin candidateΓCk

Lk= {X ∈ Ck| X.support≥ minsup}

∀T∈D, Find_and_increment(T , ΓCk)

k++

35 Ming-Yen Lin, IECS.FCU

Algorithm Find_and_Increment

cp≠ leaf

tpadvanced Find_and_Increment(tp, cp)

cpadvanced Find_and_Increment(tp, cp) Find_and_Increment(tp+1, cp.sibling) Find_and_Increment(tp+1, cp.next)

<

>

=

item[tp]

? cp.ID Internal node

Leaf node cp.support++

advance tp

advance cp

Advancetpandcp

>

=

<

Y

N

item[tp]

? cp.ID

while (not end_of_list) and (cp≠ null) 36

Ming-Yen Lin, IECS.FCU

7 × 11 × (20, 29, 37)

× 20 × (29, 37)

× 29 × (37) 11× 20 × (29, 37)

× 29 × (37) 20× 29 × (37)

Example 1: #Comparison Minimized

‰ At pass 3, T1={7, 11, 20, 29, 37}

Intrinsically ordered 3-itemsets in T1 { 7, 11} × {20, 29, 37}

{ 7, 20} × {29, 37}

{ 7, 29} × {37}

{11, 20} × {29, 37}

{11, 29} × {37}

{20, 29} × {37}

10 11

Root

3

4

7 1

3

4 10 11

4

10 11 15

10

15

# # # # # # # # # #

7

11 20 20

11

# # #

Na Nb Nc 4

7

10 10

15 7

11

20

# # #

Nd

Ne Nf

Candidate 3-itemset LexTree, ΓC3

(10)

37 Ming-Yen Lin, IECS.FCU

Example 2: Fast Support Counting

3 x 4 x (7,10,11) x 7 x (10,11) x 10 x (11) 4 x 7 x (10,11)

x 10 x (11) 7 x 10 x (11)

10 11

Root

3

4

7 1

3

4 10 11

4

10 11 15

10

15

# # # # # # # # # #

7

11 20 20

11

# # #

Na Nb Nc 4

7

10 10

15 7

11

20

# # #

Nd

Ne Nf

Ng Nh Ni

Nj Nk Nl

Candidate 3-itemset LexTree, ΓC3

‰ At pass 3, T2={3, 4, 7, 10, 11}

Intrinsically ordered 3-itemsets in T2

38 Ming-Yen Lin, IECS.FCU

Efficient Candidate Generation

• Common prefixed Lk-1 : linked by sibling

• Join: Ck = Lk-1×

L

k-1, then

• Prune: candidate itemset having any subset that is not in

Γ

Lk-1

– Searching in

ΓLk-1

: similar technique for find_and_increment

insert into C

k

select p[1], p[2], …, p[k-1], q[k-1]

from L

k-1

p, L

k-1

q

where p[1]=q[1], …, p[k-2] = q[k-2], p[k-1] < q[k-1] ;

39 Ming-Yen Lin, IECS.FCU

FP-growth

• J. Han, J. Pei, and Y. Yin: “Mining frequent

patterns without candidate generation

”. In Proc.

ACM-SIGMOD’2000, pp. 1-12, Dallas, TX, May 2000.

• Compress DB into a tree (FP-tree)

• Find frequent itemsets in FP-tree

40 Ming-Yen Lin, IECS.FCU

Construction of FP-tree from a Transaction Database

{}

f:4 c:1 b:1 p:1 b:1 c:3 a:3

b:1 m:2

p:2 m:1 Header Table

Item frequency head

f 4

c 4

a 3

b 3

m 3

p 3

min_support = 0.5 TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}

200 {a, b, c, f, l, m, o} {f, c, a, b, m}

300 {b, f, h, j, o, w} {f, b}

400 {b, c, k, s, p} {c, b, p}

500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Order frequent items in

frequency descending order

3. Scan DB again, construct FP-tree

(11)

41 Ming-Yen Lin, IECS.FCU

Mining Frequent Patterns with FP-trees

• Idea: Frequent pattern growth

– Recursively grow frequent patterns by pattern and database partition

• Method

– For each frequent item, construct its conditional pattern- base, and then its conditional FP-tree

– Repeat the process on each newly created conditional FP- tree

– Until the resulting FP-tree is empty, or it containsonly one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

42 Ming-Yen Lin, IECS.FCU

From FP-tree to Conditional Pattern-Base

• Starting at the frequent item header table in the FP-tree

• Traverse the FP-tree by following the link of each frequent item p

• Accumulate all of transformed prefix pathsof item p to form ps conditional pattern base

Conditional pattern bases item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {}

f:4 c:1 b:1 p:1 b:1 c:3 a:3

b:1 m:2

p:2 m:1 Header Table

Item frequency head

f 4

c 4

a 3

b 3

m 3

p 3

43 Ming-Yen Lin, IECS.FCU

Transformed Prefix Paths

• Derive the transformed prefix paths of item p

– For each item p in the tree, collect p’s prefix path with count = p’s frequency

– Why only prefix path? Why this count? Complete?

Conditional pattern bases item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {}

f:4 c:1

b:1 p:1 b:1 c:3 a:3

b:1 m:2

p:2 m:1 Header Table

Item frequency head

f 4

c 4

a 3

b 3

m 3

p 3

44 Ming-Yen Lin, IECS.FCU

From Conditional Pattern-Bases to Conditional FP-trees

• For each pattern-base

– Accumulate the count for each item in the base

– Construct the FP-tree for the frequent items of the pattern base

m-conditional pattern base:

fca:2, fcab:1

{}

f:3 c:3 a:3

m-conditional FP-tree All frequent patterns relate to m m,

fm, cm, am, fcm, fam, cam, fcam

¼ ¼

{}

f:4 c:1 b:1 p:1 b:1 c:3 a:3

b:1 m:2

p:2 m:1 Header Table

Item frequency head

f 4

c 4

a 3

b 3

m 3

p 3

(12)

45 Ming-Yen Lin, IECS.FCU

Recursion: Mining Each Conditional FP-tree Until …

{}

f:3 c:3 a:3

m-conditional FP-tree

Cond. pattern base of “am”: (fc:3)

{}

f:3 c:3

am-conditional FP-tree

Cond. pattern base of “cm”: (f:3) {}

f:3

cm-conditional FP-tree

Cond. pattern base of “cam”: (f:3)

{}

f:3

cam-conditional FP-tree

46 Ming-Yen Lin, IECS.FCU

A Special Case: Single FP-tree Path

• Suppose a (conditional) FP-tree T has a single path P

• The complete set of frequent patterns of T can be generated by enumeration of all the combinations of the sub-paths of P

{}

f:3 c:3 a:3

m-conditional FP-tree

All frequent patterns concerning m

m,

fm, cm, am, fcm, fam, cam, fcam

¼

47 Ming-Yen Lin, IECS.FCU

A More General (Special) Case: Single Prefix Path in FP-tree

• Suppose a (conditional) FP-tree T has a shared single prefix-path P

• Mining can be decomposed into two parts

– Reduction of the single prefix path into one node – Concatenation of the mining results of the two parts

¼

a2:n2 a3:n3 a1:n1 {}

b1:m1 C1:k1 C2:k2 C3:k3

b1:m1 C1:k1 C2:k2 C3:k3

r1

a2:n2

+

a3:n3 a1:n1 {}

r1

=

48 Ming-Yen Lin, IECS.FCU

循序樣式(sequential patterns)

• 交易資料庫 (transaction databases), 時序資料庫 (time-series databases) 與 序列資料庫 (sequence databases)

• 頻繁樣式(frequent patterns)與循序樣式(sequential patterns)

• 循序樣式的應用 Applications of sequential pattern mining

– 顧客購買序列

• First buy computer, then CD-ROM, and then digital camera, within 3 months.

– 醫療處方, 天災 (e.g., earthquakes), 科學、工程程序, 股票等 – 電話通聯樣式,Weblog click streams

– DNA sequences and gene structures

(13)

49 Ming-Yen Lin, IECS.FCU

何謂循序樣式之勘測?

• 給一堆序列(sequence),找出所有

frequent

subsequences(子序列)

A sequence database

A sequence : < (ef) (ab) (df) c b >

An elementmay contain a set of items.

Items within an element are unordered and we list them alphabetically.

<a(bc)dc> is a subsequence of <<a(abc)(ac)d(cf)>

Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

<eg(af)cbc>

40

<(ef)(ab)(df)cb>

30

<(ad)c(bc)(ae)>

20

<a(abc)(ac)d(cf)>

10

sequence SID

50 Ming-Yen Lin, IECS.FCU

循序樣式勘測的困難何在?

• 隱藏於資料庫中的可能的循序樣式個數相 當龐大可能

• 探勘演算法必須

– 找出所有滿足 minimum support (frequency) threshold的樣式

– 要具有高度效率與可擴充性,減少資料庫檢視 次數

– 可以跟各種使用者所指定的constraints 搭配

51 Ming-Yen Lin, IECS.FCU

Sequential Patterns基本特性: Apriori

• 基本特性

: Apriori (Agrawal & Sirkant’94)

– If a sequence Sis not frequent

– then none of the super-sequencesof S is frequent

– 例如, <hb> infrequent Æ <hab> and <(ah)b> 也 infrequent

<a(bd)bcb(ade)>

50

<(be)(ce)d>

40

<(ah)(bf)abf>

30

<(bf)(ce)b(fg)>

20

<(bd)cb(ac)>

10

Sequence

Seq. ID Given support threshold

min_sup =2

52 Ming-Yen Lin, IECS.FCU

GSP 演算法細節

L

1

= {frequent sequence of length 1}; k = 1 ; if L

k

= ∅ stop ;

C

k+1

= 由 L

k

產生 ;

對資料庫 D 中每一個交易 t 執行

所有包含於 t 的、 C

k+1

中的 candidate的個數 (support count) 加一

L

k+1

= C

k+1

candidate滿足最小支持者 (minimum support)

k = k +1 ;

答案:

L

k

的聯集 ;

• Lk:大小為 k 的 frequent sequence

• Ck:大小為 k 的candidate sequence

(14)

53 Ming-Yen Lin, IECS.FCU

找 Length-1的 Sequential Patterns

• Initial candidates: all singleton sequences – <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>

• Scan database once, count support for candidates

<a(bd)bcb(ade)>

50

<(be)(ce)d>

40

<(ah)(bf)abf>

30

<(bf)(ce)b(fg)>

20

<(bd)cb(ac)>

10

Sequence Seq. ID

min_sup =2

1

<h>

1

<g>

2

<f>

3

<e>

3

<d>

4

<c>

5

<b>

3

<a>

Sup Cand

54 Ming-Yen Lin, IECS.FCU

產生 Length-2 Candidates

<ff>

<fe>

<fd>

<fc>

<fb>

<fa>

<f>

<ef>

<ee>

<ed>

<ec>

<eb>

<ea>

<e>

<df>

<de>

<dd>

<dc>

<db>

<da>

<d>

<cf>

<ce>

<cd>

<cc>

<cb>

<ca>

<c>

<bf>

<be>

<bd>

<bc>

<bb>

<ba>

<b>

<af>

<ae>

<ad>

<ac>

<ab>

<aa>

<a>

<f>

<e>

<d>

<c>

<b>

<a>

<f>

<(ef)>

<e>

<(df)>

<(de)>

<d>

<(cf)>

<(ce)>

<(cd)>

<c>

<(bf)>

<(be)>

<(bd)>

<(bc)>

<b>

<(af)>

<(ae)>

<(ad)>

<(ac)>

<(ab)>

<a>

<f>

<e>

<d>

<c>

<b>

<a>

51 length-2 Candidates

Without Apriori property,

8*8+8*7/2=92 candidates

Apriori prunes 44.57% candidates

55 Ming-Yen Lin, IECS.FCU

找Length-2 Sequential Patterns

• Scan database one more time, collect support count for each length-2 candidate

• There are 19 length-2 candidates which pass the minimum support threshold

– They are length-2 sequential patterns

56 Ming-Yen Lin, IECS.FCU

產生& 找出 Length-2 Candidates

• 產生

– length-2 sequential patterns 自交 (Self-join)

• Based on the Apriori property

• <ab>, <aa> and <ba> are all length-2 sequential patterns Æ <aba> is a length-3 candidate

• <(bd)>, <bb> and <db> are all length-2 sequential patterns Æ <(bd)b> is a length-3 candidate

– 46 candidates are generated, prunethe impossible

• 找 Length-3 Sequential Patterns

– Scan database once more, collect support counts for candidates

– 19 out of 46 candidates pass support threshold

(15)

57 Ming-Yen Lin, IECS.FCU

The GSP Mining Process

<a> <b> <c> <d> <e> <f> <g> <h>

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

<abb> <aab> <aba> <baa> <bab> …

<abba> <(bd)bc> …

<(bd)cba>

1stscan: 8 cand. 6 length-1 seq.

pat.

2ndscan: 51 cand. 19 length-2 seq.

pat. 10 cand. not in DB at all

3rdscan: 46 cand. 19 length-3 seq.

pat. 20 cand. not in DB at all 4thscan: 8 cand. 6 length-4 seq.

pat.

5thscan: 1 cand. 1 length-5 seq.

pat.

Cand. cannot pass sup. threshold

Cand. not in DB at all

<a(bd)bcb(ade)>

50

<(be)(ce)d>

40

<(ah)(bf)abf>

30

<(bf)(ce)b(fg)>

20

<(bd)cb(ac)>

10

Sequence Seq. ID

min_sup =2

58 Ming-Yen Lin, IECS.FCU

GSP瓶頸

• candidates 個數太多

– 1,000 frequent length-1 sequences: generate length-2

• 全部資料庫檢視次數太多

• 真正難的:mining longsequential patterns

– An exponential number of short candidates 天文數字 – A length-100 sequential pattern needs 1030

candidate sequences!

500 , 499 , 2 1

999 1000 1000

1000× + × =

30 100

100

1

10 1 100 2

⎟⎟=

⎜⎜

=

i i

59 Ming-Yen Lin, IECS.FCU

頻繁樣式的應用

• 以關聯性來分類Association-based classification

• Iceberg cube computation

• Database compression by fascicles and frequent patterns

• Mining sequential patterns (GSP, PrefixSpan, SPADE, etc.)

• Mining partial periodicity, cyclic associations, etc.

• Mining frequent structures, trends, etc

60 Ming-Yen Lin, IECS.FCU

頻繁樣式的研究問題

• Multi-dimensional gradient analysis: patterns regarding changes and differences

• Mining fault-tolerant associations

– “3 out of 4 courses excellent” leads to A in data mining

• Fascicles and database compression by frequent pattern mining

• Partial periodic patterns

• DNA sequence analysis and pattern classification

(16)

61 Ming-Yen Lin, IECS.FCU

Tools: Association Rule Mining

• Free

– ARTool, http://www.cs.umb.edu/~laur/ARtool/

– Apriori, http://fuzzy.cs.uni-

magdeburg.de/~borgelt/#Software

• Commercial

– IBM Intelligent Miner for Data,

http://www.software.ibm.com/data/intelli-mine/

– DBMiner 2.0, http://www.dbminer.com/

– clementine, http://www.spss.com/clementine/

62 Ming-Yen Lin, IECS.FCU

Apriori (1/2)

• apriori [options] infile outfile [appfile]

– infile: file to read transactions from – outfile: file to write association rules

– appfile: file stating item appearances (optional)

• options:

-t# target type (s: item sets, r: rules (default), h: hyperedges) -m# minimal number of items per set/rule/hyperedge (default:

1)

-n# maximal number of items per set/rule/hyperedge (default:

5)

-s# minimal support of a set/rule/hyperedge (default: 10%) -c# minimal confidence of a rule/hyperedge (default: 80%)

63 Ming-Yen Lin, IECS.FCU

Apriori (2/2) options

-b/f/r# blank characters, field and record separators (default: " \t\r", " \t", "\n")

-o use original definition of the support of a rule (body

& head)

-p# output format for support/confidence (default:

"%.1f%%")

-x extended support output (print both rule support types) -a print absolute support (number of transactions)

-e# additional rule evaluation measure (default: none) (# always means a number, a letter, or a string that

specifies the parameter of the option.)

64 Ming-Yen Lin, IECS.FCU

Apriori Input Format

• text file (field and record separators and blanks)

– Record separators: lines

– field separators fields (or columns): words

– Blanks : fill fields (columns), e.g. to align them.

• Examples 1,2,3

1,4,5 2,3,4 1,2,3,4 2,3 1,2,4 4,5 1,2,3,4 3,4,5 1,2,3

(17)

65 Ming-Yen Lin, IECS.FCU

Item Appearances File

• item may appear only in rule bodies (antecedents):

– i in b body a ante antecedent

• item may appear only in rule heads (consequents):

– o out h head c cons consequent

• item may appear in rule bodies (antecedents) or in rule heads (consequents):

– io inout bh b&h ac a&c both

• item may appear neither in rule bodies (antecedents) nor in rule heads (consequents):

– n neither none ign ignore -

• Example 1: Generate only rules with item "x" in the consequent.

in

x out 66

Ming-Yen Lin, IECS.FCU

Sample Command

• apriori test1.tab test.rul

• apriori -b"(" -f, -r")" test2.tab test2.rul

• apriori -f ",.;:" -l test3.tab test3.rul

• apriori test1.tab -

0 1 0 2 0 3 1 1 1 4 1 5 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 2 4 3 5 1 5 2 5 4 6 4 6 5 7 1 7 2 7 3 7 4 8 3 8 4 8 5 9 1 9 2 9 3

(0,1)(0,2)(0,3)(1,1)(1,4)(1,5)(2,2)(2,3)(2,4)(3,1)(3,2)(3,3)(3,4)(4,2)(4,3)...

1,2,3 1,4,5 2.3.4 1,2,3,4 2:3 1,2,4 4,5 1,2,3,4 3;4;5 1,2,3 Example 2: Item "x" may appear only in a

rule head (consequent),

item "y" only in a rule body (antecedent);

appearance of all other items is not restricted.

both x head y body

67 Ming-Yen Lin, IECS.FCU

Sample Output

3 <- 2 (70.0%, 85.7%) 2 <- 3 (70.0%, 85.7%) 2 <- 1 (60.0%, 83.3%) 4 <- 5 (30.0%, 100.0%) 3 <- 2 1 (50.0%, 80.0%) 2 <- 3 1 (40.0%, 100.0%) 4 <- 3 5 (10.0%, 100.0%) 4 <- 1 5 (10.0%, 100.0%) 2 <- 3 4 1 (20.0%, 100.0%)

68 Ming-Yen Lin, IECS.FCU

Applications 回顧

• 購物籃分析

• 交叉行銷

• 型錄設計

• 行銷活動分析

• web log (click stream)分析

• DNA sequence analysis

• ...

參考文獻

相關文件

There is evidence that some of the listening- to-number items in this paper caused comprehension problems for many relatively weaker candidates.. These items include

Geo-referenced data and other relevant information of Direct Subsidy Scheme primary schools. 12/2022

[20] Rakesh Agrawal, Tomasz Imielinski, Arun N.Swami, ―Mining association rules between sets of items in large databases,‖ In Proceedings of the 1993 ACM SIGMOD

However the non-linear equation (27) has also some non-perturbative solutions for finite values of x. depending upon the initial conditions) are poles.. (31) One

We compare the results of analytical and numerical studies of lattice 2D quantum gravity, where the internal quantum metric is described by random (dynamical)

We shall end this paper by considerations on possible extensions of these kind of models, their imbedding in an enveloping space and their possible relationship

Introducing positive magnetoresistance

Teachers can make use of the learning materials flexibly according to students’ learning needs and interests..?. 初小適用