Frequent Pattern Mining
- Mining Association Rules
2 Ming-Yen Lin, IECS.FCU
Outline
• 何謂頻繁樣式的勘測 (frequent pattern mining)?
• 頻繁樣式的勘測方法
• 植基於條件式的(Constraint-based)頻繁樣 式勘測
• 循序樣式(sequential patterns)
• 頻繁樣式的應用
• 頻繁樣式的研究問題
3 Ming-Yen Lin, IECS.FCU
頻繁樣式的勘測
• 頻繁樣式(
Frequent patterns): patterns (set of items, sequence, etc.)在資料庫中經常出現的樣式/模式 (pattern:項目集、順序等) [AIS93]
• 頻繁樣式的勘測
: 找出資料中的規律(regularities)
– What products were often purchased together? — Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
F.P.M. : frequent pattern mining
4 Ming-Yen Lin, IECS.FCU
F.P.M. 是data mining的基本功能/工作
• 許多data mining task的基礎
– Association rules, correlation, causality
– sequential patterns, temporal/cyclic association, partial periodicity
– spatial and multimedia patterns – associative classification – cluster analysis
– iceberg cube, …
• 廣泛的應用
– 購物籃分析 – 交叉行銷 – 型錄設計 – 行銷活動分析
– web log (click stream)分析, DNA sequence analysis, …
5 Ming-Yen Lin, IECS.FCU
基本觀念:頻繁項目集
• 項目集 Itemset X={x
1, …, x
k}
– 例:{A,C}, {B,E,F}, {C,E}
• 項目集的支持度(
support)– s(A) = 3/4
• 頻繁項目集: 符合最小支持度( m.s.: minimum support )的項目集
• 勘測頻繁樣式:找出所有的頻繁樣式
B, E, F 40
A, D 30
A, C 20
A, B, C 10
Items bought Transaction-id
Pattern = set of items
6 Ming-Yen Lin, IECS.FCU
基本觀念:關聯規則
• 關聯規則
的勘測:頻繁項目集
找出後,決定
「有興趣的」
項目集之間的關係
– 信賴度confidence, c, 某交 易如果包含 X ,此交易 同時會包含Y的條件機率
– 支持度support, s,某交易包
含 X∪Y的機率
• m.s. = 50%, m.c. = 50%
– A Æ C (50%, 66.7%) – C Æ A (50%, 100%)
B, E, F 40
A, D 30
A, C 20
A, B, C 10
Items bought Transaction-id
Customer buys diaper Customer
buys both
Customer buys beer
m.s.: minimum support 最小支持度 m.c.: minimum confidence 最小信賴度
7 Ming-Yen Lin, IECS.FCU
探勘關聯規則(例)
rule A
Æ C:
support = support({A}∪{C}) = 50%
confidence = support({A}∪{C})/support({A}) = 66.6%
Min. support 50%
Min. confidence 50%
B, E, F 40
A, D 30
A, C 20
A, B, C 10
Items bought Transaction-id
50%
{A, C}
50%
{C}
50%
{B}
75%
{A}
Support Frequent pattern
8 Ming-Yen Lin, IECS.FCU
關聯規則的類別
• 布林式的 (boolean )與 數量式的(quantitative)
– buys(x, “SQLServer”) ^ buys(x, “DMBook”) Æ buys(x,
“DM Software”) [0.2%, 60%]
– age(x, “30..39”) ^ income(x, “42..48K”) Æ buys(x, “PC”) [1%, 75%]
• 單一維度(single dimension)與多維度(multiple dimensional)
• 單一層次(single level)與多層次(multiple-level)
– What brands of beers are associated with what brands of diapers?
9 Ming-Yen Lin, IECS.FCU
關聯規則 的延伸與應用
• Correlation, causality analysis & mining interesting rules
• Maxpatterns and frequent closed itemsets
• Constraint-based mining
• Sequential patterns
• Association-based classification
• Computing iceberg cubes
10 Ming-Yen Lin, IECS.FCU
頻繁樣式的勘測方法
• Apriori 方法與其變化、改進
– 不產生「候選樣式」的探勘方法
• 最大樣式(max-patterns)與封閉樣式(closed - patterns)
– 精簡的表示方式
• 多維度、多層次頻繁樣式
• 有意義程度(Interestingness): correlation and causality
11 Ming-Yen Lin, IECS.FCU
Apriori: Candidate Generation-and-test
• 產生「候選者」後檢視方式
– 由長度 (k) 的「當選者」(frequent itemset) 產生長度 (k+1) 的「候選者」(candidate itemset)
– 針對資料庫數其支持度,檢視是否是frequent
• 「候選者」條件:其子集合必定是frequent
– 特性:anti-monotone
– 包含 {beer, diaper, nuts}的交易必定包含 {beer, diaper}
– 若{beer, diaper, nuts} frequent Æ {beer, diaper}一定 frequent
– 任一 non-frequent 項目集之superset根本就不可能是頻繁 項目集(所以就不會是「候選者」,不必產生,也不必 數)
– 可以排除許多「無用的組合」 12
Ming-Yen Lin, IECS.FCU
Apriori 方法實例
Database
1stscan
C1 L1
L2
C2 C2
2ndscan
C3 3rdscan L3
B, E 40
A, B, C, E 30
B, C, E 20
A, C, D 10
Items Tid
1 {D}
3 {E}
3 {C}
3 {B}
2 {A}
sup Itemset
3 {E}
3 {C}
3 {B}
2 {A}
sup Itemset
{C, E}
{B, E}
{B, C}
{A, E}
{A, C}
{A, B}
Itemset 1
{A, B}
2 {A, C}
1 {A, E}
2 {B, C}
3 {B, E}
2 {C, E}
sup Itemset
2 {A, C}
2 {B, C}
3 {B, E}
2 {C, E}
sup Itemset
{B, C, E}
Itemset
2 {B, C, E}
sup Itemset
sup 以count表示,未以%表示
m.s. = 2
13 Ming-Yen Lin, IECS.FCU
Apriori 演算法細節
L
1= {frequent items}; k = 1 ; if L
k= ∅ stop ;
C
k+1= 由 L
k產生 ;
對資料庫 D 中每一個交易 t 執行
所有包含於 t 的、 C
k+1中的 candidate的個數 (support count) 加一
L
k+1= C
k+1之candidate滿足最小支持者 (minimum support)
k = k +1 ;
答案:
L
k的聯集 ;
• Lk:大小為 k 的 frequent itemset
• Ck:大小為 k 的candidate itemset
14 Ming-Yen Lin, IECS.FCU
Apriori 關鍵細節 I
• C
k+1= 由 L
k產生 ;
– Step 1: self-joining L
k( L
k自交)
– Step 2: pruning(消去不可能者)
• 例:L3={abc, abd, acd, ace, bcd}依序排好
– Self-joining: L
3*L
3• abcd: 由 abc and abd
• acde:由 acd and ace
– Pruning:
• 消去acde :因 ade 不在 L3
• C4={abcd}
15 Ming-Yen Lin, IECS.FCU
Apriori 關鍵細節 II
• 所有包含於 t 的 candidate的count加一; 難
在哪?
– candidate的總個數太多
– 各個 t中包含不止一個 candidate
• 方法:
– 好好安排 candidate (放在「hash tree」結構) – 找出所有可能被 t包含的candidate,加以測試
1,4,7 2,5,8 3,6,9 Subset function
2 3 4 5 6 7 1 4 5
1 3 6 1 2 4
4 5 7 1 2 5 4 5 8
1 5 9
3 4 5 3 5 6 3 5 7 6 8 9
3 6 7 3 6 8 Transaction: 1 2 3 5 6
1 + 2 3 5 6
1 2 + 3 5 6 1 3 + 5 6
:candidate 16
Ming-Yen Lin, IECS.FCU
勘測方法的挑戰
Challenges (也就是Apriori 方法變化改進的方 向)
• 全部資料庫檢視次數太多
– 想辦法減少
• candidates個數還是太多
– 想辦法減少
• 數 candidate support 太麻煩
– 想辦法數快點
17 Ming-Yen Lin, IECS.FCU
其他方法
• Apriori的改良
– DIC – DHP – Partition – Sampling – ...
• 不產生candidate,壓縮資料庫再找
– FP-Growth, H-mine
• 用項目的交集(資料庫改以直向排列)
– Eclat/MaxEclat ,VIPER
18 Ming-Yen Lin, IECS.FCU
Association Rules 視覺化 : : Pane Graph
19 Ming-Yen Lin, IECS.FCU
Association Rules 視覺化 : Rule Graph
20 Ming-Yen Lin, IECS.FCU
最大樣式Max-patterns
• Frequent pattern {a1, …, a100} Æ (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030
frequent sub- patterns!
• 最大樣式 Max-pattern: 沒有「真」(proper) super pattern的樣式 (PS: frequent)
– BCDE, ACD are max-patterns – BCD is not a max-pattern
A,C,D,F 30
B,C,D,E, 20
A,B,C,D,E 10
Items Tid
m.s.=2
21 Ming-Yen Lin, IECS.FCU
封閉樣式(Frequent Closed Patterns)
• Conf(acÆd)=100% Î 只紀錄 acd 就好
• 對於某一個頻繁項目集 X, 如果沒有項目 y 造成以下情形「每一個包含 X的交易也包含 y 」,則X 稱為一個封閉樣式
– “acd” is a frequent closed pattern
• 頻繁樣式的一種精簡表示方式
• 簡化樣式與規則的個數
c, e, f 50
a, c, d, f 40
c, e, f 30
a, b, e 20
a, c, d, e, f 10
Items TID
22 Ming-Yen Lin, IECS.FCU
多層次 Association Rules
• 項目通常有其概念架構
• 設定 support 要具彈性: 低層次的 support應該比較 低
• 依據維度與層次將交易資料庫編碼
• 探索多層次的探勘
uniform support
Milk [support = 10%]
2% Milk [support = 6%]
Skim Milk [support = 4%]
Level 1 min_sup = 5%
Level 2 min_sup = 5%
Level 1 min_sup = 5%
Level 2 min_sup = 3%
reduced support
23 Ming-Yen Lin, IECS.FCU
多維度 Association
• 單一維度(dimension)規則
buys(X, “milk”) ⇒ buys(X, “bread”)
• 多維度: ≥ 2 維度或陳述(predicate)
– 維度內(Inter-dimension) assoc. rules (no repeated predicates) age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,“coke”) – hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
• 類別屬性(Categorical Attributes)
– finite number of possible values, no ordering among values
• 數量屬性(Quantitative Attributes)
– numeric, implicit ordering among values
24 Ming-Yen Lin, IECS.FCU
Distance-based Association Rules
• Binning 不見得抓得住區間資料的語意(semantic)
• 以距離為主的分割更距離散化(discretization)考量
– density/number of points in an interval – “closeness” of points in an interval
Price($)
Equi-width (width $10)
Equi-depth (depth 2)
Distance- based
7 [0,10] [7,20] [7,7]
20 [11,20] [22,50] [20,22]
22 [21,30] [51,53] [50,53]
50 [31,40]
51 [41,50]
53 [51,60]
25 Ming-Yen Lin, IECS.FCU
有意義程度Interestingness Measure: Correlations
• play basketball ⇒ eat cereal [40%, 66.7%] 誤導!
– The overall percentage of students eating cereal is 75% which is higher than 66.7%.
• play basketball ⇒ not eat cereal [20%, 33.3%] 更精準, 雖然 support and confidence 較低
• 度量相依性/相關事件: lift
5000 2000
3000 Sum(col.)
1250 250
1000 Not cereal
3750 1750
2000 Cereal
Sum (row) Not basketball
Basketball
) ( ) (
) (
,
P A P B
B A corr
AB =P
∪26 Ming-Yen Lin, IECS.FCU
植基於條件式的頻繁樣式勘測
• Finding all the patterns in a database
autonomously? — unrealistic!– The patterns could be too many but not focused!
• Data mining should be an interactive process
– User directs what to be mined using a data mining query language(or a graphical user interface)
• Constraint-based mining
– User flexibility: providesconstraintson what to be mined – System optimization: explores such constraints for efficient
mining—constraint-based mining
27 Ming-Yen Lin, IECS.FCU
LexMiner
• Ming-Yen Lin and Suh-Yin Lee,
"A Fast Lexicographic Algorithm for
Association Rule Mining in Web Applications,"
Proceedings of the ICDCS Workshop on Knowledge Discovery and Data Mining in the World-Wide Web (ICDCS00), Taipei, Taiwan, R.O.C., pp. F7-F14, 2000.
28 Ming-Yen Lin, IECS.FCU
Apriori use Hash Tree to stored Candidates
• Excess Comparison,
Eg. T1= {1, 2, 3, 4, 5, 6}• Duplicate Counting Avoidance,
Eg. T2= {1, 3, 4, 200, 401, 403}• Large Storage Requirement
• High Splitting Cost
d = 1
d = 2
d = 3 (1,3,4)
(1,3,9) (1,401,403) (200,401,403) (200,401,555) (1,2,83)
(1,2,95) (1,2,96)
1
2 3
2
(2,3,4) (2,3,5)
(198,201,203) (198,400,555) root
3
hash function = x MOD 199, size of each leaf = 5, d = depth
. . .
. . .
: leaf : interior
: empty leaf
...
0
29 Ming-Yen Lin, IECS.FCU
LexTree: Lexicographically Ordered Tree
• Intrinsic Property: Lexicographic Order
– Items in each transaction, eg. {7, 11, 20, 29, 37}
– k-itemsets, eg. (1, 3, 4) < (1, 3, 10) < (1, 4, 10)
• Storing itemsets: by lexicographic order
• LexTree: compact, hierarchical tree
– candidate LexTree: efficient, redundant-free support counting
– frequent LexTree: effective candidate generation
• LexMiner Algorithm
30 Ming-Yen Lin, IECS.FCU
LexTree Structure
10 11
Root
3
4
7 1
3
4 10 11 4
10 11 15 10
15
# # # # # # # # # #
7
11 20 20 11
# # #
4
7
10 10
15 7
11
20
# # #
(1, 3, 4), (1, 3, 10), (1, 3, 11), (1, 4, 10), (1, 4, 11), (1, 4, 15), (1, 10, 15), (3, 4, 7), (3, 4, 10), (3, 4, 11), (3, 7, 11), (3, 7, 20), (3, 11, 20), (4, 7, 10), (4, 10, 15), (7, 11, 20)
Item_ID sibling support
Internal node: Leaf node:
Item_ID sibling next
: null link : support
#
31 Ming-Yen Lin, IECS.FCU
LexTree Construction
Insert candidates(1, 3, 4), (1, 3, 10), (1, 3, 11), (1, 4, 10) (1)
Root Last[1] 1 Last[2]
Last[3]
3 04
Insert candidate (1, 3, 4) (2)
Last[1] 1 Last[2]
Last[3]
3
04 10
0
Insert candidate (1, 3, 10) Root
(3)
Last[3]
1
11 Last[1]
Last[2] 3
04 10
0 0
Insert candidate (1, 3, 11) Root
(4) Insert candidate (1, 4, 10)
Last[3]
10 11 Last[1]
Last[2]
3
04 10
0 0
Root
4 1
0
32 Ming-Yen Lin, IECS.FCU
LexTree Construction (Cont.)
(5) Insert candidate (1, 10, 15) after (1, 4, 11), (1, 4, 15) inserted
Last[3]
11 10 Last[1]
Last[2]
3
04 10
0 0
Root
4 1
0 11 15
0 0 15
0 10
(6) Insert candidate (3, 4, 7)
11 10
Last[1]
3
04 10
0 0
Root
4 1
0 11 15
0 0 15
0 10
7 0
4 3
Last[3]
Last[2]
33 Ming-Yen Lin, IECS.FCU
Notations
• D : The database of transactions
• T : A transaction, T={x
1, x
2, …, x
p, …, x
m} x
1, x
2, …, x
k: Items
• minsup : The minimum support specified by the user
• X : k-itemset, X=(x
1, x
2, …, x
k)
• X.support : The support of itemset X
• C
k: The set of candidate k-itemsets
• L
k: The set of frequent k-itemsets
• Γ
Ck: The candidate k-itemset LexTree
• Γ
Lk: The frequent k-itemset LexTree
34 Ming-Yen Lin, IECS.FCU
Algorithm LexMiner
Find L1
Find L2 L1≠ ∅
Y
k=3
N
Answer = ∪kLk
Lk-1≠ ∅
N Y
Build frequentΓLk-1 from Lk-1 Generate Ckfrom ΓLk-1 Store Ckin candidateΓCk
Lk= {X ∈ Ck| X.support≥ minsup}
∀T∈D, Find_and_increment(T , ΓCk)
k++
35 Ming-Yen Lin, IECS.FCU
Algorithm Find_and_Increment
cp≠ leaf
tpadvanced Find_and_Increment(tp, cp)
cpadvanced Find_and_Increment(tp, cp) Find_and_Increment(tp+1, cp.sibling) Find_and_Increment(tp+1, cp.next)
<
>
=
item[tp]
? cp.ID Internal node
Leaf node cp.support++
advance tp
advance cp
Advancetpandcp
>
=
<
Y
N
item[tp]
? cp.ID
while (not end_of_list) and (cp≠ null) 36
Ming-Yen Lin, IECS.FCU
7 × 11 × (20, 29, 37)
× 20 × (29, 37)
× 29 × (37) 11× 20 × (29, 37)
× 29 × (37) 20× 29 × (37)
Example 1: #Comparison Minimized
At pass 3, T1={7, 11, 20, 29, 37}
Intrinsically ordered 3-itemsets in T1 { 7, 11} × {20, 29, 37}
{ 7, 20} × {29, 37}
{ 7, 29} × {37}
{11, 20} × {29, 37}
{11, 29} × {37}
{20, 29} × {37}
10 11
Root
3
4
7 1
3
4 10 11
4
10 11 15
10
15
# # # # # # # # # #
7
11 20 20
11
# # #
Na Nb Nc 4
7
10 10
15 7
11
20
# # #
Nd
Ne Nf
Candidate 3-itemset LexTree, ΓC3
37 Ming-Yen Lin, IECS.FCU
Example 2: Fast Support Counting
3 x 4 x (7,10,11) x 7 x (10,11) x 10 x (11) 4 x 7 x (10,11)
x 10 x (11) 7 x 10 x (11)
10 11
Root
3
4
7 1
3
4 10 11
4
10 11 15
10
15
# # # # # # # # # #
7
11 20 20
11
# # #
Na Nb Nc 4
7
10 10
15 7
11
20
# # #
Nd
Ne Nf
Ng Nh Ni
Nj Nk Nl
Candidate 3-itemset LexTree, ΓC3
At pass 3, T2={3, 4, 7, 10, 11}
Intrinsically ordered 3-itemsets in T2
38 Ming-Yen Lin, IECS.FCU
Efficient Candidate Generation
• Common prefixed Lk-1 : linked by sibling
• Join: Ck = Lk-1×
L
k-1, then• Prune: candidate itemset having any subset that is not in
Γ
Lk-1– Searching in
ΓLk-1: similar technique for find_and_increment
insert into C
kselect p[1], p[2], …, p[k-1], q[k-1]
from L
k-1p, L
k-1q
where p[1]=q[1], …, p[k-2] = q[k-2], p[k-1] < q[k-1] ;
39 Ming-Yen Lin, IECS.FCU
FP-growth
• J. Han, J. Pei, and Y. Yin: “Mining frequent
patterns without candidate generation
”. In Proc.ACM-SIGMOD’2000, pp. 1-12, Dallas, TX, May 2000.
• Compress DB into a tree (FP-tree)
• Find frequent itemsets in FP-tree
40 Ming-Yen Lin, IECS.FCU
Construction of FP-tree from a Transaction Database
{}
f:4 c:1 b:1 p:1 b:1 c:3 a:3
b:1 m:2
p:2 m:1 Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
min_support = 0.5 TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Order frequent items in
frequency descending order
3. Scan DB again, construct FP-tree
41 Ming-Yen Lin, IECS.FCU
Mining Frequent Patterns with FP-trees
• Idea: Frequent pattern growth
– Recursively grow frequent patterns by pattern and database partition
• Method
– For each frequent item, construct its conditional pattern- base, and then its conditional FP-tree
– Repeat the process on each newly created conditional FP- tree
– Until the resulting FP-tree is empty, or it containsonly one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern
42 Ming-Yen Lin, IECS.FCU
From FP-tree to Conditional Pattern-Base
• Starting at the frequent item header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item p
• Accumulate all of transformed prefix pathsof item p to form p’s conditional pattern base
Conditional pattern bases item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {}
f:4 c:1 b:1 p:1 b:1 c:3 a:3
b:1 m:2
p:2 m:1 Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
43 Ming-Yen Lin, IECS.FCU
Transformed Prefix Paths
• Derive the transformed prefix paths of item p
– For each item p in the tree, collect p’s prefix path with count = p’s frequency
– Why only prefix path? Why this count? Complete?
Conditional pattern bases item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {}
f:4 c:1
b:1 p:1 b:1 c:3 a:3
b:1 m:2
p:2 m:1 Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
44 Ming-Yen Lin, IECS.FCU
From Conditional Pattern-Bases to Conditional FP-trees
• For each pattern-base
– Accumulate the count for each item in the base
– Construct the FP-tree for the frequent items of the pattern base
m-conditional pattern base:
fca:2, fcab:1
{}
f:3 c:3 a:3
m-conditional FP-tree All frequent patterns relate to m m,
fm, cm, am, fcm, fam, cam, fcam
¼ ¼
{}
f:4 c:1 b:1 p:1 b:1 c:3 a:3
b:1 m:2
p:2 m:1 Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
45 Ming-Yen Lin, IECS.FCU
Recursion: Mining Each Conditional FP-tree Until …
{}
f:3 c:3 a:3
m-conditional FP-tree
Cond. pattern base of “am”: (fc:3)
{}
f:3 c:3
am-conditional FP-tree
Cond. pattern base of “cm”: (f:3) {}
f:3
cm-conditional FP-tree
Cond. pattern base of “cam”: (f:3)
{}
f:3
cam-conditional FP-tree
46 Ming-Yen Lin, IECS.FCU
A Special Case: Single FP-tree Path
• Suppose a (conditional) FP-tree T has a single path P
• The complete set of frequent patterns of T can be generated by enumeration of all the combinations of the sub-paths of P
{}
f:3 c:3 a:3
m-conditional FP-tree
All frequent patterns concerning m
m,
fm, cm, am, fcm, fam, cam, fcam
¼
47 Ming-Yen Lin, IECS.FCU
A More General (Special) Case: Single Prefix Path in FP-tree
• Suppose a (conditional) FP-tree T has a shared single prefix-path P
• Mining can be decomposed into two parts
– Reduction of the single prefix path into one node – Concatenation of the mining results of the two parts
¼
a2:n2 a3:n3 a1:n1 {}
b1:m1 C1:k1 C2:k2 C3:k3
b1:m1 C1:k1 C2:k2 C3:k3
r1
a2:n2
+
a3:n3 a1:n1 {}
r1
=
48 Ming-Yen Lin, IECS.FCU
循序樣式(sequential patterns)
• 交易資料庫 (transaction databases), 時序資料庫 (time-series databases) 與 序列資料庫 (sequence databases)
• 頻繁樣式(frequent patterns)與循序樣式(sequential patterns)
• 循序樣式的應用 Applications of sequential pattern mining
– 顧客購買序列
• First buy computer, then CD-ROM, and then digital camera, within 3 months.
– 醫療處方, 天災 (e.g., earthquakes), 科學、工程程序, 股票等 – 電話通聯樣式,Weblog click streams
– DNA sequences and gene structures
49 Ming-Yen Lin, IECS.FCU
何謂循序樣式之勘測?
• 給一堆序列(sequence),找出所有
frequent
subsequences(子序列)A sequence database
A sequence : < (ef) (ab) (df) c b >
An elementmay contain a set of items.
Items within an element are unordered and we list them alphabetically.
<a(bc)dc> is a subsequence of <<a(abc)(ac)d(cf)>
Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern
<eg(af)cbc>
40
<(ef)(ab)(df)cb>
30
<(ad)c(bc)(ae)>
20
<a(abc)(ac)d(cf)>
10
sequence SID
50 Ming-Yen Lin, IECS.FCU
循序樣式勘測的困難何在?
• 隱藏於資料庫中的可能的循序樣式個數相 當龐大可能
• 探勘演算法必須
– 找出所有滿足 minimum support (frequency) threshold的樣式
– 要具有高度效率與可擴充性,減少資料庫檢視 次數
– 可以跟各種使用者所指定的constraints 搭配
51 Ming-Yen Lin, IECS.FCU
Sequential Patterns基本特性: Apriori
• 基本特性
: Apriori (Agrawal & Sirkant’94)
– If a sequence Sis not frequent
– then none of the super-sequencesof S is frequent
– 例如, <hb> infrequent Æ <hab> and <(ah)b> 也 infrequent
<a(bd)bcb(ade)>
50
<(be)(ce)d>
40
<(ah)(bf)abf>
30
<(bf)(ce)b(fg)>
20
<(bd)cb(ac)>
10
Sequence
Seq. ID Given support threshold
min_sup =2
52 Ming-Yen Lin, IECS.FCU
GSP 演算法細節
L
1= {frequent sequence of length 1}; k = 1 ; if L
k= ∅ stop ;
C
k+1= 由 L
k產生 ;
對資料庫 D 中每一個交易 t 執行
所有包含於 t 的、 C
k+1中的 candidate的個數 (support count) 加一
L
k+1= C
k+1之candidate滿足最小支持者 (minimum support)
k = k +1 ;
答案:
L
k的聯集 ;
• Lk:大小為 k 的 frequent sequence
• Ck:大小為 k 的candidate sequence
53 Ming-Yen Lin, IECS.FCU
找 Length-1的 Sequential Patterns
• Initial candidates: all singleton sequences – <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
• Scan database once, count support for candidates
<a(bd)bcb(ade)>
50
<(be)(ce)d>
40
<(ah)(bf)abf>
30
<(bf)(ce)b(fg)>
20
<(bd)cb(ac)>
10
Sequence Seq. ID
min_sup =2
1
<h>
1
<g>
2
<f>
3
<e>
3
<d>
4
<c>
5
<b>
3
<a>
Sup Cand
54 Ming-Yen Lin, IECS.FCU
產生 Length-2 Candidates
<ff>
<fe>
<fd>
<fc>
<fb>
<fa>
<f>
<ef>
<ee>
<ed>
<ec>
<eb>
<ea>
<e>
<df>
<de>
<dd>
<dc>
<db>
<da>
<d>
<cf>
<ce>
<cd>
<cc>
<cb>
<ca>
<c>
<bf>
<be>
<bd>
<bc>
<bb>
<ba>
<b>
<af>
<ae>
<ad>
<ac>
<ab>
<aa>
<a>
<f>
<e>
<d>
<c>
<b>
<a>
<f>
<(ef)>
<e>
<(df)>
<(de)>
<d>
<(cf)>
<(ce)>
<(cd)>
<c>
<(bf)>
<(be)>
<(bd)>
<(bc)>
<b>
<(af)>
<(ae)>
<(ad)>
<(ac)>
<(ab)>
<a>
<f>
<e>
<d>
<c>
<b>
<a>
51 length-2 Candidates
Without Apriori property,
8*8+8*7/2=92 candidates
Apriori prunes 44.57% candidates
55 Ming-Yen Lin, IECS.FCU
找Length-2 Sequential Patterns
• Scan database one more time, collect support count for each length-2 candidate
• There are 19 length-2 candidates which pass the minimum support threshold
– They are length-2 sequential patterns
56 Ming-Yen Lin, IECS.FCU
產生& 找出 Length-2 Candidates
• 產生
– length-2 sequential patterns 自交 (Self-join)
• Based on the Apriori property
• <ab>, <aa> and <ba> are all length-2 sequential patterns Æ <aba> is a length-3 candidate
• <(bd)>, <bb> and <db> are all length-2 sequential patterns Æ <(bd)b> is a length-3 candidate
– 46 candidates are generated, prunethe impossible
• 找 Length-3 Sequential Patterns
– Scan database once more, collect support counts for candidates
– 19 out of 46 candidates pass support threshold
57 Ming-Yen Lin, IECS.FCU
The GSP Mining Process
<a> <b> <c> <d> <e> <f> <g> <h>
<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
<abb> <aab> <aba> <baa> <bab> …
<abba> <(bd)bc> …
<(bd)cba>
1stscan: 8 cand. 6 length-1 seq.
pat.
2ndscan: 51 cand. 19 length-2 seq.
pat. 10 cand. not in DB at all
3rdscan: 46 cand. 19 length-3 seq.
pat. 20 cand. not in DB at all 4thscan: 8 cand. 6 length-4 seq.
pat.
5thscan: 1 cand. 1 length-5 seq.
pat.
Cand. cannot pass sup. threshold
Cand. not in DB at all
<a(bd)bcb(ade)>
50
<(be)(ce)d>
40
<(ah)(bf)abf>
30
<(bf)(ce)b(fg)>
20
<(bd)cb(ac)>
10
Sequence Seq. ID
min_sup =2
58 Ming-Yen Lin, IECS.FCU
GSP瓶頸
• candidates 個數太多
– 1,000 frequent length-1 sequences: generate length-2
• 全部資料庫檢視次數太多
• 真正難的:mining longsequential patterns
– An exponential number of short candidates 天文數字 – A length-100 sequential pattern needs 1030
candidate sequences!
500 , 499 , 2 1
999 1000 1000
1000× + × =
30 100
100
1
10 1 100 2
≈
−
⎟⎟=
⎠
⎜⎜ ⎞
⎝
∑⎛
=
i i
59 Ming-Yen Lin, IECS.FCU
頻繁樣式的應用
• 以關聯性來分類Association-based classification
• Iceberg cube computation
• Database compression by fascicles and frequent patterns
• Mining sequential patterns (GSP, PrefixSpan, SPADE, etc.)
• Mining partial periodicity, cyclic associations, etc.
• Mining frequent structures, trends, etc
60 Ming-Yen Lin, IECS.FCU
頻繁樣式的研究問題
• Multi-dimensional gradient analysis: patterns regarding changes and differences
• Mining fault-tolerant associations
– “3 out of 4 courses excellent” leads to A in data mining
• Fascicles and database compression by frequent pattern mining
• Partial periodic patterns
• DNA sequence analysis and pattern classification
61 Ming-Yen Lin, IECS.FCU
Tools: Association Rule Mining
• Free
– ARTool, http://www.cs.umb.edu/~laur/ARtool/
– Apriori, http://fuzzy.cs.uni-
magdeburg.de/~borgelt/#Software
• Commercial
– IBM Intelligent Miner for Data,
http://www.software.ibm.com/data/intelli-mine/
– DBMiner 2.0, http://www.dbminer.com/
– clementine, http://www.spss.com/clementine/
62 Ming-Yen Lin, IECS.FCU
Apriori (1/2)
• apriori [options] infile outfile [appfile]
– infile: file to read transactions from – outfile: file to write association rules
– appfile: file stating item appearances (optional)
• options:
-t# target type (s: item sets, r: rules (default), h: hyperedges) -m# minimal number of items per set/rule/hyperedge (default:
1)
-n# maximal number of items per set/rule/hyperedge (default:
5)
-s# minimal support of a set/rule/hyperedge (default: 10%) -c# minimal confidence of a rule/hyperedge (default: 80%)
63 Ming-Yen Lin, IECS.FCU
Apriori (2/2) options
-b/f/r# blank characters, field and record separators (default: " \t\r", " \t", "\n")
-o use original definition of the support of a rule (body
& head)
-p# output format for support/confidence (default:
"%.1f%%")
-x extended support output (print both rule support types) -a print absolute support (number of transactions)
-e# additional rule evaluation measure (default: none) (# always means a number, a letter, or a string that
specifies the parameter of the option.)
64 Ming-Yen Lin, IECS.FCU
Apriori Input Format
• text file (field and record separators and blanks)
– Record separators: lines
– field separators fields (or columns): words
– Blanks : fill fields (columns), e.g. to align them.
• Examples 1,2,3
1,4,5 2,3,4 1,2,3,4 2,3 1,2,4 4,5 1,2,3,4 3,4,5 1,2,3
65 Ming-Yen Lin, IECS.FCU
Item Appearances File
• item may appear only in rule bodies (antecedents):
– i in b body a ante antecedent
• item may appear only in rule heads (consequents):
– o out h head c cons consequent
• item may appear in rule bodies (antecedents) or in rule heads (consequents):
– io inout bh b&h ac a&c both
• item may appear neither in rule bodies (antecedents) nor in rule heads (consequents):
– n neither none ign ignore -
• Example 1: Generate only rules with item "x" in the consequent.
in
x out 66
Ming-Yen Lin, IECS.FCU
Sample Command
• apriori test1.tab test.rul
• apriori -b"(" -f, -r")" test2.tab test2.rul
• apriori -f ",.;:" -l test3.tab test3.rul
• apriori test1.tab -
0 1 0 2 0 3 1 1 1 4 1 5 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 2 4 3 5 1 5 2 5 4 6 4 6 5 7 1 7 2 7 3 7 4 8 3 8 4 8 5 9 1 9 2 9 3
(0,1)(0,2)(0,3)(1,1)(1,4)(1,5)(2,2)(2,3)(2,4)(3,1)(3,2)(3,3)(3,4)(4,2)(4,3)...
1,2,3 1,4,5 2.3.4 1,2,3,4 2:3 1,2,4 4,5 1,2,3,4 3;4;5 1,2,3 Example 2: Item "x" may appear only in a
rule head (consequent),
item "y" only in a rule body (antecedent);
appearance of all other items is not restricted.
both x head y body
67 Ming-Yen Lin, IECS.FCU
Sample Output
3 <- 2 (70.0%, 85.7%) 2 <- 3 (70.0%, 85.7%) 2 <- 1 (60.0%, 83.3%) 4 <- 5 (30.0%, 100.0%) 3 <- 2 1 (50.0%, 80.0%) 2 <- 3 1 (40.0%, 100.0%) 4 <- 3 5 (10.0%, 100.0%) 4 <- 1 5 (10.0%, 100.0%) 2 <- 3 4 1 (20.0%, 100.0%)
68 Ming-Yen Lin, IECS.FCU