• 沒有找到結果。

於資料探勘中利用抽樣方法發掘高頻項目組時抽樣數目之決定

N/A
N/A
Protected

Academic year: 2021

Share "於資料探勘中利用抽樣方法發掘高頻項目組時抽樣數目之決定"

Copied!
10
0
0

加載中.... (立即查看全文)

全文

(1)

2d¿b:

’e«¡4u*’eé2ꨌkܲµí’m, âk’mxíª¥, ñq¦)×¾

’e, ’e«¡˛uçD0§êíû˝ä ÖÍ’e¦).A½æ, O}&×¾í’eº óç‘v, ð’eé2í©°’e.cv, 7/A…òú àø’eéeÑ‚ñªW‚š, 1‡úš…ªW’e«¡ZAÑÇøªWí¤ ¤ø‚šj¶Öª±Q ŒF’eFI‘

ívÈ, OÇøjÞºâw§‚šFû_í‚šÏÏ Ê’e«¡2, ê¨É:¶†uøt

í{æ, 7ê¨òäáñ uê¨É:¶†í3b T ç‚à‚šê¨òäáñ v, Bb û˝š…×üD£ü7^Ëê¨òäáñ 5œ0ÈíÉ[, 1}&š…×üú½æ¡bí Ü>, 1WJT|õŽ,@àS²ì_çíš…bñ

Éœå : ’e«¡, É:¶†, ‚š, ‚šÏÏ

ѵqñ:

‡k û˝ñí d.«n:

Toivnen[1] AFT|í‚šj¶2, wš…×üí²ìjîu‡úš…’eé2Ó¹á ñ XM5,l¾ ˆp D‚ñ¡b p 5ÏÏ.ª¬  5œ0k 1 − α 7Rû|Ví, O ’¨òäáñ í½æ2, BbEíub‡ì p u´×kCükUà6¿ì5|ü XM(minimum support) p , úkÓ¹áñ 5XM p íü~M1ÌØ×íE

Ĥ, ÛÊBbT|íj¶øZJ«à ˆp D p 5Èí×.â¬øì˙, U)Bb ªJòQ‡i p > p C p < p , ?¹−„ ˆp D p 5Èí×, 7Ý−„ ˆp D p 5ÈíÏ Ï×, ª7²ì_çíš…×ü, U)š…?£ü/^ˇìò(Q) äáñ , 1%âÚ 7_ÒõðhôÑ®ƒ‚šÏÏFqìíb°-, š…×ü퉓, 1/%âÍ$¡b퉓

VªWÜ>}&, 1WJT|õŽ,@àS²ì_çíš…bñ, U),‚šbñ|ü

ToivonenÊ 1996 FT|í Sampling Algorithm[2], ¹‚šúk«¡É:¶†

í@à ¤j¶u‚à‚šíxXVªWÉ:¶†í«¡ w–1uôàâ Savasere et al.

k 1995FT|í Partition Algorithm[3] 7)V, ø’eé2í’eJÓœíj‚š¦

|, Uw×ü?ñÑk3p[2; 7(, y‚ंš!‹, ½©¤‚š’eé25š…>qp

“J«¡É:¶†, w3bí;¶u%â‚šv|òäáñ (frequent itemsets)

‚à¤j¶, ÖÍÉÛb½©øŸ’eéívÈ, ×ÙíÁý7Ê I/O D CPU íI‘, I

(2)

1/Z¾^?, OF«¡|5É:¶†, w!‹5ª]D£üø}FÏÏ Ñ7Tò|

(!‹íªÔD£ü, Toivonen Žâ#ìªñíÏÏMD¬¤ÏÏMí|ל0M, V²ìš…×ü

ç#ìø_ªJñíÏÏM  £ e(I, s) ¬¤ÏÏMí|ל0M δ v, ZªJ²ì

|FÛbíš…×ü, ?¹š…bñ n .âÅ—-ä:

P{e(I, s) ≥ } ≤ δ

;W Chernoff bounds[4], ª7Rû|Å— P{e(I, s) ≥ } ≤ δ íš…bñÑ:

n ≥ 1 22 ln(2

δ)

9õ,, BbêÛ‚à Chernoff bounds F²ìíš…×üiäMج\è, 7/š…

bñí²ì1³øváñ íXM p 5?ª , ?¹.XM p 5×ü, FÛš…b ñuó°í, ¥õu.¯Üí, cJUà6Aì5  D δ V²ìíš…×ü1Ý|7“íš

…×ü °šË, Zaki et al.[5] 6J Chernoff bounds V²ìš…×ü, J¤j¶F°)í š…×ü n ·óçË×, ÝB¨AF²ìíš…bñª?¬Ÿá’eéí>qp“bñ, F J!‹1._à

wFíû˝ç6à S. D. Lee et al. 6k 1998 T|«à‚šj¶Vj²É:¶†&

ˆí½æ, wFT|í DELI Algorithm[6] u‚à‚šV,lŸá’eé5É:¶†D%

¬Ót5(í’eé5É:¶†, s65ÈíÏæ˙, ªø¥²ìu´.búkñ‡íÉ :¶†‹J^Z JÏæ'ü, †wÑŸ’eé5É:¶†Eª_àkÓt(5’eé; ¥ 5†ªWÉ:¶†í^Z, ¤T¶ªJ±Q©Ÿ^ZÉ:¶†FI‘íA…

2zα/2

sp(1 − p)

n > 2zα/2

sp(1 − p) n

II

(3)

â,BbªJêÛç]˝–ÈBüv, FÛíš…bñøB×, F)ƒí!‹øBx

£ü4, Ĥ, ‰“š…bñøª−„]˝–Èí|× , WàBb.ı]˝–Èí 

¬ p/5 , †ª J-., °|FÛíš…bñ

2zα/2

sp(1 − p)

n p 5

°šË, DELI Algorithm F²ì5š…bñ?uóç\èíT¶, 7/, wJ−„]

˝–ÈíjV²ìš…bñ, øª?.â5?Ä‚š7¨Aí‡i˜Ï, ?¹J|üXM

p rÊF í p 5]˝–Èqv, [ý ˆp D p ÌéOË.°, ø̶‡iváñ uò äáñ CQäáñ , 7$A_ÈË, J˘k¤8$íáñ 'Ö/¬9lqìíª ñrMv, DELI Algorithm Zwì.âJ Œcñ’eéíjV^ZÉ:¶† 9õ,, Ê DELI Algorithm 2‚ší«àuç’eé5>qp“hÓDtÎ, køÉ:¶†Ì Z‰v, lJ‚šíjTøð, Áý7©Ÿ Œcñ’eéFI‘íA…, OÄÑF²ì íš…×ü„5? ˆp D p íóúÉ[, U)íÊõŽ,w6Œ.×

û˝j¶:

…û˝lå4uç‚à‚šj¶ÏW’e«¡2ê¨É:¶†v, ì2ø¯ÜíÉk‚šÏ Ïíñ™ƒb, ª7²µ|7í‚šbñ }&ÊÜ;Õ”-¤‚šbñú½æ¡bíÜ>, TXÊõŽ,àS²ìø¯Üí‚šbñ휄, 1}&¤œ„í$l4” 9õ,, ê¨ò äáñ ¥_½æu$lR2¡b,l (estimation of parameter) £cqì (testing of hypothesis) ù6¯7Ñøí½æ$, ¯Üíš…bñ.â׃ŗ-bç:

P



p ∈ ˆ/ p ± cα/2qVar(ˆp)



≥ 1 − β

w2 ˆp ± cα/2qVar(ˆp) , [ý‚àš…,5XM ˆp F íXM p5]˝–È, cα/2 †D ˆp 5‚š}ÓÉ, (1 − α) Ñ]˝®Ä, ?¹]˝–È?£ü¨ÖXM p 5œ 0Ñ (1 − α), 7 (1 − β) Ñ ˆp D p éO.°5œ0 ç]˝–ȨÖXM p £ ˆp D p éO.° s_9K°vêÞv, š…ø?£ü7/^Ëê¨òäáñ CQäáñ 

III

(4)

_Òõð:

Bb%âÚ7_Òõð}&, «nø¼¨‚šbñ n1 ú ˆn, J£õÒ£ü7^ê¨

ò (Q) äáñ œ05 à, °v, ‚àÚ7_Ò}&àSqì|7íø¼¨‚šbñ, ?

¹àSÊ®ƒ‚šÏÏFqìíb°-, U,‚šbñ|ü

…û˝låT|‚à‚šj¶?^7£üËRò (Q) áñ 5š…bñ n .âÅ

—-bçä:

P



p ∈ ˆ/p ± zα/2qVar(ˆp)



≥ 1 − β

1Rû‚šbñ n D p, p, α £ β 5É[, FÛš…bñ.âÅ—.:

n ≥ p(1 − p)(zα/2+ zβ)2 (p − p)2

ã¯J,FH, ¤û˝íõ𥠪Hà-:

1 #ìø p, p, α, β, n1  2 ‚š ˆp(n1)∼ Np,p(1−p)n

1

 

3 øø¼¨‚šF)5 ˆp(n1) Hp ˆn ít2, l©Ÿõðí,‚šbñ Ni , ì2 à-:

ˆ n =

$pˆ(n1)(1 − ˆp(n1))(zα/2+ zβ)2 p(n1)− p)2

%

+ 1

Ni = max(ˆn, n1) =

n1, if n1 > ˆn ˆ

n , if n1 ≤ ˆn

4 ø¥ 3 lF)í,‚šbñ Ni D,‚šbñ5j Ni2 , }bWYÕí‹, T, JZõ𽺠r Ÿ!!5(, ªW,‚šbñ5Ìb E[N ] D‰æb Var[N ] í,l

IV

(5)

5 ½º¥ {2 ∼ 4} , òƒõðªW r Ÿ

6 õ𽺠r Ÿ5(, ÇáªWóÉbWl, à-Fý:

E[N ] =ˆ 1 r

r

X

i=1

Ni

S2 = Var[N ] =d

Pr

i=1Ni− r × ¯N2 r − 1

q

Var[ ¯d N ] =

sS2 r

ѽ©U,‚šbñ|ü5|7ø¼¨‚šbñ n1 , Bbø‰‰b n1 , 7wF¡

b p, p, α, β ì.‰, ½º,Hõð¥ {1 ∼ 6} , Jv|U ˆE[N ] + 2

q

Var[N ] |ü5d

|7ø¼¨‚šbñ n1 

…û˝S¦Óœ‚šj¶ªWÚ7_Ò, _ÒõðFní¡b¸ˇà-, |üXM

p Ñ 0.02, 0.1, 0.2, 0.3, 0.4 , áñ XM p Ñ 0.02 ± 3 × 0.005, 0.1 ± 0.05, 0.2 ± 2 × 0.05, 0.3 ± 2 × 0.05 D 0.4 ± 2 × 0.05 , α = β = 0.05, w2 zα/2 = 1.96, zβ = 1.645,  ø¼¨‚šbñ n1 â 200 ∼ 50, 000, ©Ÿ]Ó 200  …û˝â,H¡b¸ˇZA5¡b

¯, TÑ_Òõðí¸ˇ, }«nø¼¨‚šbñ n1 ú ˆn , J£õÒ£ü7^ê¨ò (Q) äáñ œ05 à, 1½©U E[N ] + 2qVar[N ] |ü5|7ø¼¨‚šbñ

_Òõð!‹:

_Òõð!‹J[j|%âÚ7_ÒõðF)ƒí® ¡b ¯5õðbW, ° v, J ˆE[N ] + 2

q

Var[N ] |üÑô^N™, J²ì|7ø¼¨‚šbñ nd 1, ?¹¤|7

ø¼¨šbñª\}š…?^Ëê¨ò (Q) äáñ íœ0®ãqíƧM (1 − α)

£(1 − β), /,‚šbñ'òíœ0.}¬ ˆE[N ] + 2

q

Var[N ]d

ílBbø‡ú p = 0.02 ± 3 × 0.005, p = 0.02, α = 0.05, β = 0.05, ªW_Ò, Fõ ðbWÌâ_ÒõðßÞ

V

(6)

[ 1 : p = 0.02, α = β = 0.05, ˆnp

(n1)=p}= 287 p n1 E[N ] (ˆ

q

Var[ ¯d N ])

q

Var[N ]d E[N ] + 2ˆ

q

Var[N ]d αˆ βˆ 0.005 400 871 (130.8) 41363.3 83598 0.040 0 0.005 600 699 (1.31) 414.1 1527 0.040 0 0.005 800 835 (0.59) 185.6 1206 0.040 0 0.005 1000 1011 (0.28) 89.0 1189 0.039 0 0.005 1200 1203 (0.14) 42.8 1288 0.048 0 0.005 1400 1401 (0.07) 20.7 1442 0.050 0 0.005 1600 1600 (0.02) 7.4 1615 0.051 0 0.005 1800 1800 (0.02) 6.7 1813 0.051 0

[ 1 2, ˆnp(n1)=p} = 287 , 4uø ˆp(n1) = p , Hpt

ˆ n =

$pˆ(n1)(1 − ˆp(n1))(zα/2+ zβ)2 p(n1)− p)2

%

+ 1,

F°)íÜ‚šbñ

hô[ 1, êÛUÌ,‚šbñ5,lM ˆE[N ] |üíø¼¨‚šbñÑ n1 = 600,

¤v ˆE[N ] = 696 ÖÑ|ü, Ow‰æb Var[N ] º´óçË×d

Ñ7UBby‹nj, U,‚šbñ|ü5|7ø¼¨‚šbñ n1 DÜM ˆn 5È íÉ[, J£wFú@í ˆE[N ]  Var[ ¯d N ]  Var[N ]  (1 − ˆd α)  (1 − ˆβ) 5‰“8$ ]ø

…û˝_Òõð¸ˇq5F p = 0.02, α = β = 0.05 , D.°í p = p± 3 × 0.005 F Aí¡b ¯, F)5õð!‹, ‹JcÜà-[

[ 2 : p = 0.02, α = β = 0.05 , ,‚šbñ|ü5|7ø¼¨‚šbñ p nˆp

(n1)=p} n1 E[N ] (ˆ Var[ ¯d N ]) Var[N ] ˆd E[N ] + 2Var[N ]d αˆ β nˆ 1np

(n1)=p}

0.005 287 1000 1011 (0.3) 89.0 1189 0.04 0 3.48 0.01 1287 3200 3269 (1.5) 480.1 4229 0.04 0 2.49 0.015 7681 17600 17891 (7.2) 2270.6 22432 0.04 0 2.29 0.025 12671 25800 26243 (10.5) 3331.2 32905 0.04 0 2.04 0.03 3782 7800 7892 (2.5) 801.3 9495 0.04 0 2.06 0.035 1951 3800 3853 (1.3) 424.1 4701 0.04 0 1.95

°v, hô[22®í ˆβ D ˆα, êÛ¤š…×ü^ê¨ò (Q) äáñ íœ0 (1 − β) ?®ƒã‚íƧM 95% , /‚àø¼¨Dù¼¨s_š…í’e¾ XM pˆ

VI

(7)

í]˝–È? (1 − ˆα) ≈ 95% í]˝®Ä 6ÿ uz, ¤%âÚ7_Ò}&F)í|7

ø¼¨‚šbñ, î?®ƒ‚šÏÏFqìíb°-

ÇÕªJêÛ, ®_ÒõðF)5|7ø¼¨‚šbñ n1 DÜ‚šbñ ˆnp

(n1)=p}

5IbÉ[ n1np

(n1)=p} ÌÑ 2.38 I, ?¹Ñ7U,‚šbñ|ü, Ü6ªJÄI Ë,l p , Í(;W (p, p, α, β) ¡bMHpt ˆn =

pˆ(n1)(1− ˆp(n1))(zα/2+zβ)2 ( ˆp(n1)−p)2



+ 1 2, F)íÜ‚šbñM ˆn , ø ˆn J 2.38 I, TÑ_çíø¼¨‚šbñ

}‡úáñ XMà p = 0.2 ± 2 × 0.05, p = 0.3 ± 2 × 0.05, 7|üXM}

Ñ p = 0.2, 0.3 , J£ α = β = 0.05, ªW_Òõð, øw_Òõð!‹¦Ñ}&à-[



[ 3 : p = 0.2, α = β = 0.05 , ,‚šbñ|ü5|7ø¼¨‚šbñ

p nˆp(n1)=p} n1 E[N ] (ˆ Var[ ¯d N ]) Var[N ] ˆd E[N ] + 2Var[N ]d αˆ β nˆ 1np(n1)=p}

0.1 117 400 401 (0.04) 11.18 423 0.050 0 3.42 0.15 663 1600 1617 (0.50) 159.51 1936 0.048 0 2.41 0.25 975 2400 2410 (0.41) 131.18 2673 0.048 0 2.46 0.3 273 600 606 (0.19) 58.55 723 0.041 0 2.20

[ 4 : p = 0.3, α = β = 0.05 , ,‚šbñ|ü5|7ø¼¨‚šbñ p nˆp

(n1)=p} n1 E[N ] (ˆ Var[ ¯d N ]) Var[N ] ˆd E[N ] + 2Var[N ]d αˆ β nˆ 1np

(n1)=p}

0.2 208 400 420 (0.37) 117.91 656 0.040 0 1.92 0.25 975 2000 2055 (1.16) 367.55 2790 0.040 0 2.05 0.35 1183 2800 2820 (0.69) 218.14 3256 0.042 0 2.37 0.4 312 600 617 (0.44) 137.69 893 0.041 0 1.92

â[ 3D[ 4 í¦Ñ}&2, Bb.ØêÛ® ¡b ¯, çáñ XM p D|üXM

 p 5Èí× |p − p| BVBüv, ÛbœÖíš…j?^Ëê¨ò (Q) äáñ  ¤ Õ, %â_ÒõðF)5U,‚šbñ5E[N ] + 2qVar[N ]Ñ|üí|7ø¼¨‚šbñd n1DÜ‚šbñ ˆnp(n1)=p}5IbÉ[, ÌÑ n1np(n1)=p}≈ 2.33 I, ?¹Ñ7U,

‚šbñ|ü, Ü6ª;W (p, p, α, β) ¡bMHpt ˆn =

pˆ(n1)(1− ˆp(n1))(zα/2+zβ)2 ( ˆp(n1)−p)2



+ 1 2, F)íÜ‚šbñM ˆnp(n1)=p} , y J 2.33 I, TÑ_çíø¼¨‚šbñ, ø ªU,‚šbñÑ|ü

°v, hô[ 3 D[ 4 2í ˆβ D ˆα, êÛ¤š…×ü^ê¨ò (Q) äáñ íœ0 (1 − ˆβ) î?®ƒã‚íƧM 95% , /‚àø¼¨Dù¼¨s_š…í’e¾ X

VII

(8)

M p í]˝–È? (1 − ˆα) ≈ 95% í]˝®Ä 6ÿ uz, ¤%âÚ7_Ò}&F)í

|7ø¼¨‚šbñ, îªÊ®ƒ‚šÏÏFqìíb°-, U,‚šbñ|ü

!

…û˝JÓœ‚šj¶, «nÑ?^/£üËê¨ò (Q) äáñ FÛš…bñú®

Í$¡bíÜ>, J£¤š…bñúáñ XM p £|üXM p È× |p − p| í ƒbÉ[, J£ÊõŽ,.â‚àø¼¨‚š, F,)XMHp‚šbñ-ä, FßÞ5 ,‚šbñí$l4”, âk,‚šbñ5$l4”ª?̶âj&¶}&, Ú7_Òuøª Wí¤, ª7njø¼¨‚šbñú,‚šbñ, J£õÒ£ü7^ê¨ò (Q) äá ñ œ05 à, DàSÊ®ƒ‚šÏÏFqìíb°-, J|7íø¼¨‚šbñ, U,

‚šbñ|ü

ãh®¡b p, p, α, β 5 ¯, F)í_Òõð!‹, ‹J}&¦Ñ, TXÜ6ÊõŽ

,, k‚àÓœ‚šíj¶Vê¨ò (Q) äáñ v, àS²ìš…bñí×ü, n?U‚š ÏÏüƒ?R|áñ XM p D p íóúÉ[5¡5 ã¯,H, ª)-!DõŽ

,íÍT¥ , TXÜ6²µ5¡5:

(1) ‚à‚šj¶?^7£üËRò (Q) äáñ 5š…bñ n í²ì.âÅ—

P

p ∈ ˆ/ p ± zα/2

sp(1 − p) n

≥ 1 − β,

ª7Rû|ÑU¤œ0ükãqƧM β , FÛíš…bñ.âÅ—-.

n ≥ p(1 − p)(zα/2+ zβ)2 (p − p)2

(2) ÊõŽ,, ‚šÏÏ5ƧM α D β, ÌâÜ6YwF?¨ñí£üD^4‹J qì, |üXM p †YWwkê¨|xSgMíò (Q) äáñ 7‹J²ì, O¤š…bñ-ä−ƒ„øíXM p , FJ, Bbªl‚¦’e×üÑ n0 íÉ

VIII

(9)

¼¨š…, )ƒXM p í,l¾ ˆp(n0) , Í(, à ˆp(n0) ¦H-ä2íXM p , ° )š…bñ:

ˆ n0 =

$pˆ(n0)(1 − ˆp(n0))(zα/2+ zβ)2 p(n0)− p)2

%

+ 1

(3) %âÚ7_Òõð!‹}&ª)ø, U,‚šbñ|üí|7íø¼¨‚šbñ n1

Dš…bñ-ä ˆn0 5É[, ÌÑ n1n0 ≈ 2.34  Í(, ‚¦š…bñ n1 (2.34)ˆn0 5ø¼¨‚š, 1‡iu´Ûbù¼¨5‚š, |(, !¯ú_¼¨F‚

¦íš… ]˝–È, 1W¤]˝–Èê¨ò (Q) äáñ 

(4) Ü6CªN¬ɼ¨‚šF,)áñ XM ˆp(n0), ;W ˆp(n0) D|üXM p 5Èí× |ˆp(n0)− p| , ¡5…û˝2_Ò¸ˇqF p, p, α, β ¡b ¯, ‚àÚ 7_ÒõðF)5!‹, ‹JúÎ, ²ÏªJQ§í|7ø¼¨‚šbñ n1 , ®ƒã lqì5‚šÏÏb°, 1U,‚šA…|ü

IX

(10)

d.¡5’e

1. Toivonen, H., “Sampling large databases for association rules,” In proceeding of the 22nd International Conference on Very Large Data Base (VLDB’96), Morgan Kaufmann, 1996.

2. Toivonen, H., “Sampling large databases for association rules,” In proceeding of the 22nd International Conference on Very Large Data Base (VLDB’96), Morgan Kaufmann, 1996.

3. Savasere, A., E. Omiecinski, and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases,” In proceeding of the 21th In- ternational Conference on Very Large Data Base (VLDB), pp.432-444, 1995.

4. Hagerup, T., and C. R¨ub, “A guided tour of chernoff bounds,” In Information Processing Letters , pp.305-308, North-Holland, 1989/90.

5. Zaki, M.J., S. Parthasarathy, W. Li and M. Ogihara, “Evaluation of sampling for data mining of association rules,” In proceeding of the 7th Workshop on Research Issues in Data Engieering, 1997.

6. Lee, S. D., D. W. Cheung, and B. Kao, “Is sampling useful in data mining?

A case in the maintenance of discovered association rules,” Data Mining and Knowledge Discovery, vol. 2, pp.233-262, 1998.

X

參考文獻

相關文件

&#34;Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,&#34; Data Mining and Knowledge Discovery, Vol. “Density-Based Clustering in

The entire moduli space M can exist in the perturbative regime and its dimension (∼ M 4 ) can be very large if the flavor number M is large, in contrast with the moduli space found

In the past researches, all kinds of the clustering algorithms are proposed for dealing with high dimensional data in large data sets.. Nevertheless, almost all of

母體分配 樣本平均數 的抽樣分配 抽樣誤差與 非抽樣誤差 樣本平均數 的平均數與. 變異數

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time

Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In Proceedings of the 22nd ACM SIGKDD International Conference on

For the data sets used in this thesis we find that F-score performs well when the number of features is large, and for small data the two methods using the gradient of the

A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning