Frequent Closed Itemsets Maintenance - 設計高效能之頻繁封閉項目集維護演算法

Considering an original database D and the newly inserted transactions d, there are four cases of candidate itemsets for the updated database D⁺ have been discussed in Section 2. With pre-storing previously mined frequent itemsets FID, a typical incremental mining algorithm can efficiently cope with these four cases by two steps:

(a) updating O against d and (b) rescanning P against D. Following this idea, we can use two similar steps: (a) updating CO against d and (b) rescanning CP against D to find out FCID+ dealing with the problem of maintaining association rules. However, directly obtaining CO = {x|x ∈ FID and x ∈ CID+} and CP = {x|x ∈ FId − FID and x ∈ CI_D+} is impractical because CID+ is unknown before processing D⁺. In the following, we attempt to utilize the pre-stored known information FCID from D and the information FCId obtained from d to approach CO and CP.

4.1 Joint closed itemsets

Lemma 1: If x ∈ CID ∪ CId, then x ∈ CID+.

Proof: We prove the lemma by contradiction. If x ∉ CID+, there must exist a proper superset y of x such that y.supD+ = x.supD+, i.e., y.supD*|D| + y.supd*|d| = x.sup_D*|D| + x.sup_d*|d|. Thus y.sup_D= x.supD and y.supd = x.supd, contradicting the claim that x ∈ CID ∪ CId. Thus, x ∈ CID+.

Let FCId-D denote FCId – FCID. Since FCID is the pre-stored mining information, we only need to find FCId from d to determine FCID-d. According to Lemma 1, we have FCID ⊆ CID ⊆ CID+ and FCId-D ⊆ CId ⊆ CID+. FCID and FCId-D are both closed itemsets in D⁺. If an incremental mining algorithm can utilize FCID and FCId to

obtain CO and CP, the problem of maintaining association rules in a dense database can be efficiently coped with. We first discuss the differences between FCID and CO and between FCId-D and CP. For example, given D = {ABCE, CD, BCE}, d = {ABCDE, CDE} and minsup = 0.6, FID = {B, C, E, BC, BE, CE, BCE}, FId = {C, D, E, CD, CE, DE, CDE}, FCI_D = {C, BCE} and FCId = {CDE}. By definitions, FCI_d-D

= {CDE}, CO = {C, CE, BCE} and CP = {CD, CDE}. As shown in this example, there exist some closed itemsets in CID+ but not in CID or CId, such that FCID and FCId-D may be not equivalent to CO and CP. The following lemmas are used to derive the set of joint itemsets (JCI) which are closed itemsets for D+ but can not be determined by FCID and FCId-D.

Lemma 2: If x ∈ JCI, then x ∈ CID+.

Proof: If x ∈ JCI, x must be one of following two cases.

Case 1: If x ∈ CID ∪ CId, then x ∈ CID+ according to Lemma 1;

Case 2: If x ∉ CID ∪ CId, there exist y ∈ CID and z ∈ CId such that x ⊂ y, x ⊂ z, and x is closed by both y and z. We prove this case by contradiction. If x ∉ CID+, there must exist a proper superset x’ of x such that x’.supD+ = x.supD+, i.e., x’.supD*|D| + x’.supd*|d| = x.supD*|D| + x.supd*|d| = y.supD*|D| + z.supd*|d|.

Thus x’ ⊂ y, x’ ⊂ z (because x’.supD = y.supD and x’.supd = z.supd) and x’ = y ∩ z, contradicting the claim that x ∈ JCI. Thus, x ∈ CID+.

Lemma 3: If x ∈ CID+, then x ∈ CID ∪ CId ∪ JCI.

Proof: If x ∈ CID+ and x ∉ CID ∪ CId, x must be closed in both D and d. Assume y is the itemset that closes x in D and z is the itemset that closes x in d. Then x.supD+ * |D⁺| = y.supD * |D| + z.supd * |d|. If y ⊆ z, x is belonging to Case 1 of Lemma 2, contradicting the claim that x ∉ CID; if z ⊆ y, x is also belonging to

Case 1 of Lemma 2, contradicting the claim that x ∉ CI_d. Thus y z and z y. According to Case 2 of Lemma 2, there must exist x’ = y ∩ z and x’ ∈ CI

⊆/ ⊆/

D+. If x ⊂ x’, x is closed by x’ (because x’.supD+ = x.supD+), contradicting the claim that

x ∈ CID+. Thus, x = x’ and x ∈ JCI.

Theorem 1: CI_D+ = CID ∪ CId ∪ JCI.

Proof: According to Lemmas 1 and 2, we have (CID ∪ CId ∪ JCI) ⊆ CID+. On the other hand, according to Lemma 3, we have CID+ ⊆ (CID ∪ CId ∪ JCI).

Thus, CID+ = CID ∪ CId ∪ JCI.

4.2 The effect of intersectional closed itemsets

Considering an original database and the newly inserted transactions, there are four cases of joint closed itemsets shown in Figure 4-1 may arise:

Incremental Batch

Figure 4-1: Four cases of JCI Original DB

d

FCId NCId

D

NCID

FCI_D FFJCI FNJCI

NFJCI NNJCI

The case of FFJCI: A closed itemset is frequent in both the original database and the newly inserted transactions.

The case of FNJCI: A closed itemset is frequent in the original database but infrequent in the newly inserted transactions.

The case of NFJCI: A closed itemset is infrequent in the original database but frequent in the newly inserted transactions.

The case of NNJCI: A closed itemset is infrequent in both the original database and the newly inserted transactions.

Since the closed itemsets in FFJCI are frequent in both the original database and the new transactions, they will still be frequent after the weighted average of the counts. Similarly, the closed itemsets in NNJCI will still be infrequent after the new transactions are inserted. FFJCI and NNJCI will not affect the final association rules.

FNJCI may remove existing association rules, and NFJCI may add new association rules.

According to Theorem 1, the following theorems are derived to obtain CO and CP by FCID, FCId and JCI.

Theorem 2: CO = {x|x ∈ FCID ∪ FFJCI ∪ FNJCI}.

Proof: By definition, CO collects the closed itemsets for D⁺ which is generated from FID. According to Theorem 1, CO = {x|x ∈ FID and x ∈ CID+} = {x|x ∈ FID

and x ∈ CID ∪ CId ∪ JCI } = {x|x ∈ FCID ∪ FFJCI ∪ FNJCI}.

Theorem 3: CP = {x|x ∈ (FCId − FFJCI) ∪ NFJCI}.

Proof: By definition, CP collects the closed itemsets for D⁺ which is generated

from FId−FID. As Theorem 2, FCId ∪ FFJCI ∪ NFJCI is the set of closed itemsets for D⁺ which is generated from FId. Thus CP = {x|x ∈ FId − FID and x ∈ CID+} = {(FCId ∪ FFJCI ∪ NFJCI) − (FCID ∪ FFJCI ∪ FNJCI)) = {x|x ∈ FCId

∪ FFJCI ∪ NFJCI – FFJCI} = {x|x ∈ (FCId − FFJCI) ∪ NFJCI}.

In contrast to the definitions of CO and CP, Theorems 2 and 3 provide a convenient way to obtain CO and CP. For CO, FFJCI and FNJCI can be obtained by processing the pre-stored mining information FCID against d. For CP, however, since NFJCI has to be generated from NCID, which is usually unknown in a typically incremental mining process, this cost is too expensive to be acceptable. As a result, the following theorem is derived to obtain CP.

Theorem 4: CP = {x|x ∈ FId – cover(FFJCI, FId), x ∈ CID+}.

Proof: By definition, the FFJCI covers the itemsets which are included both in FId and FID. Thus CP = {x|x ∈ FId − FID and x ∈ CID+} = {x|x ∈ FId – cover(FFJCI, FId), x ∈ CID+}, where the function cover(FFJCI, FId) means the itemsets in FId which are covered by FFJCI.

Since FFJCI has been obtained in CO generation, we only need to find FId and remove the itemsets in FId which have been determined in FFJCI as candidates for CP. It seems to be a better way for CP generation, because the cost of checking closure property of {FId – cover(FFJCI, FId)} in D+ is less than that of NFJCI generation.

在文檔中設計高效能之頻繁封閉項目集維護演算法 (頁 18-23)