• 沒有找到結果。

We also propose another algorithm called Diff_ET2, which simultaneously processes both the original extended database ED and the updated extended database UE. As stated previously, only Cases 3 and 4 need further database rescanning to determine the support counts of itemset A. For Case 3, we have to rescan the affected transactions in ED and UE, to decide whether A is frequent or not. As for Case 4, the following lemma provides an effective pruning strategy to reduce the number of candidate itemsets and so avoid unnecessarily database rescans.

Lemma 1. If an affected itemset AL (frequent itemsets in ED) and A  , then AA

L(frequent itemsets in UE).

Proof. If A L, then  |ED| ms. Note that |UE| |ED| and |A | ||. Hence,

A  + (A A  ) |ED| ms |UE| ms. Thus, A A L. 

Thus in Case 4, we scan the affected transactions, in ED and in UE, respectively, to count the occurrences of A. If the support counts of A in are greater than those in , then we have to scan UE to decide whether A is frequent or not.

Example 6. Consider Figure 13. Let ms60% (3 transactions). Then item “I”is not frequent in ED. Hence, are not available in L. From{I} and , we have  {I} 1,

{I}0 and {I} {I} 1, i.e., {I} ,{I} “I” is infrequent in UE according to Lemma 1. Therefore, we do not need to scan the transactions in UEto determine whether item “I”isfrequent or not. This reduces the number of candidates before scanning UE.

H Updated Extended Database (UE)

A, E Original Extended Database (ED)

A, E

Figure 13. Example of mining generalized association rules caused by item reclassification.

Based on the aforementioned concept, the main process of Diff_ET2 is presented as follows.

First, add all generalized items in the old and new item taxonomies T and T into the original database DB to form ED and UE, respectively, and identify affected primitive items and affected generalizations. We identify affected primitive items for affected transactions. Next, let the set of candidate 1-itemsets C be the set of items1 in the new item taxonomies T , i.e. all items in T are candidate 1-itemsets. Then load the original frequent 1-itemsets L1 and divide C into the three subsets of1 C ,1

C , and1 C ; where1 C1 consists of unaffected 1-itemsets in L1, C1 contains affected 1-itemsets, and C1 contains unaffected 1-itemsets not in L1. According to Case 2, C is infrequent, so1 C1 is pruned directly. After dividing C , we first1 compute the support counts of each 1-itemset A in C1 over the affected transactions of  and ; letting the values be  andA , respectively. Then, for eachA 1-itemset A that is in L1and C , i.e., A1C1L , calculate the support count1 A

  .According to Lemma 1, for any candidate A L and  , count

which are frequent in C . For the next cycle we generate candidates 2-itemsets C1 2

from Land repeat the same procedure for generating1 Luntil no frequent1 k-itemsetsLkare created.

The Diff_ET2 algorithm is shown in Figure 14.

Input: (1) DB: the database; (2) the minimum support setting ms; (3) T: the old item taxonomies; (4) T : the new item taxonomies; (5) L: the set of original frequent itemsets.

Output: L: the set of new frequent itemsets with respect to T . Steps:

1. Add generalized items in T into the original database DB to form ED;

2. Add generalized items in T into the original database DB to form UE;

3. APiden_AP(T, T ); /* Identifing affected items */

4. AGiden_AG(T, T ); /* Identifing affected generalizations */

5. Call iden_AT(ED, UE, AP); /* Identifing affected transactions */

6. k

7. repeat

8. if kthen Generate C from1 T ; 9. elseCkapriori-gen(Lk1);

10. Delete any candidate inC that consists of an item and its ancestor;k 11. Load original frequent itemsetsL ;k

12. Divide C into three subsets:k Ck, Ck, and Ck; /* Ck consists of unaffected itemsets inL , andk Ckcontains affected itemsets, and

Ckcontains unaffected itemsets not inLk */

13. Count the occurrences of each itemset A in Ck over ; 14. Count the occurrences of each itemset A in Ck over ; 15. for each itemset ACkLk do /* Case 3 */

16. Calculate A A  A ;A

17. for each itemset AL andk A  doA /* Case 4 & Lemma 1 */

18. Count the occurrences of A over UE -; 19. Add the count into ;A

20. end for

21. Lk{ACksupUE(A)ms}Ck; 22. until Lk

23. LUkLk;

Figure 14. Algorithm Diff_ET2.

3.4.1. An example for Diff_ET2

Consider Figure 13 and let ms 60% 3 transactions). The set of frequent itemsets L {A, D, E, H, AE, AH, DH, EH, AEH}. Note that item “K” is not interesting and is deleted from the old taxonomy; therefore, we do not add its ancestors to UE.

The Diff_ET2 algorithm first divides all 1-itemsets in C1into three subsets: one consists of unaffected 1-itemsets C1{D, E, H} in L1; the second contains affected generalized 1-itemsets C1{A, I},where“A”is frequent in L1, while “I”is not; and the third consists of unaffected 1-itemsets C1{B, C, F, G} not in L1. Since 1-itemsets in C1 are frequent and do not change their supports in UE, we do not need to process them; we only need to process generalized 1-itemsets {A, I}.

According to Case 2, C1 is infrequent and pruned directly. Next, Transaction 1 in ED and UE is scanned since the transaction is affected by item “L”.We then subtract the support countsofitem “A”inand add their support counts in . For example, we have {A} {A}  {A} {A} . Because {I} {I} 1, i.e.,

{I} ,{I} item “I”isinfrequent in UE according to Lemma 1. Therefore, we do not need to scan the transactions in UEto determine whether item “I”isfrequentor notAftercomparing thesupportsof“A”to ms, the new set of frequent 1-itemsets L1 is {A, D, E, H}.

Next, we use Lto generate candidate 2-itemsets1 C , obtaining2 C2{AE, AH, DE, DH, EH}. However, only {AE} and {AH} undergo support counting, because the

procedure of generating frequent 2-itemsets is the same as that for generating L1. The new set of frequent 2-itemsets Lis {AH, DH, EH}. Finally, we use the same2 approach to generate L3, obtaining L3.

相關文件