Algorithm Diff_ET2 - Updating generalized association rules with evolving taxonomies

We also propose another algorithm called Diff_ET2, which simultaneously processes both the original extended database ED and the updated extended database UE. As stated previously, only Cases 3 and 4 need further database rescanning to determine the support counts of itemset A. For Case 3, we have to rescan the affected transactions in ED and UE, to decide whether A is frequent or not. As for Case 4, the following lemma provides an effective pruning strategy to reduce the number of candidate itemsets and so avoid unnecessarily database rescans.

Lemma 1. If an affected itemset AL (frequent itemsets in ED) and _A  , then A_A

L(frequent itemsets in UE).

Proof. If A L, then  |ED| ms. Note that |UE| |ED| and |_A | ||. Hence,

A  + (_A _A  ) |ED| ms |UE| ms. Thus, A _A L. 

Thus in Case 4, we scan the affected transactions, in ED and in UE, respectively, to count the occurrences of A. If the support counts of A in are greater than those in , then we have to scan UE to decide whether A is frequent or not.

Example 6. Consider Figure 13. Let ms60% (3 transactions). Then item “I”is not frequent in ED. Hence, are not available in L. From_{I} and , we have  _{I} 1,

{I}^0 and _{I}^ _{I} 1, i.e., _{I}^ ,_{I} “I” is infrequent in UE according to Lemma 1. Therefore, we do not need to scan the transactions in UEto determine whether item “I”isfrequent or not. This reduces the number of candidates before scanning UE.

H Updated Extended Database (UE)

A, E Original Extended Database (ED)

A, E



Figure 13. Example of mining generalized association rules caused by item reclassification.

Based on the aforementioned concept, the main process of Diff_ET2 is presented as follows.

First, add all generalized items in the old and new item taxonomies T and T into the original database DB to form ED and UE, respectively, and identify affected primitive items and affected generalizations. We identify affected primitive items for affected transactions. Next, let the set of candidate 1-itemsets C be the set of items₁ in the new item taxonomies T , i.e. all items in T are candidate 1-itemsets. Then load the original frequent 1-itemsets L1 and divide C into the three subsets of₁ C ,₁^



C , and1 C ; where₁^ C₁^ consists of unaffected 1-itemsets in L₁, C₁^ contains affected 1-itemsets, and C₁^ contains unaffected 1-itemsets not in L₁. According to Case 2, C is infrequent, so₁^ C₁^ is pruned directly. After dividing C , we first₁ compute the support counts of each 1-itemset A in C₁^ over the affected transactions of  and ; letting the values be  and_A , respectively. Then, for each_A 1-itemset A that is in L1and C , i.e., A₁^  C₁^L , calculate the support count₁ _A

  .According to Lemma 1, for any candidate A L and  , count

which are frequent in C . For the next cycle we generate candidates 2-itemsets C₁^ 2

from Land repeat the same procedure for generating₁ Luntil no frequent₁ k-itemsetsL_kare created.

The Diff_ET2 algorithm is shown in Figure 14.

Input: (1) DB: the database; (2) the minimum support setting ms; (3) T: the old item taxonomies; (4) T : the new item taxonomies; (5) L: the set of original frequent itemsets.

Output: L: the set of new frequent itemsets with respect to T . Steps:

1. Add generalized items in T into the original database DB to form ED;

2. Add generalized items in T into the original database DB to form UE;

3. APiden_AP(T, T ); /* Identifing affected items */

4. AGiden_AG(T, T ); /* Identifing affected generalizations */

5. Call iden_AT(ED, UE, AP); /* Identifing affected transactions */

6. k

7. repeat

8. if kthen Generate C from₁ T ; 9. elseC_kapriori-gen(L_k_₁);

10. Delete any candidate inC that consists of an item and its ancestor;_k 11. Load original frequent itemsetsL ;_k

12. Divide C into three subsets:_k C_k^, C_k^, and C_k^; /* C_k^ consists of unaffected itemsets inL , and_k C_k^contains affected itemsets, and



Ckcontains unaffected itemsets not inL_k */

13. Count the occurrences of each itemset A in C_k^ over ; 14. Count the occurrences of each itemset A in C_k^ over ; 15. for each itemset A C_k^L_k do /* Case 3 */

16. Calculate _A  _A  _A ;_A

17. for each itemset A L and_k _A  do_A /* Case 4 & Lemma 1 */

18. Count the occurrences of A over UE -; 19. Add the count into ;_A

20. end for

21. Lk{AC_k^sup_UE(A)ms}C_k^; 22. until L_k

23. _LUkLk;

Figure 14. Algorithm Diff_ET2.

3.4.1. An example for Diff_ET2

Consider Figure 13 and let ms 60% 3 transactions). The set of frequent itemsets L {A, D, E, H, AE, AH, DH, EH, AEH}. Note that item “K” is not interesting and is deleted from the old taxonomy; therefore, we do not add its ancestors to UE.

The Diff_ET2 algorithm first divides all 1-itemsets in C₁into three subsets: one consists of unaffected 1-itemsets C₁^{D, E, H} in L1; the second contains affected generalized 1-itemsets C₁^{A, I},where“A”is frequent in L1, while “I”is not; and the third consists of unaffected 1-itemsets C₁^{B, C, F, G} not in L1. Since 1-itemsets in C₁^ are frequent and do not change their supports in UE, we do not need to process them; we only need to process generalized 1-itemsets {A, I}.

According to Case 2, C₁^ is infrequent and pruned directly. Next, Transaction 1 in ED and UE is scanned since the transaction is affected by item “L”.We then subtract the support countsofitem “A”inand add their support counts in . For example, we have _{A}^ _{A}  _{A} _{A}^ . Because _{I}^ _{I} 1, i.e.,

{I}^ ,_{I} item “I”isinfrequent in UE according to Lemma 1. Therefore, we do not need to scan the transactions in UEto determine whether item “I”isfrequentor notAftercomparing thesupportsof“A”to ms, the new set of frequent 1-itemsets L₁ is {A, D, E, H}.

Next, we use Lto generate candidate 2-itemsets₁ C , obtaining₂ C₂{AE, AH, DE, DH, EH}. However, only {AE} and {AH} undergo support counting, because the

procedure of generating frequent 2-itemsets is the same as that for generating L₁. The new set of frequent 2-itemsets Lis {AH, DH, EH}. Finally, we use the same₂ approach to generate L₃, obtaining L₃.

在文檔中 Updating generalized association rules with evolving taxonomies (頁 22-26)