Determining Frequent Itemsets from Current SFI-forest

Chapter 2 Online Mining of Frequent Itemsets in Data Streams

2.3 The Proposed Algorithm: DSM-FI

2.3.3 Determining Frequent Itemsets from Current SFI-forest

Once SFI-forest containing all the frequent items of the data stream generated so far is constructed, we can derive all the frequent itemsets by traversing the SFI-forest according to the Apriori principle. Therefore, we propose an efficient mechanism called top-down frequent itemset selection (todoFIS), as shown in Figure 2-8, for mining frequent itemsets. It is especially useful in mining long frequent itemsets. The method is described as follows.

Assume that there are k frequent items, namely e₁, e₂, …, e_k, in the current FI-list, and each item e_i, ∀i = 1, 2, …, k, has an associated e_i.OFI-list, where the size of e_i.OFI-list is denoted by |ei.OFI-list|. Note that the items, namely o1, o2, …, oj, within the ei.OFI-list are denoted by e_i.o₁, e_i.o₂, …, e_i.o_j, respectively, where the value j equals to |e_i.OFI-list|. For each entry e_i, ∀i = 1, 2, …, k, in the current FI-list, DSM-FI algorithm first generates a maximal candidate itemset with (j+1) items, i.e., (eiei.o1ei.o2 …ei.oj) by combining the item-prefix ei

with all frequent items in e_i.OFI-list. Then, DSM-FI uses the following scheme to count its estimated support.

First, we start with a specific frequent item e_i.o_l (1 ≤ l ≤ j), whose estimated support is smallest, and traverse the paths containing e_i.o_l via node-links of e_i.SFI-tree to count the estimated support of the candidate (eiei.o1ei.o2 …ei.oj). If the estimated support of the candidate is greater than or equal to (s−ε)⋅ e_i.CL, then it is a frequent itemset. All subsets of this frequent itemset are also frequent itemsets according to the Apriori principle. Hence, the complete set of the frequent itemsets stored in the e_i.SFI-tree can be generated by enumeration of all the combinations of the subsets of frequent (j+1)-itemset, (eiei.o1ei.o2 …ei.oj). On the other hand, if the estimated support of the candidate (j+1)-itemset is less than the threshold (s−ε)⋅ e_i.CL, then it is not a frequent itemset. Now, we need to use the same mechanism to test all the subsets of the (j+1)-itemset until the candidate 3-itemsets. This is because all frequent 2-itemsets can be generated by combining the item e_i and the frequent items of the e_i.OFI-list.

Note that a (j+1)-itemset can be decomposed into C(j+1, j) j-itemsets. We decompose one candidate j-itemset from the (j+1)-itemset at a time, and use the same scheme described above to count the estimated support of this candidate j-itemset. Finally, all the maximal frequent itemsets are maintained in a temporal MFI-list, called MFI_temp-list, for efficient generation of the set of all frequent itemsets. If such a MFItemp-list is obtained, all the frequent itemsets can be generated efficiently by enumerating the set of all maximal frequent itemsets in the current MFItemp-list without any candidate generation and support counting. Note that if the user request is just to find the set of all maximal frequent itemsets so far, DSM-FI algorithm can output all maximal frequent itemsets efficiently by scanning the MFI_temp-list.

Example 2-3. Let the minimum support threshold s be 0.5. Therefore, an itemset X is frequent in Figure 2-7 if X.esup ≥ s⋅X.CL. Note that s⋅X.CL = 0.5⋅6 = 3 in this case. The online mining steps of DSM-FI algorithm are described as follows.

(1) First of all, DSM-FI starts the frequent itemset mining scheme from the first frequent item a (from left to right). At this moment, only item a is a frequent itemset, since the estimated support of items c, d, e, and f in the a.OFI-list are less than s⋅a.CL, where s⋅a.CL = 3. Now, DSM-FI stores the maximal frequent 1-itemset (a) into the MFItemp-list.

(2) Next, DSM-FI starts on the second entry c for frequent itemset mining. DSM-FI generates a candidate maximal 3-itemset (cef), and traverses the c.SFI-tree to count its estimated support. As a result, the candidate (cef) is a maximal frequent itemset, since its estimated support is 3 and it is not a subset of any other frequent itemsets in the MFI_temp-list. Now, DSM-FI stores the maximal frequent itemset (cef) into the MFItemp-list.

(3) Next, DSM-FI starts on the third entry d and generates a candidate maximal 2-itemset (df).

DSM-FI stores the itemset (df) into the MFI_temp-list without traversing d.SFI-tree because (df) is a frequent 2-itemset and is not a subset of any other maximal frequent itemsets stored in the MFItemp-list.

(4) On the fourth entry f, DSM-FI algorithm generates one frequent 1-itemset (f) directly, since the f.OFI-list is empty. DSM-FI does not store it into the MFItemp-list, because (f) is a subset of a generated maximal frequent itemset (cef).

Finally, on the fifth entry e, DSM-FI generates a frequent 2-itemset (ef) directly. However, the frequent 2-itemset (ef) is a subset of a maximal frequent itemset (cef) stored in the MFItemp-list.

DSM-FI algorithm does not store it into the MFI_temp-list.

Algorithm todoFIS

Input: A current SFI-forest, the current window identifier N, a minimum support threshold s, and a maximum support error threshold ε.

Output: A set of all frequent itemsets.

1: MFI_temp-list = ∅;

/* MFItemp-list is a temporary list used to store the set of maximal frequent itemsets */

2: foreach entry e in the current FI-list do

3: construct a maximal candidate itemset E with size |E| /* |E| = 1+|e.OFI-list| */

4: count E.esup by traversing the e.SFI-tree;

5: if E.esup ≥ (s−ε) ⋅⋅⋅⋅E.CL then

6: if E ⊄ MFItemp-list and E is not a subset of any other patterns in the MFI_temp-list

then

7: add E into the MFItemp-list;

8: remove E’s subsets from the MFI_temp-list;

9: end if

10: else /* if E is not a frequent itemset */

11: enumerate E into itemsets with size |E|−1;

12: end if

13: until todoFIS finds the set of all frequent itemsets with respect to entry e;

14: end for

Figure 2- 8. Algorithm todoFIS

After processing all the entries in the FI-list, the MFI_temp-list generated by DSM-FI algorithm contains the set of current maximal frequent iemsets: {(a), (cef), (df)}. Therefore, the set of all frequent itemsets can be generated by enumerating the set: {(a), (cef), (df)}.

Consequently, the set of all frequent itemsets in Figure 2-7 are {(a), (cef), (ce), (cf), (ef), (c), (e), (f), (df), (d)}.

在文檔中在串流資料中高效率頻繁樣式探勘演算法之研究 (頁 40-43)