Chapter 3 Online Mining of Frequent Itemsets over Stream Sliding Windows
3.3 The Proposed Algorithm: MFI-TransSW
In this section, we proposed an efficient single-pass algorithm, called MFI-TransSW (Mining Frequent Itemsets over a Transaction-sensitive Sliding Window), to mine the set of all frequent itemsets in data streams with a transaction-sensitive sliding window. An effective bit-sequence representation of items is used in the proposed algorithm to reduce the time and memory needed to slide the windows.
3.3.1 Bit-Sequence Representation
In the proposed MFI-TransSW algorithm, for each item X in the current transaction-sensitive sliding window TransSW, a bit-sequence with w bits, denoted as Bit(X), is constructed. If an item X is in the i-th transaction of current TransSW, the i-th bit of Bit(X) is set to be 1;
otherwise, it is set to be 0. The process is called bit-sequence transform.
For example, in Figure 3-2, the first sliding window TransSW1 consists of three transactions: <T1, (acd) >, <T2, (bce) >, and <T3, (abce) >, but the TransSW2 consists of transactions: <T2, (bce) >, <T3, (abce) >, and <T4, (be) >. Because item a appears in the 1st and 3rd transactions of TransSW1, the bit-sequence of a, Bit(a), is 101. Similarly, Bit(b) = 011, Bit(c) = 111, Bit(d) = 100, and Bit(e) = 011.
37
3.3.2 The MFI-TransSW Algorithm
MFI-TransSW algorithm consists of three phases, window initialization phase, window sliding phase, and frequent itemsets generation phase.
3.3.2.1 Window Initialization Phase
The phase is activated while the number of transactions generated so far in a transaction data stream is less than or equal to a user-predefined sliding window size w. In this phase, each item in the new incoming transaction is transformed into its bit-sequence representation.
For instance, in Figure 3-3, the first sliding window TransSW1 contains three transactions:
T1, T2, and T3. The bit-sequences of items of TransSW1 in the window initialization phase are shown in Figure 3-4.
Figure 3- 3. Bit-sequences of items in window initialization phase of TransSW
Figure 3- 4. Bit-sequences of items after sliding TransSW1 to TransSW2
Window-id Transactions Bit-Sequences of items TransSW1 <T1, (acd) >
<T2, (bce) >
<T3, (abce) >
Bit(a) = 101, Bit(c) = 111, Bit(d) = 100, Bit(b) = 011, Bit(e) = 011
TransSW2 <T2, (bce) >
<T3, (abce) >
<T4, (be) >
Bit(a) = 010, Bit(c) = 110, Bit(d) = 000, Bit(b) = 111, Bit(e) = 111
tid Items Bit-Sequences in current TransSW1
T1 (acd) Bit(a)=100, Bit(c)=100, Bit(d)=100
T2 (bce) Bit(a)=100,Bit(c)=110,Bit(d)=100,Bit(b)=010, Bit(e)=010
T3 (abce) Bit(a)=101,Bit(c)=111,Bit(d)=100, Bit(b)=011, Bit(e)=011
38
3.3.2.2 Window Sliding Phase
The phase is activated after the sliding window TransSW becomes full. A new incoming transaction is appended to the sliding window, and the oldest transaction is removed from the window.
For removing oldest information, an efficient method is used in the proposed algorithm.
Based on the bit-sequence representation, MFI-TransSW algorithm uses the bitwise left shift operation to remove the aged transaction from the set of items in the current sliding window.
After sliding the window, an effective pruning method, called Item-Prune, is used to improve the memory usage. The pruning approach is that an item X in the current transaction-sensitive sliding window is dropped if sup(X)TransSW = 0.
For example, in Figure 3-2, before the fourth transaction <T4, (be)> is processed, the first transaction T1 must be removed from the current window using bitwise left shift on the set of items. Hence, Bit(a) is modified from 101 to 010. Similarly, Bit(c)=110, Bit(d)=000, Bit(b)=110, and Bit(e)=110. Then, the new transaction <T4, (be)> is processed by bit-sequence transform. The result is shown in Figure 3-4. Note that item d is dropped since Bit(d)=000, i.e., sup(d)TransSW = 0.
Algorithm MFI-TransSW
Input: TDS (a transaction data stream), s (a user-defined minimum support threshold in the range of [0, 1]), and w (the user-specified sliding window size).
Output: a set of frequent itemsets, FI-Output.
Begin
TransSW = NULL; /* TransSW consists of w transactions */
Repeat:
for each incoming transaction Ti in TransSW do
39
if TransSW = FULL then
Do bitwise-shift on bit-sequences of all items in TransSW;
else
for each item X in Ti do
Do bit-sequence transform(X);
end for end if
end for
for each bit-sequence Bit(X) in TransSW do if sup(X) = 0 then
Drop X from TransSW;
end if end for
/* The following is the frequent itemsets generation phase. The phase is performed only when requested by users. */
FI1 = {frequent 1-itemsets};
for (k=2; FIk−1≠ NULL; k++) do CIk = CIGA(FIk−1);
Do bitwise AND to find the supports of CIk; for each candidate ck ∈ CIk do
if sup(ck)TransSW ≥ w⋅s then
FIk = {ck ∈ CIk | sup(ck)TransSW ≥ w⋅s};
end if end for end for FI-Output = ∪kFIk; End
Figure 3- 5. Algorithm MFI-TransSW
40
Figure 3- 6. Steps of frequent itemsets generation in TransSW2
3.3.2.3 Frequent Itemsets Generation Phase
The phase is performed only when the up-to-date set of frequent itemsets is requested. In this phase, MFI-TransSW algorithm uses a level-wise method to generate the set of candidate itemsets CIk (candidate itemsets with k items) from the pre-known frequent itemsets FIk−1 (frequent itemsets with k-1 items) according to the Apriori property [3]1. The step is called CIGA (Candidate Itemset Generation using Apriori property). Then, the proposed algorithm uses the bitwise AND operation to compute the support (the number of bit 1) of these candidates in order to find the frequent k-itemsets FIk. The candidate-generation-then-testing process stops when no new candidates with k+1 items (CIk+1) are generated. The MFI-TransSW algorithm is shown in Figure 3-5.
For instance, consider the bit-sequences of TransSW2 in Figure 3-4, and let the minimum
1 It is a downward closure property, i.e., if a pattern is frequent, all of its sub-patterns will also be frequent.
41
support threshold s be 0.6. Hence, an itemset X is frequent if sup(X)TransSW ≥ 0.6⋅3 = 1.8. In the following, we discuss the step of frequent itemset mining of TransSW2. The generated patterns are shown in Figure 3-2.
First, MFI-TransSW algorithm generates three candidate 2-itemsets, (bc), (be) and (ce), by combining frequent 1-itemsets: (b), (c) and (e), where Bit(b) = 111, i.e., sup(b) = 3, Bit(c)
= 110, i.e., sup(c) = 2, and Bit(e) = 110, i.e., sup(e) = 2. 1-itemset (a) is an infrequent itemset, since its Bit(a) = 010, i.e., sup(a) = 1. All other candidates are frequent itemsets after using bitwise AND operations to count the supports of these candidates. Because the Bit(bc) is 110, the support of candidate 2-itemset bc are 2, i.e., sup(bc) = 2. Similarity, sup(be) = 3, and sup(ce) = 2. Second, MFI-TransSW generates one candidate 3-itemset (bce) according to Apriori property and uses bitwise AND operation to count the sup(bce) = 2, i.e., Bit(bc) AND Bit(be) AND Bit(ce) = 110. Because no new candidates are generated, the generation-then-test process stops. Hence, there are six frequent itemsets, (b), (c), (bc), (be), (ce), (bce), generated by MFI-TransSW algorithm in TransSW2. The process is shown in Figure 3-6.