• 沒有找到結果。

The Proposed Algorithm: MFI-TransSW

Chapter 3 Online Mining of Frequent Itemsets over Stream Sliding Windows

3.3 The Proposed Algorithm: MFI-TransSW

In this section, we proposed an efficient single-pass algorithm, called MFI-TransSW (Mining Frequent Itemsets over a Transaction-sensitive Sliding Window), to mine the set of all frequent itemsets in data streams with a transaction-sensitive sliding window. An effective bit-sequence representation of items is used in the proposed algorithm to reduce the time and memory needed to slide the windows.

3.3.1 Bit-Sequence Representation

In the proposed MFI-TransSW algorithm, for each item X in the current transaction-sensitive sliding window TransSW, a bit-sequence with w bits, denoted as Bit(X), is constructed. If an item X is in the i-th transaction of current TransSW, the i-th bit of Bit(X) is set to be 1;

otherwise, it is set to be 0. The process is called bit-sequence transform.

For example, in Figure 3-2, the first sliding window TransSW1 consists of three transactions: <T1, (acd) >, <T2, (bce) >, and <T3, (abce) >, but the TransSW2 consists of transactions: <T2, (bce) >, <T3, (abce) >, and <T4, (be) >. Because item a appears in the 1st and 3rd transactions of TransSW1, the bit-sequence of a, Bit(a), is 101. Similarly, Bit(b) = 011, Bit(c) = 111, Bit(d) = 100, and Bit(e) = 011.

37

3.3.2 The MFI-TransSW Algorithm

MFI-TransSW algorithm consists of three phases, window initialization phase, window sliding phase, and frequent itemsets generation phase.

3.3.2.1 Window Initialization Phase

The phase is activated while the number of transactions generated so far in a transaction data stream is less than or equal to a user-predefined sliding window size w. In this phase, each item in the new incoming transaction is transformed into its bit-sequence representation.

For instance, in Figure 3-3, the first sliding window TransSW1 contains three transactions:

T1, T2, and T3. The bit-sequences of items of TransSW1 in the window initialization phase are shown in Figure 3-4.

Figure 3- 3. Bit-sequences of items in window initialization phase of TransSW

Figure 3- 4. Bit-sequences of items after sliding TransSW1 to TransSW2

Window-id Transactions Bit-Sequences of items TransSW1 <T1, (acd) >

<T2, (bce) >

<T3, (abce) >

Bit(a) = 101, Bit(c) = 111, Bit(d) = 100, Bit(b) = 011, Bit(e) = 011

TransSW2 <T2, (bce) >

<T3, (abce) >

<T4, (be) >

Bit(a) = 010, Bit(c) = 110, Bit(d) = 000, Bit(b) = 111, Bit(e) = 111

tid Items Bit-Sequences in current TransSW1

T1 (acd) Bit(a)=100, Bit(c)=100, Bit(d)=100

T2 (bce) Bit(a)=100,Bit(c)=110,Bit(d)=100,Bit(b)=010, Bit(e)=010

T3 (abce) Bit(a)=101,Bit(c)=111,Bit(d)=100, Bit(b)=011, Bit(e)=011

38

3.3.2.2 Window Sliding Phase

The phase is activated after the sliding window TransSW becomes full. A new incoming transaction is appended to the sliding window, and the oldest transaction is removed from the window.

For removing oldest information, an efficient method is used in the proposed algorithm.

Based on the bit-sequence representation, MFI-TransSW algorithm uses the bitwise left shift operation to remove the aged transaction from the set of items in the current sliding window.

After sliding the window, an effective pruning method, called Item-Prune, is used to improve the memory usage. The pruning approach is that an item X in the current transaction-sensitive sliding window is dropped if sup(X)TransSW = 0.

For example, in Figure 3-2, before the fourth transaction <T4, (be)> is processed, the first transaction T1 must be removed from the current window using bitwise left shift on the set of items. Hence, Bit(a) is modified from 101 to 010. Similarly, Bit(c)=110, Bit(d)=000, Bit(b)=110, and Bit(e)=110. Then, the new transaction <T4, (be)> is processed by bit-sequence transform. The result is shown in Figure 3-4. Note that item d is dropped since Bit(d)=000, i.e., sup(d)TransSW = 0.

Algorithm MFI-TransSW

Input: TDS (a transaction data stream), s (a user-defined minimum support threshold in the range of [0, 1]), and w (the user-specified sliding window size).

Output: a set of frequent itemsets, FI-Output.

Begin

TransSW = NULL; /* TransSW consists of w transactions */

Repeat:

for each incoming transaction Ti in TransSW do

39

if TransSW = FULL then

Do bitwise-shift on bit-sequences of all items in TransSW;

else

for each item X in Ti do

Do bit-sequence transform(X);

end for end if

end for

for each bit-sequence Bit(X) in TransSW do if sup(X) = 0 then

Drop X from TransSW;

end if end for

/* The following is the frequent itemsets generation phase. The phase is performed only when requested by users. */

FI1 = {frequent 1-itemsets};

for (k=2; FIk−1≠ NULL; k++) do CIk = CIGA(FIk−1);

Do bitwise AND to find the supports of CIk; for each candidate ck ∈ CIk do

if sup(ck)TransSW ≥ w⋅s then

FIk = {ck ∈ CIk | sup(ck)TransSW ≥ w⋅s};

end if end for end for FI-Output = ∪kFIk; End

Figure 3- 5. Algorithm MFI-TransSW

40

Figure 3- 6. Steps of frequent itemsets generation in TransSW2

3.3.2.3 Frequent Itemsets Generation Phase

The phase is performed only when the up-to-date set of frequent itemsets is requested. In this phase, MFI-TransSW algorithm uses a level-wise method to generate the set of candidate itemsets CIk (candidate itemsets with k items) from the pre-known frequent itemsets FIk−1 (frequent itemsets with k-1 items) according to the Apriori property [3]1. The step is called CIGA (Candidate Itemset Generation using Apriori property). Then, the proposed algorithm uses the bitwise AND operation to compute the support (the number of bit 1) of these candidates in order to find the frequent k-itemsets FIk. The candidate-generation-then-testing process stops when no new candidates with k+1 items (CIk+1) are generated. The MFI-TransSW algorithm is shown in Figure 3-5.

For instance, consider the bit-sequences of TransSW2 in Figure 3-4, and let the minimum

1 It is a downward closure property, i.e., if a pattern is frequent, all of its sub-patterns will also be frequent.

41

support threshold s be 0.6. Hence, an itemset X is frequent if sup(X)TransSW ≥ 0.6⋅3 = 1.8. In the following, we discuss the step of frequent itemset mining of TransSW2. The generated patterns are shown in Figure 3-2.

First, MFI-TransSW algorithm generates three candidate 2-itemsets, (bc), (be) and (ce), by combining frequent 1-itemsets: (b), (c) and (e), where Bit(b) = 111, i.e., sup(b) = 3, Bit(c)

= 110, i.e., sup(c) = 2, and Bit(e) = 110, i.e., sup(e) = 2. 1-itemset (a) is an infrequent itemset, since its Bit(a) = 010, i.e., sup(a) = 1. All other candidates are frequent itemsets after using bitwise AND operations to count the supports of these candidates. Because the Bit(bc) is 110, the support of candidate 2-itemset bc are 2, i.e., sup(bc) = 2. Similarity, sup(be) = 3, and sup(ce) = 2. Second, MFI-TransSW generates one candidate 3-itemset (bce) according to Apriori property and uses bitwise AND operation to count the sup(bce) = 2, i.e., Bit(bc) AND Bit(be) AND Bit(ce) = 110. Because no new candidates are generated, the generation-then-test process stops. Hence, there are six frequent itemsets, (b), (c), (bc), (be), (ce), (bce), generated by MFI-TransSW algorithm in TransSW2. The process is shown in Figure 3-6.