• 沒有找到結果。

Chapter 2 Problem Definition and Background

2.3 Definition of Mining Sequential Patterns

An input database DS contains customer-transactions. These customer-transactions are a little different from transactions in section 2.1.1. Each customer-transaction consists of the following field: customer-id(CID), transaction-id(TID), and the items purchased in the transaction (called an itemset). The concepts of TID, items and itemsets here are the same in section 2.2. The difference is that each transaction in DS belongs to some customer. Figure 2-3 shows an example of the transaction database DS.

Customer ID (CID) Transaction ID(TID) Itemset

1 1 (a, b, d)

2 2 (b)

1 3 (b, c, d)

2 4 (a, b, c)

3 5 (a, b)

1 6 (b, c, d)

3 7 (b, c, d)

Fig 2-3. An example of an input database DS

A sequence is an ordered list of itemset and is denoted as S = 〈s1s2s3…sk〉, where sj is an itemset. A sequence α = 〈a1a2a3…ak〉 is contained in another sequence β = 〈b1b2b3…bk〉 if there exists integers i1 < i2 < i3 < … < in such thata1bi1,a2bi2,...,anbin

.

All the transactions of a customer can be viewed as a sequence, where each transaction corresponds to a set of items, and the list of transactions, ordered by increasing transaction-id, corresponds to a sequence. We call such a sequence a customer-sequence. Figure 2-4 shows the customer-sequences in Figure 2-3.

CID Sequence 1 <(a, b, d)(b, c, d)(b, c, d)>

2 <(b)(a, b, c)>

3 <(a, b)(b, c, d)>

Fig 2-4. The customer-sequences in Fig 2-3

The absolute support of a sequence S is defined as the number of customer-sequences containing S. Sequential patterns are the sequences whose supports are more than a user-defined minimum support, also called frequent sequences.

Chapter 3

New-Moment: Mining Closed Frequent Itemsets

The goal of New-Moment is to improve Moment algorithm. First we introduce Moment algorithm in section 3.1. Next we introduce our proposed algorithm, New-Moment, in section 3.2.

3.1 Related Work: Moment Algorithm

Moment [20, 21] algorithm mines closed frequent itemsets with sliding window model in a data stream. It uses a closed enumeration tree (CET) to maintain the closed frequent itemsets in the current window. CET not only maintains closed frequent itemsets but also maintains some boundary tree nodes. Figure 3-1 shows the CET in the first window. Assume that the window size is 4 and the first four incoming transaction is listed in the left of the graph.

(a, b, c): 2 Infrequent gateway nodeInfrequent gateway nodeInfrequent gateway node

Closed node

Fig 3-1. CET in the first sliding window

There are four types of tree nodes for CET:

(1) infrequent gateway nodes

A node nI that represents itemset I is an infrequent gateway node if i) I is an infrequent itemset, ii) nI’s parent, nJ, is frequent, and iii) I is the result of joining I’s parent, J, with one of J’s frequent siblings. In Figure 3-1, the tree node (d) is an infrequent gateway node.

(2) unpromising gateway nodes

A node nI is an unpromising gateway node if i) I is a frequent itemset and ii) there exists a closed frequent itemset J such that J ⊂ I, and J has the same support as I does. In Figure 3-1, the tree nodes (a, c) and (b) are unpromising gateway nodes.

(3) intermediate nodes

A node nI is an intermediate node if i) I is a frequent itemset, ii) nI has a child node nJ

such that J has the same support as I does, and iii) nI is not an unpromising gateway node.

In Figure 3-1, the tree node (a) is an intermediate node because its child (a, b) has the same support as (a) does.

(4) closed nodes

These nodes represent closed frequent itemsets in the current window. A closed node can be an internal node or a leaf node. In Figure 3-1, (c), (a, b), and (a, b, c) are closed nodes.

Except closed nodes, Moment keeps three types of boundary nodes. These nodes are the most possible candidates of new closed nodes in the next window. Moment keeps these nodes for speeding up modification of the closed enumeration tree.

There are three steps in Moment algorithm:

(1) Building the closed enumeration tree (CET)

When the total number of transactions coming from the data stream does not excess window size N, Moment just saves these transactions in its sliding window. As long as the

window is full, Moment builds an initial closed enumeration tree (CET). Figure 3-1 shows the tree in the first window.

Moment adopts a depth-first procedure to generate all possible candidate itemsets in the window and check their supports. In the procedure, if a node is found to be infrequent, it is marked as an infrequent gateway node and Moment does not explore its descendants further.

If a node is frequent itemset but not closed frequent itemset, the node is marked as an unpromising gateway node. Moment also does not explore its descendants, which does not contain any closed frequent itemsets. Moment uses support of a node and the tid sum of the transactions that containing the node (tid_sum) to check if the node is a closed node. Take the nodes (a, c) and (a, b, c) in Figure 3-1 as an example. The support of (a, c) is the same as (a, b, c). The tid_sum of (a, c) is 7 (the third transaction and the fourth transaction in the window). That is equal to the tid_sum of (a, b, c). By the definition of closed frequent itemsets, we can know that (a, c) is not a closed node.

If a node is found to be neither an infrequent node nor an unpromising gateway node, Moment explores its descendants. The nodes that are intermediate nodes or closed nodes are maintained in the CET.

(2) Updating the CET

Initial closed enumeration tree is built when the number of incoming transactions from the data stream is equal to the window size. After that, when a new transaction comes from the data stream, Moment updates the CET to maintain the closed frequent itemsets in the current window. There are two steps for updating the CET:

Adding the new transaction coming from the data stream

Fig 3-2. Adding the new transaction with tid = 5

In Figure 3-2, a new transaction T (tid = 5) is added to the sliding window. Moment traverses the parts of the CET that are related to transaction T. For each related node nI in depth-first order, Moment updates its support and tid_sum. Whenever a node is updated, Moment checks if it needs to change its node type.

In Figure 3-2, the node (d) becomes a new frequent node so Moment generates the new candidates node (a, d) and (c, d). By node properties Moment know that (a, d) is an infrequent gateway node and (c, d) is a new closed node. By checking the support of the nodes (a), (a, c), and (c), Moment modifies them to closed nodes.

Deleting the oldest transaction in the window

In Figure 3-3, the transaction with tid = 1 is deleted. Like adding the new transaction, Moment updates support and tid_sum of each node in the CET. By checking the support of each node, Moment modifies its node type.

In Figure 3-3, node (c) becomes unpromising gateway node because it is contained by node (a, c) and supports of (c) and (a, c) are the same. Then the sub tree of node (c), (c, d), is deleted. The node (d) becomes new infrequent gateway node.

Moment maintains a huge number of boundary nodes to speed up the procedure of updating CET. The cost for a node to change its type is less. But we find that those boundary nodes are unnecessary overhead. In our proposed algorithm New-Moment, we reduce the number of tree nodes and utilize an efficient structure to store the information of the sliding window.

3.2 Our Proposed Algorithm: New-Moment Algorithm

We use bit-vector to store the information of a sliding window. Because of the efficiency of bit-vector in counting support and modifying transactions in window, New-Moment only maintains closed frequent itemsets in each sliding window. The new closed enumeration tree (New-CET) is composed of the bit-vectors of 1-itemsets, the closed frequent itemsets in current sliding window, and a hash table.

3.2.1 Bit-Vector

Definition of Bit-Vector: For a specified item i and a given window w of sliding window

model in a data stream, a bit-vector is used to store the occurrences of item i in the transactions of w. Each bit of a bit-vector represents a transaction in w. If the item i

occurs in some transaction of w, the corresponding bit is set to one, else set to zero.

Figure 3-4 shows an example of input database and the first three sliding windows are displayed next to it. These windows are marked from window #1 to window #3. It is assumed that the size of sliding window is 4. The example of figure 3-4 will be used in the following context.

Fig 3-4. An example database and the first three sliding windows

Each window in figure 3-4 can be transformed to a bit-vector by the definition of bit-vector.

The bit-vectors of all items in each window are listed in Table 3-1. The most left bit represents the oldest transaction and the most right bit is the most recent transaction.

Window #1 Window #2 Window #3

a 0111 1111 1110

b 0111 1110 1101

c 1011 0111 1111

d 1000 0001 0010

Table 3-1. The bit-vectors of all items in each window in Figure 3-4

3.2.2 Window Sliding with Bit-Vector

When the number of transactions in a data stream exceeds the size of a window, window

Bit-vector is efficient in window sliding process. We can separate the sliding process into two steps:

(1) Delete the oldest transaction

The only thing a bit-vector needs to do is to left-shift one bit. Take item a as an example.

a’s bit-vector is 1010 in the first window. If transaction with TID = 1 is deleted, a’s bit-vector becomes 0100. Now the most left bit represents the transaction with TID = 2 and the most right bit is meaningless and reserved for next step.

(2) Append the incoming transaction

After deleting the oldest transaction, the most right bit of the bit-vector is set corresponding to the incoming transaction. The bit-vectors of the items contained in the incoming transaction set its most right bit to one; the others set its most right bit to zero. Take item a as an example.

a’s bit-vector is 0100 after deleting the oldest transaction. The incoming transaction is (b, d) (TID = 5) not containing a so a’s bit-vector is still 0100. b’s bit-vector is 1110 after deleting the oldest transaction. The incoming transaction contains b so b’s bit-vector is 1111 after appending the incoming transaction.

3.2.3 Counting Support with Bit-Vector

Concept of bit-vector can be extended to itemset. For example, the bit-vector of itemset (a, b) in the first window is 1010. That means (a, b) occurs in the transactions with TID = 1 and TID = 3.

Assume there are two itemsets X and Y and their corresponding bit-vector BITX and BITY. The bit-vector of the itemset Z = X ∪ Y can be obtained by bitwise AND BITX and BITY.

For example, the bit-vector of itemset (a, b) in the first window (Window #1) is 1010 which can be obtained by bitwise AND the bit-vectors of items a and b. That means (a, b) occurs in the first and the third transactions in the first sliding window. By bitwise AND between

The support of each itemset can be obtained by counting how many bits in the bit-vector are set to one. For example, the support of itemset (a, b) is 2.

3.2.4 Building the New Closed Enumeration Tree (New-CET)

For improving the efficiency of CET in Moment, we propose a new closed enumeration tree (New-CET). New-CET is basically a lexicographical tree. There are three important parts in New-CET:

(1) Bit-vectors of all items (1-itemsets)

Moment maintains an independent sliding window for counting support of each node in CET. Instead of independent sliding window to store current N transactions, information of these transactions is maintained by the bit-vectors of all items.

(2) Closed frequent itemsets in current window Each closed frequent itemset only maintains its support.

(3) Hash table

For checking whether a frequent itemset is closed or not, we need a hash table to store all closed frequent itemsets with their supports as keys. Whenever a new frequent itemset is generated, we can judge if this frequent itemset is closed by hashing its support to the hash table. How to utilize the information of support to judge if a frequent itemset is closed is introduced in section 2.2.

Building New-CET is almost the same as building CET. The major difference is that New-CET only retains bit-vectors of items and closed frequent itemsets and bit-vectors are used to count supports of generated candidates.

When the total number of incoming transactions is less than the size of sliding window, New-Moment only records all item information as introduced in section 3.2.1. When the

to generate all possible candidates and check their supports. Because the candidates are generated by its parent and its parent’s frequent siblings, we can obtain the supports by the method introduced in section 3.2.3. Then for each frequent candidate, we use hash table to check if the frequent candidate is closed. If the candidate is closed, it is inserted in the hash table. If the candidate is not closed, the node is not maintained in New-CET. Figure 3-5 shows the pseudo code of building New-CET

Build (nI, N, S)

1: if support(nI) ≥ S · N then 2: if leftcheck(nI) = false then

3: foreach frequent sibling nK of nI do 4: generate a new child nIK for nI;

5: bitwise AND BITI and BITK to obtain BITIK; 6: foreach child nI′ of nI do

7: Build(nI′, N, S);

8: if ∄a child nI′ of nI such that

support(nI′) = support(nI) then 9: retain nI as a closed frequent itemset;

10: insert nI into the hash table;

Fig 3-5. Pseudo code of building New-CET

nI is a tree node, N is the window size and S is minimum support. Each nI has a corresponding bit-vector BITI to store the information of sliding window. Except the bit-vectors of items, the BITI for a node nI only exists in counting support of a new candidate.

Figure 3-6 shows the New-CET in the first window by previous example when generating new candidates from item a. For simplicity, hash table is not displayed in it. By the bit-vectors of items, we know that items a, b, and c are frequent items. Take item a as an example, new candidates (a, b) and (a, c) are generated. By bitwise AND bit-vectors of items a and b, we can obtain that the support of (a, b) is 3. In the same way, the support of (a, c) is 2

and the support of (a, b, c) is 2. For generating candidates below item a, the bit-vectors of (a, b), (a, c), and (a, b, c) are temporarily maintained in the memory.

Window #1

(a): <0111> (b): <0111> (c): <1011> (d): <1000>

(a, b): <0111>

(a, b, c): <0011>

(a, c): <0011>

Minsup = 2 Window Size = 4

Fig 3-6. New-CET in the first window after generating new candidates from item a

Figure 3-7 shows the New-CET after checking if each frequent candidate is closed. The tree nodes with squares are closed frequent itemsets. By checking support with hash table, we can know that frequent itemset (a, c) is not closed. So New-Moment eliminates this node and other frequent candidates are marked as closed frequent itemsets. Although item a is not closed, New-Moment still maintains the bit-vector of item a. After the sub-tree of item a is checked, the bit-vectors in this sub-tree are eliminated. New-Moment only keeps the supports of closed frequent itemsets.

Window #1

(a): <0111> (b): <0111> (c): <1011> (d): <1000>

(a, b): 3

(a, b, c): 2 (a, b, c): 2 Minsup = 2

Window Size = 4

Figure 3-8 shows the New-CET when Build is done. The sub-tree generations of item b and c are the same as item a. Item c is a new closed frequent itemset.

Window #1

(a): <0111> (b): <0111> (c): <1011> (d): <1000>

(a, b): 3

(a, b, c): 2

(a): <0111> (b): <0111> (c): <1011> (d): <1000>

(a, b): 3

(a, b, c): 2 (a, b, c): 2 Minsup = 2

Window Size = 4

Fig 3-8. New-CET in the first window (Window #1)

3.2.5 Deleting the Oldest Transaction in Window Sliding

Deleting the oldest transaction is our first step of window sliding. All bit-vectors of items are left-shifted one bit first and all items in the deleted transaction are kept. This can be done by observing the most left bit before left-shifting. After modification of bit-vectors of items, New-Moment begins to modify New-CET.

There is only one case for deleting the oldest transaction: original closed frequent itemsets in the New-CET becomes non-closed frequent itemsets or infrequent itemsets. For checking this situation, New-Moment traverses the New-CET again to check the supports of the existing node in the New-CET. Because just the subsets of the deleted transaction are the possible infrequent itemsets, only the sub-trees of the items in the deleted transaction need to be checked. The traversing method is almost the same as building the initial New-CET, called function Delete. The difference is that Delete generates the entire lexicographical tree including the itemsets whose supports are (S · N – 1). This is because supports of some closed frequent itemsets in previous window would be (S · N) and then becomes (S · N – 1) after deletion.

Figure 3-9 shows the New-CET after deleting the oldest transaction. In the above example, the deleted transaction is (c, d). Only the sub-trees of items c and d need to be checked. We find that item c is no longer a closed frequent itemset. Item d is infrequent and we do not need to check its sub-tree. Figure 3-10 shows the pseudo code of deleting the oldest transaction after left-shifting all bit-vectors of 1-itemsets.

(a): <0111> (b): <0111> (c): <1011> (d): <1000>

(a, b): 3

(a, b, c): 2 (a, b, c): 2

(a): <1110> (b): <1110> (d): <0000>

a, b, c 4

a, b, c 3

a, b 2

c, d 1

Itemsets TID

a, b, c 4

a, b, c 3

a, b 2

c, d 1

Itemsets TID

Minsup = 2 Window Size = 4

Fig 3-9. New-CET after deleting the oldest transaction

Delete (nI, N, S)

1: if nI is not relevant to the deleted transaction then 2: return;

3: else if support(nI) ≥ (S · N – 1) then 4: foreach sliding nK of nI

whose support ≥ (S · N – 1) do 5: generate a new child nIK for nI;

6: bitwise AND BITI and BITK to obtain BITIK; 7: foreach child nI′ of nI do

8: Delete(nI′, N, S);

9: if support(nI) ≥ S · N then 10: if leftcheck(nI) = false then

11: if nI is closed frequent itemset

in previous sliding window then 12: update the support of nI;

13: update nI in the hash table;

14: else

15: retain nI as a closed frequent itemset;

16: insert nI into the hash table;

17: else //leftcheck(nI) = true

18: if nI is closed frequent itemset

in previous sliding window then 19: mark nI as non-closed frequent itemset;

20: eliminate nI from the hash table;

21: else //support(nI) < S · N

22: if nI is closed frequent itemset

in previous sliding window then 23: mark nI as non-closed itemset;

24: eliminate nI from the hash table;

Fig 3-10. Pseudo code of deleting the oldest transaction in window sliding

3.2.6 Appending the Incoming Transaction in window sliding

Appending the incoming transaction is our second step of window sliding. The most right bits of all the bit-vectors of items are set corresponding to the items contained in the incoming transaction. After modification of bit-vectors of items, New-Moment begins to modify New-CET. Only the sub-trees of the items in the inserted transaction need to be checked.

The method of traverse the New-CET for adding a new transaction, called function Append, is the same as function Build. A little difference is that the supports of existing closed frequent itemsets in the hash table need to be modified. Figure 3-11 shows the pseudo code of appending the incoming transaction after setting the most right bit in each bit-vector of 1-itemset.

Append (nI, N, S)

1: if support(nI) ≥ S · N then 2: if leftcheck(nI) = false then

3: foreach frequent sibling nK of nI do 4: generate a new child nIK for nI;

5: bitwise AND BITI and BITK to obtain BITIK; 6: foreach child nI′ of nI do

7: Append(nI′, N, S);

8: if ∄a child nI′ of nI such that

support(nI′) = support(nI) then 9: if nI is closed frequent itemset

in previous sliding window then 10: update the support of nI;

11: update nI in the hash table;

12: else

13: retain nI as a closed frequent itemset;

14: insert nI into the hash table;

Fig 3-11. Pseudo code of appending the incoming transaction in window sliding

itemsets. Figure 3-12 shows the New-CET after appending the incoming transaction. This is also the New-CET in the second sliding window.

(a): <1110> (b): <1110> (c): <0110> (d): <0000>

(a, b): 3

(a, b, c): 2 (a, b, c): 2

(a): <1111> (b): <1110> (c): <0111> (d): <0001>

(a, c): 3

Window #2

a, b, c 4

a, c, d 5

a, b, c 3

a, b 2

c, d 1

Itemsets TID

a, b, c 4

a, c, d 5

a, b, c 3

a, b 2

c, d 1

Itemsets TID

Minsup = 2 Window Size = 4

Fig 3-12. New-CET after appending the incoming transaction (Window #2)

Chapter 4

Incremental SPAM (IncSPAM): Mining Sequential Patterns

Sequential pattern mining is more complicated than mining frequent itemsets, especially in the stream environment. In previous researches, there is no general processing model for handling a data stream with a transaction unit. Incremental SPAM (IncSPAM) provides a suitable sliding window model in a data stream. It receives transactions from the data stream and uses a brand-new concept of bit-vector, Customer Bit-Vector Array with Sliding Window (CBASW), to store the information of items for each customer. Then IncSPAM uses a lexicographic sequence tree to maintain the sequential patterns in the current window. For

Sequential pattern mining is more complicated than mining frequent itemsets, especially in the stream environment. In previous researches, there is no general processing model for handling a data stream with a transaction unit. Incremental SPAM (IncSPAM) provides a suitable sliding window model in a data stream. It receives transactions from the data stream and uses a brand-new concept of bit-vector, Customer Bit-Vector Array with Sliding Window (CBASW), to store the information of items for each customer. Then IncSPAM uses a lexicographic sequence tree to maintain the sequential patterns in the current window. For