• 沒有找到結果。

Chapter 1 Introduction

1.3 Organization of Thesis

The remainder of this paper is organized as follows. Some basic definitions and terminology about itemset, sequence, and sliding window model are described in Chapter 2.

The New-Moment algorithm to mine closed frequent itemsets is presented in Chapter 3. The IncSPAM algorithm to mine sequential patterns is introduced in Chapter 4. Finally the experiments and performance measurements are described in Chapter 5. Conclusion and future work is in Chapter 6.

Chapter 2

Problem Definition and Background

In this chapter we introduce the basic definition of problems. We introduce the definition of the data stream environment and the sliding window model in section 2.1. Next we describe the definition of closed frequent itemsets and sequential patterns in section 2.2.

2.1 The Sliding Window Model in Data Streams

2.1.1 Data Stream Environment

Fig 2-1. Processing model of data stream environment

A data stream DS = [T1, T2, …, TM) is an infinite transaction set. In a data stream environment, the input is the continuous data stream and each transaction can only be

scanned once. Due to the limited memory and one-time scan of each transaction (one-pass), a summary data structure is needed to store compact information about the data stream. In other words, one-pass algorithms for mining data streams have to sacrifice the correctness of its analytical results by allowing some counting error. Hence traditional multi-pass techniques

for mining static databases are not feasible to be used in the data stream environment. Figure 2-1 shows a processing model of data streams [19].

2.1.2 A Sliding Window Model

Some applications in data streams emphasize the importance of the latest transactions. A sliding window model is suitable to solve this kind of problems. In the basic concept, a sliding

window keeps the latest N transactions in the data streams; N is called a window size. The mining data streams engine in Figure 2-1 only mines patterns in the current sliding window.

Whenever a new transaction is coming, the sliding window eliminates the oldest transaction and appends the incoming transaction. This process is called window sliding. The mining data streams engine also modifies the summary data structure by the changes of sliding window.

Figure 2-2 shows the sliding window in an input data stream.

‧‧‧

Data Streams

N N System starts N

Fig 2-2. A sliding window model in a data stream

2.2 Definition of Mining Closed Frequent Itemsets

I = {i1, i2, i3, …, in} is a set of literals, called items. An itemset is a set of items. An itemset X with k items is represented in form of X = (x1, x2, …, xk), called k-itemset. Let DI be a

database which has a set of transactions. Each transaction T consists of a set of items from I, i.e., T ⊆ I and a transaction id (TID) represents the time order in the database. An itemset X is said to be contained in a transaction T if X ⊆ T. The support of an itemset X is the number of transactions containing X. An itemset X is a frequent itemset if the support of X is more than a user specified threshold minimum support S.

As an example, let I = {a, b, c, d}, DI = {(a, b, c), (b, c, d), (a, b, c), (b, c)}, S = 0.5. The set of frequent itemsets F = {(a): 2, (b): 4, (c): 4, (a, b): 2, (a, c): 2, (b, c): 4, (a, b, c): 2}.The Number following the colon represents the support of the itemset.

The total number of all the frequent itemsets sometimes is too large and it is difficult to retrieve useful information. For reducing the number of output patterns, the concept of closed frequent itemsets [4] is proposed.

Definition of Closed Frequent Itemset. A frequent itemset X is closed if there is no frequent itemset X′ such that (1) X ⊂X′ and (2) ∀transaction T, X ∈T Æ X′ ∈T.

In the above example, the set of closed frequent itemsets C = {(b, c): 4, (a, b, c): 2}, C ⊆F.

We observe itemsets (b), (c), and (b, c). The supports of itemsets (b) and (c) are equals to the support of itemset (b, c); and further, itemsets (b) and (c) are subsets of (b, c). That means itemsets (b) and (c) exist in the same transactions of itemset (b, c). By the definition of closed frequent itemset, (b) and (c) are not closed frequent itemsets and (b, c) is a closed frequent itemset.

All frequent itemsets can be obtained from closed frequent itemsets without losing support information. In the above example, we know that (b) and (c) are frequent itemsets by observing closed frequent itemset (b, c). Supports of (b) and (c) are the same as (b, c).

Supports of frequent itemsets can be used to judge if a frequent itemset is closed.

2.3 Definition of Mining Sequential Patterns

An input database DS contains customer-transactions. These customer-transactions are a little different from transactions in section 2.1.1. Each customer-transaction consists of the following field: customer-id(CID), transaction-id(TID), and the items purchased in the transaction (called an itemset). The concepts of TID, items and itemsets here are the same in section 2.2. The difference is that each transaction in DS belongs to some customer. Figure 2-3 shows an example of the transaction database DS.

Customer ID (CID) Transaction ID(TID) Itemset

1 1 (a, b, d)

2 2 (b)

1 3 (b, c, d)

2 4 (a, b, c)

3 5 (a, b)

1 6 (b, c, d)

3 7 (b, c, d)

Fig 2-3. An example of an input database DS

A sequence is an ordered list of itemset and is denoted as S = 〈s1s2s3…sk〉, where sj is an itemset. A sequence α = 〈a1a2a3…ak〉 is contained in another sequence β = 〈b1b2b3…bk〉 if there exists integers i1 < i2 < i3 < … < in such thata1bi1,a2bi2,...,anbin

.

All the transactions of a customer can be viewed as a sequence, where each transaction corresponds to a set of items, and the list of transactions, ordered by increasing transaction-id, corresponds to a sequence. We call such a sequence a customer-sequence. Figure 2-4 shows the customer-sequences in Figure 2-3.

CID Sequence 1 <(a, b, d)(b, c, d)(b, c, d)>

2 <(b)(a, b, c)>

3 <(a, b)(b, c, d)>

Fig 2-4. The customer-sequences in Fig 2-3

The absolute support of a sequence S is defined as the number of customer-sequences containing S. Sequential patterns are the sequences whose supports are more than a user-defined minimum support, also called frequent sequences.

Chapter 3

New-Moment: Mining Closed Frequent Itemsets

The goal of New-Moment is to improve Moment algorithm. First we introduce Moment algorithm in section 3.1. Next we introduce our proposed algorithm, New-Moment, in section 3.2.

3.1 Related Work: Moment Algorithm

Moment [20, 21] algorithm mines closed frequent itemsets with sliding window model in a data stream. It uses a closed enumeration tree (CET) to maintain the closed frequent itemsets in the current window. CET not only maintains closed frequent itemsets but also maintains some boundary tree nodes. Figure 3-1 shows the CET in the first window. Assume that the window size is 4 and the first four incoming transaction is listed in the left of the graph.

(a, b, c): 2 Infrequent gateway nodeInfrequent gateway nodeInfrequent gateway node

Closed node

Fig 3-1. CET in the first sliding window

There are four types of tree nodes for CET:

(1) infrequent gateway nodes

A node nI that represents itemset I is an infrequent gateway node if i) I is an infrequent itemset, ii) nI’s parent, nJ, is frequent, and iii) I is the result of joining I’s parent, J, with one of J’s frequent siblings. In Figure 3-1, the tree node (d) is an infrequent gateway node.

(2) unpromising gateway nodes

A node nI is an unpromising gateway node if i) I is a frequent itemset and ii) there exists a closed frequent itemset J such that J ⊂ I, and J has the same support as I does. In Figure 3-1, the tree nodes (a, c) and (b) are unpromising gateway nodes.

(3) intermediate nodes

A node nI is an intermediate node if i) I is a frequent itemset, ii) nI has a child node nJ

such that J has the same support as I does, and iii) nI is not an unpromising gateway node.

In Figure 3-1, the tree node (a) is an intermediate node because its child (a, b) has the same support as (a) does.

(4) closed nodes

These nodes represent closed frequent itemsets in the current window. A closed node can be an internal node or a leaf node. In Figure 3-1, (c), (a, b), and (a, b, c) are closed nodes.

Except closed nodes, Moment keeps three types of boundary nodes. These nodes are the most possible candidates of new closed nodes in the next window. Moment keeps these nodes for speeding up modification of the closed enumeration tree.

There are three steps in Moment algorithm:

(1) Building the closed enumeration tree (CET)

When the total number of transactions coming from the data stream does not excess window size N, Moment just saves these transactions in its sliding window. As long as the

window is full, Moment builds an initial closed enumeration tree (CET). Figure 3-1 shows the tree in the first window.

Moment adopts a depth-first procedure to generate all possible candidate itemsets in the window and check their supports. In the procedure, if a node is found to be infrequent, it is marked as an infrequent gateway node and Moment does not explore its descendants further.

If a node is frequent itemset but not closed frequent itemset, the node is marked as an unpromising gateway node. Moment also does not explore its descendants, which does not contain any closed frequent itemsets. Moment uses support of a node and the tid sum of the transactions that containing the node (tid_sum) to check if the node is a closed node. Take the nodes (a, c) and (a, b, c) in Figure 3-1 as an example. The support of (a, c) is the same as (a, b, c). The tid_sum of (a, c) is 7 (the third transaction and the fourth transaction in the window). That is equal to the tid_sum of (a, b, c). By the definition of closed frequent itemsets, we can know that (a, c) is not a closed node.

If a node is found to be neither an infrequent node nor an unpromising gateway node, Moment explores its descendants. The nodes that are intermediate nodes or closed nodes are maintained in the CET.

(2) Updating the CET

Initial closed enumeration tree is built when the number of incoming transactions from the data stream is equal to the window size. After that, when a new transaction comes from the data stream, Moment updates the CET to maintain the closed frequent itemsets in the current window. There are two steps for updating the CET:

Adding the new transaction coming from the data stream

Fig 3-2. Adding the new transaction with tid = 5

In Figure 3-2, a new transaction T (tid = 5) is added to the sliding window. Moment traverses the parts of the CET that are related to transaction T. For each related node nI in depth-first order, Moment updates its support and tid_sum. Whenever a node is updated, Moment checks if it needs to change its node type.

In Figure 3-2, the node (d) becomes a new frequent node so Moment generates the new candidates node (a, d) and (c, d). By node properties Moment know that (a, d) is an infrequent gateway node and (c, d) is a new closed node. By checking the support of the nodes (a), (a, c), and (c), Moment modifies them to closed nodes.

Deleting the oldest transaction in the window

In Figure 3-3, the transaction with tid = 1 is deleted. Like adding the new transaction, Moment updates support and tid_sum of each node in the CET. By checking the support of each node, Moment modifies its node type.

In Figure 3-3, node (c) becomes unpromising gateway node because it is contained by node (a, c) and supports of (c) and (a, c) are the same. Then the sub tree of node (c), (c, d), is deleted. The node (d) becomes new infrequent gateway node.

Moment maintains a huge number of boundary nodes to speed up the procedure of updating CET. The cost for a node to change its type is less. But we find that those boundary nodes are unnecessary overhead. In our proposed algorithm New-Moment, we reduce the number of tree nodes and utilize an efficient structure to store the information of the sliding window.

3.2 Our Proposed Algorithm: New-Moment Algorithm

We use bit-vector to store the information of a sliding window. Because of the efficiency of bit-vector in counting support and modifying transactions in window, New-Moment only maintains closed frequent itemsets in each sliding window. The new closed enumeration tree (New-CET) is composed of the bit-vectors of 1-itemsets, the closed frequent itemsets in current sliding window, and a hash table.

3.2.1 Bit-Vector

Definition of Bit-Vector: For a specified item i and a given window w of sliding window

model in a data stream, a bit-vector is used to store the occurrences of item i in the transactions of w. Each bit of a bit-vector represents a transaction in w. If the item i

occurs in some transaction of w, the corresponding bit is set to one, else set to zero.

Figure 3-4 shows an example of input database and the first three sliding windows are displayed next to it. These windows are marked from window #1 to window #3. It is assumed that the size of sliding window is 4. The example of figure 3-4 will be used in the following context.

Fig 3-4. An example database and the first three sliding windows

Each window in figure 3-4 can be transformed to a bit-vector by the definition of bit-vector.

The bit-vectors of all items in each window are listed in Table 3-1. The most left bit represents the oldest transaction and the most right bit is the most recent transaction.

Window #1 Window #2 Window #3

a 0111 1111 1110

b 0111 1110 1101

c 1011 0111 1111

d 1000 0001 0010

Table 3-1. The bit-vectors of all items in each window in Figure 3-4

3.2.2 Window Sliding with Bit-Vector

When the number of transactions in a data stream exceeds the size of a window, window

Bit-vector is efficient in window sliding process. We can separate the sliding process into two steps:

(1) Delete the oldest transaction

The only thing a bit-vector needs to do is to left-shift one bit. Take item a as an example.

a’s bit-vector is 1010 in the first window. If transaction with TID = 1 is deleted, a’s bit-vector becomes 0100. Now the most left bit represents the transaction with TID = 2 and the most right bit is meaningless and reserved for next step.

(2) Append the incoming transaction

After deleting the oldest transaction, the most right bit of the bit-vector is set corresponding to the incoming transaction. The bit-vectors of the items contained in the incoming transaction set its most right bit to one; the others set its most right bit to zero. Take item a as an example.

a’s bit-vector is 0100 after deleting the oldest transaction. The incoming transaction is (b, d) (TID = 5) not containing a so a’s bit-vector is still 0100. b’s bit-vector is 1110 after deleting the oldest transaction. The incoming transaction contains b so b’s bit-vector is 1111 after appending the incoming transaction.

3.2.3 Counting Support with Bit-Vector

Concept of bit-vector can be extended to itemset. For example, the bit-vector of itemset (a, b) in the first window is 1010. That means (a, b) occurs in the transactions with TID = 1 and TID = 3.

Assume there are two itemsets X and Y and their corresponding bit-vector BITX and BITY. The bit-vector of the itemset Z = X ∪ Y can be obtained by bitwise AND BITX and BITY.

For example, the bit-vector of itemset (a, b) in the first window (Window #1) is 1010 which can be obtained by bitwise AND the bit-vectors of items a and b. That means (a, b) occurs in the first and the third transactions in the first sliding window. By bitwise AND between

The support of each itemset can be obtained by counting how many bits in the bit-vector are set to one. For example, the support of itemset (a, b) is 2.

3.2.4 Building the New Closed Enumeration Tree (New-CET)

For improving the efficiency of CET in Moment, we propose a new closed enumeration tree (New-CET). New-CET is basically a lexicographical tree. There are three important parts in New-CET:

(1) Bit-vectors of all items (1-itemsets)

Moment maintains an independent sliding window for counting support of each node in CET. Instead of independent sliding window to store current N transactions, information of these transactions is maintained by the bit-vectors of all items.

(2) Closed frequent itemsets in current window Each closed frequent itemset only maintains its support.

(3) Hash table

For checking whether a frequent itemset is closed or not, we need a hash table to store all closed frequent itemsets with their supports as keys. Whenever a new frequent itemset is generated, we can judge if this frequent itemset is closed by hashing its support to the hash table. How to utilize the information of support to judge if a frequent itemset is closed is introduced in section 2.2.

Building New-CET is almost the same as building CET. The major difference is that New-CET only retains bit-vectors of items and closed frequent itemsets and bit-vectors are used to count supports of generated candidates.

When the total number of incoming transactions is less than the size of sliding window, New-Moment only records all item information as introduced in section 3.2.1. When the

to generate all possible candidates and check their supports. Because the candidates are generated by its parent and its parent’s frequent siblings, we can obtain the supports by the method introduced in section 3.2.3. Then for each frequent candidate, we use hash table to check if the frequent candidate is closed. If the candidate is closed, it is inserted in the hash table. If the candidate is not closed, the node is not maintained in New-CET. Figure 3-5 shows the pseudo code of building New-CET

Build (nI, N, S)

1: if support(nI) ≥ S · N then 2: if leftcheck(nI) = false then

3: foreach frequent sibling nK of nI do 4: generate a new child nIK for nI;

5: bitwise AND BITI and BITK to obtain BITIK; 6: foreach child nI′ of nI do

7: Build(nI′, N, S);

8: if ∄a child nI′ of nI such that

support(nI′) = support(nI) then 9: retain nI as a closed frequent itemset;

10: insert nI into the hash table;

Fig 3-5. Pseudo code of building New-CET

nI is a tree node, N is the window size and S is minimum support. Each nI has a corresponding bit-vector BITI to store the information of sliding window. Except the bit-vectors of items, the BITI for a node nI only exists in counting support of a new candidate.

Figure 3-6 shows the New-CET in the first window by previous example when generating new candidates from item a. For simplicity, hash table is not displayed in it. By the bit-vectors of items, we know that items a, b, and c are frequent items. Take item a as an example, new candidates (a, b) and (a, c) are generated. By bitwise AND bit-vectors of items a and b, we can obtain that the support of (a, b) is 3. In the same way, the support of (a, c) is 2

and the support of (a, b, c) is 2. For generating candidates below item a, the bit-vectors of (a, b), (a, c), and (a, b, c) are temporarily maintained in the memory.

Window #1

(a): <0111> (b): <0111> (c): <1011> (d): <1000>

(a, b): <0111>

(a, b, c): <0011>

(a, c): <0011>

Minsup = 2 Window Size = 4

Fig 3-6. New-CET in the first window after generating new candidates from item a

Figure 3-7 shows the New-CET after checking if each frequent candidate is closed. The tree nodes with squares are closed frequent itemsets. By checking support with hash table, we can know that frequent itemset (a, c) is not closed. So New-Moment eliminates this node and other frequent candidates are marked as closed frequent itemsets. Although item a is not closed, New-Moment still maintains the bit-vector of item a. After the sub-tree of item a is checked, the bit-vectors in this sub-tree are eliminated. New-Moment only keeps the supports of closed frequent itemsets.

Window #1

(a): <0111> (b): <0111> (c): <1011> (d): <1000>

(a, b): 3

(a, b, c): 2 (a, b, c): 2 Minsup = 2

Window Size = 4

Figure 3-8 shows the New-CET when Build is done. The sub-tree generations of item b and c are the same as item a. Item c is a new closed frequent itemset.

Window #1

(a): <0111> (b): <0111> (c): <1011> (d): <1000>

(a, b): 3

(a, b, c): 2

(a): <0111> (b): <0111> (c): <1011> (d): <1000>

(a, b): 3

(a, b, c): 2 (a, b, c): 2 Minsup = 2

Window Size = 4

Fig 3-8. New-CET in the first window (Window #1)

3.2.5 Deleting the Oldest Transaction in Window Sliding

Deleting the oldest transaction is our first step of window sliding. All bit-vectors of items are left-shifted one bit first and all items in the deleted transaction are kept. This can be done by observing the most left bit before left-shifting. After modification of bit-vectors of items, New-Moment begins to modify New-CET.

There is only one case for deleting the oldest transaction: original closed frequent itemsets in the New-CET becomes non-closed frequent itemsets or infrequent itemsets. For checking this situation, New-Moment traverses the New-CET again to check the supports of the existing node in the New-CET. Because just the subsets of the deleted transaction are the possible infrequent itemsets, only the sub-trees of the items in the deleted transaction need to be checked. The traversing method is almost the same as building the initial New-CET, called function Delete. The difference is that Delete generates the entire lexicographical tree including the itemsets whose supports are (S · N – 1). This is because supports of some

There is only one case for deleting the oldest transaction: original closed frequent itemsets in the New-CET becomes non-closed frequent itemsets or infrequent itemsets. For checking this situation, New-Moment traverses the New-CET again to check the supports of the existing node in the New-CET. Because just the subsets of the deleted transaction are the possible infrequent itemsets, only the sub-trees of the items in the deleted transaction need to be checked. The traversing method is almost the same as building the initial New-CET, called function Delete. The difference is that Delete generates the entire lexicographical tree including the itemsets whose supports are (S · N – 1). This is because supports of some