使用位元向量在資料串流環境探勘封閉式頻繁項目集及循序樣式之研究

(1)

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

使用位元向量在資料串流環境探勘封閉式頻繁項目集及

循序樣式之研究

Mining of Closed Frequent Itemsets and Sequential Patterns in Data

Streams Using Bit-Vector Based Method

研究生：何錦泉

指導教授：李素瑛教授

(2)

使用位元向量在資料串流環境探勘封閉式頻繁項目集及循序樣式之研

究

Mining of Closed Frequent Itemsets and Sequential Patterns in Data

Streams Using Bit-Vector Based Method

研究生：何錦泉 Student：Chin-Chuan Ho

指導教授：李素瑛 Advisor：Suh-Yin Lee

國立交通大學

資訊科學與工程研究所

碩士論文

A Thesis

Submitted to Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Computer Science July 2006

Hsinchu, Taiwan, Republic of China

(3)

使用位元向量在資料串流環境探勘封閉式頻繁項目集及循序樣式

之研究

研究生：何錦泉指導教授：李素瑛

國立交通大學資訊科學與工程研究所

摘要

在資料串流環境中探勘有意義的樣式是一個重要的課題，在感測網路及股市分析等許多應用中都經常採用。由於資料串流環境的限制，探勘工作將會變得比較困難。我們在此篇論文的第一部份提出 New-Moment 演算法在資料串流環境中探勘封閉式頻繁項目集，New-Moment 使用位元向量以及精簡的 closed enumeration tree 大幅改進原來 Moment 演算法的效能。在第二部分我們提出 IncSPAM 演算法在串流環境中探勘循序樣式，它提供了一個全新的滑動視窗架構。IncSPAM 利用 SPAM 演算法以及記憶體索引的方法，動態維護目前最新的樣式。實驗顯示我們的方法能夠很有效率地在資料串流環境中探勘出有意義的樣式。

(4)

Mining of Closed Frequent Itemsets and Sequential Patterns in

Data Streams Using Bit-Vector Based Method

Student: Chin-Chuan Ho Advisor: Suh-Yin Lee

Institute of Computer Science and Engineering

College of Computer Science

National Chiao-Tung University

Abstract

Mining a data stream is an important data mining problem with broad applications, such as sensor network, stock analysis. It is a difficult problem because of some limitations in the data stream environment. In the first part of this paper, we propose New-Moment to mine closed frequent itemsets. New-Moment uses bit-vectors and a compact lexicographical tree to improve the performance of Moment algorithm. In the second part, we propose IncSPAM to mine sequential patterns with a new sliding window model. IncSPAM is based on SPAM and utilizes memory indexing technique to incrementally maintain sequential patterns in current sliding window. Experiments show that our approaches are efficient for mining patterns in a data stream.

(5)

Acknowledgment

I greatly appreciate the kind guidance of my advisor, Prof. Suh-Yin Lee. Without her graceful suggestion and encouragement, I cannot complete this thesis.

Besides I want to express my thanks to all the members in the Information System Laboratory for their suggestion and instruction, especially Mr. Hua-Fu Li. Finally I would like to express my appreciation to my parents. This thesis is dedicated to them.

(6)

List of Figures

Fig 2-1. Processing model of data stream environment...5

Fig 2-2. A sliding window model in a data stream...6

Fig 2-3. An example of an input database DS...8

Fig 2-4. The customer-sequences in Fig 2-3 ...9

Fig 3-1. CET in the first sliding window...10

Fig 3-2. Adding the new transaction with tid = 5 ...13

Fig 3-3. Deleting the transaction with tid = 1...13

Fig 3-4. An example database and the first three sliding windows ...15

Fig 3-5. Pseudo code of building New-CET ...18

Fig 3-6. New-CET in the first window after generating new candidates from item a...19

Fig 3-7. New-CET in the first window after checking closed frequent itemsets...19

Fig 3-8. New-CET in the first window (Window #1)...20

Fig 3-9. New-CET after deleting the oldest transaction ...21

Fig 3-10. Pseudo code of deleting the oldest transaction in window sliding...22

Fig 3-11. Pseudo code of appending the incoming transaction in window sliding...23

Fig 3-12. New-CET after appending the incoming transaction (Window #2)...24

Fig 4-1. An example for the new concept of sliding window...26

Fig 4-2. An example of CBASW...27

Fig 4-3. An example of index sets...28

Fig 4-4. An example of window sliding in a CBASW ...30

Fig 4-5. A lexicographic sequence tree example...31

Fig 4-6. Main function of Incremental SPAM...34

Fig 4-7. Reducing the generated candidates ...35

Fig 4-8. The pseudo code of function MaintainTree ...36

Fig 4-9. The pseudo code of function Generate ...36

Fig 4-10. The pseudo code of function Update ...37

Fig 4-11. The lexicographic sequence tree when the third transaction comes in ...37

Fig 4-12. The lexicographic sequence tree after the fourth transaction comes in...38

Fig 4-13. The lexicographic sequence tree after the fifth transaction comes in ...39

Fig 4-14. The lexicographic sequence tree after the sixth transaction comes in...40

Fig 4-15. The transactions of a customer with no recent records in a data stream ...40

Fig 4-16. An example of calculating the weights of customers...42

Fig 4-17. When a new transaction with TID = 8 comes in...42 Fig 4-18. The lexicographic sequence tree when the third transaction comes in (with the concept of customer

(9)

Fig 4-19. An example of support updating in IncSPAM (Case 1)...44

Fig 4-20. An example of support updating in Incremental SPAM (Case 2)...45

Fig 4-21. The pseudo code of function UpdateSupport...46

Fig 5-1. Memory usage with different minimum support (T10I8D200K) (New-Moment and Moment) ...48

Fig 5-2. Loading time of the first window with different minimum support (T10I8D200K) (New-Moment and Moment) ...49

Fig 5-3. Average time of window sliding with different minimum support (T10I8D200K) (New-Moment and Moment) ...50

Fig 5-4. Memory usage with different minimum support (T15I12D200K) (New-Moment and Moment) ...51

Fig 5-5. Loading time of the first window with different minimum support (T15I12D200K) (New-Moment and Moment) ...51

Fig 5-6. Average time of window sliding with different minimum support (T15I12D200K) (New-Moment and Moment) ...52

Fig 5-7. Memory usage with different sliding window size (New-Moment and Moment) ...53

Fig 5-8. Loading time of the first window with different sliding window size (New-Moment and Moment) ...54

Fig 5-9. Average time of window sliding with different sliding window size (New-Moment and Moment) ...54

Fig 5-10. Memory usage with different number of items (New-Moment and Moment)...55

Fig 5-11. Loading time of the first window with different number of items (New-Moment and Moment)...56

Fig 5-12. Average time of window sliding with different number of items (New-Moment and Moment) ...56

Fig 5-13. Memory usage with different minimum support (IncSPAM) ...58

Fig 5-14. Relationship between maximum number of tree nodes and memory usage (IncSPAM)...59

Fig 5-15. Average time of window sliding with different minimum support (IncSPAM) ...59

Fig 5-16. Memory usage with different window size (IncSPAM) ...60

Fig 5-17. Average sliding time with different window size (IncSPAM)...60

Fig 5-18. Memory usage with different number of customers (IncSPAM)...61

Fig 5-19. Average sliding time with different number of customers (IncSPAM) ...62

(10)

List of Tables

Table 3-1. The bit-vectors of all items in each window in Figure 3-4...15

Table 4-1. Bit-vectors of all items for all customers ...27

Table 5-1. Parameters of testing data for New-Moment ...47

(11)

Chapter 1 Introduction

1.1 Overview and Motivation

Many problems in mining frequent itemsets and sequential patterns focus on static databases. In recent years, dynamic environment is becoming more and more important in many applications. This dynamic environment is called data streams. However, there are some inherent limitations in streaming data environment. For examples, data mining in sensor networks has some limitation different from traditional data mining, e.g. battery power and capability of sensor CPU [13].

Data streams have characteristics as described below [14] [19]: (1) Unbounded size of input data; (2) Usage of main memory is limited; (3) Input data can only be handled once; (4) Fast arrival rate; (5) System can not control the order data arrives; (6) Analytical results generated by algorithms should be instantly available when users request; (7) Errors of analytical results should be bounded in a range that users can tolerate.

For conditions above, three models are adopted by many researchers in ways of time spanning: landmark model, sliding window model, and damped window model [9].

Landmark model handles data in a time interval. Starting time point is set by users, called landmark, and end time point is equal to current time point. So end time point is changed as

time goes by. If landmark is set to the time point that the first transaction comes, this model will cover all the available data.

In sliding window model, a window with length w will be given. If current time point is t, this model handles data in the range [t – w, t]. So when time goes to next time point, this

(12)

called window sliding. From above we can know that sliding window will cover some range of newest data in a data stream.

Damped window model also considers recent data important, but is not like sliding window model eliminating passed data. In this model all available data is kept but a user defines a weighted function for data which decreases exponentially into the past.

Three models have their own advantages and disadvantages. The sliding window model keeps the latest information in the data stream. This characteristic is useful if real-time patterns are needed, like daily or weekly stock analysis.

In this paper, an algorithm New-Moment that mines closed frequent itemsets and an algorithm IncSPAM that mines sequential patterns in data streams are proposed. These two algorithms can efficiently retrieve useful patterns in data streams.

1.2 Related Work

1.2.1 Mining of Frequent Itemsets

In sliding window model, an efficient algorithm Moment was proposed in [20, 21]. Moment uses a compact data structure, the closed enumeration tree (CET), to maintain a dynamically selected set of itemsets over a sliding window. These selected itemsets consist of closed frequent itemsets and a boundary between the closed frequent itemsets and the rest of the itemsets. CET can cover all necessary information because any status changes of itemsets (e.g. from infrequent to frequent) must be through the boundary in CET. Whenever a sliding occurs, it updates the counts of the related nodes in CET and modifies CET. Experiments of Moment show that the boundary in CET is stable so the updating cost is little. It outperforms algorithms Charm [8] and ZIGZAG [7] in running time.

(13)

number of closed frequent itemsets, the memory usage of Moment will be inefficient.

Our proposed algorithm, New-Moment, only maintains closed frequent itemsets and uses bit-vector to store the information in the window. Experiments show that memory usage of

New-Moment is much less than Moment and running time of both algorithms is almost the same.

There are many researches about mining frequent itemsets over data streams. Manku et al [10] propose lossy counting algorithm to mine frequent itemsets over an entire data stream. Jin et al [15] propose hCount algorithm to maintain frequent items in the streaming environment. Li et al [19] propose an algorithm, DSM-FI, to mine frequent itemsets in the landmark model over a data stream. It is a projection-based, single-pass algorithm. Chang et al [22] propose an algorithm for mining frequent itemsets in the sliding window model. Chang et al [16] propose estDec algorithm. It uses a decay function to reduce the weight of the old transactions. Researches about mining maximal frequent itemsets and closed frequent itemsets over data streams are few. Li et al [27] propose DSM-MFI to mine maximal frequent itemsets in the sliding window model in a data stream.

1.2.2 Mining of Sequential Patterns

There are many researches about mining sequential patterns in a static database. Agrawal et al [2] introduce the concept of sequential patterns. They use apriori method that is not efficient enough. Pei et al [6] provide a efficient algorithm, PrefixSpan, to mine sequential patterns by prefix-projected pattern growth. Lin et al [12] use memory indexing to decrease the time of mining sequential patterns. The assumption is that entire sequence database can be loaded into main memory. An algorithm SPAM is provided in [11] which use a lexicographic sequence tree to check all possible frequent sequences. Bitmap representation is used for

(14)

Besides general sequential patterns, closed sequential patterns are also studied. Yan et al [17] provide CloSpan to mine closed sequential patterns in large datasets. Chen et al [23] mine multiple-level sequential patterns. A concept hierarchy is used to represent the relationship between items. For the flexibility of sequence databases, incremental mining of sequential patterns is also a research issue. Yen et al [3] and Cheng et al [24] provide researches in this area.

Researches about mining of sequential patterns in data streams are not as many as in static databases. Teng et al [18] provide FTP-DS to mine temporal patterns in a data stream. Regression-based analysis on frequent patterns is the main feature to improve the performance of FTP-DS. Chen et al [25] mine sequential patterns across many data streams. Marascu [28] use SMDS algorithm to mine web usage sequences.

1.3 Organization of Thesis

The remainder of this paper is organized as follows. Some basic definitions and terminology about itemset, sequence, and sliding window model are described in Chapter 2. The New-Moment algorithm to mine closed frequent itemsets is presented in Chapter 3. The IncSPAM algorithm to mine sequential patterns is introduced in Chapter 4. Finally the experiments and performance measurements are described in Chapter 5. Conclusion and future work is in Chapter 6.

(15)

Chapter 2 Problem Definition and Background

In this chapter we introduce the basic definition of problems. We introduce the definition of the data stream environment and the sliding window model in section 2.1. Next we describe the definition of closed frequent itemsets and sequential patterns in section 2.2.

2.1 The Sliding Window Model in Data Streams

2.1.1 Data Stream Environment

Fig 2-1. Processing model of data stream environment

A data stream DS = [T1, T2, …, TM) is an infinite transaction set. In a data stream environment, the input is the continuous data stream and each transaction can only be

scanned once. Due to the limited memory and one-time scan of each transaction (one-pass), a summary data structure is needed to store compact information about the data stream. In other words, one-pass algorithms for mining data streams have to sacrifice the correctness of its analytical results by allowing some counting error. Hence traditional multi-pass techniques

(16)

for mining static databases are not feasible to be used in the data stream environment. Figure 2-1 shows a processing model of data streams [19].

2.1.2 A Sliding Window Model

Some applications in data streams emphasize the importance of the latest transactions. A

sliding window model is suitable to solve this kind of problems. In the basic concept, a sliding

window keeps the latest N transactions in the data streams; N is called a window size. The mining data streams engine in Figure 2-1 only mines patterns in the current sliding window. Whenever a new transaction is coming, the sliding window eliminates the oldest transaction and appends the incoming transaction. This process is called window sliding. The mining data streams engine also modifies the summary data structure by the changes of sliding window. Figure 2-2 shows the sliding window in an input data stream.

‧‧‧

Data Streams

N

System starts

Fig 2-2. A sliding window model in a data stream

2.2 Definition of Mining Closed Frequent Itemsets

I = {i1, i2, i3, …, in} is a set of literals, called items. An itemset is a set of items. An itemset

(17)

database which has a set of transactions. Each transaction T consists of a set of items from I, i.e., T ⊆ I and a transaction id (TID) represents the time order in the database. An itemset X is said to be contained in a transaction T if X ⊆ T. The support of an itemset X is the number of transactions containing X. An itemset X is a frequent itemset if the support of X is more than a user specified threshold minimum support S.

As an example, let I = {a, b, c, d}, DI = {(a, b, c), (b, c, d), (a, b, c), (b, c)}, S = 0.5. The set

of frequent itemsets F = {(a): 2, (b): 4, (c): 4, (a, b): 2, (a, c): 2, (b, c): 4, (a, b, c): 2}.The Number following the colon represents the support of the itemset.

The total number of all the frequent itemsets sometimes is too large and it is difficult to retrieve useful information. For reducing the number of output patterns, the concept of closed

frequent itemsets [4] is proposed.

Definition of Closed Frequent Itemset. A frequent itemset X is closed if there is no frequent itemset X′ such that (1) X ⊂X′ and (2) ∀transaction T, X ∈T Æ X′ ∈T.

In the above example, the set of closed frequent itemsets C = {(b, c): 4, (a, b, c): 2}, C ⊆F. We observe itemsets (b), (c), and (b, c). The supports of itemsets (b) and (c) are equals to the support of itemset (b, c); and further, itemsets (b) and (c) are subsets of (b, c). That means itemsets (b) and (c) exist in the same transactions of itemset (b, c). By the definition of closed frequent itemset, (b) and (c) are not closed frequent itemsets and (b, c) is a closed frequent itemset.

All frequent itemsets can be obtained from closed frequent itemsets without losing support information. In the above example, we know that (b) and (c) are frequent itemsets by observing closed frequent itemset (b, c). Supports of (b) and (c) are the same as (b, c). Supports of frequent itemsets can be used to judge if a frequent itemset is closed.

(18)

2.3 Definition of Mining Sequential Patterns

An input database DS contains customer-transactions. These customer-transactions are a

little different from transactions in section 2.1.1. Each customer-transaction consists of the following field: customer-id(CID), transaction-id(TID), and the items purchased in the transaction (called an itemset). The concepts of TID, items and itemsets here are the same in section 2.2. The difference is that each transaction in DS belongs to some customer. Figure 2-3

shows an example of the transaction database DS.

Customer ID (CID) Transaction ID(TID) Itemset

1 1 (a, b, d) 2 2 (b) 1 3 (b, c, d) 2 4 (a, b, c) 3 5 (a, b) 1 6 (b, c, d) 3 7 (b, c, d)

Fig 2-3. An example of an input database DS

A sequence is an ordered list of itemset and is denoted as S = 〈s1s2s3…sk〉, where sj is an

itemset. A sequence α = 〈a1a2a3…ak〉 is contained in another sequence β = 〈b1b2b3…bk〉 if

there exists integers i1 < i2 < i3 < … < in such thata1 ⊆bi1,a2 ⊆bi2,...,an ⊆bin.

All the transactions of a customer can be viewed as a sequence, where each transaction corresponds to a set of items, and the list of transactions, ordered by increasing transaction-id, corresponds to a sequence. We call such a sequence a customer-sequence. Figure 2-4 shows the customer-sequences in Figure 2-3.

(19)

CID Sequence 1 <(a, b, d)(b, c, d)(b, c, d)> 2 <(b)(a, b, c)> 3 <(a, b)(b, c, d)>

Fig 2-4. The customer-sequences in Fig 2-3

The absolute support of a sequence S is defined as the number of customer-sequences containing S. Sequential patterns are the sequences whose supports are more than a user-defined minimum support, also called frequent sequences.

(20)

Chapter 3 New-Moment: Mining Closed Frequent Itemsets

The goal of New-Moment is to improve Moment algorithm. First we introduce Moment algorithm in section 3.1. Next we introduce our proposed algorithm, New-Moment, in section 3.2.

3.1 Related Work: Moment Algorithm

Moment [20, 21] algorithm mines closed frequent itemsets with sliding window model in a data stream. It uses a closed enumeration tree (CET) to maintain the closed frequent itemsets in the current window. CET not only maintains closed frequent itemsets but also maintains some boundary tree nodes. Figure 3-1 shows the CET in the first window. Assume that the window size is 4 and the first four incoming transaction is listed in the left of the graph.

∅ (a): 3 Wi nd ow # 1 (b): 3 (c): 3 (d): 1 (a, b): 3 (a, c): 2 (a, b): 3 (a, c): 2 (a, b, c): 2

(a, b, c): 2 _{Infrequent gateway node}_{Infrequent gateway node}_{Infrequent gateway node}

Closed node Closed node Closed node

Unpromising gateway node Unpromising gateway node Unpromising gateway node

Minsup = 2 Window Size = 4 Intermediate node Intermediate node a, b, c 4 a, b, c 3 a, b 2 c, d 1 Itemsets TID a, b, c 4 a, b, c 3 a, b 2 c, d 1 Itemsets TID

Fig 3-1. CET in the first sliding window

(21)

(1) infrequent gateway nodes

A node nI that represents itemset I is an infrequent gateway node if i) I is an infrequent

itemset, ii) nI’s parent, nJ, is frequent, and iii) I is the result of joining I’s parent, J, with one

of J’s frequent siblings. In Figure 3-1, the tree node (d) is an infrequent gateway node. (2) unpromising gateway nodes

A node nI is an unpromising gateway node if i) I is a frequent itemset and ii) there exists a

closed frequent itemset J such that J ⊂ I, and J has the same support as I does. In Figure 3-1, the tree nodes (a, c) and (b) are unpromising gateway nodes.

(3) intermediate nodes

A node nI is an intermediate node if i) I is a frequent itemset, ii) nI has a child node nJ

such that J has the same support as I does, and iii) nI is not an unpromising gateway node.

In Figure 3-1, the tree node (a) is an intermediate node because its child (a, b) has the same support as (a) does.

(4) closed nodes

These nodes represent closed frequent itemsets in the current window. A closed node can be an internal node or a leaf node. In Figure 3-1, (c), (a, b), and (a, b, c) are closed nodes.

Except closed nodes, Moment keeps three types of boundary nodes. These nodes are the most possible candidates of new closed nodes in the next window. Moment keeps these nodes for speeding up modification of the closed enumeration tree.

There are three steps in Moment algorithm:

(1) Building the closed enumeration tree (CET)

When the total number of transactions coming from the data stream does not excess window size N, Moment just saves these transactions in its sliding window. As long as the

(22)

window is full, Moment builds an initial closed enumeration tree (CET). Figure 3-1 shows the tree in the first window.

Moment adopts a depth-first procedure to generate all possible candidate itemsets in the window and check their supports. In the procedure, if a node is found to be infrequent, it is marked as an infrequent gateway node and Moment does not explore its descendants further.

If a node is frequent itemset but not closed frequent itemset, the node is marked as an unpromising gateway node. Moment also does not explore its descendants, which does not contain any closed frequent itemsets. Moment uses support of a node and the tid sum of the transactions that containing the node (tid_sum) to check if the node is a closed node. Take the nodes (a, c) and (a, b, c) in Figure 3-1 as an example. The support of (a, c) is the same as (a, b, c). The tid_sum of (a, c) is 7 (the third transaction and the fourth transaction in the window). That is equal to the tid_sum of (a, b, c). By the definition of closed frequent itemsets, we can know that (a, c) is not a closed node.

If a node is found to be neither an infrequent node nor an unpromising gateway node, Moment explores its descendants. The nodes that are intermediate nodes or closed nodes are maintained in the CET.

(2) Updating the CET

Initial closed enumeration tree is built when the number of incoming transactions from the data stream is equal to the window size. After that, when a new transaction comes from the data stream, Moment updates the CET to maintain the closed frequent itemsets in the current window. There are two steps for updating the CET:

(23)

∅ (a): 3 (b): 3 (c): 3 (d): 1 (a, b): 3 (a, c): 2 (a, b, c): 2 Minsup = 2 Window Size = 4 a, b, c 4 a, c, d 5 a, b, c 3 a, b 2 c, d 1 Itemsets TID a, b, c 4 a, c, d 5 a, b, c 3 a, b 2 c, d 1 Itemsets TID (a): 4 (c): 4 (d): 2

(a, c): 3 (a, d): 1(a, d): 1 (c, d): 2(c, d): 2

Fig 3-2. Adding the new transaction with tid = 5

In Figure 3-2, a new transaction T (tid = 5) is added to the sliding window. Moment traverses the parts of the CET that are related to transaction T. For each related node nI in

depth-first order, Moment updates its support and tid_sum. Whenever a node is updated, Moment checks if it needs to change its node type.

In Figure 3-2, the node (d) becomes a new frequent node so Moment generates the new candidates node (a, d) and (c, d). By node properties Moment know that (a, d) is an infrequent gateway node and (c, d) is a new closed node. By checking the support of the nodes (a), (a, c), and (c), Moment modifies them to closed nodes.

Deleting the oldest transaction in the window

∅ (a): 3 (b): 3 (c): 3 (d): 1 (a, b): 3 (a, c): 2 (a, b, c): 2 Minsup = 2 Window Size = 4 (a): 4 (c): 4 (d): 2 (a, c): 3 W indo w # 2 (d): 1 (c): 3 (c): 3 (c): 3 a, b, c 4 a, c, d 5 a, b, c 3 a, b 2 c, d 1 Itemsets TID a, b, c 4 a, c, d 5 a, b, c 3 a, b 2 c, d 1 Itemsets TID (a, d): 1 (a, d): 1 (a, d): 1

(24)

In Figure 3-3, the transaction with tid = 1 is deleted. Like adding the new transaction, Moment updates support and tid_sum of each node in the CET. By checking the support of each node, Moment modifies its node type.

In Figure 3-3, node (c) becomes unpromising gateway node because it is contained by node (a, c) and supports of (c) and (a, c) are the same. Then the sub tree of node (c), (c, d), is deleted. The node (d) becomes new infrequent gateway node.

Moment maintains a huge number of boundary nodes to speed up the procedure of updating CET. The cost for a node to change its type is less. But we find that those boundary nodes are unnecessary overhead. In our proposed algorithm New-Moment, we reduce the number of tree nodes and utilize an efficient structure to store the information of the sliding window.

3.2 Our Proposed Algorithm: New-Moment Algorithm

We use bit-vector to store the information of a sliding window. Because of the efficiency of bit-vector in counting support and modifying transactions in window, New-Moment only maintains closed frequent itemsets in each sliding window. The new closed enumeration tree (New-CET) is composed of the bit-vectors of 1-itemsets, the closed frequent itemsets in current sliding window, and a hash table.

3.2.1 Bit-Vector

Definition of Bit-Vector: For a specified item i and a given window w of sliding window

model in a data stream, a bit-vector is used to store the occurrences of item i in the transactions of w. Each bit of a bit-vector represents a transaction in w. If the item i

(25)

Figure 3-4 shows an example of input database and the first three sliding windows are displayed next to it. These windows are marked from window #1 to window #3. It is assumed that the size of sliding window is 4. The example of figure 3-4 will be used in the following context.

b, c

6 a, c, d

5 a, b, c

4 a, b, c

3 a, b

2 c, d

1 Itemsets

TID

b, c

6 a, c, d

5 a, b, c

4 a, b, c

3 a, b

2 c, d

1 Itemsets

TID

Window size N = 4

Wind ow # 1 Wind ow # 2 Wind ow # 3

Fig 3-4. An example database and the first three sliding windows

Each window in figure 3-4 can be transformed to a bit-vector by the definition of bit-vector. The bit-vectors of all items in each window are listed in Table 3-1. The most left bit represents the oldest transaction and the most right bit is the most recent transaction.

Window #1 Window #2 Window #3

a 0111 1111 1110

b 0111 1110 1101

c 1011 0111 1111

d 1000 0001 0010

Table 3-1. The bit-vectors of all items in each window in Figure 3-4

3.2.2 Window Sliding with Bit-Vector

(26)

Bit-vector is efficient in window sliding process. We can separate the sliding process into two steps:

(1) Delete the oldest transaction

The only thing a bit-vector needs to do is to left-shift one bit. Take item a as an example. a’s bit-vector is 1010 in the first window. If transaction with TID = 1 is deleted, a’s bit-vector

becomes 0100. Now the most left bit represents the transaction with TID = 2 and the most right bit is meaningless and reserved for next step.

(2) Append the incoming transaction

After deleting the oldest transaction, the most right bit of the bit-vector is set corresponding to the incoming transaction. The bit-vectors of the items contained in the incoming transaction set its most right bit to one; the others set its most right bit to zero. Take item a as an example. a’s bit-vector is 0100 after deleting the oldest transaction. The incoming transaction is (b, d)

(TID = 5) not containing a so a’s bit-vector is still 0100. b’s bit-vector is 1110 after deleting the oldest transaction. The incoming transaction contains b so b’s bit-vector is 1111 after appending the incoming transaction.

3.2.3 Counting Support with Bit-Vector

Concept of bit-vector can be extended to itemset. For example, the bit-vector of itemset (a, b) in the first window is 1010. That means (a, b) occurs in the transactions with TID = 1 and TID = 3.

Assume there are two itemsets X and Y and their corresponding bit-vector BITX and BITY.

The bit-vector of the itemset Z = X ∪ Y can be obtained by bitwise AND BITX and BITY.

For example, the bit-vector of itemset (a, b) in the first window (Window #1) is 1010 which can be obtained by bitwise AND the bit-vectors of items a and b. That means (a, b) occurs in the first and the third transactions in the first sliding window. By bitwise AND between

(27)

The support of each itemset can be obtained by counting how many bits in the bit-vector are set to one. For example, the support of itemset (a, b) is 2.

3.2.4 Building the New Closed Enumeration Tree (New-CET)

For improving the efficiency of CET in Moment, we propose a new closed enumeration tree (New-CET). New-CET is basically a lexicographical tree. There are three important parts

in New-CET:

(1) Bit-vectors of all items (1-itemsets)

Moment maintains an independent sliding window for counting support of each node in CET. Instead of independent sliding window to store current N transactions, information of these transactions is maintained by the bit-vectors of all items.

(2) Closed frequent itemsets in current window Each closed frequent itemset only maintains its support. (3) Hash table

For checking whether a frequent itemset is closed or not, we need a hash table to store all closed frequent itemsets with their supports as keys. Whenever a new frequent itemset is generated, we can judge if this frequent itemset is closed by hashing its support to the hash table. How to utilize the information of support to judge if a frequent itemset is closed is introduced in section 2.2.

Building New-CET is almost the same as building CET. The major difference is that New-CET only retains bit-vectors of items and closed frequent itemsets and bit-vectors are used to count supports of generated candidates.

When the total number of incoming transactions is less than the size of sliding window, New-Moment only records all item information as introduced in section 3.2.1. When the

(28)

to generate all possible candidates and check their supports. Because the candidates are generated by its parent and its parent’s frequent siblings, we can obtain the supports by the method introduced in section 3.2.3. Then for each frequent candidate, we use hash table to check if the frequent candidate is closed. If the candidate is closed, it is inserted in the hash table. If the candidate is not closed, the node is not maintained in New-CET. Figure 3-5 shows the pseudo code of building New-CET

Build (nI, N, S)

1: if support(nI) ≥ S · N then

2: if leftcheck(nI) = false then

3: foreach frequent sibling nK of nI do

4: generate a new child nI∪K for nI;

5: bitwise AND BITI and BITK to obtain BITI∪K;

6: foreach child nI′ of nI do

7: Build(nI′, N, S);

8: if ∄a child nI′ of nI such that

support(nI′) = support(nI) then

9: retain nI as a closed frequent itemset;

10: insert nI into the hash table;

Fig 3-5. Pseudo code of building New-CET

nI is a tree node, N is the window size and S is minimum support. Each nI has a

corresponding bit-vector BITI to store the information of sliding window. Except the

bit-vectors of items, the BITI for a node nI only exists in counting support of a new candidate.

Figure 3-6 shows the New-CET in the first window by previous example when generating new candidates from item a. For simplicity, hash table is not displayed in it. By the bit-vectors of items, we know that items a, b, and c are frequent items. Take item a as an example, new candidates (a, b) and (a, c) are generated. By bitwise AND bit-vectors of items a and b, we can obtain that the support of (a, b) is 3. In the same way, the support of (a, c) is 2

(29)

and the support of (a, b, c) is 2. For generating candidates below item a, the bit-vectors of (a, b), (a, c), and (a, b, c) are temporarily maintained in the memory.

Wi nd ow # 1 a, b, c 4 a, b, c 3 a, b 2 c, d 1 Itemsets TID a, b, c 4 a, b, c 3 a, b 2 c, d 1 Itemsets TID ∅ (a): <0111> (b): <0111> (c): <1011> (d): <1000> (a, b): <0111> (a, b, c): <0011> (a, c): <0011> Minsup = 2 Window Size = 4

Fig 3-6. New-CET in the first window after generating new candidates from item a

Figure 3-7 shows the New-CET after checking if each frequent candidate is closed. The tree nodes with squares are closed frequent itemsets. By checking support with hash table, we can know that frequent itemset (a, c) is not closed. So New-Moment eliminates this node and other frequent candidates are marked as closed frequent itemsets. Although item a is not closed, New-Moment still maintains the bit-vector of item a. After the sub-tree of item a is checked, the bit-vectors in this sub-tree are eliminated. New-Moment only keeps the supports of closed frequent itemsets.

Wi nd ow # 1 a, b, c 4 a, b, c 3 a, b 2 c, d 1 Itemsets TID a, b, c 4 a, b, c 3 a, b 2 c, d 1 Itemsets TID ∅ (a): <0111> (b): <0111> (c): <1011> (d): <1000> (a, b): 3 (a, b, c): 2 (a, b, c): 2 Minsup = 2 Window Size = 4

(30)

Figure 3-8 shows the New-CET when Build is done. The sub-tree generations of item b and c are the same as item a. Item c is a new closed frequent itemset.

Wi nd ow # 1 a, b, c 4 a, b, c 3 a, b 2 c, d 1 Itemsets TID a, b, c 4 a, b, c 3 a, b 2 c, d 1 Itemsets TID ∅ (a): <0111> (b): <0111> (c): <1011> (d): <1000> (a, b): 3 (a, b, c): 2 ∅ (a): <0111> (b): <0111> (c): <1011> (d): <1000> (a, b): 3 (a, b, c): 2 (a, b, c): 2 Minsup = 2 Window Size = 4

Fig 3-8. New-CET in the first window (Window #1)

3.2.5 Deleting the Oldest Transaction in Window Sliding

Deleting the oldest transaction is our first step of window sliding. All bit-vectors of items are left-shifted one bit first and all items in the deleted transaction are kept. This can be done by observing the most left bit before left-shifting. After modification of bit-vectors of items, New-Moment begins to modify New-CET.

There is only one case for deleting the oldest transaction: original closed frequent itemsets in the New-CET becomes non-closed frequent itemsets or infrequent itemsets. For checking this situation, New-Moment traverses the New-CET again to check the supports of the existing node in the New-CET. Because just the subsets of the deleted transaction are the possible infrequent itemsets, only the sub-trees of the items in the deleted transaction need to be checked. The traversing method is almost the same as building the initial New-CET, called function Delete. The difference is that Delete generates the entire lexicographical tree including the itemsets whose supports are (S · N – 1). This is because supports of some closed frequent itemsets in previous window would be (S · N) and then becomes (S · N – 1) after deletion.

(31)

Figure 3-9 shows the New-CET after deleting the oldest transaction. In the above example, the deleted transaction is (c, d). Only the sub-trees of items c and d need to be checked. We find that item c is no longer a closed frequent itemset. Item d is infrequent and we do not need to check its sub-tree. Figure 3-10 shows the pseudo code of deleting the oldest transaction after left-shifting all bit-vectors of 1-itemsets.

∅ (a): <0111> (b): <0111> (c): <1011> (d): <1000> (a, b): 3 (a, b, c): 2 (a, b, c): 2 (a): <1110> (b): <1110> (d): <0000> a, b, c 4 a, b, c 3 a, b 2 c, d 1 Itemsets TID a, b, c 4 a, b, c 3 a, b 2 c, d 1 Itemsets TID Minsup = 2 Window Size = 4

(32)

Delete (nI, N, S)

1: if nI is not relevant to the deleted transaction then

2: return;

3: else if support(nI) ≥ (S · N – 1) then

4: foreach sliding nK of nI

whose support ≥ (S · N – 1) do 5: generate a new child nI∪K for nI;

8: Delete(nI′, N, S);

11: if nI is closed frequent itemset

in previous sliding window then 12: update the support of nI;

13: update nI in the hash table;

14: else

17: else //leftcheck(nI) = true

in previous sliding window then 19: mark nI as non-closed frequent itemset;

20: eliminate nI from the hash table;

21: else //support(nI) < S · N

in previous sliding window then 23: mark nI as non-closed itemset;

24: eliminate nI from the hash table;

(33)

3.2.6 Appending the Incoming Transaction in window sliding

Appending the incoming transaction is our second step of window sliding. The most right bits of all the bit-vectors of items are set corresponding to the items contained in the incoming transaction. After modification of bit-vectors of items, New-Moment begins to modify New-CET. Only the sub-trees of the items in the inserted transaction need to be checked.

The method of traverse the New-CET for adding a new transaction, called function Append, is the same as function Build. A little difference is that the supports of existing closed frequent itemsets in the hash table need to be modified. Figure 3-11 shows the pseudo code of appending the incoming transaction after setting the most right bit in each bit-vector of 1-itemset.

Append (nI, N, S)

3: foreach frequent sibling nK of nI do

4: generate a new child nI∪K for nI;

7: Append(nI′, N, S);

8: if ∄a child nI′ of nI such that

support(nI′) = support(nI) then

in previous sliding window then 10: update the support of nI;

11: update nI in the hash table;

12: else

(34)

itemsets. Figure 3-12 shows the New-CET after appending the incoming transaction. This is also the New-CET in the second sliding window.

∅ (a): <1110> (b): <1110> (c): <0110> (d): <0000> (a, b): 3 (a, b, c): 2 (a, b, c): 2 (a): <1111> (b): <1110> (c): <0111> (d): <0001> (a, c): 3 W indo w # 2 a, b, c 4 a, c, d 5 a, b, c 3 a, b 2 c, d 1 Itemsets TID a, b, c 4 a, c, d 5 a, b, c 3 a, b 2 c, d 1 Itemsets TID Minsup = 2 Window Size = 4

(35)

Chapter 4 Incremental SPAM (IncSPAM): Mining Sequential Patterns

Sequential pattern mining is more complicated than mining frequent itemsets, especially in the stream environment. In previous researches, there is no general processing model for handling a data stream with a transaction unit. Incremental SPAM (IncSPAM) provides a suitable sliding window model in a data stream. It receives transactions from the data stream and uses a brand-new concept of bit-vector, Customer Bit-Vector Array with Sliding Window (CBASW), to store the information of items for each customer. Then IncSPAM uses a

lexicographic sequence tree to maintain the sequential patterns in the current window. For speeding up the maintaining process, IncSPAM uses index sets to store the first occurring positions in all customer-sequences for a tree node. Whenever a new transaction comes, CBASWs and the lexicographic tree are modified incrementally. Each transaction can be analyzed in few seconds. Finally a weight function is adopted in this sliding window model. The weight function can judge the importance of a sequence and ensure the correctness of the sliding window model.

4.1 A New Concept of Sliding Window for Sequences

Original sliding window model keeps the latest transactions in a data stream. In the mining of sequential patterns, transactions in a data stream belong to many customers. IncSPAM keeps the latest N transactions for each customer in a data stream and N is called window size. Each customer maintains its own sliding window. Figure 4-1 shows an example of this sliding window model.

(36)

(b, c, d) 7 3 (b, c, d) 6 1 (a, b) 5 3 (a, b, c) 4 2 (b, c, d) 3 1 (b) 2 2 (a, b, d) 1 1 Itemset TID Customer ID (CID) (b, c, d) 7 3 (b, c, d) 6 1 (a, b) 5 3 (a, b, c) 4 2 (b, c, d) 3 1 (b) 2 2 (a, b, d) 1 1 Itemset TID Customer ID (CID) Transaction Database < (a, b) (b, c, d)> 3 < (b) (a, b, c)> 2 <(a, b, d)(b, c, d)(b, c, d)> 1 Customer-Sequence CID < (a, b) (b, c, d)> 3 < (b) (a, b, c)> 2 <(a, b, d)(b, c, d)(b, c, d)> 1 Customer-Sequence CID N = 2

Maintain the latest N transactions in the data stream

Sequence Database

Fig 4-1. An example for the new concept of sliding window

In Figure 4-1, the mining system has received 7 transactions. Assume that the window size of each customer is 2 (each customer keeps the latest 2 transactions in a data stream). There are three transactions belonging to customer #1: transactions with TID = 1, 3, and 6. Only transactions with TID = 3 and 6 are stored in the sliding window of customer #1 (marked by the two-way arrow). In the same concept, the sliding windows of customer #2 and customer #3 are also displayed in Figure 3-9. This example will be used through the introduction of IncSPAM.

4.2 Customer Bit-Vector Array with Sliding Window (CBASW)

IncSPAM also uses bit-vectors to store the information of the sliding window. The concept of bit-vector is almost the same as in section 3.2.1. The difference is that each customer has his own bit-vectors for all items to store the information of his sliding window. Table 4-1 shows the bit-vectors of each customer in Figure 4-1. All these bit-vectors can be collected as a unique data structure for each customer.

(37)

Customer #1 Customer #2 Customer #3

a 00 01 10

b 11 11 11

c 11 01 01

d 11 00 01

Table 4-1. Bit-vectors of all items for all customers

Definition of Customer Bit-Vector Array with Sliding Window (CBASW): For each customer-sequence c, we keep the latest N transactions. N is called window size. Each bit-vector of item i contains N bits to represent the occurrences of i in the latest N transactions.

Figure 4-2 shows an example of CBASW. Each bar with a customer id is a CBASW which contains all bit-vectors of all items for a customer.

CID = 1 0 0 a 1 1 1 1 1 1 b c d CID = 2 0 1 a 1 1 0 1 0 0 b c d CID = 3 1 0 a 1 1 0 1 0 1 b c d CID = 1 0 0 a 1 1 1 1 1 1 b c d CID = 2 0 1 a 1 1 0 1 0 0 b c d CID = 3 1 0 a 1 1 0 1 0 1 b c d < (a, b) (b, c, d)> 3 < (b) (a, b, c)> 2 <(b, c, d)(b, c, d)> 1 Customer-Sequence CID < (a, b) (b, c, d)> 3 < (b) (a, b, c)> 2 <(b, c, d)(b, c, d)> 1 Customer-Sequence CID

Sliding Window for Each Customer

Customer Bit-Vector Array with Sliding Window

Fig 4-2. An example of CBASW

A lexicographical tree is needed in the mining process of Incremental SPAM. Although CBASWs keep all information of items (1-itemsets) for all customers, they are not efficient enough to be used to build and modify a lexicographical tree. In the next section the concept of Index Set will be introduced to speed up the process.

(38)

4.3 Index Set ρ-idx

Building and modifying lexicographical tree for mining sequential patterns is complicated than for mining frequent itemsets. The number of candidate sequence is huge. Index Set for a sequence only stores the first positions in all customer-sequences. The memory usage of these positions is less than the memory usage of bit-vectors.

Definition of Index Set: For a sequence ρ, the first occurring position in a customer-sequence c of ρ is recorded as ρ-posc. If ρ is not in c, ρ-posc = 0. The collection of

these ρ-pos values in the order of customer id is called an index set ρ-idx. For convenience, ρ-posc can be represented as ρ-idx[c].

Take the CBASWs in Figure 4-2 as an example, the index sets of 1-sequences (items) are listed in Figure 4-3. Each number in the array represents the first position of a sequence ρ in each customer. By counting the number of positions which is not zero in an index set, we can obtain the support of a sequence. Take the 1-sequence <(a)> as an example. The number of non-zero positions in <(a)>-idx is 2. That means the sequence <(a)> exists in two customer-sequence (CID = 2 and CID = 3). So the support of <(a)> is 2. The support of each sequence is records after the index set.

CID = 1 0 0 a 1 1 1 1 1 1 b c d CID = 2 0 1 a 1 1 0 1 0 0 b c d CID = 3 1 0 a 1 1 0 1 0 1 b c d [0, 2, 1]: 2 <(a)>-idx [1, 1, 1]: 3 <(b)>-idx [1, 2, 2]: 3 <(c)>-idx [1, 0, 2]: 2 <(d)>-idx ρ-idx Support of ρ

(39)

4.4 Maintaining Information of Items in Window Sliding with

CBASW and ρ-idx

Whenever a new transaction comes from the data stream, its CID is checked to find out which CBASW needs to be modified. Each bit-vector in this CBASW is modified by the incoming transaction. As in section 3.2.2, if the number of transactions of the customer exceeds the size of window, window sliding is performed.

Window sliding process is the same in section 3.2.2. Each bit-vector left-shifts one bit to eliminate the oldest transaction and sets the most right bit by the incoming transaction. The bit-vectors of the items in the incoming transaction set their most right bits to one; the others set their most right bits to zero. Figure 3-11 shows an example of this sliding process.

In Figure 4-4, we assume that the system has received 5 transactions and only the CBASW of customer #1 is observed. The first CBASW shows the situation before the new transaction with TID = 6 coming. The second CBASW shows the result after left-shifting each bit-vector one bit. The third CBASW shows the final result after setting the most right bit by the incoming transaction.

(40)

(a, b) 5 3 (a, b, c) 4 2 (b, c, d) 3 1 (b) 2 2 (a, b, d) 1 1 Itemset TID CID (a, b) 5 3 (a, b, c) 4 2 (b, c, d) 3 1 (b) 2 2 (a, b, d) 1 1 Itemset TID CID CID = 1 1 01 0 a b c d 1 1 1 1 0 10 1 1 11 1 (b, c, d) (b, c, d) 6 6 1 1 (a, b) 5 3 (a, b, c) 4 2 (b, c, d) 3 1 (b) 2 2 (a, b, d) 1 1 Itemset TID CID (b, c, d) (b, c, d) 6 6 1 1 (a, b) 5 3 (a, b, c) 4 2 (b, c, d) 3 1 (b) 2 2 (a, b, d) 1 1 Itemset TID CID CID = 1 0 00 0 a b c d 0 1 0 1 1 01 0 1 01 0 Before the transaction with TID = 6 comes

After the transaction with TID = 6 comes

Left-shift each bit-vector one bit CID = 1 0 00 0 a b c d 1 1 1 1 1 11 1 1 11 1

Set the most right bit by the incoming transaction ① ② ③ [1, 2, 1]: 3 <(a)>-idx [0, 2, 1]: 2 <(a)>-idx [0, 2, 1]: 2 <(a)>-idx

Fig 4-4. An example of window sliding in a CBASW

ρ-idx of each item (1-sequence) is maintained according to the CBASWs. Whenever a window sliding for a CBASW of a customer c is performed, each ρ-idx[c] is decreased once. If a ρ-idx[c] becomes zero after decreasing, the bit-vector of ρ is checked to find out the new first occurring position. If ρ does not exist in this customer-sequence anymore, the ρ-idx[c] is set to zero.

In the first CBASW of Figure 4-4, <(a)>-idx[1] is 1. After window sliding, <(a)>-idx[1] is decreased to 0. In the third CBASW of Figure 3-11, <(a)>-idx[1] is 0 because sequence <(a)> does not exist in customer-sequence of customer #1. The support of sequence <(a)> also decreases to 2.

(41)

4.5 Building of Lexicographical Sequence Tree

Section 4.4 introduces the algorithm for maintaining information of items in a data stream. After the items of the incoming transaction are processed, these items are used to generate all possible sequences and check their supports. SPAM algorithm [11] presents a concept of lexicographical sequence tree to archive this goal.

Assume that there is a lexicographical ordering ≤ of the items I in the data stream. If item i occurs before item j in the ordering, then we denote this by i ≤I j. This ordering can be

extended to sequences by defining sa ≤ sb if sa is a subsequence of sb. If sa is not a

subsequence of sb, then there is no relationship in this ordering.

The root of the tree is labeled with ∅. Recursively if n is a node in the tree, then n’s

children are all nodes n′ such that n ≤ n′ and ∀m ∈ T: n′ ≤ m ⇒ n ≤ m. Figure 4-5 shows an example of a lexicographic sequence tree. This graph is modified from [11]. Here shows a sub-tree of sequence tree for two items a and b to the fourth level.

∅

<(a)> <(b)>

Level 1

<(a)(a)> <(a)(b)> <(a, b)> 2

3 <(a, b)(b)> <(a, b)(a)> <(a)(b)(b)> <(a)(b)(a)> <(a)(a, b)> <(a)(a)(b)> <(a)(a)(a)> 4

<(a)(a)(a, b)> <(a)(a, b)(a)> <(a)(a, b)(b)> <(a)(b)(a, b)> <(a, b)(a)(a)> <(a, b)(a)(b)> <(a, b)(a, b)>

(42)

Each sequence in the sequence tree can be considered as either a sequence-extended sequence or an itemset-extended sequence. A sequence-extended sequence is a sequence by adding a 1-itemset to the end of its parent’s sequence in the lexicographical tree, like <(a)(a)(a)> and <(a)(a)(b)> generated from <(a)(a)> in Figure 4-5. An itemset-extended

sequence is a sequence by adding an item to the last itemset in the parent’s sequence in the lexicographical tree. The order of the item is greater than any item in the last itemset, like <(a)(a, b)> generated from <(a)(a)> in Figure 4-5.

If we generate sequences by traversing the tree, then each node in the tree can generate sequence-extended children sequences and itemset-extended children sequences. We refer to the process of generating sequence-extended sequences as the sequence-extension step

(S-step) and the process of generating itemset-extended sequences as the itemset-extension

step (I-step). Thus each node n in the tree has two sets: Sn, the set of candidate items that are

considered for possible S-step extensions of node n and In, the set of candidate items that are

considered for possible I-step extensions.

4.6 Counting Support with ρ-idx

For a sequence ρ, ρ-idx stores the first positions of ρ in all customer-sequences. In the lexicographical tree of Incremental SPAM, each node, representing ρ, uses ρ-idx to count the support of sequence ρ. We introduce the method in two different steps.

4.6.1 Counting Support in S-step

Assume there are a sequence α and an appended 1-itemset β. An S-extended sequence γ is generated by α and β. Our goal is to use α-idx and β-idx to generate γ-idx and count the support of γ.

(43)

The first step α-idx[c] and β-idx[c] for each customer c are checked. If either α-idx[c] or β-idx[c] is zero, γ-idx[c] is set to zero. That means γ can not exist in the customer-sequence of c. If there is a c that α-idx[c] and β-idx[c] are both not zero, γ may exist in this

customer-sequence. The corresponding position values have to be checked. There are two cases for α-idx[c] and β-idx[c]:

(Case 1) α-idx[c] < β-idx[c]: That means α appears before β and γ does exist in the customer-sequence of c. γ-idx[c] is set to β-idx[c].

(Case 2) α-idx[c] ≥ β-idx[c]: In this case we do not have enough information for judging if γ exists. The CBASW of customer c needs to be further checked. We denote the bit-vector

of item i in the CBASW of customer c as CBASWc(i).

A left-shit operation is performed on CBASWc(β) by α-idx[c] bits. If the result of shifting

is a non-zero bit-vector, γ exists in the customer-sequence of c. Assume the position of the first non-zero bit in the result is h. The first position of γ, γ-idx[c], is set to α-idx[c] + h. Otherwise γ does not exist in the customer-sequence of c and γ-idx[c] is set to zero.

Finally the support of γ can be counted by the number of non-zero positions.

4.6.2 Counting Support in I-step

Assume a sequence α, an appended item T, and an I-extended sequence γ generated by α and T. Our goal is to use α-idx and T-idx to generate γ-idx and count the support of γ.

As in S-step, α-idx[c] and T-idx[c] for each customer c are also checked. If either α-idx[c] or T-idx[c] is zero, γ-idx[c] is set to zero. If there is a c that α-idx[c] and T-idx[c] are both not zero, the corresponding position values have to be checked. There are also two cases for α-idx[c] and T-idx[c] in I-step:

(44)

(Case 2) α-idx[c] ≠ β-idx[c]: The further check in CBASW of c is performed. Assume X is the last itemset of α. A bit-vector BITX is obtained by bitwise AND CBASWc(x1),

CBASWc(x2), …, CBASWc(xk), where xi is an item contained in X. Then BITX is

left-shifted (α-idx[c] – 1) bits. If the resultant sequence is non-zero, γ exists in this customer-sequence. Assume the position of the first non-zero bit in the result is h. The first position of γ, γ-idx[c], is set to [(α-idx[c] – 1) + h]. Otherwise γ does not exist in the customer-sequence of c and γ-idx[c] is set to zero.

As in section 4.6.1, the support of γ can be counted by the number of non-zero positions.

4.7 The Entire Process of Incremental SPAM (IncSPAM)

Finally we introduce entire IncSPAM algorithm for the mining of sequential patterns. Figure 4-6 shows the main function of IncSPAM.

IncSPAM (S, d, N)

1: foreach incoming transaction from the data stream do

2: find out which customer c the incoming transaction belongs to; 3: update the CBASW of this customer by the incoming transaction; 4: store all the frequent 1-sequences to F;

5: MaintainTree(c, F);

Fig 4-6. Main function of Incremental SPAM

The CBASW of each customer is modified from line 1 to line 4. After the modification of CBASWs is finished, function MaintainTree is called. Function MaintainTree maintains sequential patterns dynamically in a lexicographic sequence tree. There are some cases about incremental mining of sequential patterns. Assume that a new transaction ω comes in. ω belongs to customer c. The lexicographical tree T is updated to T′:

(45)

‧ A pattern which is frequent in T is still frequent in T′: We only needs to update its ρ-idx and support

‧ A pattern which is not in T appears in T′: A new pattern is generated because of the incoming transaction. By the Apriori [1] property, since prefix of the new pattern must also be frequent, we only need to generate candidates from the leaf nodes of T. There are two ways to reduce the number of candidates to be generated: (1) We only consider the items in the incoming transaction to append on the leaf nodes because the new patterns must contain these items in the end. (2) The incoming transaction only belongs to a specific customer c so the generated candidates must begin with the items in the customer-sequence of c. Figure 4-7 shows an example after sliding the CBASW of customer #3. The incoming transaction is TID = 7.

(b, c, d) 7 3 (b, c, d) 6 1 (a, b) 5 3 (a, b, c) 4 2 (b, c, d) 3 1 (b) 2 2 (a, b, d) 1 1 Itemset TID Customer ID (CID) (b, c, d) 7 3 (b, c, d) 6 1 (a, b) 5 3 (a, b, c) 4 2 (b, c, d) 3 1 (b) 2 2 (a, b, d) 1 1 Itemset TID Customer ID (CID) Transaction Database < (a, b) (b, c, d)> 3 < (b) (a, b, c)> 2 <(b, c, d)(b, c, d)> 1 Customer-Sequence CID < (a, b) (b, c, d)> 3 < (b) (a, b, c)> 2 <(b, c, d)(b, c, d)> 1 Customer-Sequence CID

Sliding Window of Each Customer

The appended items are items b, c, and d.

The sub-trees of items a, b, c, and d need to generate candidates; others don’t.

Fig 4-7. Reducing the generated candidates

‧ A pattern which is in T does not exist in T′: The pattern becomes infrequent because of window sliding. We directly delete the node and its sub-tree.

(46)

MaintainTree (c, F)

1: foreach tree node n who’s representing item i is in F do 2: if i exists in the customer-sequence of c then 3: Generate(c, n);

4: else //i does not exist in c 5: Update(c, n);

Fig 4-8. The pseudo code of function MaintainTree

Figure 4-8 shows the pseudo code of MaintainTree. Function Generate, as shown in Figure 4-9, uses S-step and I-step to generate all possible children with the principles mentioned above for each tree node. If the child does not exist in the lexicographical tree, Generate creates a new tree node for this child. If the child is in the lexicographical tree, Generate only updates the index set and support of this child. Function Update, as shown in Figure 4-10, is simpler than Generate. Update does not need to generate children. Update only checks each tree node to update its index set and support. The process of updating the index set and the support is in Function UpdateSupport.

Generate (c, n)

1: foreach existing child n′ of n do 2: UpdateSupport(c, n′); 3: if the support of n′ < S then 4: eliminate n′ and its sub-tree;

5: generate candidates of n by S-step and I-step; 6: foreach generated candidate x of n do

7: count the support of x; 8: if the support of x ≥ S then 9: x is a child of n; 10: foreach child n′ of n do 11: Generate(c, n′);

(47)

Update (c, n)

1: foreach existing child n′ of n do 2: UpdateSupport(c, n′); 3: if the support of n′ < S then 4: eliminate n′ and its sub-tree; 5: foreach child n′ of n do

6: Update(c, n′);

Fig 4-10. The pseudo code of function Update

We use the previous example to show the process of IncSPAM. Assume three transactions have been received by IncSPAM. Figure 4-11 shows the CBASWs and the lexicographic sequence tree. We mark the sequential patterns with squares. Each tree node maintains an index set to record its support. In Figure 4-11, only 1-sequence <(b)> is frequent so the tree does not have longer sequential patterns.

∅ <(a)> <(b)> <(c)> <(d)> [1, 0]: 1 [1, 1]: 2 [2, 0]: 1 [1, 0]: 1 0 1 0 1 a 1 1 1 1 0 10 1 1 11 1 b c d 0 0 0 0 a 0 1 0 1 0 00 0 0 00 0 b c d CID = 1 CID = 2 minsup = 2 (b, c, d) 3 1 (b) 2 2 (a, b, d) 1 1 Itemset TID CID (b, c, d) 3 1 (b) 2 2 (a, b, d) 1 1 Itemset TID CID

Fig 4-11. The lexicographic sequence tree when the third transaction comes in

When the fourth transaction (a, b, c) comes in, CBASW of customer 2 has been modified and 1-sequences <(a)> and <(c)> become new sequential patterns. By the extension methods, S-step and I-step, longer candidates have been generated. IncSPAM checks the support of each candidate using index set and keeps sequential patterns in the lexicographic sequence

使用位元向量在資料串流環境探勘封閉式頻繁項目集及循序樣式之研究

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

使用位元向量在資料串流環境探勘封閉式頻繁項目集及

循序樣式之研究

Mining of Closed Frequent Itemsets and Sequential Patterns in Data

Streams Using Bit-Vector Based Method

研 究 生：何錦泉

指導教授：李素瑛 教授

使用位元向量在資料串流環境探勘封閉式頻繁項目集及循序樣式之研

究

Mining of Closed Frequent Itemsets and Sequential Patterns in Data

Streams Using Bit-Vector Based Method

研 究 生：何錦泉 Student：Chin-Chuan Ho

指導教授：李素瑛 Advisor：Suh-Yin Lee

國 立 交 通 大 學

資 訊 科 學 與 工 程 研 究 所

碩 士 論 文

使用位元向量在資料串流環境探勘封閉式頻繁項目集及循序樣式

之研究

研究生：何錦泉 指導教授：李素瑛

國立交通大學資訊科學與工程研究所

摘要

Mining of Closed Frequent Itemsets and Sequential Patterns in

Data Streams Using Bit-Vector Based Method

Student: Chin-Chuan Ho Advisor: Suh-Yin Lee

Institute of Computer Science and Engineering

College of Computer Science

National Chiao-Tung University

Abstract

Acknowledgment

Table of Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Overview and Motivation

1.2 Related Work

1.2.1 Mining of Frequent Itemsets

1.2.2 Mining of Sequential Patterns

1.3 Organization of Thesis

Chapter 2

Problem Definition and Background

2.1 The Sliding Window Model in Data Streams

2.1.1 Data Stream Environment

2.1.2 A Sliding Window Model

‧‧‧

Data Streams

N

N

N

System starts

2.2 Definition of Mining Closed Frequent Itemsets

2.3 Definition of Mining Sequential Patterns

Chapter 3

New-Moment: Mining Closed Frequent Itemsets

3.1 Related Work: Moment Algorithm

3.2 Our Proposed Algorithm: New-Moment Algorithm

3.2.1 Bit-Vector

b, c

6

a, c, d

5

a, b, c

4

a, b, c

3

a, b

2

c, d

1

研究生：何錦泉

指導教授：李素瑛教授

研究生：何錦泉 Student：Chin-Chuan Ho

國立交通大學

資訊科學與工程研究所

碩士論文

研究生：何錦泉指導教授：李素瑛