Introduction - 在串流資料中高效率頻繁樣式探勘演算法之研究

1.1 Background

Data mining, which is also referred to as knowledge discovery in databases, has been recognized as the process of extracting non-trivial, implicit, previously unknown and potentially useful information or knowledge from large amounts of data. The typical data mining tasks include association mining, sequential pattern mining, classification, and clustering. The tasks help us to finding interesting patterns and regularities from the data.

Traditional data mining techniques assume the targeting databases are disk resident or could be fit into the main memory. Hence, due to the complexity of mining tasks, almost all data mining algorithms require scanning the data several times.

Recently, database and knowledge discovery communities have focused on a new model of data processing, where data arrive in the form of continuous streams. It is often referred to as data streams or streaming data. The new data model addresses the data explosion from two new perspectives. First, the arrival of data streams and the volume of data are beyond our capability to store them. For example, the network traffic information of a router, though extremely important, is often impossible to record. Second, data streams processing requires real-time constraint. Generally, the need to process the data timely prohibits rescanning the data from secondary storage. For example, detecting network intrusion in real-time is the necessary condition to prevent the damage. The new model has captured a large class of important applications in current world, such as discovering the patterns of sensor data generated from sensor networks, analyzing the transactional behaviors of transaction flows in retail chains, mining user traversal behaviors from the Web record and click-streams, protecting network securities, timely finding terrorist activities, monitoring call records in

telecommunications, analyzing stock and business data, and so on [6, 33].

In order to facilitate the following discussions, we will first introduce the streaming data model in more detail. Data streams assume the data elements arrive in some order. Moreover, the amount of data is often huge and can not be held in the main memory or even disks. This means that once a new data element arrives, it must be processed quickly. In general, the period for a data element staying in the main memory is quite short. Once a data element is removed from the main memory, it is not available to be accessed again. In other words, we can only have one look at the data.

Data mining over streaming data brings many new challenges [6]. The first challenge is how to perform data mining tasks on data streams. Most of existing data mining algorithms require scanning datasets multiple times, such as Apriori algorithm of association rule mining, k-means of clustering, and C4.5 of decision tree construction. The new data model limits us to have only one look at the data, or at most to scan it once. Further, the relatively small memory compared with the large amount of streaming data results in the fact that we can only store a concise summary or partial data of the data stream. Therefore, getting precise results from data streams is commonly impossible or very difficult. The challenge is how to design efficient algorithms to get approximate results with high accuracy and confidence. The second challenge is how to understand the changes of data streams. The data streams bring us much new useful information to explore, such as the knowledge that if and when the underlying distribution has changed for continuous data streams. An example is to find such products in the retail chains that have become very popular recently in certain regions, but relatively unpopular for quite a long time before. In conclusions, how to perform data mining tasks, how to discover new knowledge, and how to mine changes of data streams make stream mining very challenging.

1.2 Research Objectives and Contributions

The research objective of this dissertation is to investigate efficient and scalable algorithms for mining frequent itemsets, path traversal patterns, and the changes of items over continuous data streams.

The first research issue of this dissertation is the online mining of frequent itemset over data streams. We propose the DSM-FI (Data Stream Mining for Frequent Itemsets) algorithm to find the set of all frequent itemsets over the entire history of the data streams. An effective projection method is used in the proposed algorithm to extract the essential information from each incoming transaction of data streams. A summary data structure based on the prefix tree is constructed. DSM-FI utilizes a top-down pattern selection approach to find the complete set of frequent itemsets. Experiments show that DSM-FI outperforms BTS (Buffer-Trie-SetGen), a state-of-the-art single-pass algorithm, by one order of magnitude for discovering the set of all frequent itemsets over a landmark window of data streams. For mining of frequent itemsets in data streams with a sliding window, we propose an online algorithm, called MFI-TransSW (Mining Frequent Itemsets over a Transaction-sensitive Sliding Window), to mine the set of frequent itemsets in streaming data with a transaction-sensitive sliding window. Moreover, another single-pass algorithm called MFI-TimeSW (Mining Frequent Itemsets over a Time-sensitive Sliding Window) based on the proposed MFI-TransSW algorithm, is proposed to mine the set of frequent itemsets in a time-sensitive sliding window. An effective bit-sequence representation of items is used in the proposed algorithms to reduce the time and memory needed to slide the windows. Experiments show that the proposed algorithms not only attain highly accurate mining results, but also run significantly faster and consume less memory than do existing algorithms for mining recent frequent itemsets over data streams.

The second research issue of the thesis is change mining of data streams. We define a new problem of the online mining of changes of items across two data streams, and propose

an one-pass algorithm, called MFC-append (Mining Frequency Changes of append-only data streams), to mine the set of frequent frequency changed items, vibrated frequency changed items, and stable frequency changed items across two continuous append-only data streams.

Furthermore, a single-pass algorithm, called MFC-dynamic (Mining Frequency Changes of dynamic data streams) based on MFC-append, is proposed to mine the changes across two dynamic data streams. A new summary data structure, called Change-Sketch, is developed to compute the frequency changes between two data streams as fast as possible. Theoretical analysis and experimental results show that our algorithms meet the major performance requirements, namely single-pass, bounded space requirement, and real-time computing, in mining streaming data.

The third issue of the work is the online mining of all path traversal patterns over Web click-streams. We proposed the first single-pass algorithm, called DSM-PLW (Data Stream Mining for Path traversal patterns in a Landmark Window), to discover the path traversal patterns over Web click-streams with a user-defined minimum support constraint. Moreover, we proposed the first online algorithm, called DSM-TKP (Data Stream Mining for Top-K Path traversal patterns), to mine the set of top-K path traversal patterns without a user-specified minimum support threshold. Experiments of real click-streams show that both algorithms successfully mine maximal reference sequences with linear scalability.

All the proposed algorithms are verified by experiments of mining continuous streams of various characteristics. In the experiments comprising comprehensive comparisons, the proposed algorithms outperforms several related algorithms, and they all show excellent linear scalability with respect to the size of the streaming data.

1.3 Organization of this Thesis

The rest of this dissertation is organized as follows. In Chapter 2, we describe efficient

one-pass algorithms for mining frequent itemsets and maximal frequent itemsets in a landmark window of data streams. Efficient single-pass algorithms for mining frequent itemsets over stream sliding windows are delineated in Chapter 3. Chapter 4 addresses the problem of mining of changes of items over append-only and dynamic data streams. Efficient algorithms for mining path traversal patterns with a user-specified minimum support constraint over Web click-streams are introduced in Chapter 5. The problem of mining top-K path traversal patterns is discussed in Chapter 6. Finally, the conclusions and future work are given in Chapter 7.

在文檔中在串流資料中高效率頻繁樣式探勘演算法之研究 (頁 19-24)