Organization of the Thesis - 在資料串流環境探勘高實用性項目集之研究

Chapter 1 Introduction

1.3 Organization of the Thesis

The remainder of this thesis is organized as follows. Some basic definitions and terminology about utility itemsets, and sliding window are described in Chapter 2. Our proposed method for mining high utility itemsets is presented in Chapter 3. The experiments and performances are described in Chapter 4. Conclusion and future work is in Chapter 5.

Chapter 2 Problem Definition and Background

In this chapter we introduce the basic definition of problems. We introduce the data stream environment and the sliding window model in Section 2.1. Next we describe the definition of utility itemsets and the problem of mining high utility itemsets in Section 2.2. Finally, we describe transaction-weighted utilization closure property in Section 2.3.

2.1 Definition and Background of Data Stream

2.1.1 Data Stream

Database and knowledge discovery communities have focused on a new data model, where data arrive in the form of continuous streams. It is often referred to as data streams or streaming data. The characteristics of data streams are as follows: (1) Continuity: data continuously arrive at a rapid rate. (2) Expiration: Data can be read only once. (3) Infinity:

The total amount of data is unbounded.

Data Streams Mining Data Streams Engine

Figure 2-1. Data stream environment

Figure 2-1 shows the data stream environment. For reasons given above, mining patterns of data streams differs from traditional mining of static database in the following aspects: Firstly, data streams continuously arrive at a rapid rate and thus the amount of data is huge. This means that once a new data element arrives, it must be processed quickly. Besides, once a data element is removed from the main memory, it is unable to backtrack over previously-arrived data elements. Therefore, the best condition is to achieve one sequential pass over the data, called one-pass scan. Secondly, the relatively small memory compared with the large amount of streaming data results in the fact that we can only store a concise summary or partial data of the data stream. Finally, due to the limited memory and one-pass scan, getting precise answers from data streams is commonly impossible or very difficult. Due to these reasons it is not feasible to use traditional multiple-pass techniques for mining static databases in the data stream environment. The challenges of mining in data streams are how to design an efficient algorithm to derive the useful patterns under limited memory and execution time.

2.1.2 A Sliding Window Model

Some applications in data streams emphasize the importance of the recent transactions. A sliding window model is suitable to solve this kind of problems. In the sliding window model, knowledge discovery is performed over a fixed number, window size, of recently generated data elements which is the target of data mining and once the window is full, window sliding is performed to eliminate the oldest data and then append the newest data.

According to the basic unit of window sliding, two types of sliding window, i.e., transaction-sensitive sliding window ( ) and time-sensitive sliding window ( TimeSW ) are used in mining data streams. A transaction-sensitive sliding window in the transaction data stream is a window that slides forward for every transaction, whereas a time- sensitive sliding window in the transaction data stream is a window that slides forward for

TransSW

every time unit( ), each consisting of variable number of transactions. Therefore, the window size, w, in at each slide is a fixed number of transactions, whereas the window size, w, in at each slide is a variable number of transactions. The sliding window model is shown in Figure 2-2.

TUi

TransSW TimeSW

Figure 2-2. The sliding window model

2.2 Problem Definition: Mining High Utility Itemsets in a Sliding Window Model

2.2.1 Utility Itemsets

Let I {i₁_,i₂_,..i_n}be a set of n distinct literals called items. D {T₁,T₂,...,T_m}is a set of variable length transactions where each transaction T_iD is a subset of I. A transaction also has an associated unique identifier called TID. In general, a set of items is called an itemset.

The number of items in an itemset is called the length of an itemset. Itemsets of length k are referred to as k-itemsets.

In traditional frequent itemsets mining, the number of an item in each transaction is always 0 or 1. However, in utility mining model, the number of an item in each transaction, called local transaction utility, may be arbitrary number. An extra resource, called external utility which can be a measure for describing user preference, is defined as a utility table. Figure 2-3 shows an example of the transaction database and a utility table.

Figure 2-3. An example of input transaction database and utility table

Some definitions of a set of items that leads to the formal definition of utility mining problem is given in [17]

1. , local transaction utility, represents the quantity of item in the transaction .

For example, and

2. , external utility, is the value associated with item in the utility table. For example, u(a)=3 and u(b)=10.

)

containing X, is defined as

¦

+ +u =66+52+28+42=188. The goal of utility mining is to identify high utility itemsets which derive a large portion of the total utility. If the twelve transactions are the target of data mining and the minimum utility threshold is 120, bd is a high utility itemset.

2.2.2 Problem Definition: Mining High Utility Itemsets in a Sliding Window Model

A transaction-sensitive sliding window (TransSW) in the transaction data stream is a window that slides forward for every transaction. The window at each slide has fixed number, w, of transactions, and w is called the size of the window. The current transaction-sensitive sliding window isTransSW_N_w₁ [T_N_w₁_,T_N_w₂,...,T_N], where N-w+1 is the id of current window. An itemset X is called a high utility itemset ifu(X)tutuw , where ut is a user specified minimum utility threshold in the range of [0,1]. The value is the minimum utility in the current transaction-sensitive sliding window.

w utu

A time-sensitive sliding window ( ) in the transaction data stream is a window that slides forward for every time unit. Each time unit consists of a variable number,

| , of transactions, and | is also called the size of the time unit. Due to the different size of each time unit, the window at each slide has variable number of transactions. The

current time-sensitive sliding window is = , where

N-w+1 is the id of current window. An itemset X is called a high utility itemset if , where ut is a user specified minimum utility threshold in the range of [0, 1] and transactions in the current time-sensitive sliding window, called window size. The value

| ₁

u TimeSW_N _w

ut is the minimum utility in the current time-sensitive sliding window.

Later in thesis, we will show that our method can be adopted in both of transaction- sensitive and time-sensitive sliding window model.

2.3 Transaction-Weighted Downward Closure Property

The downward closure property of Apriori cannot be applied for the utility mining model.

For example, u(d)=14*6=84<120 and is a low utility itemset but its superset u(bd)=160>120 is a high utility itemset. If candidates generated use all the combinations of items, the computation will be intolerable. A level-wise approach apply for utility mining, called

“Transaction-weighted Downward Closure Property “ is proposed in Two-Phase Algorithm [19].

Definition 1. (Transaction Utility) The transaction utility of transaction , denoted as

, is the sum of the utilities of all items in . For example, = +

Definition 2. (Transaction-Weighted Utilization) The transaction-weighted utilization of an itemset X, denoted as twu(X), is the sum of the transaction utilities of all the transactions containing X. Assume the target of data mining is T1 to T9, = + +

Definition 3. (High Transaction-Weighted Utilization Itemsets) X is a high transaction- weighted utilization itemset if minimum utility. Assume the minimum utility is 120, and thus bd is a high transaction-weighted utilization itemset.

t ) ( X twu

Theorem 1. (Transaction-Weighted Downward Closure Property) Let I be a k-itemset ^K and I^K¹ be a (k-1)-itemset such that I^K¹ I^K. If I is a high transaction-weighted ^K utilization itemset, I^K¹ is a high transaction-weighted utilization itemset.

Chapter 3 An Efficient Mining of High Utility Itemsets

The goal of our work is to find an efficient method for mining high utility itemsets in a data stream. Therefore, in Section 3.1 we introduce a related work, called THUI-Mine algorithm.

Next, we introduce our proposed method for mining high utility itemsets in a transaction- sensitive sliding window model, denoted as MHUI_TransSW, in Section 3.2. Subsequently, we extend this method to time-sensitive sliding window model, denoted as MHUI_TimeSW, in Section 3.3.

3.1 Related Work: THUI (Temporal High Utility Itemsets)-Mine Algorithm

THUI-Mine [20] is based on transaction-weighted downward closure property, and is extended the property with the sliding-window-filtering technique to find the temporal high utility itemsets over a sliding window. In essence, by partitioning a transaction database into several partitions from data streams, algorithm THUI-Mine employs a filtering threshold in each partition to deal with the transaction-weighted utilization itemsets generation.

For ease of exposition, the processing of a partition is termed a phase of processing. The cumulative information in the prior phase is selectively carried over toward the generation of candidate itemsets in the subsequent phases. The cumulative information THUI-Mine maintained consists of these two summary structures:

1. progressive transaction-weighted utilization set of itemsets (also called potential candidate 2-itemsets): composed of the following two types of itemsets, i.e.,

(1) The transaction-weighted utilization itemsets that were carried over from the

previous progressive candidate set in the previous phase and remain as transaction- weighted utilization itemsets after the current partition is taken into consideration.

(2) The transaction-weighted utilization itemsets that were not in the progressive candidate set in the previous phase but are newly selected after the current partition is taken into consideration.

2. : The transaction-weighted utilization itemsets and its corresponding transaction -weighted utility in each partition .

) (I TUP_k

After processing a partition , THUI-Mine maintains the potential candidate 2-itemsets and . Each potential candidate 2-itemset

) (I

TUP_k cC₂ has two attributes: (1) c.start

contains the identify the starting partition identifier when c was added to , and (2) twu(c), transaction-weighted utility of itemset c, is the sum of the transaction utilities of all the transactions containing c since c was added to . Table 3-1 shows the meanings of symbols used in THUI-Mine. The mining process of THUI-Mine is decomposed into two processes:

1. The preprocessing procedure: While the window is not full yet, it deals with mining on the original transaction database, e.g.,db¹^,ⁿ. This procedure is described in Section 3.1.1.

2. The incremental procedure: While the window is full and new partition arrives, it needs to slide the window. Thus the cumulative information needs to be updated. This procedure is described in Section 3.1.2.

Table 3-1. The meanings of symbols used in THUI-Mine

dbi^, Partition database fromP_i toP_j

s Utility threshold in one partition )

TUP_k Transactions inP_kthat contain itemset I with transaction utility

Thtwi^, The progressive temporal high transaction-weighted utilization 2-itemsets of dbⁱ^,^j

' The deleted portion of an ongoing database D The unchanged portion of an ongoing database ' The added portion of an ongoing database

3.1.1 The Preprocessing Procedure

Figure 3-1 shows an input transaction database and utility table. Let each partition contains three transactions and each window contains nine transactions. Assume the minimum utility is 120 for nine transactions, and thus the filtering threshold is s=120/3=40 for each partition.

The first window, , is segmented into three partitions, i.e., { , , }. Each partition is scanned sequentially for the generation of progressive temporal high transaction-weighted utilization 2-itemsets of , , in the first scan. Figure 3-2 shows the transaction utility of each transaction.

3 ,

db1 P₁ P₂ P₃

3 ,

db1 Thtw¹^,³

Figure 3-1. An example of input transaction database and utility table

Figure 3-2. The transaction utility of each transaction

After scanning the first 3 transactions, i.e., partition , we use high transaction-weighted utilization 1-itemsets, itemsets {a, b, d}, in to generate potential candidate 2-itemsets {ab, ad, bd}. Itemsets {ab, ad, bd} are newly generated in partition , so the start value of them is 1, the identifier of partition . The transaction-weighted value of itemsets {ab, ad, bd} in are 0, 42, 66 respectively. Since twu(ab)=0<40, itemset ab is removed. On the contrary, itemsets {ad, bd}, in shaded portion, have transaction-weighted value greater than 40, so they are the potential candidate 2-itemsets remained and then its information are carried over to the next phase of processing. maintains these potential candidate 2-itemsets and its transaction -weighted utility in partition . Figure 3-3 shows the potential candidate 2-itemsets and after processing .

P1 P₁

)

1(I TUP

)

1(I

TUP P₁

Figure 3-3.The potential candidate 2-itemsets and TUP₁(I) after processingP₁

After processing partitionP₂, candidate 2-itemsets are decomposed into two types:

(1) Itemsets that are carried over from the previous phase . The start value of this kind of itemsets is 1. For example, itemsets {ad, bd} are carried over from , so ad.start=1 and bd.start=1. Though itemset bd is carried from , it also occurs in . The transaction- weighted utility of bd in is maintained in . Besides, its transaction-weighted utility is accumulated and twu(bd) becomes 66+52=118.

P1 P₂

P2 TUP₂(I)

(2) Itemsets that are newly identified after the current partition, , is taken into consideration.

The start value of this kind of itemsets is 2. For example, itemsets {ab, ae, be} are newly generated afterP₂ is taken into consideration.

THUI-Mine prunes those itemsets with different filtering threshold. For itemset c where c.satrt=1, its filtering threshold is 2*s=40*2=80. For itemset c where c.start=2, its filtering threshold is 1*s=40. For example, Twu(ad)=42<80, so itemsets ad is not carried to next partition. Twu(bd)=118>80 and twu(ab)=48>40, so itemset bd and ab are carried over to the next partition. After pruning, there are four potential candidate 2-itemsets {ab, ae, bd, be}, in shaded portion, carried over to the next phase. One of them is carried over from partition , and three of them are newly identified in . Figure 3-4 shows the potential candidate 2-itemsets and after processing .

)

2(I

TUP P₂

Figure 3-4.The potential candidate 2-itemsets and TUP₂(I) after processing P₂

Partition is processed in the same way. Figure 3-5 shows the potential candidate 2-itemsets and after processing . Observe that there are seven potential candidate 2-itemsets left in the preprocessing procedure.

)

3(I

TUP P₃

After generating , THUI-Mine employs the scan reduction technique to generate other candidate itemsets. It uses to generate candidate 3-itemsets, , and subsequently use candidate (k-1)-itemsets, , to generate candidate k-itemsets, (k= 3,…n), where is the candidate last-itemsets. For instance, abe is constructed because itemsets {ab, ae, be} is in . Candidate 3-itemsets {abe, abc, bce} is the last candidate itemsets in this example.

After generating all candidate itemsets, one more scan is needed to find the high utility itemsets. Table 3-2 shows the itemsets generated after first and second scan of .

Table 3-2. The itemsets generated after first and second scan ofdb¹^,³ Candidate itemsets

Figure 3-5. The potential candidate 2-itemsets and TUP₃(I)after processingP₃

3.1.2 The Incremental Procedure

When partition arrives, window sliding is performed. As depicted in Figure 3-1, the current window will be moved from to , i.e., { , , }. Some transactions, i.e., T1, T2 and T3, are deleted from the window, and transactions T10, T11 and T12 are added.

This incremental procedure is decomposed into three sub-steps as follows:

3 ,

db1 db²^,⁴ P₂ P₃ P₄

1. Prune the oldest partition and update potential candidate 2-itemsets

In this sub-step, we check the pruned partition, , and reduce the value of transaction- weighted utility and set c.start=2 for those potential candidate 2-itemsets where c.start=1.

For example, Figure 3-5 shows the potential candidate 2-itemsets after processing . Observe that bd.start=1, i.e. bd is in the pruned partition . We can observe that twu(bd)=66 in from . Hence after pruning , twu(bd)=155-66=89 and bd.start

= 2. Figure 3-6 shows the result after the first sub-step, where . P1

Figure 3-6. The potential candidate 2-itemsets after performing first sub-step

2. Append newest partition and update potential candidate 2-itemsets

In this sub-step, the process to add new partition, , is similar to the operation of partition , in the preprocessing process. There is no new itemsets join the potential candidate 2-itemsets. However, itemsets {bc, bd} are carried from previous phase and also

appears in , so their transaction-weighted utility are accumulated and maintain them to . Figure 3-7 shows the potential candidate 2-itemsets and after processing

, where .

Figure 3-7. The potential candidate 2-itemsets and after performing second sub-step

)

4(I TUP

3. Use scan reduction techniques to generate all candidate itemsets, as mentioned above, and then one more database scan finds the temporal high utility itemsets of . Table 3-3 shows the itemsets generated after first and second scan of .

Table 3-3. The itemsets generated after first and second scan ofdb²^,⁴ Candidate itemsets

3.1.3 The Drawback of THUI-Mine Algorithm

THUI-Mine may give rise to two possible problems as follows:

1. More false candidate itemsets:

The temporal high transaction-weighted utilization itemsets may contain itemsets which may not be truly high utility itemsets, called false candidates. The number of false candidates depends on many factors such as the characteristics of the data, how the data is partitioned, number of partitions, and so on. THUI-Mine records the start partition of each potential candidate 2-itemsets and then uses different filtering threshold to prune. In this way, THUI-Mine may overestimate some itemsets concentrating in the later partitions. For example, there are seven potential candidate 2-itemsets maintained in shown in Figure 3-5. Observe that ac.start=3 and bc.start=3. Since twu(ac)=twu(bc)>40, filtering threshold, itemsets ac and bc are added when is taken into consideration. In fact, itemsets ac and bc only occur in . In other words, itemets ac and bc are overestimated.

3 ,

db1

Next, THUI-Mine uses scan reduction technique to generate candidate itemsets. Candidate 3-itemsets,C₃, is generated fromThtwⁱ^,^juThtwⁱ^,^j. Subsequently,C₄is generated fromC₃uC₃, where will have a size greater than high transaction-weighted utilization 3-itemsets. In other words, once the number of increases, it leads to a chain reaction for

Thtwi^,

CK(k=3,…,n). Later in experiments, we will show that if the size of a partition or the minimum utility decreases, the situation will be getting worse.

2. More memory:

THUI-Mine needs to maintain for use in the incremental procedure. The memory varies with the number of candidate 2-itemsets affected by many factors mentioned above. Since THUI-Mine has these disadvantages, we propose a new method in next Section so as to reduce the number of candidate itemsets generated, and decrease the memory used.

Later in Chapter 4, experiments will show that the proposed method outperforms than THUI-Mine algorithm.

) (I TUP_k

3.2 Our Proposed Method: MHUI_TransSW

In this section, we propose an efficient method, called MHUI_TransSW( Mining High Utility Itemsets over a Transaction-sensitive Sliding Window), to mine the set of all high utility itemsets with a transaction-sensitive sliding window. MHUI_TransSW is based on transaction-weighted downward closure property and additionally use effective item information, i.e., TIDlist or Bitvector of all 1-itemsets, to restrict candidate itemsets generated and thus reduce the time and memory needed. In section 3.2.1 we describe the representation of item information and then proposed an efficient method, called MHUI_TransSW, in Section 3.2.2.

3.2.1 Representation of Item information (TIDlist or Bitvector of items)

For each item x, item information, i.e. TIDlist(x) or Bitvector(x), maintains the relative placement of all transactions containing x in each sliding window, so that we can reduce the scan of transaction database. Assume the window contains w transactions, i.e. window size is w. The representation of item information is described as follows.

1. Definition of Bitvector(x): For each item x in the current transaction-sensitive sliding window TransSW, a bit-sequence with w bits, denoted as Bitvector(x), is constructed. If an item x is in the i-th transaction of current TransSW, the i-th bit of Bitvector(x) is set to be 1; otherwise, it is set to be 0.

2. Definition of TIDlist(x): For each item x in the current transaction-sensitive sliding

在文檔中在資料串流環境探勘高實用性項目集之研究 (頁 14-0)