• 沒有找到結果。

An efficient algorithm for mining temporal high utility itemsets from data streams

N/A
N/A
Protected

Academic year: 2021

Share "An efficient algorithm for mining temporal high utility itemsets from data streams"

Copied!
13
0
0

加載中.... (立即查看全文)

全文

(1)

An efficient algorithm for mining temporal high utility itemsets

from data streams

Chun-Jung Chu

a

, Vincent S. Tseng

b,*

, Tyne Liang

a

aDepartment of Computer Science, National Chiao Tung University, Hsinchu 300, Taiwan, ROC

bDepartment of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, ROC

Received 10 October 2006; received in revised form 19 July 2007; accepted 21 July 2007 Available online 3 August 2007

Abstract

Utility of an itemset is considered as the value of this itemset, and utility mining aims at identifying the itemsets with high utilities. The temporal high utility itemsets are the itemsets whose support is larger than a pre-specified threshold in current time window of the data stream. Discovery of temporal high utility itemsets is an important process for mining interesting patterns like association rules from data streams. In this paper, we propose a novel method, namely THUI (Temporal High Utility Itemsets)-Mine, for mining temporal high utility itemsets from data streams efficiently and effectively. To the best of our knowledge, this is the first work on mining temporal high utility itemsets from data streams. The novel contribution of THUI-Mine is that it can effectively identify the temporal high utility item-sets by generating fewer candidate itemitem-sets such that the execution time can be reduced substantially in mining all high utility itemitem-sets in data streams. In this way, the process of discovering all temporal high utility itemsets under all time windows of data streams can be achieved effectively with less memory space and execution time. This meets the critical requirements on time and space efficiency for min-ing data streams. Through experimental evaluation, THUI-Mine is shown to significantly outperform other existmin-ing methods like Two-Phase algorithm under various experimental conditions.

 2007 Elsevier Inc. All rights reserved.

Keywords: Utility mining; Temporal high utility itemsets; Data stream mining; Association rules

1. Introduction

The mining of association rules for finding the relation-ship between data items in large databases is a well studied technique in the data mining field with representative meth-ods like Apriori (Agrawal et al., 1993, 1996). The problem of mining association rules can be decomposed into two steps. The first step involves finding of all frequent itemsets (or say large itemsets) in databases. Once the frequent item-sets are found, generating association rules is straightfor-ward and can be accomplished in linear time.

An important research issue extended from the mining of association rules is the discovery of temporal association patterns in data streams due to the wide applications on various domains. Temporal data mining can be defined as the activity of discovering interesting correlations or pat-terns in large sets of temporal data accumulated for other purposes (Bettini et al., 1996). For a database with a spec-ified transaction window size, we may use an algorithm like Apriori to obtain frequent itemsets from the database. For time-variant data streams, there is a strong demand to develop an efficient and effective method to mine various temporal patterns (Das et al., 1998). However, most meth-ods designed for traditional databases cannot be directly applied to the mining of temporal patterns in data streams because of their high complexity.

In many applications, we would like to mine temporal association patterns from the most recent data in data

0164-1212/$ - see front matter  2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2007.07.026

*

Corresponding author. Tel.: +886 6 2757575x62536; fax: +886 62747076.

E-mail addresses: cjchu@cis.nctu.edu.tw (C.-J. Chu), tsengsm@ mail.ncku.edu.tw(V.S. Tseng),tliang@cis.nctu.edu.tw(T. Liang).

www.elsevier.com/locate/jss The Journal of Systems and Software 81 (2008) 1105–1117

(2)

streams. That is, in temporal data mining, one should not only include new data (i.e., data in the new hour) but also remove the old data (i.e., data in the most obsolete hour) from the mining process. Without loss of generality, con-sider a typical market-basket application as illustrated in

Fig. 1, where the transactional data of customer purchases

are shown as time advances.

InFig. 1, for example, data was accumulated as a

func-tion of time. Data obtained prior to some specified time interval in the past becomes useless for reference. People might be most interested in the temporal association pat-terns in the latest three hours (i.e., db3,5) as shown in

Fig. 1. It can be seen that in such a data stream

environ-ment it is intrinsically difficult to conduct the frequent pat-tern identification due to the constraints of limited time and memory space. Furthermore, it takes considerable time to find temporal frequent itemsets in different time win-dows. However, the frequency of an itemset may not be a sufficient indicator of interestingness, because it only reflects the number of transactions in the database that contain the itemset. It does not reveal the utility of an item-set, which can be measured in terms of cost, profit or other expressions of user preferences. On the other hand, fre-quent itemsets may only contribute a small portion of the overall profit, whereas non-frequent itemsets may contrib-ute a large portion of the profit. In reality, a retail business may be interested in identifying its most valuable custom-ers (customcustom-ers who contribute a major fraction of the prof-its to the company). Hence, frequency is not sufficient to answer questions such as whether an itemset is highly prof-itable, or whether an itemset has a strong impact. Utility mining is thus useful in a wide range of practical applica-tions and was recently studied in Chan et al. (2003), Liu

et al. (2005) and Yao et al. (2004). This also motivates

our research in developing a new scheme for finding tempo-ral high utility itemsets (THUI) from data streams.

Recently, a utility mining model was defined inYao et al.

(2004). Utility is considered as a measure of how ‘‘useful’’

(e.g., ‘‘profitable’’) an itemset is. The definition of utility u(X) of an itemset X is the sum of the utilities of X in all transactions containing X. The goal of utility mining is to identify high utility itemsets which drive a large portion of the total utility. Traditional association rules mining models assume that the utility of each item is always 1

and the sales quantity is either 0 or 1, thus it is only a spe-cial case of utility mining, where the utility or the sales quantity of each item could be any number. If u(X) is greater than a utility threshold, X is a high utility itemset. Otherwise, it is a low utility itemset.1is an example of util-ity mining in a transaction database. The number in each transaction inTable 1(panel a) is the sales volume of each item, and the utility of each item is listed inTable 2(panel b). For example, u({B, D}) = (6· 10 + 1 · 6) + (1 · 10 + 7· 6) + (3 · 10+2 · 6) = 160. {B, D} is considered as a high utility itemset if the utility threshold is set at 120.

However, a high utility itemset may consist of some low utility items. Another attempt is to adopt the level-wise searching schema that exists in fast algorithms such as Apriori (Agrawal and Srikant, 1995). However, this algo-rithm does not apply to the utility mining model. For exam-ple, consider u(D) = 84 < 120 as shown in 1, D is a low utility item. However, its superset {B, D} is a high utility itemset. If Apriori is used to find high utility itemsets, all combinations of the items must be generated. Moreover, the number of candidates is prohibitively large in discover-ing a long pattern. The cost of either computation time or memory will be intolerable, regardless of what implementa-tion is applied. The challenge of utility mining is not only in restricting the size of the candidate set but also in simplify-ing the computation for calculatsimplify-ing the utility. Another challenge of utility mining is how to find temporal high utility itemsets from data streams as time advances.

In this paper, we explore the issue of efficiently mining high utility itemsets in temporal databases like data streams. We propose an algorithm named THUI-Mine that can discover temporal high utility itemsets from data streams efficiently and effectively. The underlying idea of THUI-Mine algorithm is to integrate the advantages of Two-Phase algorithm (Liu et al., 2005) and SWF algorithm

(Lee et al., 2001) with augmentation of the incremental

mining techniques for mining temporal high utility itemsets efficiently. The novel contribution of THUI-Mine is that it can efficiently identify the utility itemsets in data streams so that the execution time for mining high utility itemsets can be substantially reduced. That is, THUI-Mine can discover the temporal high utility itemsets in current time window and also discover the temporal high utility itemsets in the next time window with limited memory space and less com-putation time by sliding window filter method. In this way, the process of discovering all temporal high utility itemsets under all time windows of data streams can be achieved effectively under less memory space and execution time. This meets the critical requirements of time and space effi-ciency for mining data streams. Through experimental evaluation, THUI-Mine is shown to produce fewer candi-date itemsets in finding the temporal high utility itemsets, so it outperforms other methods in terms of execution effi-ciency. It is observed that the average improvement of THUI-Mine over Two-Phase algorithm reaches to about 67%. Moreover, it also achieves high scalability in dealing with large databases. To our best knowledge, this is the

(3)

first work on mining temporal high utility itemsets from data streams.

The rest of this paper is organized as follows: Section2

provides an overview of the related work. Section 3

describes the proposed approach, THUI-Mine, for finding the temporal high utility itemsets. In Section4, we describe the experimental results for evaluating the proposed method. The conclusion of the paper is provided in Section

5.

2. Related work

In association with rules mining, Apriori (Agrawal and

Srikant, 1995), DHP (Park et al., 1997) and partition-based

ones (Lin and Dunham, 1998; Savasere et al., 1995) were proposed to find frequent itemsets. Many important appli-cations have called for the need of incremental mining due to the increasing use of record-based databases to which data are being added continuously. Many algorithms like FUP (Cheung et al., 1996), FUP2 (Cheung et al., 1997)

and UWEP (Ayn et al., 1999;Ayn et al., 1999) have been proposed to find frequent itemsets in incremental dat-abases. The FUP algorithm updates the association rules in a database when new transactions are added to the data-base. Algorithm FUP is based on the framework of Apriori

and is designed to discover the new frequent itemsets iter-atively. The idea is to store the counts of all the frequent itemsets found in a previous mining operation. Using these stored counts and examining the newly added transactions, the overall count of these candidate itemsets are then obtained by scanning the original database. An extension to the work inCheung et al. (1996)was reported inCheung

et al. (1997)where the authors propose an algorithm FUP2

for updating the existing association rules when transac-tions are added to and deleted from the database. UWEP (Update With Early Pruning) is an efficient incremental algorithm, that counts the original database at most once, and the increment exactly once. In addition, the number of candidates generated and counted is minimized.

In recent years, processing data from data streams becomes a popular topic in data mining. A number of algo-rithms like Lossy Counting (Manku and Motwani, 2002), FTP-DS (Teng et al., 2003) and RAM-DS (Teng et al., 2004) have been proposed to process data in data streams. Lossy Counting divided incoming stream conceptually into buckets. It uses bucket boundaries and maximal possible error to update or delete the itemsets with frequency for mining frequent itemsets. FTP-DS is a regression-based algorithm for mining frequent temporal patterns from data streams. A wavelet-based algorithm, RAM-DS, performs

Table 1

A transaction database and its utility table Panel a: Transaction table

ITEM TID A B C D E T1 0 0 26 0 1 T2 0 6 0 1 1 P1 T3 12 0 0 1 0 T4 0 1 0 7 0 T5 0 0 12 0 2 P2 T6 1 4 0 0 1 T7 0 10 0 0 1 T8 1 0 1 3 1 db1,3 D P3 T9 1 1 27 0 0 T10 0 6 2 0 0 T11 0 3 0 2 0 + P4 T12 0 2 1 0 0 db2,4

Item Profit ($) (per unit)

Panel b: The utility table

A 3

B 10

C 1

D 6

(4)

pattern mining tasks for data streams by exploring both temporal and support count granularities.

Some algorithms like SWF (Lee et al., 2001) and Moment (Chi et al., 2004) were proposed to find frequent itemsets over a stream sliding window. By partitioning a transaction database into several partitions, algorithm SWF employs a filtering threshold in each partition to deal with the candidate itemset generation. The Moment algo-rithm uses a closed enumeration tree (CET) to maintain a dynamically selected set of itemsets over a sliding window. A formal definition of utility mining and theoretical model was proposed in Yao et al. (2004), namely MEU, where the utility is defined as the combination of utility information in each transaction and additional resources. Since this model cannot rely on downward closure prop-erty of Apriori to restrict the number of itemsets to be examined, a heuristic is used to predict whether an itemset should be added to the candidate set. However, the predic-tion usually overestimates, especially at the beginning stages, where the number of candidates approaches the number of all the combinations of items. The examination of all the combinations is impractical, either in computa-tion cost or in memory space cost, whenever the number of items is large or the utility threshold is low. Although this algorithm is not efficient or scalable, it is by far the best one to solve this specific problem.

Another algorithm named Two-Phase was proposed in

Liu et al. (2005), which is based on the definition in Yao

et al. (2004) and achieves the finding of high utility

item-sets. The Two-Phase algorithm is used to prune down the number of candidates and can obtain the complete set of high utility itemsets. In the first phase, a model that applies the ‘‘transaction-weighted downward closure property’’ on the search space is used to expedite the iden-tification of candidates. In the second phase, one extra database scan is performed to identify the high utility itemsets. However, this algorithm must rescan the whole database when new transactions are added from data streams. It incurs more cost on I/O and CPU time for finding high utility itemsets. Hence, the Two-Phase algo-rithm is focused on traditional databases and is not suited for mining data streams.

Although there existed numerous studies on high utility itemsets mining and data stream analysis as described above, there is no algorithm proposed for finding temporal high utility itemsets in data streams. This motivates our exploration of the issue of efficiently mining high utility itemsets in temporal databases like data streams in this research.

3. Proposed method: THUI-Mine

In this section, we present the THUI-Mine method. Sec-tion3.1describes the basic concept of THUI-Mine. Section

3.2gives an example for mining temporal high utility item-sets. The procedure of theTHUI-Mine algorithm is pro-vided in Section3.3.

3.1. Basic concept of THUI-Mine

The goal of utility mining is to discover all the itemsets whose utility values are beyond a user specified threshold in a transaction database. In Yao et al. (2004), the goal of utility mining is defined as the discovery of all high utility itemsets. An itemset X is a high utility itemset if u(X) P e, where X I and e is the minimum utility threshold; other-wise, it is a low utility itemset. For example, in 1, u(A, T8) = 1· 3 = 3, u({A, C}, T8) = u(A, T8) + u(C, T8) =

1· 3 + 1 · 1 = 4, and u({A, C}) = u({A, C}, T8) + u({A,

C}, T9) = 4 + 30 = 34. If e = 120, {A, C} is a low utility

itemset. However, if an item is a low utility item, its super-set may be a high utility itemsuper-set. For example, consider u(D) = 84 < 120, D is a low utility item, but its superset {B, D} is a high utility itemset since u({B, D}) = 160 > 120. Intuitively, all combinations of items should be pro-cessed so that it never loses any high utility itemset. How-ever, this will incur intolerable cost on computation time and memory space. A set of terms leading to the formal definition of utility mining problem can be generally defined as follows by referring to (Yao et al., 2004):

• I = {i1, i2, . . . , im} is a set of items.

• D = {T1, T2, . . . , Tn} is a transaction database where

each transaction Ti2 Dis a subset of I.

• o(ip, Tq), local transaction utility value, represents the

quantity of item ip in transaction Tq. For example,

o(A, T3) = 12, as shown inTable 1(panel a).

• s(ip), external utility, is the value associated with item ip

in the Utility Table. This value reflects the importance of an item, which is independent of transactions. For example, inTable 1(panel b), the external utility of item A, s(A), is 3.

• u(ip, Tq), utility, the quantitative measure of utility for

item ip in transaction Tq, is defined as o(ip, Tq)· s(ip).

For example, u(A, T3) = 12· 3, in1.

• u(X, Tq), utility of an itemset X in transaction Tq, is

defined asPuðip;TqÞip2X , where X = {i1, i2, . . . , im} is a

k-item-set, X Tqand 1 6 k 6 m.

• u(X), utility of an itemset X, is defined asPuðX ;TqÞTq2D^X Tq.

Liu et al. (2005)proposed the Two-Phase algorithm for

pruning candidate itemsets and simplifying the calculation of utility. First, the Phase I overestimates some low utility itemsets, but it never underestimates any itemsets. For the example in 1, the transaction utility of transaction Tq, denoted as tu(Tq), is the sum of the utilities of all items

in Tq: tuðTqÞ ¼Puðip;TqÞip2Tq . Moreover, the

transaction-weighted utilization of an itemset X, denoted as twu(X), is the sum of the transaction utilities of all the transactions containing X : twuðX Þ ¼PtuðTqÞXTq2D. For example, twu(A) =

tu(T3) + tu(T6) + tu(T8) + tu(T9) = 42 + 48 + 27 + 40 = 157

and twu({D, E}) = tu(T2) + tu(T8) = 71 + 27 = 98. In fact,

u(A) = u({A},T3) + u({A}, T6) + u({A}, T8) + u({A},

T9)=36 + 3 + 3 + 3 = 45 and u({D, E}) = u({D, E}, T2) +

(5)

utility for each transaction in1. Second, one extra database scan is performed to filter the overestimated itemsets in phase II. For example, by observing that twu(A) = 157 > 120 and u(A) = 45 < 120, the item {A} is pruned. Otherwise, it is a high utility itemset. Finally, all high utility itemsets are discovered in this way.

We illustrate the detail process of Two-Phase algorithm by the following example in db1,3 of1. Suppose the utility threshold is set as 120 with nine transactions in db1,3. In Phase I, the high transaction-weighted utilization 1-item-sets {A, B, C, D, E} are generated since twu(A) = tu(T3) +

tu(T6) + tu(T8) + tu(T9) = 42 + 48 + 27 + 40 = 157 > 120,

twu(B) = tu(T2) + tu(T4) + tu(T6) + tu(T7) + tu(T9) = 71 +

52 + 48 + 105 + 40 = 361 > 120, twu(D) = tu(T2) + tu(T3) +

tu(T4) + tu(T8) = 71 + 42 + 52 + 27 = 192 > 120 and twu(E) =

tu(T1) + tu(T2) + tu(T5) + tu(T6) + tu(T7) + tu(T8) = 31 +

71 + 22 + 48 + 105 + 27 = 304 > 120. Then, 10 candidate 2-itemsets fAB; AC; ADAE; BC; BD; BE; CD; CE; DEg are generated by the high transaction-weighted utilization 1-itemsets fA; B; C; D; Eg in the first database scan. In the same way, the high transaction-weighted utilization 2-item-set {BE} are generated since twu(AB) = tu(T6) + tu(T9) =

48 + 40 = 88 < 120, twu(AC) = tu(T8) + tu(T9) = 27 + 40 =

67 < 120, twu(AD) = tu(T3) + tu(T8) = 42 + 27 = 69 < 120,

twu(AE) = tu(T6) + tu(T8) = 48 + 27 = 75 < 120, twu(BC) =

tu(T9) = 40 < 120, twu(BD) = tu(T4) = 52 < 120, twu(BE) =

tu(T2) + tu (T6) + tu(T 7) = 71 + 48 + 105 = 224 > 120,

twu(CD) = tu(T8) = 27 < 120, twu(CE) = tu(T1) + tu(T5) +

tu(T8) = 31 + 22 + 27 = 80 < 120 and twu(DE) = tu(T2) +

tu(T8) = 71 + 27 = 98 < 120. After processing db1,3, the

high transaction-weighted utilization itemsets in db1,3 are obtained asfA; B; C; D; E; BEg.

In phase II, the high transaction-weighted utilization itemsetsfA; B; C; D; E; BEg is used to scan db1,3

to find high utility itemsets. The resulting high utility itemsets are {B} and {BE} since u(A) = u({A}, T3) + u({A}, T6) + u({A},

T8) + u({A}, T9) = 45 < 120, u(B) = u({B}, T2) + u({B},

T4) + u({B}, T6) + u({B}, T7) + u({B}, T9) = 220 > 120,

u(C) = u({C}, T1) + u({C}, T5) + u({C}, T8) + u({C}, T9) =

66 < 120, u(D) = u({D}, T2) + u({D}, T3) + u({D}, T4) +

u({D}, T8) = 72 < 120, u(E) = u({E}, T1) + u({E}, T2) +

u({E}, T5) + u({E}, T6) + u({E}, T7) + u({E}, T8) = 35 <

120 and u({B, E}) = u({B,E}, T2) + u({B, E}, T6) +

u({B, E}, T7) = 215 > 120.

Our algorithm THUI-Mine is based on the principle of the Two-Phase algorithm (Liu et al., 2005), and we extend it with the sliding window filtering (SWF) technique and

focus on utilizing incremental methods to reduce the candi-date itemsets and execution time. In essence, by partition-ing a transaction database into several partitions from data streams, algorithm THUI-Mine employs a filtering threshold in each partition to deal with the transaction-weighted utilization itemsets (TWUI) generated. The cumu-lative information in the prior phases is selectively carried over toward the generation of TWUI in the subsequent phases by THUI-Mine. In the processing of a partition, a progressive transaction-weighted utilization set of itemsets is generated by THUI-Mine. Explicitly, a progressive trans-action-weighted utilization set of itemsets is composed of the following two types of TWUI: (1) the TWUI that were carried over from the previous progressive candidate set in the previous phase and remain as TWUI after the current partition is taken into consideration; (2) the TWUI that were not in the progressive candidate set in the previous phase but are newly selected after taking only the current data partition into account. As such, after the processing of a phase, algorithm THUI-Mine outputs a cumulative filter, denoted as CF, which consists of a progressive trans-action-weighted utilization set of itemsets with their occurrence counts and the corresponding partial support required.

THUI-Mine is different from other existing methods like Lossy Counting (Manku and Motwani, 2002), which uses bucket boundaries and maximal possible error to update or delete the itemsets with frequency. The CF computes the occurrence counts of itemsets in memory and then deletes itemsets that do not satisfy utility threshold in every partial database. With these design considerations, algorithm THUI-Mine is shown to have very good perfor-mance for mining temporal high utility itemsets from data streams. In Section 3.2, we give an example for mining temporal high utility itemsets from a data stream. The details of THUI-Mine algorithm is described in Section3.3. 3.2. An example for mining temporal high utility itemsets

The proposed THUI-Mine algorithm can be best under-stood by the illustrative transaction database in 1 and

Fig. 2 where a scenario of generating high utility itemsets

from data streams for mining temporal high utility itemsets is given. For real life applications, this illustrative transac-tion database can be mapped to the customer transactransac-tions in a supermarket. We set the utility threshold as 120 with nine transactions. Without loss of generality, the temporal mining problem can be decomposed into two procedures: 1. Pre-processing procedure: This procedure deals with

mining on the original transaction database.

2. Incremental procedure: The procedure deals with the update of the high utility itemsets from data streams. The pre-processing procedure is only utilized for the ini-tial utility mining in the original database, e.g., db1,n. For mining high utility itemsets in db2,n+1, db3,n+2, dbi,j and

Table 2

Transaction utility of the transaction database

TID Transaction utility TID Transaction utility

T1 31 T7 105 T2 71 T8 27 T3 42 T9 40 T4 52 T10 62 T5 22 T11 42 T6 48 T12 21

(6)

so on, the incremental procedure is employed. Consider the database in 1. Assume that the original transaction data-base db1,3 is segmented into three partitions, namely, {P1, P2, P3}, in the pre-processing procedure. Each partition

is scanned sequentially for the generation of candidate 2-itemsets in the first scan of the database db1,3. Since there are three partitions, the utility threshold of each partition is 120/3 = 40. Such a partial utility threshold is called the filtering threshold in this paper. After scanning the first segment of the three transactions, 1-itemsets fA; B; D; Eg are kept to generate 2-itemsets because twu(A) = 42 > 40, twu(B) = 71 > 40, twu(C) = 31 < 40, twu(D) = 113 > 40 and twu(E) = 102 > 40. Then, 2-itemsets fAB; ADAE; BD; BE; DEg are generated by 1-itemsets fA; B; D; Eg in parti-tion P1as shown inFig. 2. In addition, each potential

can-didate itemset c2 C2has two attributes: (1) c.start, which

contains the identity of the starting partition when c was added to C2and (2) transaction-weighted utility which is

the sum of the transaction utilities of all the transactions containing c since c was added to C2. Itemsets whose

trans-action-weighted utility are below the filtering threshold are removed. Then, as shown inFig. 2, onlyfAD; BD; BE; DEg, marked by ‘‘}’’, remain as temporal high transaction-weighted utilization 2-itemsets (TWU2I) whose information is then carried over to the next phase of processing. Similarly, after scanning partition P2, the temporal high

TWU2I are recorded.

From Fig. 2, it is noted that since there are also three

transactions in P2, the filtering threshold of those itemsets

carried out from the previous phase is 40 + 40 = 80 and that of newly identified candidate itemsets is 40. It can be seen fromFig. 2that we have four temporal high TWU2I in C2after the processing of partition P2, and two of them

are carried from P1to P2and two of them are newly

iden-tified in P2. Finally, partition P3is processed by algorithm

THUI-Mine. The resulting temporal high TWU2I are fAB; AC; BC; BD; BEg as shown in Fig. 2. Note that

although itemset {AE} appears in the previous phase P2,

it is removed from temporal high TWU2I once P3is taken

into account since its transaction-weighted utility does not meet the filtering threshold then, i.e., 75 < 120. However, we do have two new itemsets, i.e., AC and BC, which join the C2 as temporal high TWU2I. Consequently, we have

five temporal high TWU2I generated by THUI-Mine, where two of them are carried from P1to P3, one of them

is carried from P2to P3, and two of them are newly

identi-fied in P3. Note that only five temporal high TWU2I are

generated by THUI-Mine, while 10 candidate itemsets would be generated if Two-Phase algorithm were used as mentioned in Section3.1. After processing P1to P3, those

temporal high TWUI in db1,3 are obtained as fA; B; C; D; E; AB; AC; BC; BD; BEg.

After generating temporal high TWU2I from the first scan of database db1,3, we use a skill to reduce the number of database scan. In fact, it will take k 1 database scan to generate k-candidate itemsets by using temporal high transaction-weighted utilization (k 1)-itemsets directly. Instead, we use temporal high TWU2I to generate Ck

(k = 3, 4, . . . , n), where Cn is the candidate last itemset. It

can be verified that temporal high TWU2I generated by THUI-Mine can be used to generate the candidate 3-item-sets. Clearly, a C3 can be generated from temporal high

TWU2I. For example, the 3-candidate itemset {ABC} is generated from temporal high TWU2I fAB; AC; BCg in db1,3. However, the temporal high TWU2I generated by THUI-Mine is very close to the high utility itemsets. Simi-larly, all Ckcan be stored in main memory and we can find

temporal high utility itemsets together when the second scan of the database db1,3 is performed. Thus, only two scans of the original database db1,3 are required in the pre-processing step. In this way, the number of database scan is reduced effectively. The resulting temporal high util-ity itemsets are {B} and {BE} since u(B) = 220 > 120 and u({B, E}) = 215 > 120.

(7)

One important merit of THUI-Mine lies in its incremen-tal procedure. As depicted inFig. 2, the mining of database will be moved from db1,3to db2,4. Thus, some transactions like T1, T2and T3are deleted from the mining database

and other transactions like T10, T11 and T12, are added.

To illustrate it more clearly, this incremental step can also be divided into three sub-steps: (1) generating temporal high TWU2I in D= db1,3 D, (2) generating temporal high TWU2I in db2,4= D+ D+and (3) scanning the data-base db2,4only once for the generation of all temporal high utility itemsets. In the first sub-step, db1,3 D= D, we

check the pruned partition P1 and reduce the value of

transaction-weighted utility and set c.start = 2 for those temporal TWU2I where c.start = 1. It can be seen that itemset {BD} was removed. Next, in the second sub-step, we scan the incremental transactions in P4. The process

in D+ D+= db2,4is similar to the operation of scanning partitions, e.g., P2, in the pre-processing step. The new

itemset {BD} joins the temporal high TWU2I after the scan of P4. In the third sub-step, we use temporal high TWU2I

to generate Ckas mentioned above. Finally, those temporal

high TWUI in db2,4 arefB; C; D; E; BC; BD; BEg. By scan-ning db2,4 only once, THUI-Mine obtains temporal high

utility itemsetsfB; BC; BEg in db2,4

.

Table 3

Meanings of symbols used

dbi,j Partitioned_database (D) from Pito Pj

s Utility threshold in one partition jPkj Number of transactions in partition Pk

TUPk(I) Transactions in Pkthat contain itemset I with transaction

utility

UPk(I) Transactions in Pkthat contain itemset I with utility

jdb1,n

, (I)j Transactions number in db1,nthat contain itemset I

Ci,j The progressive candidate sets of dbi,j

Thtwi,j The progressive temporal high transaction-weighted

utilization 2-itemsets of dbi,j

Thui,j The progressive temporal high utility itemsets of dbi,j

D The deleted portion of an ongoing database D The unchanged portion of an ongoing database D+ The added portion of an ongoing database

(8)

In contrast, Two-Phase algorithm has to scan the whole database like db2,4 and more candidate itemsets, i.e.,fBC; BD; BE; CD; CE; DEg, will be generated whenever some transactions are deleted and other transactions are added. Then, Two-Phase algorithm needs one more database scan than THUI-Mine to obtain temporal high TWU2I. Finally, Two-Phase algorithm scans data-base again to produce temporal high utility itemsets. Hence, more database scans and candidate itemsets are incurred by Two-Phase algorithm in comparison with THUI-Mine.

3.3. THUI-Mine algorithm

For easier illustration, the meanings of various symbols used are given in Table 3. The pre-processing procedure and the incremental procedure of algorithm THUI-Mine are described in Sections3.3.1 and 3.3.2, respectively. 3.3.1. Pre-processing procedure of THUI-Mine

The pre-processing procedure of Algorithm THUI-Mine is shown in Fig. 3. Initially, the database db1,n is parti-tioned into n partitions by executing the pre-processing

(9)

procedure (in Step 2), and CF, the cumulative filter, is empty (in Step 3). Let Thtw1,nbe the set of progressive tem-poral high TWU2I of dbi,j. Algorithm THUI-Mine only records Thtw1,n which is generated by the pre-processing procedure to be used by the incremental procedure. From Step 4 to Step 16, the algorithm processes one partition at a time for all partitions. When partition Piis processed,

each potential candidate 2-itemset is read and saved to CF. The transaction-weight utility of an itemset I and its start-ing partition are recorded in I.twu and I.start, respectively. An itemset, whose I.twu P s, will be kept in CF. Next, we select Thtw1,nfrom I where I2 CF and keep I.twu in main memory for the subsequent incremental procedure. By employing the scan reduction technique from Step 19 to Step 26, C1;nh (h P 3) are generated in main memory. After refreshing I.count = 0 where I.twu = 0 where I2 Thtw1,n

, we begin the last scan of database for the pre-processing procedure from Step 28 to Step 31. Finally, those itemsets satisfying the constraint that I.u = s· P.count are finally obtained as the temporal high utility itemsets.

3.3.2. Incremental procedure of THUI-Mine

As shown inTable 3, Dindicates the unchanged por-tion of an ongoing transacpor-tion database. The deleted and added portions of an ongoing transaction database are denoted by Dand D+, respectively. It is worth mentioning that the sizes of D+and D, i.e.,jD+j and jDj respectively,

are not required to be the same. The incremental procedure of THUI-Mine is devised to maintain temporal high utility itemsets efficiently and effectively. This procedure is shown

in Fig. 4. As mentioned before, this incremental step can

also be divided into three sub-steps: (1) generating tempo-ral high TWU2I in D= db1,3 D, (2) generating tempo-ral high TWU2I in db2,4= D+ D+and (3) scanning the database db2,4only once for the generation of all temporal high utility itemsets. Initially, after some update activities, old transactions Dare removed from the database dbm,n and new transactions D+are added (in Step 6). Note that D dbm,n. Denoting the updated database as dbi,j, note that dbi,j= dbm,n D+ D+

. We denote the unchanged transactions by D= dbm,n D= dbi,j D+

. After load-ing Thtwm,n of dbm,n into CF where I2 Thtwm,n

, we start the first sub-step, i.e., generating temporal high TWU2I in D= dbm,n D. This sub-step reverses the cumulative processing which is described in the pre-processing proce-dure. From Step 8 to Step 16, we prune the occurrences of an itemset I, which appeared before partition Pi, by

deleting the value I.twu where I2 CF and I.start < i. Next, from Step 17 to Step 39, similarly to the cumulative pro-cessing in Section3.3.1, the second sub-step generates tem-poral high TWU2I in dbi,j= D+ D+ and employs the scan reduction technique to generate Ci;jhþ1. Finally, to gen-erate temporal high utility itemsets, i.e., Thui,j, in the updated database, we scan dbi,jonly once in the incremen-tal procedure to find temporal high utility itemsets. Note that Thtwi,jis kept in main memory for the next generation of incremental mining.

4. Experimental evaluation

To evaluate the performance of THUI-Mine, we con-ducted experiments using synthetic datasets generated via a randomized dataset generator provided by IBM Quest project (Agrawal and Srikant, 1995). However, the IBM Quest data generator only generates the quantity of 0 or 1 for each item in a transaction. In order to fit databases into the scenario of utility mining, we randomly generate the quantity of each item in each transaction, ranging from 1 to 5, as is similar to the model used inLiu et al. (2005). Utility tables are also synthetically created by assigning a utility value to each item randomly, ranging from 1 to 1000. Because it is observed from real world databases that most items are in the low profit range, we generate the util-ity values using a log normal distribution, as is similar to the model used inLiu et al. (2005).Fig. 5shows the utility value distribution of 1000 items.

The simulation is implemented in C++ and conducted in a machine with 2.4 GHz CPU and 1 GB memory. For comparison with THUI-Mine algorithm, the two-Phase algorithm is extended with sliding window scenario. The extended Two-Phase algorithm scans the database accord-ing to the set time window and then performs the compu-tation within the time window. This process is repeated over sliding time window for the database. The main per-formance metric used is execution time. We recorded the execution time of THUI-Mine in finding temporal high utility itemsets. The comparison on the number of gener-ated itemsets for THUI-Mine, Two-Phase and MEU is pre-sented in Section 4.1. Section 4.2 shows the performance comparison of THUI-Mine and Two-Phase. The results of scale-up experiments are presented in Section 4.3. Sec-tion4.4shows the performance comparison of THUI-Mine and Two-Phase on another dense dataset.

4.1. Evaluation on number of generated candidates

In this experiment, we compare the average number of candidates generated in the first database scan on the slid-ing windows and incremental transaction number d10K

Utility Value Distribution

0 20 40 60 80 100 120 140 160 180 0 200 400 600 800 1000 utility value number of items

(10)

with different support values for THUI-Mine, Two-Phase

(Liu et al., 2005) and MEU (Yao et al., 2004). Without loss

of generality, we setjdj = jD+j = jDj for simplicity. Thus, by denoting the original database as db1,nand the new min-ing database as dbi,j, we havejdbi,jj = jdb1,n D+ D+j = jDj, where D= db1,i1 and D+= dbn+1,j. Tables 4 and 5

show the average number of candidates generated by THUI-Mine, Two-Phase and MEU on two datasets, respectively. The number of items is set at 1000, and the minimum utility threshold varies from 0.2% to 1%. The experimental results show that the number of candidate itemsets generated by THUI-Mine at the first database scan decreases dramatically as the threshold goes up. Especially, when the utility threshold is set as 1%, the number of can-didate itemsets is 0 in database T10.I6.D100 K.d10 K where T denotes the average size of the transactions and I the average number of frequent itemsets. The default size of the sliding window is set as 30K. In fact, we also varied the size of sliding window and the experimental results show that THUI-Mine outperforms Two-Phase algorithm under different sliding windows sizes. Due to space limita-tion, we only show the representative results with the slid-ing window size set as 30K. However, the number of candidates generated by Two-Phase is still very large and that for MEU is always 499,500 because it needs to process all combinations of 1000 items. THUI-Mine generates far fewer candidates when compared to Two-Phase and MEU. We obtain similar experimental results for different datasets. For example, only 118 candidate itemsets are gen-erated by THUI-Mine, but 183,921 and 499,500 candidate itemsets are generated by Two-Phase and MEU, respec-tively, when the utility threshold is set as 1% in dataset

T20.I6.D100K.d10K. In the case of dataset T20.I6. D100K.d10K, more candidates are generated, because the transaction is longer than that in T10.I6.D100K.d10K. In overall, our algorithm THUI-Mine always generates far fewer candidates compared to Two-Phase and MEU for various kinds of databases. Hence, THUI-Mine is verified to be very effective in pruning candidate itemsets to find temporal high utility itemsets.

4.2. Evaluation of execution efficiency

In this experiment, we compare only the relative perfor-mance of Two-phase and THUI-Mine since MEU spends much higher execution time and becomes incomparable.

Figs. 6 and 7show the execution times for the two

algo-rithms on datasets T20.I6.D100K.d10K and T10.I6.D100 K.d10K, respectively, as the minimum utility threshold is decreased from 1% to 0.2%. It is observed that when the minimum utility threshold is high, there are only a limited number of high utility itemsets produced. However, as the minimum utility threshold decreases, the performance difference becomes prominent in that THUI-Mine signifi-cantly outperforms Two-Phase. As shown in Figs. 6 and 7, THUI-Mine leads to prominent performance improve-ment under different sizes of transaction. Explicitly, THUI-Mine is significantly faster than Two-Phase and the margin grows as the minimum utility threshold decreases. For example, THUI-Mine is 10 times faster than Two-Phase when threshold is 0.2 for T20.I6.D100K.d10K. In overall, THUI-Minespends much less time than Two-Phase with higher stability in finding temporal high utility itemsets. This is because the Two-Phase algorithm pro-duces more candidate itemsets and needs more database scans to find high utility itemsets than THUI-Mine. To measure the improvement on execution time for THUI-Mine compared to Two-Phase algorithm, we define the Improvement Ratio as follows:

T20.I6.D100K.d10K 0 500 1000 1500 2000 2500 3000 0.2 0.3 0.4 0.6 0.8 1

Minimum Utility Threshold (%)

Ex ecution T ime (Sec) Two-Phase THUI-Mine

Fig. 6. Execution time for Two-Phase and THUI on T20.I6.D100K. d10K.

Table 4

The number of candidate itemsets generated on database T10.I6.D100K.d10K

Threshold (%) Databases, T10.I6.D100K.d10K

THUI-Mine Two-Phase MEU

0.2 3433 361,675 499,500 0.3 666 303,810 499,500 0.4 161 258,840 499,500 0.6 7 182,710 499,500 0.8 1 129,286 499,500 1 0 91,378 499,500 Table 5

The number of candidate itemsets generated on database T20.I6.D100K.d10K

Threshold (%) Databases, T20.I6.D100K.d10K

THUI-Mine Two-Phase MEU

0.2 27357 401,856 499,500 0.3 11659 371,953 499,500 0.4 5389 337,431 499,500 0.6 1364 278,631 499,500 0.8 371 229,503 499,500 1 118 183,921 499,500

(11)

From the data illustrated in Fig. 6, we see that the Improvement Ratio is about 85.6% with the threshold set as 0.2%. InFig. 7, the average improvement is about 67% with minimum utility threshold varied from 0.2% to 1%. Obviously, THUI-Mine reduces substantially the time in finding high utility itemsets. Moreover, the high utility item-sets obtained by Two-Phase are not suitable for applications in data streams since Two-Phase needs more database scans and increased execution time in finding high utility itemsets. Hence, THUI-Mine meets the requirements of high effi-ciency in terms of execution time for data stream mining.

4.3. Scale-up on incremental mining

In this experiment, we investigate the effects of varying incremental transaction size on the execution time of min-ing results. To further understand the impact ofjdj on the relative performance of THUI-Mine and Two-Phase, we conduct scale-up experiments which are similar to those described in Lee et al. (2001) with minimum support thresholds being set as 0.2% and 0.4%, respectively.

Fig. 8 shows the experimental results where the value in

y-axis corresponds to the ratio of the execution time of THUI-Mine to that of Two-Phase under different values ofjdj. It can be seen that the execution-time ratio remains stable with the growth of the incremental transaction num-berjdj since the size of jdj has little influence on the perfor-mance of THUI-Mine. Moreover, the execution time ratio of the scale-up experiments with minimum support thresholds varied from 0.6% to 1% remains constant at approximately 0.4%. This implies that the advantage of THUI-Mine over Two-Phase is stable and less execution

time is taken as the amount of incremental portion increases. This result also indicates that THUI-Mine is use-ful for mining data streams with large transaction size. 4.4. Evaluation on dense data

Typically, the synthetic data sets are very sparse. For testing various kinds of databases, we evaluate another dense dataset, the gazelle data set as used in Zaki and

Hsiao (2005). The gazelle data set comes from click-stream

data from a dot-com company named Gazelle.com, a Improvement Ratio¼ðexecution time of Two  PhaseÞ  ðexecution time of THUI  MineÞ

execution time of Two Phase

T10.I6.D100K.d10K 0 50 100 150 200 250 300 350 400 0.2 0.3 0.4 0.6 0.8 1

Minimum Utility Threshold (%)

Ex ecution T ime (Sec) Two-Phase THUI-Mine

Fig. 7. Execution time for Two-Phase and THUI on T10.I6.D100K. d10K. T10.I4.D100K.dnK 0 0.1 0.2 0.3 0.4 0.5 2 4 6 8 10 12

|d|, incremental transaction number (K)

Ex

ecution

T

ime Ratio

(THUI-Mine/T

w

o-Phase)

0.2% 0.4%

Fig. 8. Scale-up performance results for THUI vs. Two-Phase.

Gazelle 0 100 200 300 400 500 600 700 800 0.02 0.03 0.04 0.06 0.08 0.1

Minimum Utility Threshold (%)

Ex ecution T ime (Sec) Two-Phase THUI-Mine

(12)

legware and legcare retailer. This data set was used in the KDD-Cup 2000 competition and publicly available from

www.ecn.purdue.edu/KDDCUP. In order to fit databases

into the scenario of utility mining, we also randomly gener-ate the quantity of each item in each transaction, ranging from 1 to 5. The utility tables are also synthetically created by assigning a utility value to each item randomly, ranging from 1 to 1000.

Fig. 9shows the execution time for the two algorithms

as the minimum utility threshold is varied from 0.1% to 0.02%. It is observed that THUI-Mine still spends less time than Two-Phase with higher stability for finding temporal high utility itemsets even under the dense data. This is because the Two-Phase algorithm produces more candidate itemsets and needs more database scans to find high utility itemsets than THUI-Mine. Hence, this result also indicates that THUI-Mine is effective for mining temporal high util-ity itemsets under both of sparse and dense datasets.

5. Conclusions

In this paper, we addressed the problem of discovering temporal high utility itemsets in data streams. Under the stream database situation, the memory is often limited and it is hard to store large itemsets in memory. We pro-pose a new algorithm, namely THUI-Mine, which can dis-cover temporal high utility itemsets from data streams efficiently and effectively. The novel contribution of THUI-Mine is that it can effectively identify the temporal high utility itemsets with less candidate itemsets such that the execution time can be reduced efficiently. In this way, the process of discovering the temporal high utility itemsets in data streams can be achieved effectively with less mem-ory space and execution time. This meets the critical requirements of time and space efficiency for mining data streams.

The experimental results show that THUI-Mine can dis-cover the temporal high utility itemsets with higher perfor-mance by generating less candidate itemsets as compared to other algorithms under different experimental condi-tions, including both of sparse and dense datasets. Across the experiments, THUI-Mine is faster than Two-Phase by 2–10 times, and the performance gain becomes more signif-icant as the minimum utility threshold decreases. For example, THUI-Mine is 10 times faster than Two-Phase when the threshold is 0.2 for dataset T20.I6.D100K.d10K. This performance enhancement comes mainly from the good feature of THUI-Mine in producing far fewer candi-date itemsets. Moreover, the experimental results also show that THUI-Mine is scalable with large databases. There-fore, it is indicated that the advantage of THUI-Mine over Two-Phase is stable and less execution time is taken as the amount of incremental portion of databases increases. Hence, THUI-Mine is promising for mining temporal high utility itemsets in data streams. For future work, we would extend the concepts proposed in this work to discover other

interesting patterns in data streams like utility items with negative profit.

References

Agrawal, R., Imielinski, T., Swami, A., 1993. Mining association rules between sets of items in large databases. In: Proceedings of 1993 ACM SIGMOD International Conference on Management of Data, Wash-ington, DC, pp. 207–216.

Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I., 1996. Fast discovery of association rules. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, pp. 307–328. Agrawal, R., Srikant, R., 1995. Mining sequential patterns. In:

Proceed-ings of the 11th International Conference on Data Engineering, March 1995. pp. 3–14.

Ayn, N.F., Tansel, A.U., Arun, E., 1999. An efficient algorithm to update large itemsets with early pruning. Technical Report BU-CEIS-9908, Dept. CEIS Bilkent Uniiversity, June 1999.

Ayn, N.F., Tansel, A.U., Arun, E., 1999. An efficient algorithm to update large itemsets with early pruning. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego.

Bettini, C., Wang, X.S., Jajodia, S., 1996. Testing complex temporal relationships involving multiple granularities and its application to data mining. In: Proceedings of the 15th ACM Symposium on Principles of Database Systems, Montreal, Canada, pp. 68–78. Chan, R., Yang, Q., Shen, Y., 2003. Mining high utility Itemsets. In:

Proceedings of IEEE ICDM, Florida.

Cheung, D., Han, J., Ng, V., Wong, C.Y. 1996. Maintenance of discovered association rules in large databases: an incremental updating technique. In: Proceedings of 1996 International Conference on Data Engineering, February 1996, pp. 106–114.

Cheung, D., Lee, S.D., Kao. B., A general incremental technique for updating discovered association rules. In: Proceedings of the Interna-tional Conference On Database Systems For Advanced Applications, April 1997.

Chi, Y., Wang, H., Yu, P.S., Richard, R., 2004. Muntz: moment: maintaining closed frequent itemsets over a stream sliding window. In: Proceedings of the 2004 IEEE International Conference on Data Mining (ICDM’04).

Das, G., Lin, K.I., Mannila, H., Renganathan G., Smyth, P. 1998. Rule discovery from time series. In: Proceedings of the 4th ACM SIGKDD, August 1998, pp. 16–22.

Lee, C.H., Lin, C.R., Chen, M.S. 2001. Sliding-window filtering: an efficient algorithm for incremental mining. In: International Confer-ence on Information and Knowledge Management (CIKM01), November 2001, pp. 263–270.

Lin, J.L., Dunham, M.H., 1998. Mining association rules: anti-skew algorithms. In: Proceedings of 1998 International Conference on Data Engineering, pp. 486–493.

Liu, Y., Liao, W., Choudhary, A., 2005. A fast high utility itemsets mining algorithm. In: Proceedings of the Utility-Based Data Mining Work-shop, August.

Manku, G.S., Motwani, R., 2002. Approximate frequency counts over data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China.

Park, J.S., Chen, M.S., Yu, P.S., 1997. Using a hash-based method with transaction trimming for mining association rules. IEEE Transactions on Knowledge and Data Engineering 9 (5), 813–825.

Savasere, A., Omiecinski, E., Navathe, S. An efficient algorithm for mining association rules in large databases. In: Proceedings of the 21th International Conference on Very Large Data Bases, September 1995, pp. 432–444.

Teng, W.G., Chen, M.S., Yu, P.S., 2003. A regression-based temporal pattern mining scheme for data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, September 2003, pp. 93–104.

(13)

Teng, W.G., Chen, M.S., Yu, P.S., 2004. Resource-aware mining with variable granularities in data streams. In: Proceedings of the 4th SIAM International Conference on Data Mining, Florida, USA.

Yao, H., Hamilton, H.J., Butz, C.J., 2004. A foundational approach to mining itemset utilities from databases. In: Proceedings of the4th

SIAM International Conference on Data Mining, Florida, USA.

Zaki, M.J., Hsiao, C.J., 2005. Efficient algorithm for mining closed itemsets and their lattice structure. IEEE Transactions on Knowledge and Data Engineering 17 (3), 462–478.

數據

Fig. 1 , where the transactional data of customer purchases
Fig. 2 where a scenario of generating high utility itemsets
Fig. 2. Temporal high utility itemsets generated from data streams by THUI-Mine.
Fig. 3. Pre-processing procedure of THUI-Mine.
+5

參考文獻

相關文件

Most existing machine learning algorithms are designed by assuming that data can be easily accessed.. Therefore, the same data may be accessed

With the proposed model equations, accurate results can be obtained on a mapped grid using a standard method, such as the high-resolution wave- propagation algorithm for a

It is well known that second-order cone programming can be regarded as a special case of positive semidefinite programming by using the arrow matrix.. This paper further studies

◦ Lack of fit of the data regarding the posterior predictive distribution can be measured by the tail-area probability, or p-value of the test quantity. ◦ It is commonly computed

Show that the requirement in the definition of uniform continuity can be rephrased as follows, in terms of diameters of sets: To every  &gt; 0 there exists a δ &gt; 0 such that

¾ Relocation, which modifies the object program so that it can be loaded at an address different from the location originally specified.. ¾ Linking, which combines two or

* All rights reserved, Tei-Wei Kuo, National Taiwan University, 2005..

Ramesh: An algorithm for generating all spann ing trees of directed graphs, Proceedings of the Workshop on Algorithms an d Data Structures, LNCS, Vol.. Ramesh: Algorithms for