Organization of Thesis - 時序資料庫中高效率頻繁樣式探勘演算法之研究

Chapter 1 Introduction

1.3 Organization of Thesis

The remainder of this thesis is organized as follows. In Chapter 2, we describe EFI-Mine algorithm for mining temporal emerging frequent itemsets from temporal databases efficiently

and effectively. Efficient THUI-Mine algorithm for mining temporal high utility itemsets from temporal databases is introduced in Chapter 3. In Chapter 4, we describe two novel algorithms, namely TP-RUI (Two-Phase Rare Utility Itemsets) -Mine and TRUI (Temporal Rare Utility Itemsets) –Mine, for mining temporal rare utility itemsets from temporal databases. relevance feedback methods are surveyed and a novel feedback mechanism is proposed. In Chapter 5, we address a novel method, namely HUINIV (High Utility Itemsets with Negative Item Values) –Mine, for efficiently and effectively mining high utility itemsets from large databases with consideration of negative item values. Last, the conclusions are given in Chapter 6.

Chapter 2 Mining Temporal Emerging Itemsets from Temporal Databases

2.1 Problem Definition

The mining of association rules for finding the relationship between data items in large databases is a well studied technique in data mining field with representative methods like Apriori [1][2][7]. The problem of mining association rules can be decomposed into two steps.

The first step involves finding all frequent itemsets (or say large itemsets) in databases. Once the frequent itemsets are found, generating association rules is straightforward and can be accomplished in linear time.

An important research issue extended from the association rules mining is the discovery of temporal association patterns in temporal databases due to the wide applications on various domains. Temporal data mining can be defined as the activity of looking for interesting correlations or patterns in large sets of temporal data accumulated for other purposes [6]. For a database with a specified transaction window size, we may use the algorithm like Apriori to obtain frequent itemsets from the database. For time-variant temporal databases, there is a strong demand to develop an efficient and effective method to mine various temporal patterns [4]. However, most methods designed for the traditional databases cannot be directly applied for mining temporal patterns in temporal databases because of the high complexity.

Without loss of generality, consider a typical market-basket application as illustrated in [30] has been considered. The transaction flow in such an application is shown in Figure 2-1 where items a to g stand for items purchased by customers.

Figure 2-1. An example of online transaction flows.

In Figure 2-1, for example, the third customer bought item c during time t=[0,1), items c, e and g during t=[2, 3), and item g during t=[4, 5). It can be seen that in such a data stream environment it is intrinsically difficult to conduct the frequent pattern identification due to the limited time and space constraints. Furthermore, it wastes too much times finding frequent itemsets in different window times. Therefore, we develop a new scheme to find potential emerging frequent itemsets before next window times.

Dong and Li [14] define an emerging pattern as an itemset the support of which increases significantly between two databases. We view emerging frequent itemsets as a special case of the emerging patterns described by Dong and Li. An Emerging Frequent Itemset (EFI) can be considered as an itemset that is infrequent (i.e., small) in the current database and gets increased for its support so that it will eventually become frequent (i.e., large) in the new database temporally added with new data transactions. For example, in the market basket domain, we may assume an interval as the time between wholesale purchases.

Recognizing the set of items that will emerge or become frequent in the next time period with windows size may allow the storekeeper to order these emerging items much earlier than usual. Thus, the storekeeper will know what kinds of items will be popular in the next time period and avoid losing the income that their sales could have generated. Although some

related issues like mining emerging frequent itemsets [28] and incremental frequent itemsets [9][10][11][25] have been studied, they have been focused on traditional databases and are not suited for temporal databases.

In this chapter, we explore the issue of efficiently mining emerging frequent itemsets in temporal databases like data streams [15][16][17][19]. We propose an algorithm named EFI-Mine that can discover emerging frequent itemsets from temporal databases efficiently and effectively. The EFI-Mine algorithm is based on the concept of Apriori algorithm [2] for mining frequent itemsets. The novel contribution of EFI-Mine is that it can effectively identify the potential emerging frequent itemsets in temporal databases so that the execution time for mining frequent itemsets can be substantially reduced. That is, EFI-Mine can discover the itemsets that are infrequent in current time window but will become frequent ones with high probability in subsegment time windows. In this way, the process of discovering all frequent itemsets under all time windows of temporal databases can be achieved efficiently with limited memory space. This meets the critical requirements of time and space efficiency for mining temporal databases. Through experimental evaluation, EFI-Mine is shown to deliver high precision in finding the emerging frequent itemsets and it also achieves high scalability in terms of execution time.

Support Framework for Mining Temporal Patterns

In this chapter, the mining of temporal patterns are explored for illustrative purposes since not only the patterns should be efficiently and effectively extracted but also variations of corresponding occurrence frequencies should be tracked. In market-basket analysis, patterns along with their frequencies are extracted from sliding window in transactions. So the data expires after a user-specified time window. As time advances, new data is included while obsolete data is discarded. With the mining task for discovering frequent temporal patterns,

only patterns with occurrence frequencies no less than a specified threshold are being tracked.

We focus in this chapter on handling the different sliding windows to find emerging frequent itemsets.

An example showing the basic process in transforming transactions into numerical time series, for discovering frequent temporal patterns, is provided as follows.

Example 1: Consider the transaction flows shown in Figure 2-1. Given the window size w=3 and the minimum support value as 40%, occurrence frequencies of the inter-transaction itemset {c, g} from time t=1 to t=5 can be obtained as shown in Table 2-1.

Table 2-1. The support values of the inter-transaction itemset {c, g}.

TxTime Occurrence(s) of {c,g} Support t=1

With the sliding window model, the frequent temporal patterns can be discovered for different time windows. The main goal of our research is to discover interesting emerging itemsets under progressive time windows.

Emerging Frequent Itemsets and Interesting Emerging Itemsets

In a database, the frequent itemsets will be changed when new datum are added. As time progress, we can see many interesting patterns with regards to the change in status of individual itemsets. An itemset that was infrequent may become frequent (large), while frequent itemsets may become infrequent (small) and an itemset may remain frequent or infrequent. We define infrequent itemsets that are moving toward being frequent as emerging.

Conversely, frequent itemsets moving toward infrequent are submerging. An infrequent

(frequent) itemset that becomes large, i.e. with support above (below) minimum support value, is said to have emerged (submerged). The problems we address in this chapter are: 1) How can we identify itemsets that are emerging (submerging)? 2) Which of these itemsets have the potential to emerge (submerge) within the next time window? That is, we focus on finding emerging frequent itemsets in this chapter.

According to the emerging itemsets of incremental scheme, we develop this concept on the temporal data mining. Temporal data mining has the limitation on window size for finding emerging itemsets. Therefore, we must change the formula for finding emerging itemsets. For the remainder of this chapter, we give definitions to the formula.

Definition 2.1 db_k is the transactions in t=k, i.e., db₁ is the transactions in t=1.

Definition 2.2 DBi,i+1,…,j is the transactions in t=i to j, i.e., DB12345 is the transactions in t=1 to 5. We also view DB₁₂₃₄₅ as the accumulation of db₁+db₂+db₃+db₄+db₅.

Suppose the original database is DBi,i+1,…,j with window size=N and N=j-i+1. Due to the limitation of window size, we should discard the old database dbi when adding a database dbj+1. The new database should be DBi+1,i+2,…,j+1. In our scheme, we should find emerging itemsets before a new database is added. So we should focus on the database DBi+1,i+2,…,j. The old database dbi is useless for finding emerging itemsets. For example, suppose original database is DB1234 and we set the limitation of window size as 5. If a database db5 is added, the new database will be DB12345. Due to the limitation of window size, when adding a database db6, we should discard the old database db1. Thus, the new database becomes DB23456. In our scheme, we would find potential emerging frequent itemsets before a database is added. So we should focus on the database DB2345 finding potential emerging frequent itemsets. And the potential emerging frequent itemsets of the database DB2345 can be represented more accurate in the new database DB23456. In practice, with the feature of data stream, we first remove db1 from DB1234 and then add db5 to form the database DB2345. So we could find potential emerging frequent itemsets from the database DB2345 before adding a new

database db6 to form DB^B23456, and this conforms the limitation of window size. Figure 2-2 shows we would find potential emerging frequent itemsets from the database DB2345. So the window size should be N-1 for finding potential emerging itemsets.

Figure 2-2. Potentially emerging frequent itemsets in DB2345.

The rest of this chapter is organized as follows: Section 2.2 describes the proposed approach, EFI-Mine, for finding the emerging frequent itemsets. In section 2.3, we describe the experimental results for evaluating the proposed method. The conclusion of the chapter is provided in Section 2.4.

2.2 Mining Temporal Emerging Itemsets

In this Section, we give an example for mining temporal emerging itemsets from data stream.

The proposed algorithm, EFI-Mine, is also described in details in this Section.

An example for mining emerging itemsets

Figure 2-3 shows an example of emerging itemsets modified on that proposed by Dong and Li in [14] for the special case of EFI. It shows partitions of the space of itemsets, indicating all

possible transitions for an itemset X from original database DB to the new database DB+db.

Figure 2-3 plots the support count in DB (denoted as SCDB) against the support count in db (denoted as SCdb). Each point in the graph depicts an ordered pair (SCdb, SCDB) where the sum of SCdb and SCDB is an itemset's support count in DB+db at some increment interval. If the increment adds no transactions to an itemset's support count, then its support count in DB has to be equal to minSCDB+minSCdb in order to achieve minSCDB+db. This corresponds to point H in Figure 2-3. Alternatively, if an itemset's SC is equal to |db| in db, then its support in DB has to be some SC=n, where n>0, and n= minSCDB+minSCdb -|db| for the itemset to be frequent. This is point C in Figure 2-3. Line HC partitions the space of all itemsets in DB+db into frequent and infrequent. The shaded area in Figure 2-3 represents all the frequent itemsets and it includes Line HC. Specific partitions under HC contain itemsets that are emerging in the current increment. For example, the area defined by ΔHFG represents those itemsets that were frequent itemsets in DB, infrequent itemsets in db, and now are infrequent in DB+db.

These itemsets have therefore submerged. ΔGIC represents itemsets that were infrequent in DB and frequent in db. These itemsets have emerged. Therefore, we can find all itemsets in area ABCG are emerging in the current interval and all itemsets in area OAGH are submerging.

Figure 2-3. Emerging frequent itemsets.

However, there are too many emerging itemsets in area ABCG. In fact, we should focus more potential emerging itemsets. To have the potential to emerge in the next increment, the support count of the itemset in DB+db needs to be greater than or equal to 2minSCdb+minSCDB - |db| in the current increment. All points with this value are represented by line RS in Figure 2-4.

For example, if we have a database with |DB|= 10000, |db|= 1000 and minsup =0.2, then the minimum support count for the current increment is 2,200 (2,000 from DB plus 200 from db). If an itemset can add the maximum support incremental support count, a total of 1,000 from db, in the next increment, it would need a support count of at least 1400 in the current increment to be able to attain the minimum support count of 2,400 ((11000+1000)*0.2=2400) needed to become frequent.

The band of itemsets between line RS and line HC are all itemsets that have the potential to become frequent in the next increment, by this formula. Intersecting area ABCG and HCSR, we get itemsets in GDSC are most likely to emerge in the next increment.

Figure 2-4. Potentially emerging frequent itemsets.

Algorithm of EFI-Mine

With window size we mention in Section 2.2.2 and the concepts of emerging itemsets in section 2.3.1, we set support value as S and assume the original database as DBi,i+1,…,j-1. According to the scheme we mentioned previously, if we want to find frequent itemsets from DBi+1,i+2,…,j+1, we should focus on DBi+1,i+2,…,j for finding potential emerging frequent itemsets after adding database dbj and then find potential emerging frequent itemsets of the database DBi+1,i+2,…,j+1 before adding next incremental new database dbj+1. It means dbi

would be an old database that needs not be considered. After adding new database dbj+1, the new database would be DBi+1,i+2,…,j+1. So the window size is N when database is changed from dbi+1 to dbj+1. It also indicates N=(j+1)-(i+1)+1. By the feature of temporal data mining, we set |db|=|dbi|=|dbi+1|=…=|dbj|. In Figure 2-4, various lines bear the following meaning:

According to the feature of window size in temporal mining, incremental database means adding length of original transactions and also promoting the probability of infrequent itemsets to become frequent. Because we focus on N-1 window size for finding potential emerging frequent itemsets, these formulas should be divided by N-1 base on the number of database as follows:

Because line FI does not add new database, it should be divided by (N-1)-1. It means line FI should be divided by N-2 as follows:

Because dbj belongs to one of N window size, the formula should be divided by N as follows:

N SC LineAK=min _dbj/

Figure 2-5 illustrates the potentially emerging frequent itemsets in area GDSC with window size limitation. The formula for each line is as mentioned above.

According to these formulas, we can simplify these lines as follows:

HC=[S*(j-1-(i+1)+1)*|db|+S*|db|]/N-1= [S*(N-2)*|db|+S*|db|]/N-1= S*|db|

FI= [S*(j-1-(i+1)+1)|db|]/N-2= S*|db|

RS=[2*S*|db|+S*[(j-1)-(i+1)+1]*|db|-|db|]/N-1= [2*S*|db|+S*(N-2)*|db|-|db|]/N-1=

[(S*N)-1]*|db|/N-1

EC=[S*|db|+S*[(j-1)-(i+1)+1]*|db|-|db|]/N-1= [S*|db|+S*(N-2)*|db|-|db|]/N-1=

[S*(N-1)-1]*|db|/N-1 AK=S*db/N

We can also find potentially emerging frequent itemsets in area HRSC without concerning support count in dbj. However, it will reduce the accuracy with potentially emerging frequent itemsets. Taking into consideration of dbj would get the trend of itemsets and get better accuracy with potentially emerging frequent itemsets. Therefore, itemsets in GDSC are most likely to emerge in the next increment.

Figure 2-5. Potentially emerging frequent itemsets for temporal patterns.

Figure 2-6 shows the algorithm of EFI-Mine and the processing procedure is outlined below. The basic processing procedure is like Apriori except the definition of for minimum support value for finding temporal emerging itemsets from data stream. With window size N, we would not only remove dbi but also add new database dbj for finding 1-emerging itemsets

on the database DBi+1,i+2,…,j and finding large 1-itemsets on the database dbj from Step 1 to Step 3. So the purpose is to find potential emerging frequent itemsets of the database DBi+1,i+2,…,j+1 before adding next new database dbj+1. We generate k-candidates and find k-emerging itemsets by calculating support count as mentioned previously from Step 4 to Step 13. Then, we generate k-candidates and find k-large itemsets by support count we mention from Step 14 to Step 23. Finally, those itemsets meeting the constraints S*|db|>

c.count ≧ [(S*N)-1]*|db|/N-1 on DB^Bi+1,i+2,…,j and c.count S*db/N db≧ j are obtained as the potentially emerging frequent itemsets.

Figure 2-6. Algorithm of EFI-Mine.

We may utilize the formulas mentioned before to discuss the following situations. Notice that an itemset is emerging or not depends on support count of the itemset. Given an itemset whose support counts in DBi+1,i+2,…,j-1 and DBi+1,i+2,…,j-1+dbj are and

1 ,..., 2 , DB 1

SC ₊ ₊ ₋

j i i

j j i i₊₁_,₊₂_,...,₋₁+db

SCDB , respectively, the growth rate of that itemset is . The growth rate of an itemset that maintains minimal support

is . An itemset meeting the

db is an emerging itemset. An itemset needs a support

count of at least _db _db _db

database dbj+1 with expanding one window size. A potential emerging frequent itemset is the one that is emerging and meets the following constraint:

. Hence, we can infer that an itemset that will potentially emerge with expanding n window sizes is an itemset that is currently emerging

and . Of course, the larger n is, the less

accurate with finding potential emerging frequent itemsets might be.

To evaluate the performance of EFI-Mine, we conducted experiments of using synthetic dataset generated via a randomized transaction generation algorithm in [3]. The synthetic data generation program takes the parameters as shown in Table 2-2, and the values of parameters used to generate the datasets are shown in Table 2-3. The simulation is implemented in C++

and conducted in a machine with 1.4GHz CPU and 512MB memory. The main performance metrices used are execution time and accuracy. We recorded the execution time that EFI-Mine spends in finding potential emerging frequent itemsets. The accuracy is to measure the number of actual emerging frequent itemset in ratio of the total potential emerging frequent itemsets that we found. Hence, the accuracy is defined as follows:

Accuracy = (number of actual emerging frequent itemset) / (total potential emerging

frequent itemsets)

Table 2-2. Parameters of the synthetic datasets.

N Number of items

T Average numbers of items per transaction C Number of customers

D Number of transactions W Windows size

S Support value

Table 2-3. Parameter settings of synthetic datasets.

Dataset Parameters

N T C D W

N100T5C1000 100 5 1000 100,000 10

Effects of Varying Support Threshold

The proposed approach is verified with experiments in various measurements. We vary the values of support threshold from 30% to 70% for interesting the effects on the accuracy. The other parameters were kept fixed as default values. Figure 2-7 shows the accuracy of EFI-Mine under different support threshold values. It is observed that the average accuracy of potential emerging frequent itemsets raises as the support value is increased. Especially, the accuracy reaches to 100% when the support value is beyond 60%. Hence, EFI-Mine is verified to be very effective in finding the emerging itemsets.

0 20 40 60 80 100

30% 40% 50% 60% 70%

support

Accuracy (%)

EFI-Mine

Figure 2-7. Accuracy under different support values (N100T5C1000, W=10).

Comparisons with Apriori in Execution time

The proposed algorithm is also compared to the well know Apriori algorithm. We compare the average execution time in different support values between Apriori and EFI-Mine. Both of these two algorithms could find frequent itemsets. However, Apriori can only find frequent itemsets, while EFI-Mine can find frequent itemsets that were infrequent in the past. Apriori algorithm processes DBi+1,i+2,…,j+1 to find frequent itemsets, while our EFI-Mine algorithm needs to process fewer database DBi+1,i+2,…,j to find potentially emerging frequent itemsets.

在文檔中時序資料庫中高效率頻繁樣式探勘演算法之研究 (頁 17-0)