Mining Temporal Emerging Itemsets - Mining Temporal Emerging Itemsets from Temporal Databases

Chapter 2 Mining Temporal Emerging Itemsets from Temporal Databases

2.2 Mining Temporal Emerging Itemsets

In this Section, we give an example for mining temporal emerging itemsets from data stream.

The proposed algorithm, EFI-Mine, is also described in details in this Section.

An example for mining emerging itemsets

Figure 2-3 shows an example of emerging itemsets modified on that proposed by Dong and Li in [14] for the special case of EFI. It shows partitions of the space of itemsets, indicating all

possible transitions for an itemset X from original database DB to the new database DB+db.

Figure 2-3 plots the support count in DB (denoted as SCDB) against the support count in db (denoted as SCdb). Each point in the graph depicts an ordered pair (SCdb, SCDB) where the sum of SCdb and SCDB is an itemset's support count in DB+db at some increment interval. If the increment adds no transactions to an itemset's support count, then its support count in DB has to be equal to minSCDB+minSCdb in order to achieve minSCDB+db. This corresponds to point H in Figure 2-3. Alternatively, if an itemset's SC is equal to |db| in db, then its support in DB has to be some SC=n, where n>0, and n= minSCDB+minSCdb -|db| for the itemset to be frequent. This is point C in Figure 2-3. Line HC partitions the space of all itemsets in DB+db into frequent and infrequent. The shaded area in Figure 2-3 represents all the frequent itemsets and it includes Line HC. Specific partitions under HC contain itemsets that are emerging in the current increment. For example, the area defined by ΔHFG represents those itemsets that were frequent itemsets in DB, infrequent itemsets in db, and now are infrequent in DB+db.

These itemsets have therefore submerged. ΔGIC represents itemsets that were infrequent in DB and frequent in db. These itemsets have emerged. Therefore, we can find all itemsets in area ABCG are emerging in the current interval and all itemsets in area OAGH are submerging.

Figure 2-3. Emerging frequent itemsets.

However, there are too many emerging itemsets in area ABCG. In fact, we should focus more potential emerging itemsets. To have the potential to emerge in the next increment, the support count of the itemset in DB+db needs to be greater than or equal to 2minSCdb+minSCDB - |db| in the current increment. All points with this value are represented by line RS in Figure 2-4.

For example, if we have a database with |DB|= 10000, |db|= 1000 and minsup =0.2, then the minimum support count for the current increment is 2,200 (2,000 from DB plus 200 from db). If an itemset can add the maximum support incremental support count, a total of 1,000 from db, in the next increment, it would need a support count of at least 1400 in the current increment to be able to attain the minimum support count of 2,400 ((11000+1000)*0.2=2400) needed to become frequent.

The band of itemsets between line RS and line HC are all itemsets that have the potential to become frequent in the next increment, by this formula. Intersecting area ABCG and HCSR, we get itemsets in GDSC are most likely to emerge in the next increment.

Figure 2-4. Potentially emerging frequent itemsets.

Algorithm of EFI-Mine

With window size we mention in Section 2.2.2 and the concepts of emerging itemsets in section 2.3.1, we set support value as S and assume the original database as DBi,i+1,…,j-1. According to the scheme we mentioned previously, if we want to find frequent itemsets from DBi+1,i+2,…,j+1, we should focus on DBi+1,i+2,…,j for finding potential emerging frequent itemsets after adding database dbj and then find potential emerging frequent itemsets of the database DBi+1,i+2,…,j+1 before adding next incremental new database dbj+1. It means dbi

would be an old database that needs not be considered. After adding new database dbj+1, the new database would be DBi+1,i+2,…,j+1. So the window size is N when database is changed from dbi+1 to dbj+1. It also indicates N=(j+1)-(i+1)+1. By the feature of temporal data mining, we set |db|=|dbi|=|dbi+1|=…=|dbj|. In Figure 2-4, various lines bear the following meaning:

According to the feature of window size in temporal mining, incremental database means adding length of original transactions and also promoting the probability of infrequent itemsets to become frequent. Because we focus on N-1 window size for finding potential emerging frequent itemsets, these formulas should be divided by N-1 base on the number of database as follows:

Because line FI does not add new database, it should be divided by (N-1)-1. It means line FI should be divided by N-2 as follows:

Because dbj belongs to one of N window size, the formula should be divided by N as follows:

N SC LineAK=min _dbj/

Figure 2-5 illustrates the potentially emerging frequent itemsets in area GDSC with window size limitation. The formula for each line is as mentioned above.

According to these formulas, we can simplify these lines as follows:

HC=[S*(j-1-(i+1)+1)*|db|+S*|db|]/N-1= [S*(N-2)*|db|+S*|db|]/N-1= S*|db|

FI= [S*(j-1-(i+1)+1)|db|]/N-2= S*|db|

RS=[2*S*|db|+S*[(j-1)-(i+1)+1]*|db|-|db|]/N-1= [2*S*|db|+S*(N-2)*|db|-|db|]/N-1=

[(S*N)-1]*|db|/N-1

EC=[S*|db|+S*[(j-1)-(i+1)+1]*|db|-|db|]/N-1= [S*|db|+S*(N-2)*|db|-|db|]/N-1=

[S*(N-1)-1]*|db|/N-1 AK=S*db/N

We can also find potentially emerging frequent itemsets in area HRSC without concerning support count in dbj. However, it will reduce the accuracy with potentially emerging frequent itemsets. Taking into consideration of dbj would get the trend of itemsets and get better accuracy with potentially emerging frequent itemsets. Therefore, itemsets in GDSC are most likely to emerge in the next increment.

Figure 2-5. Potentially emerging frequent itemsets for temporal patterns.

Figure 2-6 shows the algorithm of EFI-Mine and the processing procedure is outlined below. The basic processing procedure is like Apriori except the definition of for minimum support value for finding temporal emerging itemsets from data stream. With window size N, we would not only remove dbi but also add new database dbj for finding 1-emerging itemsets

on the database DBi+1,i+2,…,j and finding large 1-itemsets on the database dbj from Step 1 to Step 3. So the purpose is to find potential emerging frequent itemsets of the database DBi+1,i+2,…,j+1 before adding next new database dbj+1. We generate k-candidates and find k-emerging itemsets by calculating support count as mentioned previously from Step 4 to Step 13. Then, we generate k-candidates and find k-large itemsets by support count we mention from Step 14 to Step 23. Finally, those itemsets meeting the constraints S*|db|>

c.count ≧ [(S*N)-1]*|db|/N-1 on DB^Bi+1,i+2,…,j and c.count S*db/N db≧ j are obtained as the potentially emerging frequent itemsets.

Figure 2-6. Algorithm of EFI-Mine.

We may utilize the formulas mentioned before to discuss the following situations. Notice that an itemset is emerging or not depends on support count of the itemset. Given an itemset whose support counts in DBi+1,i+2,…,j-1 and DBi+1,i+2,…,j-1+dbj are and

1 ,..., 2 , DB 1

SC ₊ ₊ ₋

j i i

j j i i₊₁_,₊₂_,...,₋₁+db

SCDB , respectively, the growth rate of that itemset is . The growth rate of an itemset that maintains minimal support

is . An itemset meeting the

db is an emerging itemset. An itemset needs a support

count of at least _db _db _db

database dbj+1 with expanding one window size. A potential emerging frequent itemset is the one that is emerging and meets the following constraint:

. Hence, we can infer that an itemset that will potentially emerge with expanding n window sizes is an itemset that is currently emerging

and . Of course, the larger n is, the less

accurate with finding potential emerging frequent itemsets might be.

To evaluate the performance of EFI-Mine, we conducted experiments of using synthetic dataset generated via a randomized transaction generation algorithm in [3]. The synthetic data generation program takes the parameters as shown in Table 2-2, and the values of parameters used to generate the datasets are shown in Table 2-3. The simulation is implemented in C++

and conducted in a machine with 1.4GHz CPU and 512MB memory. The main performance metrices used are execution time and accuracy. We recorded the execution time that EFI-Mine spends in finding potential emerging frequent itemsets. The accuracy is to measure the number of actual emerging frequent itemset in ratio of the total potential emerging frequent itemsets that we found. Hence, the accuracy is defined as follows:

Accuracy = (number of actual emerging frequent itemset) / (total potential emerging

在文檔中時序資料庫中高效率頻繁樣式探勘演算法之研究 (頁 24-31)