• 沒有找到結果。

Proposed Algorithm: Inc_CTMiner

Chapter 4 Incremental Mining Temporal Patterns from Interval-based Database

4.5. Inc_CTMiner Algorithm

4.5.2 Proposed Algorithm: Inc_CTMiner

When a temporal database DB is updated to DB’, there are three possible cases for the temporal patterns in DB’,

Case 1: A pattern is frequent in DB’, and also frequent in DB.

Case 2: A pattern is frequent in DB’, and infrequent in DB but has a frequent pattern in DB as a prefix.

Case 3: A pattern is frequent in DB’, and infrequent in DB and has no any frequent patterns in DB as a prefix.

Case 1 is easy to handle since we have already stored the information of previous mining results into FPTDB. We can obtain the temporal patterns in Case 1 by checking and adjusting the support of every pattern in FPTDB in DB’. As the example database DB and db in Table 4.2, the temporal pattern (A)(D): 2 is frequent, where the notation “pattern : count” represents the pattern and its associated support. And it is still frequent after updated.

Although we have not preserved any information of infrequent sequences in DB, in Case 2, all temporal patterns have at least one prefix subsequence which is frequent in DB, i.e., the frequent prefix is stored in FPTDB. Hence, we can utilize every temporal pattern in FPTDB as prefix to recursively discover the temporal patterns in Case 2. Since, in Case 3, the temporal patterns have no information stored in previous mining results, FPTDB, we need to scan DB’ for all new frequent 1-slices, and then use each new frequent 1-slice as prefix to construct projected database and recursively mine all temporal patterns in Case 3. For example, in Table 4.2, (B)(F):

2 is frequent after updated and has no frequent pattern in DB as prefix in FPTDB.

Before introducing Inc_CTMiner algorithm, we first give an intuitive approach, Naïve_Method, for incremental mining temporal patterns. Naïve_Method will also be used for baseline comparisons to assess the merit of Inc_CTMiner later. Fig. 4.8 illustrates the pseudo code. It first determines the extended database, EDB, and uses incision_strategy to transform all event sequences in DB’ to coincidence representation (Lines 1 and 2, algorithm 4.3). Then it calls CPrefixSpan, which is the sub-procedure of CTMiner, on EDB, and store mined results in a pattern tree, PTEDB (Line 3, algorithm 4.3). Note that, when mining EDB, the mined results

pattern is infrequent in EDB, it still may become frequent in the updated database DB’. For each temporal pattern in FTPDB, we update its support count if it also exists in PTEDB and check whether it is still frequent in DB’ (Lines 4-10, algorithm 4.3). Finally, we verify each remaining pattern in PTEDB in DB-EDB to adjust the support and output if it is frequent in DB’ (Lines 11-17, algorithm 4.3).

Algorithm 4.3: Naïve_Method ( DB’, min_sup, FPTDB )

Input: DB’: updated temporal database, min_sup: the minimum support, FPTDB: frequent pattern tree of original DB

Output: FPTDB’ : frequent pattern tree of updated database DB’

Variable: PTEDB : pattern tree of EDB 01: determine EDB ;

02: use incision_strategy to transform DB’ to coincidence presentation;

03: PTEDB ← CPrefixSpan ( EDB,  , 1/ |EDB|, PTEDB);

// sub-procedure of CTMiner 04: for each node  in FPTDB do 05: if  PTEDB

06: update support( ) and delete node  in PTEDB ; 07: if support ( )  (min_sup×|DB’ |)

08: insert node  to FPTDB’ ; 09: else

10: delete node  and all its descendent node in FPTDB ; 11: scan DB-EDB once for updating the support of node in PTEDB ; 12: for each node  in PTEDB do

13: if support( )  ( min_sup×|DB’ | ) 14: insert node  to FPTDB’ ; 15: else

16: delete node  and all its descendent node in PTEDB ; 17: Output FPTDB’ ;

Fig. 4.8: Pseudo code of Naïve_Method

In order to calculate the support of all patterns which are infrequent in DB but frequent in DB’, Naïve_Method keeps the information of all possible candidate set, i.e., mining EDB with min_sup

= 1 (Line 3, algorithm 4.3). This awkward approach induces large memory usage and may involve many non-promising database projection. To remedy this problem, we design a more elegant algorithm, Inc_CTMiner, which performs two optimization techniques to reduce unnecessary space searches.

Definition 4.7 (Search Space Reduction)

Given a temporal pattern  in DB (node in FPTDB), when DB is updated to DB’, incre_sid is defined as a set of all sequence IDs in increment database db and incre_slice| is defined as a set of all event slices in db|. We have two kinds of search space reduction,

1) Sequence-reduction: If {’ s sequence list}∩incre_sid = , then DB| is identical to DB’|. The support of  and all temporal patterns prefixed with , i.e., node  and all child nodes of  in FPTDB, are unchanged in DB’. Hence there is no temporal pattern which is infrequent in DB but becomes frequent in DB’ with  as prefix. We can stop searching  and all ’s child nodes in FPTDB.

2) Slice-reduction: If ’ s parent node in in FPTDB does not insert any node as child node when DB is updated to DB’, and the set of { and all ’ s sibling nodes}∩incre_ slice| = , then the support of  and all temporal patterns prefixed with , i.e., node  and all child nodes of  in FPTDB, are unchanged in DB’. Hence there is no temporal pattern which is infrequent in DB but becomes frequent in DB’ with  as prefix. We can stop searching  and all child nodes of  in FPTDB.

Now we give an example to demonstrate the correctness of Definition 4.7. Given DB updated with db in Table 4.2 (min_sup = 2) and corresponding FTPDB in Fig. 4.9, the incre_sid = {1, 4}

+

2 2 2 3

E: 2

2E: 2 3 3

1A: 3

3 1

D: 3

2 2E: 23

3 2 E: 2

3 1D: 3

Fig. 4.9: The search space reduction on FPTDB of example database DB in Table 4.2

: slice-reduction : sequence-reduction

three nodes (A)(D)(E), (A)(E), (D)(E) and (E) are {2, 3}, and {2, 3}∩ incre_sid = {2, 3}∩{1, 4}

= , we can stop searching these three nodes when discovering FTPDB+db, as shown in Fig. 4.9.

The sequence_list of node (A)(D) is {1, 2, 3}. Hence, we cannot stop checking and growing the node (A)(D) by sequence-reduction, due to {1, 2, 3}∩{1, 4} = {1}≠. However, since the parent node of (A)(D), i.e., node (A) does not insert any new child node and the set of (A)(D) and (A)(D)’s sibling nodes ∩ incre_ slice|(A)(D) = {D, E}∩{F, G} = , we still can stop checking and growing node (A)(D) and all its child nodes by the slice-reduction, as shown in Fig. 4.9.

The search space reduction in Definition 4.7 plays an important role in Inc_CTMiner. When the minimum support goes lower and the maintained patterns turn to be longer, many unnecessary searches can be avoided effectively. As observed in our experiments, the search space reduction can skip more than 60% nodes in FPTDB, especially when minimum support is

Fig. 4.10: An algorithmic overview of Inc_CTMiner

Inc_CTMiner

extremely low. This is also the main reason why Inc_CTminer not only outperforms other algorithms in runtime performance, but also consumes less memory space. The algorithmic overview and the pseudo code of Inc_CTMiner are shown as in Fig. 4.10 and Fig. 4.11, respectively.

Algorithm 4.4: Inc_CTMiner ( DB’, min_sup, FPTDB )

Input: DB’ : updated temporal database, min_sup: the minimum support, FPTDB : frequent pattern tree of original DB

Output: FPTDB’ : frequent pattern tree of updated database DB’

01: determine EDB; // initial Phase

02: use incision_strategy with interval_extension to transform DB’ into coincidence presentation

03: NFS ← scan db and check infrequent 1-slices in DB for new frequent 1-slices in DB’ ; // frequent 1-slice in DB’  FPTDB

04: for each slice b in NFS do // mining phase 05: insert b into FPTDB’ ;

06: call Inc_CT (DB’|b , b , min_sup, FPTDB’ );

07: scan DB’ once for update the support of node in FPTDB ; // extending phase 08: for each node  in FPTDB whose support  ( min_sup×|DB’ | ) do

09: insert  into FPTDB’ ;;

10: if search_pruning ( , DB’| ) = “ false ” // search space pruning 11: call Inc_CT (DB’|,  , min_sup, FPTDB’ );

12: Output FPTDB’ ;

Procedure Inc_CT ( DB|, , min_sup, FPTDB’ )

13: scan DB’| once to find every frequent slice c ; // support  ( min_sup×|DB’ | ) 14: for each slice c do

15: if c is a “finish slice” then

16: if exist corresponding start slice in  then // pre-pruning 17: append c to  to form ;

18: if c is a “start slice” or “intact slice” then 19: append c to  to form ;

20: for each  not existed in FPTDB do

21: construct DB’| with insignificant postfix elimination; // post-pruning 22: if |DB’| |  ( min_sup×|DB’ | ) then

23: insert  into FPTDB’;

24: if search_pruning ( , DB’| ) = “ false ” // search space pruning 25: call Inc_CT (DB’|,  , min_sup, FPTDB’ );

Fig. 4.11: Algorithm of Inc_CTMiner

There are three phases in Inc_CTMiner, initial phase, mining phase and extending phase.

Initial phase first uses the incision strategy and considers the interval extension to transform all

all new frequent 1-slices in DB’. Notice that, due to the storing of infrequent 1-slices in DB, we can find the complete set of new frequent slices in DB’ without rescanning DB again (Line 3, algorithm 4.4). Then, in mining phase, we use each new frequent slice as prefix to construct projected database and call sub-procedure Inc_CT to discover the temporal patterns (Lines 4-6 algorithm 4.4). Finally, in extending phase, Inc_CTMiner updates the support of every frequent pattern in DB. If a pattern is still frequent in DB’, we use search_reduction in Definition 7 to check if growing can stop. If not, sub-procedure Inc_CT is called to discover the temporal patterns (Lines 7-11, algorithm 4.4).

Sub-procedure Inc_CT recursively calls itself and works as follows. For a patter  as prefix, we scan its projected database DB| once to find its locally frequent slices (Line 13, algorithm 4.4) and adopt pre-pruning and post-pruning strategies to avoid non-promising projection (Lines 14-23, algorithm 4.4). We also use search_reduction to check whether growing can stop. If not, call Inc_CT recursively to discover the temporal patterns (Lines 24-25, algorithm 4.4).

4.6 Experimental Results and Performance Study

To evaluate the performance of Inc_CTMiner, one temporal pattern mining algorithms, CTMiner [8] and one incremental temporal pattern maintaining approach, Naïve method are compared with Inc_CTMiner. All algorithms were implemented in C++ language and tested on a computer with Pentium D 3.0 GHz with 2 GB of main memory. The performance study has been conducted on both synthetic and real world datasets. We perform three kinds of experiments in order to assess the efficiency of Inc_CTMiner. First, we compare the execution time and memory usage using synthetic datasets at extreme low minimum support. Second, we run Inc_CTMiner on different scenario to reflect the influence on performance of updated environments. Third, we conduct an experiment to observe the scalability on execution time of Inc_CTMiner. Finally, Inc_CTMiner is applied in real-world dataset, library lending data, to show the performance and the practicability of incremental maintenance for temporal patterns.

4.6.1 Data Generation

The synthetic data sets in the experiments are generated using synthetic generation program modified from [1]. Since the original data generation program was designed to generate time point-based data, the generator for the temporal pattern maintaining algorithm requires modifications on interval events and incremental scenario accordingly. The parameter setting of temporal data generator is shown in Table 4.5.

Table 4.5: Parameters of synthetic data generator

Parameters Description

| D | Number of event sequences

| C | Average size of event sequences

| S | Average size of potentially frequent sequences NS Number of potentially frequent sequences

N Number of event symbols Rinc

Ratio of the number of sequences in increment database db to updated database DB’

Rext

Ratio of the number of existed sequences extended to new sequences inserted in increment database db

Rapp

Ratio of the number of intervals of an existed sequence appearing in original database DB to increment database db

The updated database DB’ is generated first and then divided into the original database DB and increment database db. We create a set of potentially frequent sequences used in the generation of event sequences. The number of potentially frequent sequence is NS. A potentially frequent sequence is generated by first picking the size of sequence from a Poisson distribution with mean equal to | S |. Then, the event intervals in potentially frequent sequence are chosen from N event symbols randomly. All the duration times of event intervals are classified into three categories: long, medium and short, which are normally distributed with an average length of 12, 8 and 4, respectively. For each event interval, we first randomly decide its category and then determine its length by drawing a value. The temporal relations between consecutive intervals are selected randomly to form a potentially frequent sequence. Since we adopt normalized temporal patterns [13], the temporal relationships can be chosen from the set {before, meets, overlaps, is-finished-by, contains, starts, equal}. After all potentially frequent sequences are determined,

sequence, which was picked from a Poisson distribution with mean equal to |C |. Then, each event sequence is generated by assigning a series of potentially frequent sequences.

Finally, we partition the updated database DB’ into the original database DB and increment database db, as the example in Fig. 4.1. Different settings of three parameters are used to reflect different updating scenarios. Parameter Rinc, called increment ratio, decides the size of the increment database db. We pick | D | × Rinc sequences randomly into db and place remaining | D |

× (1–Rinc) sequences into DB. Furthermore, we use extended ratio, Rext, to divide event sequences in db to “old” sequences, which’s sid have appeared in DB, and “new” inserted sequences. Total | db | × Rext sequences were randomly chosen from db as “old” sequence which were to be split further. The splitting of event sequences is to simulate that some intervals are conducted formerly (thus in DB), while the remaining intervals are newly appended (thus in db).

The splitting is controlled by the third parameter Rapp, the appended ratio. If a sequence with total m intervals is to split, we placed the leading m × (1–Rapp) intervals in DB and the remaining m

× Rapp intervals in dba. For example, a DB’ with Rinc = 20%, Rext = 30% and Rapp = 40% means that 20% of sequences in DB’ is in db; 30% of the sequences in db have sids occurring in DB;

and that for each “old” sequence, (1–40%) = 60% of intervals were conducted before database updating. Note that the calculation is integer-based with “ceiling” function.

4.6.2 Execution Time and Memory Usage on Synthetic

相關文件