Related Work - 設計高效能之頻繁封閉項目集維護演算法

In the following, the previous related studies of closed itemsets mining and incremental mining approaches will be briefly described.

2.1 Closed itemsets mining approaches

The major challenge in mining association rules is to reduce the search space and decrease the computation time required for mining frequent itemsets. The Apriori algorithm, which is the most well-known, utilizes a level-wise candidate generation approach to reduce its search space such that only frequent itemsets found in the previous level are treated as seeds for generating candidate itemsets in the current level. Many later algorithms [10][29][31][26][5] were based on this property and attempted to further reduce candidate itemsets and I/O costs. However, this Apriori property can not work well for dense databases, such as census data and DNA sequences, or a low minimum support. This is because most generated candidate itemsets are frequent itemsets such that the number of frequent itemsets will grow up exponentially; the performance of an Apriori-like algorithm thus degrades dramatically.

Some researchers have then developed closed itemsets mining algorithms to reduce the number of itemsets generated. Examples include A-close [34], CLOSET [35], CLOSET+ [36] and CHARM [36]. The A-close algorithm is an Apriori-like algorithm using a breadth-first search manner to find frequent closed itemsets directly.

However, breadth-first searches may encounter difficulties since there could be many candidates generated and need to scan the database many times. The CLOSET

algorithm [35], an extension of the FP-growth algorithm, uses a depth-first search (recursive divide-and-conquer) manner and a database-projection approach to mine long patterns from the FP-tree (frequent pattern tree) structure representing all transactions of database. However, the CLOSET algorithm may suffer from a sparse database or a low minimum support. An enhancement of the CLOSET algorithm, the CLOEST+ algorithm, thus combines various known search manners and closure-testing strategies to improve the performance of CLOSET. The CHARM algorithm uses a dual itemsets-tidset search tree and the Diffset technique to enumerate closed itemsets from a vertical-layout database. In many dense datasets, the CHARM algorithm has better performance than the A-close, CLOSET and CLOSET+ algorithms.

2.2 Incremental mining approaches

In real-world applications, a database grows over time such that existing association rules may become invalid or new implicitly valid rules may appear. In these situations, conventional batch-mining algorithms do not utilize previously mined patterns for later maintenance, and may require considerable computation time to re-process the entire updated database to get all up-to-date association rules. Some researchers have developed incremental mining algorithms to maintain association rules without re-processing the entire database whenever the database is updated.

Examples include the FUP-based algorithms [13][14], an adaptive algorithm [30], an incremental mining algorithm based on the concept of pre-large itemsets [22], and an incremental updating technique based on the concept of negative border [16][32]. The common idea of these researches lies in that, the previously mined information such as mined frequent itemsets are stored in advance; when new transactions are inserted,

a large portion of candidate itemsets can be decided by using the pre-stored frequent itemsets; only a small portion of candidate itemsets obtained from the new transactions needs to be re-processed against the original database. Much computation time can thus be saved in this way. The correctness of this idea is simply illustrated as follows.

Considering an original database and the newly inserted transactions, there are four cases of candidate itemsets shown in Figure 2-1 may arise:

Case 1: A candidate itemset is frequent in both the original database and the newly inserted transactions.

Case 2: A candidate itemset is frequent in the original database but infrequent in the newly inserted transactions.

Case 3: A candidate itemset is infrequent in the original database but frequent in the newly inserted transactions.

Case 4: A candidate itemset is infrequent in both the original database and the newly inserted transactions.

Incremental Batch

Figure 2-1: Four cases of candidate itemsets when adding new transactions to existing databases.

Original DB

d

Frequent Infrequent

D

Infrequent

Frequent Case 1 Case 2

Case 3 Case 4

Among the cases, since candidate itemsets in Case 1 are large in both the original database and the new transactions, they are still large after the weighted average of the supports; similarly, candidate itemsets in Case 4 are still small after the new transactions are inserted. Cases 1 and 4 will not affect the final association rules; Case 2 may remove existing association rules; and Case 3 may generate new association rules.

Cheung and his co-workers proposed an incremental mining algorithm, called FUP (Fast UPdate algorithm) [13][14], to efficiently cope with these four cases by pre-storing the previously mined frequent itemsets from the original database. It handles Cases 1, 2 and 4 by updating the pre-stored frequent itemsets against the newly inserted transactions, and re-processes only the itemsets without sufficient information in Case 3 against the original database if necessary.

The performance of the FUP algorithm will get degraded if a lot of candidate itemsets from the newly inserted transactions belong to Case 3. For example, suppose {A}, {B} and {AB} are all the previously mined frequent itemsets from the original database and {C}, {D} and {CD} are the three candidate itemsets from some newly inserted transactions. The final results can not be determined without re-processing the original database.

As a result, Thomas et al. [32] and Feldman et al. [16] utilized the concept of negative border [16] to enlarge the amount of pre-stored mining information in the FUP algorithm for improving the maintenance performance. A negative border of frequent itemsets can be easily formed by excluding the set of frequent itemsets from the set of candidate itemsets generated level by level. In other words, the negative

border consists of the itemsets which are candidates but do not have enough supports.

The processing time for Case 3 in the FUP algorithm can be reduced by additionally keeping the negative border of frequent itemsets. Similarly, Hong et al. [22] proposed the concept of pre-large itemsets [22] to enlarge the amount of pre-stored mining information for improving the maintenance performance. The proposed algorithm doesn't need to rescan the original database until a number of new transactions have been inserted.

在文檔中設計高效能之頻繁封閉項目集維護演算法 (頁 10-15)