Introduction - 設計高效能之頻繁封閉項目集維護演算法

Data mining technology has become increasingly important in the field of large databases and data warehouses. This technology helps discover non-trivial, implicit, previously unknown and potentially useful knowledge, thus being able to aid managers in making good decision. Among various types of databases and mined knowledge, mining association rules from transaction databases is the most interesting and popular. In general, the process of mining association rules can roughly be decomposed into two tasks: finding frequent itemsets satisfying the user-specified minimum support threshold from a given database and generating interesting association rules satisfying the user-specified minimum confidence threshold from found frequent itemsets. Since the first task is very time-consuming when compared to the second one, the major challenges in mining association rules thus focus on how to reduce the search space and decrease the computation time in the first task. Some famous mining approaches, such as Apriori [4], DIC [10], DHP [29], Partition [31], Sampling [26], GSP [5] and FP-Growth [20][33], have been proposed.

In real-world applications, a database grows over time such that existing association rules may become invalid or new implicitly valid association rules may appear. Recently, some researchers have developed incremental mining algorithms to maintain association rules without re-processing the entire updated database [13]. The common idea of these researches lies in that, the previously mined information such as mined frequent itemsets are stored in advance; when new transactions are inserted, (a) a large portion of candidate itemsets can be decided using the pre-stored mined frequent itemsets; (b) only a small portion of candidate itemsets obtained from the new transactions without sufficient information needs to be re-processed against the

original database. Task (a) is responsible for updating previously mined association rules, and Task (b) is responsible for finding new association rules. Much computation time can thus be saved in this way. However, for a dense database such as census data and DNA sequences, the computation cost of Task (a) will be getting tremendous due to a huge amount of previously mined frequent itemsets. For example, a frequent 30-itemset (a frequent itemset consisting of 30 items) implies the presence of 2³⁰-2 additional frequent itemsets as well. The performance of classical incremental mining algorithms will degrade dramatically. On the other hand, most incremental mining algorithms are required one scan of original database to deal with Task (b). When the original database is massive, this will result in excessive I/O cost. As a result, in this study, we attempt to utilize the concepts of closed itemsets and pre-large itemsets to overcome the two challenges, respectively.

In a dense database, many itemsets usually appear together, and we can consider them together. The concept of closed itemsets, which is denoted as the itemsets having no proper superset with the same support, can be treated as a lossless compression for all itemsets in the database. It can also reduce redundant rules generated [34]. Therefore, using the set of frequent closed itemsets instead of the set of frequent itemsets from the original database as the pre-stored mining information can increase both efficiency and effectiveness of an incremental mining algorithm.

The set of frequent closed itemsets can easily determine all the frequent itemsets and their exact supports, and its order of magnitude is smaller than the set of all frequent itemsets.

In general, the number of newly inserted transactions is much smaller than the number of records in the original database. Only the candidate itemsets whose

supports are slightly less than the minimum support in the original database are possible to be frequent after database maintenance. The concept of pre-large itemsets is denoted as the set of itemsets having support between a lower support threshold, which is smaller than the given minimum support, and an upper support threshold, which is equal to the given minimum support. Therefore, using the pre-large closed itemsets to enlarge the amount of pre-stored frequent closed itemsets can reduce the cost of re-processing the entire database at the expense of storage spaces. This is because they act as a buffer to avoid the movements of closed itemset directly from infrequent to frequent and vice-versa during the incremental mining process.

Although using the concept of closed itemsets can effectively reduce the number of itemsets considered, some closed itemsets for the updated database, called joint closed itemsets in this paper, may not be considered by a classical incremental mining algorithm. The major reason is that the set of joint closed itemsets, which was compressed before, cannot be determined by above-mentioned Tasks (a) and (b). In this paper, we thus propose a novel incremental mining algorithm called Closed Itemsets Maintaining (CIM) to extend Tasks (a) and (b) that can efficiently find all frequent closed itemsets for the updated database. Task (a) of CIM algorithm is responsible for extracting the joint closed itemsets, which was compressed by the pre-stored frequent closed itemsets in the original database, and updating them against the newly inserted transactions. Task (b) of CIM algorithm is responsible for generating the candidate itemsets for the updated database which has not been determined in Task (a). Furthermore, based on the concept of pre-large itemsets, we propose the CIM-P algorithm to reduce the cost of Task (b) in the CIM algorithm.

Also, we design the bucketing strategy to improve the utility of buffer. The consumption of buffer can be rigidly calculated using the maximum value of buckets.

在文檔中設計高效能之頻繁封閉項目集維護演算法 (頁 7-10)