Motivation - 高平均效益項目集之探勘

CHAPTER 1 Introduction

1.1 Motivation

Mining frequent itemsets from a transaction database is a fundamental task for knowledge discovery such as association rules [1], sequential patterns [2-4] and classification [5, 6]. In the past, numerous methods were proposed to discover frequent itemsets. Among them, the most two famous kinds were level-wise algorithms [1, 7-10] and pattern-growth methods [11-15]. These approaches, however, only considered whether an item was bought in a transaction or not. Thus, frequent itemsets just reveal the frequency of occurrence of the itemsets, but do not reflect any other factors, such as price or profit.

In some situations, frequent itemsets may only contribute a small portion to the overall profit, while non-frequent ones may contribute a large portion to the profit [16]. For example, sale of diamonds may occur less frequently than that of clothing in a department store, but the former gives a much higher profit per unit sold than the latter. Only Frequency is thus not sufficient to identify the items which are highly profitable or have other potential effects.

Utility mining [17] is thus proposed to partially solve the above problem. It may be

profits considered. Utility is used to estimate how “useful” an itemset is. An itemset is said useful to a user if it satisfies the utility constraint; that is, the utility of the itemset must be larger than a threshold defined by the user. In practice, the utility value of an itemset can be measured in terms of cost, profit or other measures from user preference. For example, someone may be interested in finding the itemsets with good profits and another may focus on the itemsets with low pollution while manufacturing.

In utility mining, local transaction utility and external utility are used to measure the utility of an item. The local transaction utility of an item is the information stored in the transaction dataset, like the quantity of the item sold in the transaction. The external utility of an item is defined by user’s preference, like the profits. Thus, external utility often reflects user’s preference and can be represented by a utility table or a utility function. By combining a transaction dataset and a utility table together, the discovered itemsets will better match a user’s expectation than those by only considering the transaction dataset itself [16].

In frequent-itemset mining, the Apriori-like strategy [18] is often adopted to search for frequent itemsets level by level. The basic principle of the Apriori-like strategy is the

“downward closure” property (anti-monotone property). That is, any superset of a non-frequent itemset is also non-frequent. This anti-monotone property of itemsets is used to reduce the search space by pruning the non-frequent itemsets early. The “downward closure”

principle cannot, however, be directly applied to discover high utility itemsets. Without the

“downward closure” property, the number of candidate itemsets generated at each level is close to all the combinations of all the items. The computation time for handling this thus becomes intolerable.

In this thesis, we proposed a new idea to evaluate the utilities of itemsets. Traditionally, the utility of an itemset is the summation of the utilities of the itemset in all the transactions regardless of its length. Thus, the utility of an itemset in a transaction will increase along with the increase of its length. That is, longer itemsets in a transaction result in higher utility values.

Thus, using the same minimum threshold to judge itemsets with different lengths is not fair. In order to alleviate the effect of the length of itemsets and identify really good utility itemsets, the average utility measure was applied to reveal a better utility effect of combining several items than the original utility measure. It is defined as the total utility of an itemset divided by its number of items within it. The average utility of an itemset is then compared with a threshold to decide whether it is a high average-utility itemset. An algorithm is also proposed to find all the high average-utility itemsets.

Like two-phase mining for high utility itemsets, the proposed mining algorithm for high average-utility itemsets uses average-utility upper bounds to overestimate the actual average utilities of itemsets for satisfying the downward closure property. The average-utility upper bound of an itemset is designed here as the summation of the maximal utility among the items in each transaction including the itemset. Only the combinations of the itemsets which have

their average-utility upper bounds beyond the user-defined threshold are added into the candidate set in a level-wise way. The downward closure property can thus be maintained in this way. That is, any subset of an itemset with high average-utility bound must also be of high average utility. Therefore, the size of the candidate set is substantially reduced during the level-wise search. Only one database scan is needed to filter out the promising candidates.

After that, the second database scan is performed to find the actual average utility of each candidate and decide whether it is desired.

Most of the above approaches assumed that the database was static and focused on batch mining. However, in real-world applications, the database varies with newly added records (or transactions). When new records are added into the database, some of the originally frequent or high utility itemsets may become invalid, or some new implicitly frequent or high utility itemsets may appear in the whole updated database [19-22]. In this situation, the conventional batch mining algorithms must re-process the whole updated database to find the frequent or high utility itemsets. However, re-processing the whole updated database results in the following two disadvantages:

(a) The computation time for mining from the whole updated database is spent. If the original database is large, much computation time is wasted.

(b) The previously mined information mined from the original database provides no help in the incremental mining process.

To avoid the shortcomings, FUP (Fast UPdate algorithm) [19] was then proposed to incrementally maintain the mined results for traditional association rules. The FUP algorithm was based on the Apriori mining algorithm [18] and adopted the pruning techniques of the DHP (Direct Hashing and Pruning) algorithm [23]. It first calculated large itemsets from newly inserted transactions, and compared them with the previous large itemsets from the original database. According to the comparison results, FUP determined whether re-scanning the original database was needed, thus saving some time in maintaining association rules.

In this thesis, we also proposed two algorithms to maintain the discovered high average-utility itemsets in the database varying with the new inserted records and the deleted records. These algorithms adopt the Apriori strategy to search for high utility itemsets level by level. The “downward closure” property of itemsets, which is the basic principle of the Apriori-like strategy, is used to reduce the search space by pruning low utility itemsets early.

With the “downward closure” property, the number of candidate itemsets generated at each level is greatly reduced. Besides, in order to handle the database varying with new inserted records and deleted records, the concept of FUP algorithm is also adopted to reduce the time to re-process the whole updated database. Finally, the performances of the proposed mining algorithms are verified by a real database.

在文檔中高平均效益項目集之探勘 (頁 11-16)