Reader’s Guide - 高平均效益項目集之探勘

CHAPTER 1 Introduction

1.3 Reader’s Guide

The remaining parts of this thesis are organized as follows. In Chapter 2, we review some related researches, including Apriori principle, utility mining, some utility mining methods

and the FUP algorithm. The definition and the meaning of the high average-utility itemsets are given in Chapter 3. The algorithm for mining high average-utility itemsets and an example to illustrate it are also given in that chapter. Two incremental utility mining algorithms for record insertion and deletion are proposed in Chapters 4 and 5, respectively. Some examples to illustrate them are also given there. The experimental results are presented in Chapter 6.

Finally, discussion and conclusion are given in Chapter 7.

CHAPTER 2 Review of Related Works

In this chapter, we review some related researches about this thesis. Section 2.1 describes the Apriori principle. Section 2.2 introduces the general concept of utility mining.

Section 2.3 reviews some utility-mining methods. Section 2.4 states the FUP algorithm, which is used for incrementally maintaining association rules.

2.1 Apriori Principle

Agrawal and Srikant proposed the Apriori algorithm [18] to mine association rules from a set of transactions in a level-wise way. In each pass, Apriori employs the downward-closure (anti-monotone) property to prune impossible candidates, thus improving the efficiency of identifying frequent itemsets. This property states that each subset of a frequent itemset must be frequent and each superset of an infrequent itemset must be infrequent. With the property in mining, the number of itemsets to be checked can decrease remarkably.

2.2 Utility Mining

Utility mining [17], an extension of frequent itemset mining, is based on the measurement of local transaction utility and external utility. Given a transaction database, a utility table and a minimum utility threshold, the goal of utility mining is to discover the itemsets whose utility value is larger than the defined minimum utility threshold. The utility of an item in a transaction is defined as the product of its quantity (local transaction utility) multiplied by its profit (external utility). The utility of an itemset in a transaction is thus the sum of the utilities of all the items in the transaction. If the sum of the utilities of an itemset in all the transactions is larger than the predefined utility threshold, then the itemset is called a high utility itemset.

In utility mining, the downward-closure property no long exists since the utility of an itemset will grow monotonically and the frequency of an itemset will reduce monotonically along with the number of items in an itemset. The two different monotonic properties make the downward-closure property invalid in utility mining.

2.3 Some Utility Mining Methods

In the past, several mining approaches were proposed for fining high utility itemsets. For example, Barber and Hamilton proposed the approaches of Zero pruning (ZP) and Zero

subset pruning (ZSP) to exhaustively search for all high utility itemsets in a database [24, 25].

They generated all the itemsets as candidates except the ones with their local measure values (utilities) being exactly zero. Although ZP and ZSP could discover all high utility itemsets in a database, the computation cost was, however, very high.

Li et al. proposed the FSM, the ShFSM and the DCG methods [26-28] to discover all high utility itemsets by taking advantage of the level-closure property. These methods relied on the critical function of each candidate to remove useless candidates.

Yao then proposed a framework for mining high utility itemsets based on mathematical properties of utility constraints. Two pruning strategies based on utility upper bounds and expected utility upper bounds respectively were adopted to reduce the search space. These pruning strategies were then incorporated into the mining approach Umining and its heuristic successor, Umining_H [29].

Liu et al. then presented a two-phase algorithm for fast discovering all high utility itemsets [16, 30]. It had two phases. In the first phase, the transaction utility was used as the effective upper bound of each candidate itemset in the transaction such that the “transaction-weighted downward closure property” could be kept in the search space to decrease the number of candidate itemsets. In the second phase, an additional database scan was performed to find out the real utility values of the remaining candidates and identifies the high utility itemsets.

2.4 The FUP Algorithm

Generally, the following four cases (illustrated in Figure 2-1) may arise while considering an original database and newly inserted transactions:

Case 1: An itemset is large (frequent) in the original database and in the newly inserted transactions;

Case 2: An itemset is large in the original database, but is not large (small) in the newly inserted transactions;

Case 3: An itemset is not large in the original database, but is large in the newly inserted transactions;

Case 4: An itemset is not large in the original database and in the newly inserted transactions.

Large

Figure 2-1: Four cases arising from adding new transactions to an existing database.

Cases 1 and 4 will not affect the existing rules (results). Case 2 may remove existing rules (results), and case 3 may add new rules (results).

Cheung et al. proposed the FUP [19] algorithm to incrementally maintain the mined results of association rules when new transactions were inserted. In FUP, large itemsets with their counts in preceding runs were recorded for later use in maintenance. As new transactions were added, FUP first scanned them to generate candidate 1-itemsets, and then compared these itemsets with the previous ones to decide which case the itemset belonged to. The corresponding process was thus executed according to the four cases, respectively as follows.

Case 1: An itemset was large in the original database and in the newly inserted

transactions. In this case, the itemset was always large in the updated database, and the only thing to do was to re-calculate the counts of the itemset in the updated database.

Case 2: An itemset was large in the original database, but was not large in the newly inserted transactions. In this case, the counts of the itemset were re-calculated and the itemset was then checked against the minsup to decide whether it was a large itemset.

Case 3: An itemset was not large in the original database, but was large in the newly inserted transactions. In this case, a database rescan was needed to determine the counts of the itemset in the original database. Also, the counts of the itemset were re-calculated and the itemset was then checked against the minsup to decide whether it was a large itemset.

Case 4: An itemset was not large in the original database and in the newly inserted transactions. The itemset was always small in the updated database. Nothing had to be done for this case.

Summarization of the four cases and their results in FUP is thus listed in Table 2-1.

Table 2-1: Four cases and their results in FUP for record insertion.

Cases: Original - New Results

Case 1: Large - Large Always large

Case 2: Large - Small Decided from the existing information Case 3: Small - Large Decided by rescanning the original database

Case 4: Small - Small Always small

After all the large 1-itemsets for an entire updated database were found, candidate 2-itemsets from the newly inserted transactions were generated and the same procedure was used to find all large 2-itemsets. This procedure was repeated until all large itemsets had been found.

Also, the following four cases (illustrated in Figure 2-2) may arise while considering an original database and deleted transactions:

Case 1: An itemset is large (frequent) in the original database and in the deleted transactions;

Case 2: An itemset is large in the original database, but is not large (small) in the deleted transactions;

Case 3: An itemset is not large in the original database, but is large in the deleted transactions;

Case 4: An itemset is not large in the original database and in the deleted transactions.

Large

Figure 2-2: Four cases arising from deleting transactions from an existing database.

Cases 2 and 3 will not affect the existing rules (results). Case 1 and Case 4 may add new rules (results) or remove existing rules (results).

FUP first scanned the deleted transactions to generate candidate 1-itemsets, and then compared these itemsets with the previous ones to decide which case the itemset belonged to.

The corresponding process was thus executed according to the four cases, respectively as follows.

Case 1: An itemset was large in the original database and in the deleted transactions. In this case, the counts of the itemset were re-calculated and the itemset was then checked against the minsup to decide whether it was a large itemset.

Case 2: An itemset was large in the original database, but was not large in the deleted transactions. In this case, the itemset was always large in the updated database, and the only thing to do was to re-calculate the counts of the itemset in the updated database.

Case 3: An itemset was not large in the original database, but was large in the deleted transactions. The itemset was always small in the updated database. Nothing had to be done for this case.

Case 4: An itemset was not large in the original database and in the deleted transactions.

In this case, a database rescan was needed to determine the counts of the itemset in the original database. Also, the counts of the itemset were re-calculated and the itemset was then checked against the minsup to decide whether it was a large itemset.

Summarization of the four cases and their results in FUP is thus listed in Table 2-2.

Table 2-2: Four cases and their results in FUP for record deletion.

Cases: Original - Deleted Results

Case 1: Large - Large Decided from the existing information

Case 2: Large - Small Always large

Case 3: Small - Large Always small

Case 4: Small - Small Decided by rescanning the original database

After all the large 1-itemsets for an entire updated database were found, candidate 2-itemsets from the deleted transactions were generated and the same procedure was used to find all large 2-itemsets. This procedure was repeated until all large itemsets had been found.

CHAPTER 3 Mining High Average-Utility Itemsets

In this thesis, we would like to find high average-utility itemsets instead of traditional high utility itemsets. It is reasonable and can effective reduce the size of candidates. The average utility of an itemset is first defined below.

Traditionally, the utility of an itemset is the summation of the utilities of the itemset in all the transactions regardless of its length. Thus, the utility of an itemset in a transaction will increase along with the increase of its length. That is, longer itemsets in a transaction result in higher utility values. For example, assume a transaction is given as shown in Table 3-1. There are five items in the transaction, respectively denoted A to E. The value attached to each item is the quantity sold in the transaction.

Table 3-1: A transaction as the example.

TID A B C D E

tx 1 1 4 1 0

Assume the predefined profit of each item is defined in Table 3-2. The utility of the 1-itemset {A} in the transaction is thus calculated as 1*3, which is 3, according to the above two tables. The utility of the 2-itemset {AB} in the transaction is calculated as 1*3+1*10,

which is 13. Similarly, the utility of the 3-itemset {ABC} is calculated as 1*3+1*10+4*1, which is 17. Accordingly, the utility of the 3-itemset {ABC} is larger than the 2-itemset {AB}, which is further larger than the 1-itemset {A}. Longer itemsets result in higher utility values.

This property is very obvious since longer itemsets will include some more items than their proper subsets. This effect will attenuate the judgment about whether an itemset is really better than its subsets.

Table 3-2: The predefined profit values of the items.

Item Profit

A 3

B 10

C 1

D 6

E 5

Let’s give another example to show our idea. Assume there are five transactions and only two items, A and B, in the data set shown in Table 3-3. Assume the sale quantities of both the items each time are equal if they are purchased and the profits of the two items are also the same as well. Thus, the utility values of both the items are the same in a transaction if they are purchased. Let the utility value of a purchased item in a transaction as X.

Table 3-3: Five transactions as an example.

TID A B

t1 X 0

t2 0 X

t3 0 X

t4 X X

t5 X X

For the first transaction in Table 3-3, item A is purchased and its utility is thus X. B is not purchased and its utility is thus 0. Besides, the support of A is 0.6 and the utility is 3X. The support of B is 0.8 and the utility is 4X. However, the support of the 2-itemset AB is 0.4, but the utility of AB is 4X, which doesn’t decrease along with its lower support value. Besides, the utility (4X) of selling A and B together in the case does not mean better than the total utility (7X) of individually selling A and selling B. It is because the length of the itemset {AB} is 2, which is not considered when the utility of the itemset is calculated. The average utility measure is thus adopted in this thesis to reveal a better utility effect of combining several items than the original utility measure. It is defined as the total utility of an itemset divided by its number of items within it. In this example, the utility of {AB} is divided by 2, which is equal to 2X. The average utility of an itemset is then compared with a threshold to decide whether it is a high average-utility itemset. As expected, the mined itemsets in the proposed way will be fewer than those in the original way under the same threshold. Our proposed approach can thus be executed under a larger threshold than the original, thus with a more significant and relevant

criterion. The approach for mining useful itemsets under the proposed criterion is stated below.

3.1 The Proposed Algorithm for Mining High Average-Utility Itemsets

In the proposed algorithm, the anti-monotone property is used to decrease the number of itemsets to be scanned level by level. There are two phases in the proposed algorithm. In phase 1, the average-utility upper bound is used to overestimate the itemsets. The average-utility upper bound is an overestimated utility value instead of actual utility value. The average-utility upper bound can ensure the anti-monotone property. Thus, each subset of an itemset with high average-utility upper bound must be high; each superset of an itemset with low average-utility upper bound must be low. It can thus prune many low average-utility upper bound itemsets level by level and decrease the time to scan a database. In phase 2, we just need to scan the database once to check the result of phase 1 is actually high or not.

The proposed algorithm first finds all the candidate average-utility 1-itemsets C1. The 1-itemsets whose average-utility upper bounds larger than or equal to minimum average-utility threshold are put in the set of candidate average-utility 1-itemset C1. Candidate average-utility 2-itemsets C2 are formed from C1. The proposed algorithm then check all the candidate average-utility 2-itemsets C2by comparing the average-utility upper bound with the minimum

average-utility threshold. The itemsets which do not exceed the minimum average-utility threshold are removed from the candidate 2-itemsets. The same procedure is repeated until all the itemsets have been found. Then we calculate the actual average-utility value of each candidate average-utility itemset. If the itemset is larger than or equal to the minimum average-utility threshold, put it in the set of high average-utility itemsets, H. The details of the proposed mining algorithm are described below.

Two-phase algorithm for mining high average-utility itemsets:

INPUT:

1. A set of m items I = {i1, i2, … , ij, … , im}, each ijwith a profit value pj, j = 1 to m;

2. A transaction database D = {T1, T2, … , Tn}, in which each transaction includes a subset of items with quantities;

3. The minimum average-utility threshold.

OUTPUT: A set of high average-utility itemsets.

STEP 1: Calculate the utility value ujk of each item ij in each transaction Tk as ujk=qjk*pj, where qjkis the quantity of ijin Tkfor j = 1 to m and k = 1 to n.

STEP 2: Find the maximal utility value muk in each transaction Tkas muk= max{u1k, u2k, … , umk} for k = 1 to n.

STEP 3: Calculate the average-utility upper bound ubjof each item ij as the summation of the maximal utilities of the transactions which include ij. That is:

j k

STEP 4: Check whether the average-utility upper bound of an item ijis larger than or equal to

. If ij satisfies the above condition, put it in the set of candidate average-utility

1-itemsets, C1. That is:

1 { |_j _j ,1 }

C  i ub   j m .

STEP 5: Set r = 1, where r is used to represent the number of items in the current candidate average-utility itemsets to be processed.

STEP 6: Generate the candidate set Cr+1from Crwith all the r-subitemsets in each candidate in Cr+1must be contained in Cr.

STEP 7: Calculate the average-utility upper bound ubs of each candidate average-utility (r+1)-itemset as the summation of the maximal utilities of the transactions which include s. That is:

STEP 8: Check whether the average-utility upper bound of each candidate (r+1)-itemsets s is

larger than or equal to . If s does not satisfy the above condition, remove it from Cr+1. That is:

1 { , 1}

r s r

New C _  s ub  s original C _ .

STEP 9: IF Cr+1is null, do the next step; otherwise, set r = r + 1 and repeat STEPs 6 to 9.

STEP 10: For each candidate average-utility itemset s, calculate its actual average-utility value ausas follows:

where ujkis the utility value of each item ijin transaction Tkand |s| is the number of items in s.

STEP 11: Check whether the actual average-utility value ausof each candidate average-utility

itemset s is larger than or equal to. If s satisfies the above condition, put it in the set of high average-utility itemsets, H. That is:

{ _s , }

H  s au  s C ,

where C is the set of all the candidate average-utility itemsets.

3.2 An Example

In this section, an example is given to demonstrate the proposed mining algorithm based on the average-utility of items. This is a simple example to show how the proposed algorithm can be easily used to find out the high average-utility itemsets from a set of transactions.

Assume the ten transactions shown in Table 3-4 are used for mining. Each transaction consists of two features, transaction identification (TID) and items purchased.

Table 3-4: The set of ten transaction data for this example.

Also assume that the predefined profit value for each single item is defined in Table 3-5.

Table 3-5: The predefined profit values of the items.

Item Profit

Moreover, the minimum average-utility threshold is set as 45.4 which is 20% of total utility. In order to find the high average-utility itemsets from the data in Table 3-4, the proposed mining algorithm proceeds as follows.

STEP 1: The utility value of each item occurring in each transaction in Table 3-4 is calculated. Take item B in transaction 7 as an example. The quantity of item B in transaction 7 is 2, and its profit is 10. The utility value of B is thus calculated as 2*10, which is 20. The utility values of all the items in each transaction are shown in Table 3-6.

Table 3-6: The utility values of all the items in each transaction.

TID A B C D E

t1 3 10 4 6 0

t2 0 10 0 18 0

t3 6 0 0 6 0

t4 0 0 1 0 0

t5 3 20 0 6 15

t6 3 10 1 6 5

t7 0 20 3 0 5

t8 0 0 0 6 10

t9 21 0 1 6 0

t10 0 10 1 6 5

STEP 2: The utility values of the items in each transaction are compared and the maximal utility value in the transaction is found. Take transaction 1 as an example. It can be observed from Table 3-6 that the utility value of B is 10, which is the maximal in transaction 1. The maximal utility value in each transaction is shown in Table 3-7.

Table 3-7: The maximal utility values in each transaction of all the given

在文檔中高平均效益項目集之探勘 (頁 16-0)