CHAPTER 2 Review of Related Works
2.3 Mining High Average-Utility
Traditionally, the utility of an itemset is the summation of the utilities of the itemset in all the transactions regardless of its length. Thus, the utility of an itemset in a transaction will increase along with the increase of its length. That is, longer itemsets in a transaction result in higher utility values. Thus, using the same minimum utility threshold to judge itemsets with different lengths is unfair. In order to alleviate the effect of the length of itemsets and identify really good utility itemsets, the average utility measure was proposed to reveal a better utility effect of combining several items than the original utility measure [13]. It is defined as the total utility of an itemset divided by its number of items within it. The average utility of an itemset is then compared with a threshold to decide whether it is a high average-utility itemset.
Assume the five transactions shown in Table 2-2 are used for mining high average-utility itemsets. Each transaction consists of two features, transaction identification and purchased items. Also assume the profits of the items are shown in Table 2-3 and the user-specified minimum average-utility threshold is set at 20%.
Table 2-2: The set of five transaction data in this example
Table 2-3: The profits of the items in this example Itemset Profit
A 1
B 6
C 8
D 4
The utility value of each item occurring in each transaction in Table 2-2 is calculated. Take item {A} in transaction 2 as an example to illustrate the steps. The quantity of item {A} in the first transaction is 2, and its profit is 1. The utility value of {A} is thus calculated as 2×1, which is 2. The results of utility values for all items in each transaction are shown in Table 2-4.
Table 2-4: The utility values of all items in each transaction
TID A B C D
First, the utility values for all items in each transaction are then compared to find the maximal utility value in the transaction. For example, it can be observed from Table 2-4 that the utility value of {C} is 16 in the first transaction, which is also the maximal utility value in the first transaction. The other transactions from transactions
10
2 to 5 are then processed to respectively find their maximal utility values in the same way. Thus, the average-utility upper bound of {A} is the total amount of the maximal utility values of the appeared transactions, which are transactions 2 and 4. It can be calculated as 8 + 16 (= 24). The average-utility upper bounds of the other items {B} to {D} are then calculated and shown in Table 2-5.
Table 2-5: The average-utility upper bounds of all items Candidate Itemset AUUB
A 24
B 60
C 76
D 48
Second, the average-utility upper bounds of the 1-itemsets are then checked against the minimum average-utility value, which is totaling utility multiplied by the user-specified minimum average-utility threshold. The results are calculated as all values in Table 2.4 multiply by 0.2, which is 127×0.2 (= 25.4). In this example, items {B}, {C} and {D} satisfy the condition. They are then considered as the candidate set of average-utility 1-itemsets. The Apriori-like approach is then progressed to generate-and-test the candidate itemsets level-by-level. After that, the actual average-utility value of each candidate average-utility itemset is calculated. Note that the total utility of an itemset must divide by its number of items within it for finding the average-utility itemsets. The average utility of an itemset is then compared with a minimum average-utility value to decide whether it is a high average-utility itemset.
In this example, the final results are then shown in Table 2-6.
11
12
Table 2-6: High average-utility itemsets High Average-Utility Itemset Average-Utility Value
B 36
C 72
BC 46
CD 32
BCD 29.33
Lin et al. proposed a high average-utility pattern tree (abbreviated as HAUP-tree) algorithm for integrating the high average-utility mining algorithm and the FP-tree-like approach to construct a condensed tree structure for efficiently mining high average utility patterns [23]. It first builds the HAUP tree tuple by tuple from the first transaction to the last one and keeps the average-utility upper bound, instead of the count. Besides, each node at the end of a path in the tree has to store the average-utility upper bound of the item in the node as well as the quantities of its preceding items in the path. An illustrative example is given below to demonstrate the progress of construction. Assume the quantitative database and the profit table was respectively shown in Table 2-2 and Table 2-3. The derived average-utility upper bounds of 1-itemsets were shown in Table 2-5. The minimum average-utility value was already calculated as 25.4. In Table 2-4, the occurrence frequencies of the items are also calculated, which are {A:2, B:4, C:5, D:3}. In this example, we have found that the itemsets {B}, {C} and {D} are considered as the candidate set of average-utility 1-itemsets since their average-utility upper bounds are larger than or equal to the minimum average-utility value. The items {B}, {C} and {D} are then sorted in descending order according to their occurrence frequencies, which will be kept in the Header_Table in the above sorted order. The results are shown in Table 2-7.
13
Table 2-7: The Header_Table constructed in the example
Header_Table
Item AUUB
C 76
B 60
D 48
The items not existing in the Header_Table are removed from the transactions in the quantitative database. The remaining items in each updated transaction are then sorted according to the above order, which will be used to construct the HAUP tree tuple by tuple. The results for the updated transactions are shown in Table 2-8.
Table 2-8: The updated transactions in the quantitative database
TID C B D
In the HAUP tree, each node consists of not only the average-utility upper bound of an item but also the quantities of its prefix items in the path. Take the first transaction as an example to illustrate the process for constructing the HAUP tree. In this case, there is not any corresponding path in the HAUP tree for the first updated transaction (C:2, B:1, D:2); the new node is then sequentially created for each of them;
linked as the child of its super-item in the first updated transaction, and inserted into the HAUP tree. Each node consisted of the maximal utility value 16 in the first updated transaction as its value as well as the quantities of its super-items in the transaction. The other transactions from 2 to 5 are processed in the same way to inert
into the HAUP tree tuple by tuple. After that, the finally constructed HAUP tree is shown in Figure 2-2.
Figure 2-2: The final constructed HAUP tree
After the HAUP tree is constructed, the desired high average-utility itemsets can be derived by the proposed HAUP-growth mining algorithm [23]. In the HAUP-growth algorithm, the items in the Header_Table are processed one by one and bottom-up in Figure 2-2. The items with their quantities are extracted from the quan_Ary arrays in the nodes. The quantities of the same itemsets are also summed together. The associated itemsets to the current processed item are then generated by a combination approach. After that, the actual average-utility value can thus be calculated as the quantity multiplied by its profit. The results are the same shown in Table 2-6.
14
In the above approaches, the database is assumed as a static database. Thus, whether the transactions are inserted into or deleted in from the original database, the updated database is then processed in a batch way. Hong et al. proposed the insertion
15
and deletion approaches for handling the high average-utility itemsets based on FUP concepts in dynamic databases [14, 17]. It helps reduce the time to re-process the whole updated database, thus resulting better performance than the batch approach.
16
CHAPTER 3
The Incremental HAUP-tree Construction Algorithm
In the past, Lin et al. proposed a mining approach with the aid of a tree structure to efficiently implement average utility mining [23]. They designed the high average utility pattern tree (HAUP-tree) structure to keep some related information and then proposed the HAUP-growth algorithm to mine high average utility itemsets from the tree structure. Their approach, however, handles the transactions in a batch way. In real-world applications, transactions may come intermittently. In this paper, we thus attempt to extend their approach and propose an incremental mining approach for efficiently finding high average utility patterns.
The proposed approach consists of two main phases. The first phase maintains the correct HAUP tree while handling the newly coming data and the second phase runs the previous HAUP-growth algorithm to get the desired patterns from the HAUP tree. An incremental HAUP-tree construction algorithm is thus designed here based on the concept of the FUP (Fast Update) concepts [7] and the FUFP-tree (Fast Updated Frequent-Pattern tree) algorithm [15], which was originally used for incrementally mining association rules. The proposed incremental construction process handles the transactions tuple by tuple, from the first transaction to the last one. It divides the items in coming transactions into four cases to maintain the HAUP tree.
17
The first case is for items which have their average-utility upper-bound values from new transactions larger than or equal to the minimum average-utility threshold for the new transactions and appear in the Header_Table. These items will still be high average-utility upper-bound items for the whole updated database. The second case is for items which have their average-utility upper-bound values from new transactions smaller than the minimum average-utility threshold for new transactions but appears in the Header_Table. These items are not necessarily still high average-utility upper-bound items for the whole updated database. Their final average-utility upper-bound values may, however, be easily calculated from the new transactions and the Header_Table. The third case is for items which have their average-utility upper-bound values from new transactions larger than or equal to the minimum average-utility threshold for new transactions, but do not appear in the Header_Table.
These items are not necessarily still low average-utility upper-bound items for the whole updated database. Their final average-utility upper-bound values can be decided only after the average-utility upper-bound values of the items are still found by re-scanning the original database. This case will be more time-consuming than the previous two. The fourth case is for items which have their average-utility upper-bound values from new transactions smaller than the minimum average-utility threshold for new transactions, but do not appear in the Header_Table. These items are certainly still low average-utility upper-bound items and don’t need to be processed. The four cases are thus individually handled in the proposed incremental HAUP-tree construction algorithm, with the HAUP tree and the Header_Table being updated correspondingly.
3.1 Notations
Notations used in the proposed algorithm are described as follows:
D: the original database;
N: the set of new transactions;
U: the whole updated database, i.e , D∪N;
I: an item;
T: a transaction;
n: the number of transactions in N;
m: the number of items;
qjk: the quantity of item Ij in transaction Tk;
pj: the profit value of Ij;
ujk: the utility value of Ij in Tk:
tuk: the total item utility in Tk;
muk: the maximal item utility in Tk;
tuD: the total utility of transactions in D;
tuN: the total utility of transactions in N;
tuU: the total utility of transactions in U;
λ: the minimum average-utility ratio;
αD: the minimum average-utility threshold for the original database;
18
αN: the minimum average-utility threshold for the new transactions;
αU: the minimum average-utility threshold for the whole updated database;
D
AUUB : the average-utility upper bound of Ij j in D;
N
AUUB : the average-utility upper bound of Ij j in N;
U
AUUB : the average-utility upper bound of Ij j in U;
quan_Ary: the quantities of the prefix items of item to the corresponding elements of the array;
Insert_Items: the set of items with which the new transactions are reprocessed for updating the HAUP tree;
Rescan_Items: the set of items with which the original transactions are reprocessed for updating the HAUP tree;
Rescan_Transactions: the set of original transactions with items in the Rescan_Items.
3.2 The Incremental HAUP-tree Construction Algorithm
With the above notation, the details of the proposed incremental HAUP-tree construction algorithm are described below.
INPUT:
1. A high average utility pattern tree (HAUP tree) with its Header_Table and the total utility tuD from an original database D with m items;
19
2. A new set of n transactions, each of which includes a subset of items with quantities;
3. The profit values of the m items;
4. The minimum average-utility ratio λ. OUTPUT:
A new HAUP tree with its Header_Table for the updated database.
PROCEDURE: transaction Tk respectively as follows for k = 1 to n:
STEP 3: Calculate the total utility tuN from the set of new transactions as:
STEP 4: Calculate the total utility tuU for the whole updated database U as:
tuU = tuD + tuN.
STEP 5: Calculate the minimum average-utility thresholds (αN and αU) respectively for the new transactions N and for the updated database U as follows:
αN = tuN×λ and αU = tuU×λ,
20
where tuN is the total utility of all the new transactions and tuU is the total utility of the whole updated database.
STEP 6: Calculate the average-utility upper bound AUUBNj of each item Ij in the set of new transactions N as the summation of the maximal utilities of these new transactions containing Ij. That is:
STEP 7: Check whether the average-utility upper bound AUUBNj of each item Ij in the new transactions is larger than or equal to the minimum average-utility threshold αN for the new transactions.
STEP 8: Do the following substeps for each item Ij which has their average-utility upper bound AUUBNj from the new transactions larger than or equal to αN and appears in the Header_Table (Case 1).
Substep 8-1: Set the new average-utility upper bound AUUBUj of the item Ij
in the whole updated database as:
AUUBUj = AUUBDj + AUUBNj ,
where is the average-utility upper bound of the item I
D
AUUBj
j in the original database D and can be found from the Header_Table.
Substep 8-2: Update the average-utility upper bound value of Ij in the Header_Table as AUUBUj .
Substep 8-3: Put Ij in the set of Insert_Items, which will be further processed in STEP 13 to insert the item from the new transactions into the HAUP tree.
STEP 9: Do the following substeps for each item Ij which has their average-utility upper bound AUUBNj smaller than αN but appears in the Header_Table
21
(Case 2):
Substep 9-1: Calculate the new average-utility upper bound of the item I
U
AUUBj
j in the whole updated database as:
AUUBUj = AUUBDj +AUUBNj .
Substep 9-2: If AUUBUj α≥ U, item Ij will still be large after the database is updated; Update the AUUB value of Ij in the Header_Table as AUUBUj and add item Ij to the set of Insert_Items, which will be further processed in STEP 13 to insert the item from the new transactions into the HAUP tree.
Substep 9-3: If AUUBUj < αU, item Ij will become small after the database is updated; Remove item Ij from the Header_Table, connect each parent node of item Ij directly to the corresponding child node of item Ij, and remove item Ij from the HAUP tree.
STEP 10: Do the following substeps for each item Ij which has their average-utility upper bound AUUBNj larger than or equal to αN but do not appear in the Header_Table (Case 3):
Substep 10-1: Rescan the original database to find out the transactions with item Ij, find the maximal item utility muk in each found
Substep 10-2: Calculate the average-utility upper bound AUUBUj of item Ij
in the whole updated database as:
AUUBUj = AUUBDj +AUUBNj .
Substep 10-3: If AUUBUj α≥ U, item Ij will be large after the database is updated; Add item Ij both in the set of Insert_Items and in the set of Rescan_Items, and put the transaction IDs with item Ij
in the set of Rescan_Transactions, which will be further processed in STEP 12 to insert the item from the original database into the HAUP tree.
STEP 11: Sort the items in the set of Rescan_Items in a descending order of theirAUUBUj .
STEP 12: Do the following substeps for each item Ij in the set of Rescan_Items according to the sorted order.
Substep 12-1: Insert Ij to the end of the Header_Table.
Substep 12-2: Find the corresponding branch of the HAUP tree for each transaction Tk in the set of Rescan_Transactions with Ij.
Substep 12-3: If Ij has been in the corresponding branch, add the maximal utility value muk of the transaction Tk to the current AUUB value of the node with Ij in the corresponding branch;
Otherwise, insert Ij at the end of the branch, set its AUUB value as muk, attach an initially empty quan_Ary array (with its prefix items) to the node, and connect the current last brother node (with Ij) in the HAUP tree to it.
Substep 12-4: Add the quantity of each prefix item in Tk to the quantity of the corresponding element in the quan_Ary array attached to the node of Ij in the branch.
23
24
STEP 13: Do the following substeps for each inserted transaction Tk with an item Ij
existing in the Insert_Items.
Substep 13-1: Find the corresponding branch of the HAUP tree for the inserted transaction Tk with Ij.
Substep 13-2: If Ij has been in the corresponding branch, add the maximal utility value muk of the inserted transaction Tk to the current AUUB value of the node with Ij in the corresponding branch;
Otherwise, insert Ij at the end of the branch, set its AUUB value as muk, attach an initially empty quan_Ary array (with its prefix items) to the node, and connect the current last brother node (with Ij) in the HAUP tree to it.
Substep 13-3: Add the quantity of each prefix item in Tk to the quantity of the corresponding element in the quan_Ary array attached to the node of Ij in the corresponding branch.
STEP 14: Output the HAUP tree with its Header_Table.
In Steps 12 and 13, a corresponding branch is generated from the large items in a transaction and corresponding to the order of items appearing in the Header_Table.
After Step 14, the final HAUP tree is updated from the new transactions, which can then be integrated into the original database. Based on the updated HAUP tree, the desired association rules can then be found by the HAUP-growth mining approach as proposed in [23].
3.3. An Example
In this section, an example is given to demonstrate the proposed incremental
HAUP-tree construction algorithm. Assume the quantitative database and the profit table were shown in Table 2-2 and Table 2-3, respectively. The user-specified minimum average-utility threshold is set at 20%, and the minimum average-utility value was calculated as 25.4 for the original database. The constructed HAUP-tree structure in a batch way was shown in Figure 2-2.
For processing the incremental algorithm to maintain the HAUP tree, suppose there are two new transactions shown in Table 3-1 inserted into the original database, The proposed algorithm proceeds the construction phase as follows.
For processing the incremental algorithm to maintain the HAUP tree, suppose there are two new transactions shown in Table 3-1 inserted into the original database, The proposed algorithm proceeds the construction phase as follows.