An Example - Mining High Average-Utility Itemsets

CHAPTER 3 Mining High Average-Utility Itemsets

3.2 An Example

In this section, an example is given to demonstrate the proposed mining algorithm based on the average-utility of items. This is a simple example to show how the proposed algorithm can be easily used to find out the high average-utility itemsets from a set of transactions.

Assume the ten transactions shown in Table 3-4 are used for mining. Each transaction consists of two features, transaction identification (TID) and items purchased.

Table 3-4: The set of ten transaction data for this example.

Also assume that the predefined profit value for each single item is defined in Table 3-5.

Table 3-5: The predefined profit values of the items.

Item Profit

Moreover, the minimum average-utility threshold is set as 45.4 which is 20% of total utility. In order to find the high average-utility itemsets from the data in Table 3-4, the proposed mining algorithm proceeds as follows.

STEP 1: The utility value of each item occurring in each transaction in Table 3-4 is calculated. Take item B in transaction 7 as an example. The quantity of item B in transaction 7 is 2, and its profit is 10. The utility value of B is thus calculated as 2*10, which is 20. The utility values of all the items in each transaction are shown in Table 3-6.

Table 3-6: The utility values of all the items in each transaction.

TID A B C D E

t1 3 10 4 6 0

t2 0 10 0 18 0

t3 6 0 0 6 0

t4 0 0 1 0 0

t5 3 20 0 6 15

t6 3 10 1 6 5

t7 0 20 3 0 5

t8 0 0 0 6 10

t9 21 0 1 6 0

t10 0 10 1 6 5

STEP 2: The utility values of the items in each transaction are compared and the maximal utility value in the transaction is found. Take transaction 1 as an example. It can be observed from Table 3-6 that the utility value of B is 10, which is the maximal in transaction 1. The maximal utility value in each transaction is shown in Table 3-7.

Table 3-7: The maximal utility values in each transaction of all the given ten transactions.

TID A B C D E Maximal Utility Value

in Transaction

t1 3 10 4 6 0 10

t2 0 10 0 18 0 18

t3 6 0 0 6 0 6

t4 0 0 1 0 0 1

t5 3 20 0 6 15 20

t6 3 10 1 6 5 10

t7 0 20 3 0 5 20

t8 0 0 0 6 10 10

t9 21 0 1 6 0 21

t10 0 10 1 6 5 10

STEP 3: The average-utility upper bound of 1-itemsets is calculated. Take item A as an example. It appears in transactions 1, 3, 5, 6 and 9. The average-utility upper bound of A is thus the total amount of the maximal utility values of these transactions. It is calculated as 10 + 6 + 20 + 10 + 21, which is 67. The upper-bound values of all the items are shown in Table 3-8.

Table 3-8: The average-utility upper bounds of 1-itemsets.

STEP 4: Check whether the average-utility upper bound of 1-itemsets is larger than or equal

to user-defined minimum average-utility threshold, which is 45.4. In this example, the average-utility upper bound of 1-itemsets exceeds the minimum average-utility

threshold . All the items are recorded as candidate average-utility 1-itemsets, C1, shown in Table 3-9.

Table 3-9: The candidate average-utility 1-itemsets, C1. Candidate

STEP 5: The variable r is set at 1, where r is used to represent the number of items in the current candidate average-utility itemsets to be processed.

STEP 6: The candidate average-utility 2-itemsets (C2) are then generated from C1. They are {AB}, {AC}, {AD}, {AE}, {BC}, {BD}, {BE}, {CD}, {CE}, {DE}.

STEP 7: The average-utility upper bound of each 2-itemset is calculated. Take the itemset {AB} as an example. It appears in transactions 1, 5 and 6. The average-utility upper bound of {AB} is thus the total amount of the maximal utility values of these transactions as 10 + 20 + 10, which is 40. The upper-bound values of all the 2-itemsets are shown in Table 3-10.

Table 3-10: The average-utility upper bounds of the 2-itemsets.

Candidate

STEP 8: The average-utility upper bound of each 2-itemset is thus checked against

the user-defined minimum average-utility threshold. In this example, the itemsets {AB}, {AC}, {AE} and {CE} do not exceed. These itemsets are thus removed from C2. The remaining candidate average-utility 2-itemsets are shown in Table 3-11.

Table 3-11: The remaining candidate average-utility 2-itemsets, C2. Candidate generated from C2as shown in Table 3-12.

Table 3-12: The average-utility upper bounds of the 3-itemsets.

Candidate

Since the average-utility upper bounds of both the two candidate 3-itemsets are less

than , they are removed from C3 and C3 becomes null. After this step, all the candidate average-utility itemsets are shown in Table 3-13.

Table 3-13: All the candidate average-utility itemsets in the example.

Candidate

STEP 10: The actual average-utility value aus of each candidate average-utility itemset is calculated. Take the itemset {AD} as an example. The actual utility values of items A and D in transaction 1 are 3 and 6, respectively. Since the itemset {AD} contains

2 items, its actual average-utility value in transaction 1 is calculated as (3 + 6) / 2, which is 4.5. The itemset {AD} appears in transactions 1, 3, 5, 6 and 9. The actual

average-utility value of {AD} is thus the total amount of actual average-utility values of these transactions. The value is calculated as (9 + 12 + 9 + 9 + 27) / 2, which is 33. The actual average-utility value of each candidate average-utility itemset is shown in Table 3-14.

Table 3-14: The actual average-utility values of the candidate average-utility itemsets.

STEP 11: The actual average-utility value of each candidate average-utility itemset is then

compared with the user-defined minimum average-utility threshold . In this example, the actual average-utility values of itemsets {B}, {D} and {BD} are larger

than or equal to. They are thus put into the set of high average-utility itemsets, H,

as shown in Table 3-15.

Table 3-15: High average-utility itemsets.

High Average-Utility

Itemset Average-Utility

B 80

D 60

BD 51

In this example, four high average-utility itemsets are generated. Note that if the traditional utility criterion is used, the results will be {B}, {D}, {AD}, {BC}, {BD}, {BE}

and {DE}. The number of the high average-utility itemsets is less than that of the high utility itemset. Under the perspective of the average utility, the utility values of itemsets won’t increase with the increase of itemset length. The item combination in a high average-utility itemset can thus really show its excellence in obtaining profits.

CHAPTER 4 Incremental Utility Mining Algorithm for Record Insertion

The proposed incremental average-utility mining algorithm was based on the concept of the four cases in FUP but for average-utility itemsets. There are two phases in the proposed incremental average-utility mining algorithm. In the first phase, the average-utility upper bound is used to overestimate the itemsets. The average-utility upper bound is an overestimated utility value instead of actual utility value. The average-utility upper bound can ensure the anti-monotone property which is used to decrease the number of itemsets to be scanned level by level. The itemsets which have their average-utility upper bounds larger than or equal to the user-defined threshold are defined as “high upper-bound average-utility itemsets”. Otherwise, they are regarded as “low upper-bound average-utility itemsets”. Each subset of a “high upper-bound average-utility itemset” is certainly a “high upper-bound average-utility itemset” and each superset of a “low upper-bound average-utility itemset” is certainly a “low upper-bound average-utility itemset”. It can thus prune many “low upper-bound average-utility itemsets” level by level and decrease the time to scan a database.

The proposed algorithm first scans the new transactions to obtain all candidate 1-itemsets with their average-utility upper bounds. The candidate 1-itemsets are then checked

against the minimum average-utility threshold to decide whether they are high upper-bound average-utility itemsets for the new transactions.

For each 1-itemset in the set of high upper-bound average-utility itemsets from the original database, if it appears in the set of high upper-bound average-utility 1-itemsets in the new transactions, it belongs to Case 1 (Large - Large) which is similar to that mentioned in Table 2-1. Thus, it is still a high upper-bound average-utility itemset for the whole updated database. The updated average-utility upper bound of the itemset can easily be obtained by using addition.

For each 1-itemset in the set of high upper-bound average-utility itemsets from the original database, if it does not appear in the set of high upper-bound average-utility 1-itemsets in the new transactions, it belongs to Case 2 (Large - Small). The updated average-utility upper bound of the itemset is thus re-calculated and checked against the minimum average-utility threshold to determine whether it is a high upper-bound average-utility itemset in the updated database.

For each itemset in the set of high upper-bound average-utility itemsets in the new transactions, if it does not appear in the set of high upper-bound average-utility itemsets in the original database, it belongs to Case 3 (Small – Large). A database rescan is needed to determine the average-utility upper bound of the itemset for the original database. The upper-bound value is then re-calculated and checked against the minimum average-utility

threshold to determine whether it is a high upper-bound average-utility itemset in the updated database.

All the high upper-bound average-utility 1-itemsets for the whole updated database are then formed.

Next, candidate 2-itemsets based on the high upper-bound average-utility 1-itemsets from the new transactions are generated. The same procedure is repeated, each time with one more item added, until no high upper-bound average-utility itemsets are formed. After the first phase, all the high upper-bound average-utility itemsets for the whole updated database are formed.

Then the second phase begins. In this phase, the actual average-utility values of the high upper-bound average-utility itemsets are calculated. Also, these itemsets are checked against the minimum average-utility threshold to determine whether they are actually high or not. All the actual high average-utility itemsets can thus be found.

Our incremental utility mining algorithm can reduce the time to re-process the whole updated database when compared with conventional batch utility mining algorithms. The details of the proposed incremental average-utility mining algorithm are described below.

4.1 Notation

Notation used in this algorithm is described as follows:

D : the original database;

N : the set of new transactions;

U : the entire updated database, i.e., D∪ N;

d : the number of transactions in D;

n : the number of transactions in N;

 : the minimum average-utility ratio;

α^D: the minimum average-utility threshold defined in the original database;

α^N: the minimum average-utility threshold for the new transactions;

α^U: the minimum average-utility threshold for the updated database;

HUk : the set of high upper-bound average-utility k-itemsets in the original database;

HUD: the set of high upper-bound average-utility itemsets in the original database;

HUk : the set of high upper-bound average-utility k-itemsets in the new transactions;

HUN: the set of high upper-bound average-utility itemsets in the new transactions;

HUk : the set of high upper-bound average-utility k-itemsets in the updated database;

HUU: the set of high upper-bound average-utility itemsets in the updated database;

H^D: the set of high average-utility itemsets in the original database;

H^U: the set of high average-utility itemsets in the updated database;

muk: the maximal utility value mukin each transaction Tk;

s: an itemset;

ub^D(s): the average-utility upper bound of itemset s in the original database;

ub^N(s): the average-utility upper bound of itemset s in the new transactions;

ub^U(s): the average-utility upper bound of itemset s in the updated database;

au^D(s): the actual average-utility value of itemset s in the original database;

au^N(s): the actual average-utility value of itemset s in the new transactions;

au^U(s): the actual average-utility value of itemset s in the updated database;

4.2 The Proposed Incremental Utility Mining Algorithm for Record Insertion

INPUT: The profit values of the items, the minimum average-utility ratio , an original database D with its minimum average-utility threshold α^D (= total utility*), high upper-bound average-utility itemsets (HU^D) and high average-utility itemsets (H^D), and a set of new transactions N = {T1, T2, … , Tn}.

OUTPUT: A set of high average-utility itemsets (H^U) for the updated database U (= D∪ N).

STEP 1: Calculate the minimum average-utility thresholds (α^N and α^U) respectively for the

new transactions N and for the updated database U as follows:

and ( ),

where α^Dis the minimum average-utility threshold for the original database, d is the number of transactions in the original database, and n is the number of new transactions.

STEP 2: Calculate the utility value ujkof each item Ijin each new transaction Tkas ujk= qjk* pj, where qjkis the quantity of Ijin Tk, pjis the profit value of Ij, j = 1 to m and k = 1 to n.

STEP 3: Find the maximal item-utility value mukin each new transaction Tkas muk= max{u1k, u2k, … , umk}, k = 1 to n.

STEP 4: Set k = 1, where k records the number of items in the itemsets currently being processed.

STEP 5: Generate the candidate k-itemsets and calculate their average-utility upper bounds from the new transactions. The average-utility upper bound ubs of each candidate k-itemset s is set as the summation of the maximal item-utilities of the transactions

which include s. That is:

STEP 6: Check whether the average-utility upper bound of each candidate k-itemset s from

the new transactions is larger than or equal to ^N. If s satisfies the above condition,

put it in the set of high upper-bound average-utility k-itemsets for the new transactions, HU_k^N.

STEP 7: For each k-itemset s in the set of high upper-bound average-utility itemsets (HU_k^D)

from the original database, if it appears in the set of high upper-bound average-utility k-itemsets (HU_k^N) in the new transactions, do the following substeps.

Substep 7-1: Set the newly updated average-utility upper bounds of itemset s as:

ub^U(s) = ub^D(s) + ub^N(s).

Substep 7-2: Put s in the set of updated high upper-bound average-utility k-itemsets,

HUk .

STEP 8: For each k-itemset s in the set of high upper-bound average-utility itemsets (HU_k^D)

from the original database, if it does not appear in the set of high upper-bound average-utility k-itemsets (HU_k^N) in the new transactions, do the following substeps.

Substep 8-1: Set the updated average-utility upper bound of itemset s as : ub^U(s) = ub^D(s) + ub^N(s).

Substep 8-2: Check whether the average-utility upper bound of itemset s is larger than or

equal to^U. If it satisfies the above condition, put it in the set of updated high upper-bound average-utility k-itemsets, HU_k^U.

STEP 9: For each k-itemset s in the set of high upper-bound average-utility itemsets (HU_k^N)

in the new transactions, if it does not appear in the set of high upper-bound

average-utility k-itemsets (HU_k^D) in the original database, do the following substeps.

Substep 9-1: Rescan the original database to determine the average-utility upper bound (ub^D(s)) of itemset s.

Substep 9-2: Set the updated average-utility upper bound of itemset s as:

ub^U(s) = ub^D(s) + ub^N(s).

Substep 9-3: Check whether the average-utility upper bound of itemset s is larger than or

equal to^U. If it satisfies the above condition, put it in the set of updated high upper-bound average-utility k-itemsets, HU_k^U.

STEP 10: Generate the candidate (k+1)-itemsets from the set of high upper-bound average-utility k-itemsets (HU_k^N) in the new transactions; If any k-sub-itemsets of a

candidate (k+1)-itemsets is not contained in the set of updated high upper-bound average-utility k-itemsets (HU^U_k ), remove it from the candidate set.

STEP 11: Set k = k+1.

STEP 12: Repeat STEPs 5 to 11 until no new candidate itemsets are generated.

STEP 13: For each high upper-bound average-utility itemset s in HU^Uof the updated database, if it appears in the set of high upper-bound average-utility itemsets (HU^D) of the original database, do the following substeps.

Substep 13-1: Calculate the actual average-utility value of each itemset s for the new transactions as:

( ) | |

where ujk is the utility value of each item Ij in transaction Tkand |s| is the number of items in s.

Substep 13-2: Set the new actual average-utility value of s in the updated database as:

au^U(s) = au^D(s) + au^N(s).

Substep 13-3: Check whether the actual average-utility value of itemset s is larger than or equal to α^U. If it satisfies the above condition, put it in the set of updated high average-utility itemsets, H^U.

STEP 14: For each high upper-bound average-utility itemset s in HU^Uof the updated database, if it does not appears in the set of high upper-bound average-utility itemsets (HU^D) of the original database, do the following substeps.

Substep 14-1: Calculate the actual average-utility value of each itemset s for the new transactions as:

where ujk is the utility value of each item Ij in transaction Tkand |s| is the number of items in s.

Substep 14-2: Rescan the original database to determine the actual average-utility value au^D(s) in HU^D.

Substep 14-3: Set the new actual average-utility value of s in the updated database as:

au^U(s) = au^D(s) + au^N(s).

Substep 14-4: Check whether the actual average-utility value of itemset s is larger than or equal to α^U. If it satisfies the above condition, put it in the set of updated high average-utility itemsets, H^U.

After Step 14, the final updated high average-utility itemsets for the updated database can then be found.

4.3 An Example

In this section, an example is given to demonstrate the proposed incremental average-utility mining algorithm for record insertion. This is a simple example to show how the proposed algorithm can be easily used to efficiently find out high average-utility itemsets from incrementally coming transaction data without less rescans of original databases.

Assume the original database includes 10 transactions, shown in Table 4-1. Each transaction consists of its transaction identification (TID) and items purchased. The numbers represents the quantities purchased.

Table 4-1: The set of ten transaction data in the original database.

Also assume that the profit value of each item is defined in Table 4-2.

Table 4-2: The predefined profit values of the items.

Item Profit

Suppose the minimum average-utility ratio  is set at 20%. Thus, the minimum average-utility threshold is calculated as the total utility value multiplied by 20%, which is 45.4. Using the batch mining algorithm for the original database, the set of high upper-bound average-utility itemsets generated in Phase 1 are shown in Table 4-3. The average-utility

upper bound and the actual average-utility value of each high upper-bound average-utility itemset are also recorded in Table 4-3.

Table 4-3: The average-utility upper bounds and the actual average-utility values of the high upper-bound average-utility itemsets from the original

database.

Itemset Average-Utility

Upper Bound Average-Utility

A 67 36

B 88 80

C 72 11

D 105 60

E 70 40

AD 67 33

BC 50 29.5

BD 68 51

BE 60 45

CD 51 15.5

DE 50 29.5

Assume the three new transactions shown in Table 4-4 are inserted after the initial data set is processed. The proposed incremental average-utility mining algorithm proceeds as follows.

Table 4-4: The three newly inserted transactions.

TID A B C D E

t11 1 1 1 2 0

t12 0 3 4 1 0

t13 2 0 2 0 1

STEP 1: The minimum average-utility thresholds (α^N and α^U) respectively for the newly inserted transactions N and for the updated database U are calculated. In this example, there are 3 newly inserted transactions and thus 13 (10+3) transactions in the updated database. According to the formulas, α^N and α^U are calculated as

follows:

STEP 2: The utility value of each item occurring in each newly inserted transaction is calculated. Take item {D} in transaction 11 as an example. The quantity of item {D} in transaction 11 is 2, and its profit is 6. The utility value of {D} is thus calculated as 2*6, which is 12. The utility values of all the items in each newly inserted transaction are shown in Table 4-5.

Table 4-5: The utility values of all the items in the newly inserted transactions.

TID A B C D E

t11 3 10 1 12 0

t12 0 30 4 6 0

t13 6 0 2 0 5

STEP 3: The utility values of the items in a transaction are compared and the maximal utility

在文檔中高平均效益項目集之探勘 (頁 34-0)