The Proposed Incremental Utility Mining Algorithm for Record Insertion

CHAPTER 4 Incremental Utility Mining Algorithm for Record Insertion

4.2 The Proposed Incremental Utility Mining Algorithm for Record Insertion

INPUT: The profit values of the items, the minimum average-utility ratio , an original database D with its minimum average-utility threshold α^D (= total utility*), high upper-bound average-utility itemsets (HU^D) and high average-utility itemsets (H^D), and a set of new transactions N = {T1, T2, … , Tn}.

OUTPUT: A set of high average-utility itemsets (H^U) for the updated database U (= D∪ N).

STEP 1: Calculate the minimum average-utility thresholds (α^N and α^U) respectively for the

new transactions N and for the updated database U as follows:

and ( ),

where α^Dis the minimum average-utility threshold for the original database, d is the number of transactions in the original database, and n is the number of new transactions.

STEP 2: Calculate the utility value ujkof each item Ijin each new transaction Tkas ujk= qjk* pj, where qjkis the quantity of Ijin Tk, pjis the profit value of Ij, j = 1 to m and k = 1 to n.

STEP 3: Find the maximal item-utility value mukin each new transaction Tkas muk= max{u1k, u2k, … , umk}, k = 1 to n.

STEP 4: Set k = 1, where k records the number of items in the itemsets currently being processed.

STEP 5: Generate the candidate k-itemsets and calculate their average-utility upper bounds from the new transactions. The average-utility upper bound ubs of each candidate k-itemset s is set as the summation of the maximal item-utilities of the transactions

which include s. That is:

STEP 6: Check whether the average-utility upper bound of each candidate k-itemset s from

the new transactions is larger than or equal to ^N. If s satisfies the above condition,

put it in the set of high upper-bound average-utility k-itemsets for the new transactions, HU_k^N.

STEP 7: For each k-itemset s in the set of high upper-bound average-utility itemsets (HU_k^D)

from the original database, if it appears in the set of high upper-bound average-utility k-itemsets (HU_k^N) in the new transactions, do the following substeps.

Substep 7-1: Set the newly updated average-utility upper bounds of itemset s as:

ub^U(s) = ub^D(s) + ub^N(s).

Substep 7-2: Put s in the set of updated high upper-bound average-utility k-itemsets,

HUk .

STEP 8: For each k-itemset s in the set of high upper-bound average-utility itemsets (HU_k^D)

from the original database, if it does not appear in the set of high upper-bound average-utility k-itemsets (HU_k^N) in the new transactions, do the following substeps.

Substep 8-1: Set the updated average-utility upper bound of itemset s as : ub^U(s) = ub^D(s) + ub^N(s).

Substep 8-2: Check whether the average-utility upper bound of itemset s is larger than or

equal to^U. If it satisfies the above condition, put it in the set of updated high upper-bound average-utility k-itemsets, HU_k^U.

STEP 9: For each k-itemset s in the set of high upper-bound average-utility itemsets (HU_k^N)

in the new transactions, if it does not appear in the set of high upper-bound

average-utility k-itemsets (HU_k^D) in the original database, do the following substeps.

Substep 9-1: Rescan the original database to determine the average-utility upper bound (ub^D(s)) of itemset s.

Substep 9-2: Set the updated average-utility upper bound of itemset s as:

ub^U(s) = ub^D(s) + ub^N(s).

Substep 9-3: Check whether the average-utility upper bound of itemset s is larger than or

equal to^U. If it satisfies the above condition, put it in the set of updated high upper-bound average-utility k-itemsets, HU_k^U.

STEP 10: Generate the candidate (k+1)-itemsets from the set of high upper-bound average-utility k-itemsets (HU_k^N) in the new transactions; If any k-sub-itemsets of a

candidate (k+1)-itemsets is not contained in the set of updated high upper-bound average-utility k-itemsets (HU^U_k ), remove it from the candidate set.

STEP 11: Set k = k+1.

STEP 12: Repeat STEPs 5 to 11 until no new candidate itemsets are generated.

STEP 13: For each high upper-bound average-utility itemset s in HU^Uof the updated database, if it appears in the set of high upper-bound average-utility itemsets (HU^D) of the original database, do the following substeps.

Substep 13-1: Calculate the actual average-utility value of each itemset s for the new transactions as:

( ) | |

where ujk is the utility value of each item Ij in transaction Tkand |s| is the number of items in s.

Substep 13-2: Set the new actual average-utility value of s in the updated database as:

au^U(s) = au^D(s) + au^N(s).

Substep 13-3: Check whether the actual average-utility value of itemset s is larger than or equal to α^U. If it satisfies the above condition, put it in the set of updated high average-utility itemsets, H^U.

STEP 14: For each high upper-bound average-utility itemset s in HU^Uof the updated database, if it does not appears in the set of high upper-bound average-utility itemsets (HU^D) of the original database, do the following substeps.

Substep 14-1: Calculate the actual average-utility value of each itemset s for the new transactions as:

where ujk is the utility value of each item Ij in transaction Tkand |s| is the number of items in s.

Substep 14-2: Rescan the original database to determine the actual average-utility value au^D(s) in HU^D.

Substep 14-3: Set the new actual average-utility value of s in the updated database as:

au^U(s) = au^D(s) + au^N(s).

Substep 14-4: Check whether the actual average-utility value of itemset s is larger than or equal to α^U. If it satisfies the above condition, put it in the set of updated high average-utility itemsets, H^U.

After Step 14, the final updated high average-utility itemsets for the updated database can then be found.

4.3 An Example

In this section, an example is given to demonstrate the proposed incremental average-utility mining algorithm for record insertion. This is a simple example to show how the proposed algorithm can be easily used to efficiently find out high average-utility itemsets from incrementally coming transaction data without less rescans of original databases.

Assume the original database includes 10 transactions, shown in Table 4-1. Each transaction consists of its transaction identification (TID) and items purchased. The numbers represents the quantities purchased.

Table 4-1: The set of ten transaction data in the original database.

Also assume that the profit value of each item is defined in Table 4-2.

Table 4-2: The predefined profit values of the items.

Item Profit

Suppose the minimum average-utility ratio  is set at 20%. Thus, the minimum average-utility threshold is calculated as the total utility value multiplied by 20%, which is 45.4. Using the batch mining algorithm for the original database, the set of high upper-bound average-utility itemsets generated in Phase 1 are shown in Table 4-3. The average-utility

upper bound and the actual average-utility value of each high upper-bound average-utility itemset are also recorded in Table 4-3.

Table 4-3: The average-utility upper bounds and the actual average-utility values of the high upper-bound average-utility itemsets from the original

database.

Itemset Average-Utility

Upper Bound Average-Utility

A 67 36

B 88 80

C 72 11

D 105 60

E 70 40

AD 67 33

BC 50 29.5

BD 68 51

BE 60 45

CD 51 15.5

DE 50 29.5

Assume the three new transactions shown in Table 4-4 are inserted after the initial data set is processed. The proposed incremental average-utility mining algorithm proceeds as follows.

Table 4-4: The three newly inserted transactions.

TID A B C D E

t11 1 1 1 2 0

t12 0 3 4 1 0

t13 2 0 2 0 1

STEP 1: The minimum average-utility thresholds (α^N and α^U) respectively for the newly inserted transactions N and for the updated database U are calculated. In this example, there are 3 newly inserted transactions and thus 13 (10+3) transactions in the updated database. According to the formulas, α^N and α^U are calculated as

follows:

STEP 2: The utility value of each item occurring in each newly inserted transaction is calculated. Take item {D} in transaction 11 as an example. The quantity of item {D} in transaction 11 is 2, and its profit is 6. The utility value of {D} is thus calculated as 2*6, which is 12. The utility values of all the items in each newly inserted transaction are shown in Table 4-5.

Table 4-5: The utility values of all the items in the newly inserted transactions.

TID A B C D E

t11 3 10 1 12 0

t12 0 30 4 6 0

t13 6 0 2 0 5

STEP 3: The utility values of the items in a transaction are compared and the maximal utility value in the transaction is found. Take transaction 12 as an example. It can be observed from Table 4-5 that the utility value of {B} is 30, which is the maximal in transaction 12. The maximal utility value in each transaction is shown in Table 4-6.

Table 4-6: The maximal utility values in the newly inserted transactions.

TID A B C D E Maximal Utility Value

in a Transaction

t11 3 10 1 12 0 12

t12 0 30 4 6 0 30

t13 6 0 2 0 5 6

STEP 4: k is set to 1, where k is used to record the number of items in the itemsets currently being processed.

STEP 5: The average-utility upper bounds of the 1-itemsets in the newly inserted transactions are first calculated. Take item {A} as an example. It appears in

transactions 11 and 13. The average-utility upper bound of {A} is thus the total amount of the maximal utility values of these transactions. It is calculated as 12+6 (=18) in the example. The upper-bound values of all the items in the new transactions are shown in Table 4-7.

Table 4-7: The average-utility upper bounds of the 1-itemsets in the new transactions.

1-Itemset Average-Utility Upper Bound

A 18

B 42

C 48

D 42

E 6

STEP 6: The average-utility upper bounds of the 1-itemsets are checked against the

minimum average-utility threshold^N(which is 13.62) for the new transactions. In this example, the four 1-itemsets {A}, {B}, {C}, {D} are larger than^N. The four items are then put in the set of high upper-bound average-utility 1-itemsets for the new transactions, HU₁^N, which are thus shown in Table 4-8.

Table 4-8: The set of high upper-bound average-utility 1-itemsets for the new transactions, HU₁^N.

1-Itemset Average-Utility Upper Bound

A 18

B 42

C 48

D 42

STEP 7: For each 1-itemset s in the set of high upper-bound average-utility itemsets (HU₁^D)

from the original database, if it appears in the set of high upper-bound average-utility 1-itemsets (HU₁^N) in the new transactions, the following substeps

are done. In this case, the four 1-itemsets {A}, {B}, {C} and {D} are then processed.

Substep 7-1: The newly updated average-utility upper bound of each itemset is calculated. Take {A} as an example. Its average-utility upper bounds in the original database and in the new transactions are 67 and 18, respectively.

As a result, the average-utility upper bound of {A} in the updated database is calculated as 67+18, which is 85. The average-utility upper bounds for the other three items can be easily calculated in the same way.

Substep 7-2: The four 1-itemsets, {A}, {B}, {C} and {D}, are then put into the set of updated high upper-bound average-utility 1-itemsets, HU₁^U.

STEP 8: For each 1-itemset s in the set of high upper-bound average-utility itemsets (HU₁^D)

from the original database, if it does not appear in the set of high upper-bound average-utility 1-itemsets (HU₁^N) in the new transactions, the following substeps

are done. In this case, only {E} satisfies the condition.

Substep 8-1: The updated average-utility upper bound of {E} is calculated as 70+6, which is 76.

Substep 8-2: The updated average-utility upper bound of the itemset {E} is larger than

^U, which is 59.02. Itemset {E} is thus put in the set of updated high upper-bound average-utility 1-itemsets, HU₁^U.

STEP 9: In this case, there are no 1-itemsets in the set of high upper-bound average-utility itemsets (HU₁^N ) in the new transactions not appearing in the set of high upper-bound average-utility 1-itemsets (HU₁^D) in the original database, this step is

then skipped.

After Step 9, all the updated high upper-bound average-utility 1-itemsets are shown in Table 4-9.

Table 4-9: The set of all the updated high upper-bound average-utility 1-itemsets, HU₁^U.

1-Itemset Average-Utility Upper Bound

A 85

B 130

C 120

D 147

E 76

STEP 10: The candidate 2-itemsets are generated from the set of high upper-bound average-utility 1-itemsets (HU₁^N) in the new transactions; If any 1-sub-itemsets of

a candidate 2-itemsets is not contained in the set of updated high upper-bound average-utility 1-itemsets (HU₁^U ), it will be removed from the candidate set. In

this case, the 2-itemsets are {AB}, {AC}, {AD}, {BC}, {BD} and {CD}.

STEP 11: k is set to 2, where k is used to record the number of items in the itemsets currently being processed.

STEP 12: The average-utility upper bounds of 2-itemsets in the newly inserted transactions are calculated. The upper-bound values of all the 2-itemsets in the new transactions are shown in Table 4-10.

Table 4-10: The average-utility upper bounds of 2-itemsets in the new

STEP 13: The average-utility upper bounds of the 2-itemsets are checked against the

minimum average-utility threshold^N(which is 13.62) for the new transactions. In this example, the four 2-itemsets {AC}, {BC}, {BD}, {CD} are larger than ^N. The four itemsets are then put in the set of high upper-bound average-utility 2-itemsets for the new transactions, HU₂^N, which are thus shown in Table 4-11.

Table 4-11: The set of high upper-bound average-utility 2-itemsets for the new transactions, HU₂^N.

2-Itemset Average-Utility

STEP 14: For each 2-itemset s in the set of high upper-bound average-utility itemsets (HU₂^D)

from the original database, if it appears in the set of high upper-bound average-utility 2-itemsets (HU₂^N) in the new transactions, the following substeps

are done. In this case, the three 2-itemsets {BC}, {BD} and {CD} are then processed.

Substep 14-1: The newly updated average-utility upper bounds of itemsets {BC}, {BD}

and {CD} are calculated, which are (92), (110) and (93), respectively.

Substep 14-2: The three 2-itemsets, {BC}, {BD} and {CD}, are then put into the set of updated high upper-bound average-utility 2-itemsets, HU₂^U.

STEP 15: For each 2-itemset s in the set of high upper-bound average-utility itemsets (HU₂^D)

from the original database, if it does not appear in the set of high upper-bound average-utility 2-itemsets (HU₂^N) in the new transactions, the following substeps

are done. In this case, the three 2-itemsets {AD}, {BE} and {DE} are then processed.

Substep 15-1: The newly updated average-utility upper bounds of itemsets {AD}, {BE}

and {DE} are calculated, which are (79), (60) and (50), respectively.

Substep 15-2: The updated average-utility upper bounds of the itemsets {AD} and {BE}

are larger than^U, which is 59.02. Itemsets {AD} and {BE} are thus put in the set of updated high upper-bound average-utility 2-itemsets, HU₂^U .

STEP 16: For each 2-itemset s in the set of high upper-bound average-utility itemsets (HU₂^N)

in the new transactions, if it does not appear in the set of high upper-bound average-utility 2-itemsets (HU₂^D) in the original database, the following substeps

are done. In this case, the 2-itemset {AC} is then processed.

Substep 16-1: The original database is rescanned to determine the average-utility upper bound of itemset {AC}, which is 41.

Substep 16-2: The updated average-utility upper bound of itemset {AC} is calculated.

The average-utility upper bounds of {AC} in the original database and new transactions are 41 and 18, respectively. Thus, the updated average-utility upper bound of {AC} is calculated as 41+18, which is 59.

Substep 16-3: The updated average-utility upper bound of the itemset {AC} is smaller

than^U, which is 59.02. Thus, nothing has to be done.

After Step 16, all the updated high upper-bound average-utility 2-itemsets are shown in Table 4-12.

Table 4-12: The set of all the updated high upper-bound average-utility 2-itemsets, HU₂^U.

2-Itemset Average-Utility Upper Bound

AD 79

BC 92

BD 110

BE 60

CD 93

STEP 17: The candidate 3-itemsets are generated from the set of high upper-bound average-utility 2-itemsets (HU₂^N) in the new transactions; If any 2-sub-itemsets of

a candidate 3-itemsets is not contained in the set of updated high upper-bound average-utility 2-itemsets (HU₂^U ), it will be removed from the candidate set. In

this case, the 3-itemset is {BCD}.

STEP 18: k is set to 3, where k is used to record the number of items in the itemsets currently being processed.

STEP 19: Repeat STEPs 5 to 11 until no new candidate itemsets are generated.

The set of all the updated high upper-bound average-utility itemsets are shown in Table 4-13.

Table 4-13: The set of all the updated high upper-bound average-utility

STEP 20: For each high upper-bound average-utility itemset s in HU^U of the updated database, if it appears in the set of high upper-bound average-utility itemsets (HU^D) of the original database, the following substeps are done.

Substep 20-1: The actual average-utility value of each itemset s for the new transactions is calculated. Take the itemset {BD} as an example. The actual utility values of items {B} and {D} in transaction 11 are 10 and 12, respectively.

Since the itemset {BD} contains 2 items, its actual average-utility value in transaction 11 is calculated as (10 + 12) / 2, which is 11. The itemset {BD}

appears in transactions 11 and 12. The actual average-utility value of {BD}

is thus the total amount of actual average-utility values of these transactions. The value is calculated as (11+18), which is 29.

Substep 20-2: The new actual average-utility value of s in the updated database is set.

Also take the itemset {BD} as an example. The actual average-utility value of itemset {BD} in original database and new transactions is 51 and 29, respectively. Thus, the new actual average-utility value of itemset {BD} in updated database is calculated as 51+29, which is 80.

The actual average-utility value of each high upper-bound average-utility itemset s in HU^Uof the updated database appearing in the set of high upper-bound average-utility itemsets (HU^D) of the original database is shown in Table 4-14.

Table 4-14: The actual average-utility value of each high upper-bound average-utility itemset s in HU^Uof the updated database appearing in the

set of high upper-bound average-utility itemsets (HU^D) of the original database.

Substep 20-3: Check whether the actual average-utility value of each itemset is larger than or equal to α^U, which is 59.02. In this example, the average-utility values of itemsets {B}, {D} and {BD} are larger than α^U. These itemsets are thus put in the set of high average-utility itemsets, H^U.

STEP 21: For each high upper-bound average-utility itemset s in HU^U of the updated database, if it does not appears in the set of high upper-bound average-utility itemsets (HU^D) of the original database, the following substeps are done. In this case, itemset {BCD} is then processed.

Substep 21-1: The actual average-utility value of itemset {BCD} for the new transactions

is calculated, which is 21.

Substep 21-2: The original database is rescanned to determine the actual average-utility value of itemset {BCD} in the original database, which is 18.

Substep 21-3: The updated actual average-utility value of itemset {BCD} is calculated as 18 + 21, which is 39.

Substep 21-4: The actual average-utility value of itemset {BCD} is smaller than α^U, which is 59.02. Thus, itemset {BCD} is not a high average-utility itemset.

After Step 21, all the high average-utility itemsets for the updated database can then be found.

As shown in Table 4-15.

Table 4-15: All the high average-utility itemsets for the updated database, H^U.

High Average-Utility

Itemset Average-Utility

B 120

D 78

BD 80

CHAPTER 5 Incremental Utility Mining Algorithm for Record Deletion

In this chapter, an incremental utility mining algorithm is proposed to maintain the discovered high average-utility itemsets for record deletion. The proposed algorithm first scans the deleted transactions to obtain all candidate 1-itemsets with their average-utility upper bounds. The candidate 1-itemsets are then checked against the minimum average-utility threshold to decide whether they are high upper-bound average-utility itemsets for the deleted transactions.

For each 1-itemset in the set of high upper-bound average-utility itemsets from the original database, if it appears in the set of high upper-bound average-utility 1-itemsets in the

在文檔中高平均效益項目集之探勘 (頁 48-0)