The Proposed Incremental Utility Mining Algorithm for Record Deletion

CHAPTER 5 Incremental Utility Mining Algorithm for Record Deletion

5.2 The Proposed Incremental Utility Mining Algorithm for Record Deletion

INPUT: The profit values of the items, the minimum average-utility ratio , an original database D with its minimum average-utility threshold α^D (= total utility*), high upper-bound average-utility itemsets (HU^D) and high average-utility itemsets (H^D), and a set of deleted transactions R = {T1, T2, … , Tn}.

OUTPUT: A set of high average-utility itemsets (H^U) for the updated database U (= D-R).

STEP 1: Calculate the minimum average-utility thresholds (α^R and α^U) respectively for the deleted transactions R and for the updated database U as follows:

and ( ),

where α^Dis the minimum average-utility threshold for the original database, d is the number of transactions in the original database, and r is the number of deleted transactions.

STEP 2: Calculate the utility value ujkof each item Ijin each deleted transaction Tkas ujk= qjk

* pj, where qjkis the quantity of Ijin Tk, pjis the profit value of Ij, j = 1 to m and k = 1 to n.

STEP 3: Find the maximal item-utility value muk in each deleted transaction Tk as muk =

max{u1k, u2k, … , umk}, k = 1 to n.

STEP 4: Set k = 1, where k records the number of items in the itemsets currently being processed.

STEP 5: Generate the candidate k-itemsets and calculate their average-utility upper bounds from the deleted transactions. The average-utility upper bound ubsof each candidate k-itemset s is set as the summation of the maximal item-utilities of the transactions

which include s. That is:

STEP 6: Check whether the average-utility upper bound of each candidate k-itemset s from

the deleted transactions is larger than or equal to^R. If s satisfies the above condition, put it in the set of high upper-bound average-utility k-itemsets for the deleted transactions, HU_k^R.

STEP 7: For each k-itemset s in the set of high upper-bound average-utility itemsets (HU_k^D)

from the original database, if it appears in the set of high upper-bound average-utility k-itemsets (HU_k^R) in the deleted transactions, do the following substeps.

Substep 7-1: Set the updated average-utility upper bound of itemset s as : ub^U(s) = ub^D(s) - ub^R(s).

Substep 7-2: Check whether the average-utility upper bound of itemset s is larger than or

equal to^U. If it satisfies the above condition, put it in the set of updated

high upper-bound average-utility k-itemsets, HU_k^U.

STEP 8: For each k-itemset s in the set of high upper-bound average-utility itemsets (HU_k^D)

from the original database, if it does not appear in the set of high upper-bound average-utility k-itemsets (HU_k^R) in the deleted transactions, do the following

substeps.

Substep 8-1: Set the updated average-utility upper bounds of itemset s as:

ub^U(s) = ub^D(s) - ub^R(s).

Substep 8-2: Put s in the set of updated high upper-bound average-utility k-itemsets,

HUk .

STEP 9: For each candidate k-itemset s, if it does not appear in the set of high upper-bound average-utility k-itemsets (HU_k^D) in the original database and does not appear in the set of high upper-bound average-utility itemsets (HU_k^R) in the deleted transactions,

do the following substeps.

Substep 9-1: Rescan the original database to determine the average-utility upper bound (ub^D(s)) of itemset s.

Substep 9-2: Set the updated average-utility upper bound of itemset s as:

ub^U(s) = ub^D(s) - ub^R(s).

Substep 9-3: Check whether the average-utility upper bound of itemset s is larger than or

equal to^U. If it satisfies the above condition, put it in the set of updated

high upper-bound average-utility k-itemsets, HU_k^U.

STEP 10: Generate the candidate (k+1)-itemsets from the set of updated high upper-bound average-utility k-itemsets (HU_k^U) in the updated database; If any k-sub-itemsets of a

candidate (k+1)-itemsets is not contained in the set of updated high upper-bound average-utility k-itemsets (HU^U_k ), remove it from the candidate set.

STEP 11: Set k = k+1.

STEP 12: Repeat STEPs 5 to 11 until no new candidate itemsets are generated.

STEP 13: For each high upper-bound average-utility itemset s in HU^Uof the updated database, if it appears in the set of high upper-bound average-utility itemsets (HU^D) of the original database, do the following substeps.

Substep 13-1: Calculate the actual average-utility value of each itemset s for the deleted transactions as:

where ujk is the utility value of each item Ij in transaction Tkand |s| is the number of items in s.

Substep 13-2: Set the actual average-utility value of s in the updated database as:

au^U(s) = au^D(s) - au^R(s).

Substep 13-3: Check whether the actual average-utility value of itemset s is larger than or equal to α^U. If it satisfies the above condition, put it in the set of updated

high average-utility itemsets, H^U.

STEP 14: For each high upper-bound average-utility itemset s in HU^Uof the updated database, if it does not appears in the set of high upper-bound average-utility itemsets (HU^D) of the original database, do the following substeps.

Substep 14-1: Calculate the actual average-utility value of each itemset s for the deleted transactions as:

where ujk is the utility value of each item Ij in transaction Tkand |s| is the number of items in s.

Substep 14-2: Rescan the original database to determine the actual average-utility value au^D(s) in HU^D.

Substep 14-3: Set the actual average-utility value of s in the updated database as:

au^U(s) = au^D(s) - au^R(s).

Substep 14-4: Check whether the actual average-utility value of itemset s is larger than or equal to α^U. If it satisfies the above condition, put it in the set of updated high average-utility itemsets, H^U.

After Step 14, the final updated high average-utility itemsets for the updated database can then be found.

5.3 An Example

In this section, an example is given to demonstrate the proposed incremental average-utility mining algorithm for record deletion. This is a simple example to show how the proposed algorithm can be easily used to efficiently find out high average-utility itemsets from databases with deletion of records without rescanning the entire databases completely. Assume the original database includes 10 transactions, shown in Table 5-1. Each transaction consists of its transaction identification (TID) and items purchased. The numbers represents the quantities purchased.

Table 5-1: The set of ten transaction data in the original database.

TID A B C D E

t1 1 1 4 1 0

t2 0 1 0 3 0

t3 2 0 0 1 0

t4 0 0 1 0 0

t5 1 2 0 1 3

t6 1 1 1 1 1

t7 0 2 3 0 1

t8 0 0 0 1 2

t9 7 0 1 1 0

t10 0 1 1 1 1

Also assume that the profit value of each item is defined in Table 5-2.

Table 5-2: The predefined profit values of the items.

Item Profit

A 3

B 10

C 1

D 6

E 5

Suppose the minimum average-utility ratio  is set at 20%. Thus, the minimum average-utility threshold is calculated as the total utility value multiplied by 20%, which is 45.4.

Using the batch mining algorithm for the original database, the set of high upper-bound average-utility itemsets generated in phase 1 are shown in Table 5-3. The average-utility upper bound and the actual average-utility value of each high upper-bound average-utility itemset are also recorded in Table 5-3.

Table 5-3: The average-utility upper bounds and the actual average-utility values of the high upper-bound average-utility itemsets from the original

database.

Assume the last three transactions of original database shown in Table 5-4 are deleted after the initial data set is processed. The proposed incremental average-utility mining algorithm for record deletion proceeds as follows.

Table 5-4: The three deleted transactions.

TID A B C D E

t8 0 0 0 1 2

t9 7 0 1 1 0

t10 0 1 1 1 1

STEP 1: The minimum average-utility thresholds (α^R and α^U) respectively for the deleted transactions R and for the updated database U are calculated. In this example, there are 3 deleted transactions and thus 7 (10-3) transactions in the updated database.

According to the formulas, α^Rand α^Uare calculated as follows:

45.4 3 13.62

STEP 2: The utility value of each item occurring in each deleted transaction is calculated.

Take item {A} in transaction 9 as an example. The quantity of item {A} in transaction 9 is 7, and its profit is 3. The utility value of {A} is thus calculated as 7*3, which is 21. The utility values of all the items in each deleted transaction are shown in Table 5-5.

Table 5-5: The utility values of all the items in the deleted transactions.

TID A B C D E

t8 0 0 0 6 10

t9 21 0 1 6 0

t10 0 10 1 6 5

STEP 3: The utility values of the items in a transaction are compared and the maximal utility value in the transaction is found. Take transaction 9 as an example. It can be observed from Table 5-5 that the utility value of {A} is 21, which is the maximal in

transaction 9. The maximal utility value in each transaction is shown in Table 5-6.

Table 5-6: The maximal utility values in the deleted transactions.

TID A B C D E Maximal Utility Value

in a Transaction

t8 0 0 0 6 10 10

t9 21 0 1 6 0 21

t10 0 10 1 6 5 10

STEP 4: k is set to 1, where k is used to record the number of items in the itemsets currently being processed.

STEP 5: The average-utility upper bounds of the 1-itemsets in the deleted transactions are first calculated. Take item {C} as an example. It appears in transactions 9 and 10.

The average-utility upper bound of {C} is thus the total amount of the maximal utility values of these transactions. It is calculated as 21+10 (=31) in the example.

The upper-bound values of all the items in the deleted transactions are shown in Table 5-7.

Table 5-7: The average-utility upper bounds of the 1-itemsets in the

STEP 6: The average-utility upper bounds of the 1-itemsets are checked against the minimum

average-utility threshold ^R (which is 13.62) for the deleted transactions. In this example, the four 1-itemsets {A}, {C}, {D}, {E} are larger than^R. The four items are then put in the set of high upper-bound average-utility 1-itemsets for the deleted transactions, HU₁^R, which are thus shown in Table 5-8.

Table 5-8: The set of high upper-bound average-utility 1-itemsets for the deleted transactions, HU₁^R.

1-Itemset Average-Utility

STEP 7: For each 1-itemset s in the set of high upper-bound average-utility itemsets (HU₁^D)

from the original database, if it appears in the set of high upper-bound average-utility 1-itemsets (HU₁^R) in the deleted transactions, the following substeps

are done. In this case, the four 1-itemsets {A}, {C}, {D} and {E} are then processed.

Substep 7-1: The updated average-utility upper bound of each itemset is calculated. Take {A} as an example. Its average-utility upper bounds in the original database and in the deleted transactions are 67 and 21, respectively. As a result, the average-utility upper bound of {A} in the updated database is calculated as 67-21, which is 46. The average-utility upper bounds for the other three items can be easily calculated in the same way.

Substep 7-2: The updated average-utility upper bounds of the itemsets {A}, {C}, {D}

and {E} are larger than^U, which is 31.78. These four itemsets are thus put in the set of updated high upper-bound average-utility 1-itemsets, HU₁^U . STEP 8: For each 1-itemset s in the set of high upper-bound average-utility itemsets (HU₁^D)

from the original database, if it does not appear in the set of high upper-bound average-utility 1-itemsets (HU₁^R) in the deleted transactions, the following substeps

are done. In this case, only {B} satisfies the condition.

Substep 8-1: The updated average-utility upper bound of itemset {B} is calculated as 88-10, which is 78.

Substep 8-2: Itemset {B} is then put into the set of updated high upper-bound average-utility 1-itemsets, HU₁^U.

STEP 9: In this case, there are no 1-itemsets in the set of high upper-bound average-utility itemsets (HU₁^R ) in the deleted transactions not appearing in the set of high upper-bound average-utility 1-itemsets (HU₁^D) in the original database, this step is

then skipped.

After Step 9, all the updated high upper-bound average-utility 1-itemsets are shown in Table 5-9.

Table 5-9: The set of all the updated high upper-bound average-utility 1-itemsets, HU₁^U.

1-Itemset Average-Utility Upper Bound

A 46

B 78

C 41

D 64

E 50

STEP 10: The candidate 2-itemsets are generated from the set of updated high upper-bound average-utility 1-itemsets ( HU₁^U ) of the updated database; In this case, the

2-itemsets are {AB}, {AC}, {AD}, {AE}, {BC}, {BD}, {BE}, {CD}, {CE} and {DE}.

STEP 11: k is set to 2, where k is used to record the number of items in the itemsets currently being processed.

STEP 12: The average-utility upper bounds of 2-itemsets in the deleted transactions are calculated. The upper-bound values of all the 2-itemsets in the deleted transactions are shown in Table 5-10.

Table 5-10: The average-utility upper bounds of 2-itemsets in the deleted transactions.

STEP 13: The average-utility upper bounds of the 2-itemsets are checked against the

minimum average-utility threshold^R(which is 13.62) for the deleted transactions.

In this example, the four 2-itemsets {AC}, {AD}, {CD} and {DE} are larger than^R.

The four itemsets are then put in the set of high upper-bound average-utility 2-itemsets for the deleted transactions, HU₂^R, which are thus shown in Table 5-11.

Table 5-11: The set of high upper-bound average-utility 2-itemsets for the deleted transactions, HU₂^R.

2-Itemset Average-Utility Upper Bound

AC 21

AD 21

CD 31

DE 20

STEP 14: For each 2-itemset s in the set of high upper-bound average-utility itemsets (HU₂^D)

from the original database, if it appears in the set of high upper-bound average-utility 2-itemsets (HU₂^R) in the deleted transactions, the following substeps

are done. In this case, the three 2-itemsets {AD}, {CD} and {DE} are then processed.

Substep 14-1: The updated average-utility upper bounds of itemsets {AD}, {CD} and {DE}

are calculated, which are (46), (20) and (30), respectively.

Substep 14-2: The updated average-utility upper bound of the itemsets {AD} is larger than

^U, which is 31.78. Itemsets {AD} is thus put in the set of updated high upper-bound average-utility 2-itemsets, HU₂^U .

STEP 15: For each 2-itemset s in the set of high upper-bound average-utility itemsets (HU₂^D)

from the original database, if it does not appear in the set of high upper-bound average-utility 2-itemsets (HU₂^R) in the deleted transactions, the following substeps

are done. In this case, the three 2-itemsets {BC}, {BD} and {BE} are then processed.

Substep 15-1: The updated average-utility upper bounds of itemsets {BC}, {BD} and {BE}

are calculated, which are (40), (58) and (50), respectively.

Substep 15-2: The three 2-itemsets, {BC}, {BD} and {BE}, are then put into the set of updated high upper-bound average-utility 2-itemsets, HU₂^U.

STEP 16: For each candidate 2-itemset s, if it does not appear in the set of high upper-bound average-utility 2-itemsets (HU₂^D) in the original database and does not appear in the set of high upper-bound average-utility 2-itemsets ( HU₂^R ) in the deleted

transactions, the following substeps are done. In this case, the 2-itemsets {AB}, {AE}

and {CE} are then processed.

Substep 16-1: The original database is rescanned to determine the average-utility upper bounds of itemsets {AB}, {AE} and {CE}, which are (40), (30) and (40), respectively.

Substep 16-2: The updated average-utility upper bounds of itemsets {AB}, {AE} and {CE}

are calculated. Take itemset {CE} as an example. The average-utility upper

bounds of {CE} in the original database and deleted transactions are 40 and 10, respectively. Thus, the updated average-utility upper bound of {CE} is calculated as 40-10, which is 30.

Substep 16-3: The updated average-utility upper bounds of the itemsets {AB}, {AE} and

{CE} are (40), (30) and (30), respectively. Itemset {AB} is larger than ^U (=31.78) and is then put into the set of updated high upper-bound average-utility 2-itemsets, HU₂^U.

After Step 16, all the updated high upper-bound average-utility 2-itemsets are shown in Table 5-12.

Table 5-12: The set of all the updated high upper-bound average-utility 2-itemsets, HU₂^U.

2-Itemset Average-Utility Upper Bound

AB 40

AD 46

BC 40

BD 58

BE 50

STEP 17: The candidate 3-itemsets are generated from the set of updated high upper-bound average-utility 2-itemsets (HU₂^U ) of the updated database; If any 2-sub-itemsets of

a candidate 3-itemsets is not contained in the set of updated high upper-bound

average-utility 2-itemsets (HU₂^U), it will be removed from the candidate set. In this

case, the 3-itemset is {ABD}.

STEP 18: k is set to 3, where k is used to record the number of items in the itemsets currently being processed.

STEP 19: Repeat STEPs 5 to 11 until no new candidate itemsets are generated.

The set of all the updated high upper-bound average-utility itemsets are shown in Table 5-13.

Table 5-13: The set of all the updated high upper-bound average-utility itemsets, HU^U.

STEP 20: For each high upper-bound average-utility itemset s in HU^Uof the updated database, if it appears in the set of high upper-bound average-utility itemsets (HU^D) of the original database, the following substeps are done.

Substep 20-1: The actual average-utility value of each itemset s for the deleted transactions is calculated. Take the itemset {BD} as an example. The actual utility values of items {B} and {D} in transaction 10 are 10 and 6, respectively. Since the itemset {BD} contains 2 items, its actual average-utility value in transaction 10 is calculated as (10 + 6) / 2, which is 8. The actual average-utility value of itemset s is the total amount of actual average-utility values of transactions containing itemset s. Since itemset {BD} only appears in transaction 10. The value is 8.

Substep 20-2: The actual average-utility value of s in the updated database is set. Also take the itemset {BD} as an example. The actual average-utility value of itemset {BD} in original database and deleted transactions is 51 and 8, respectively.

Thus, the actual average-utility value of itemset {BD} in the updated database is calculated as (51-8), which is 43.

The actual average-utility value of each high upper-bound average-utility itemset s in HU^U of the updated database appearing in the set of high upper-bound average-utility itemsets (HU^D) of the original database is

shown in Table 5-14.

Table 5-14: The actual average-utility value of each high upper-bound average-utility itemset s in HU^Uof the updated database appearing in the

set of high upper-bound average-utility itemsets (HU^D) of the original database.

Substep 20-3: Check whether the actual average-utility value of each itemset is larger than or equal to α^U, which is 31.78. In this example, the average-utility values of itemsets {B}, {D}, {BD} and {BE} are larger than α^U. These itemsets are thus put in the set of high average-utility itemsets, H^U.

STEP 21: For each high upper-bound average-utility itemset s in HU^Uof the updated database, if it does not appears in the set of high upper-bound average-utility itemsets (HU^D) of the original database, the following substeps are done. In this case, itemsets {AB}

and {ABD} is then processed.

Substep 21-1: The actual average-utility values of itemsets {AB} and {ABD} for the deleted transactions are calculated, which are 0 and 0.

Substep 21-2: The original database is rescanned to determine the actual average-utility values of itemsets {AB} and {ABD} in the original database, which are 24.5 and 22.33.

Substep 21-3: The updated actual average-utility values of itemsets {AB} and {ABD} are 24.5 and 22.33, respectively.

Substep 21-4: The actual average-utility values of these two itemsets are smaller than α^U. Thus, nothing has to be done.

After Step 21, all the high average-utility itemsets for the updated database can then be found.

As shown in Table 5-15.

Table 5-15: All the high average-utility itemsets for the updated database, H^U.

High Average-Utility

Itemset Average-Utility

B 70

D 42

BD 43

BE 37.5

CHAPTER 6 Experimental Results

Experiments were made to show the performance of the proposed approach. All the experiments were performed on an Intel Core 2 Duo E6550 (2.33GHz) PC with 2 GB main memory, running the Windows XP Professional operating system. The proposed algorithm was implemented in Visual C# 9.0.

A real data set from a major grocery chain store in America was used for the experiments. There were 21,556 transactions and 1,559 distinct items in the database. Each transaction consisted of the products sold and their quantities. The average transaction length was 4.03. The total utility from all the transactions in the dataset was 104,450,739.

在文檔中高平均效益項目集之探勘 (頁 74-0)