Proposed Methods - Mining Temporal Rare Utility Itemsets in Large Databases Using Relative

Chapter 4 Mining Temporal Rare Utility Itemsets in Large Databases Using Relative

4.2 Proposed Methods

In this section, we present the TP-RUI-Mine and TRUI-Mine methods and describe the basic concept of TP-RUI-Mine and TRUI-Mine. Then we give an example of mining temporal high utility itemsets. Finally, the procedure of the TRUI-Mine algorithm is provided in the last paragraph of the section.

The goal of our algorithms is to discover temporal rare itemsets from temporal databases.

The concept consists of utility mining and significantly rare itemsets. We describe the basic concept of utility mining and significantly rare itemset as follows.

Basic Concept of Utility Mining

The goal of utility mining is to discover all the itemsets whose utility values are beyond a user specified threshold in a transaction database. In [35] the goal of utility mining is to find all the

high utility itemsets. An itemset X is a high utility itemset if u(X) ≥ ε, where X I and ε is the minimum utility threshold, otherwise, it is a low utility itemset. For example, in Table 4-1, u(A, T

⊆

8) = 1×3 = 3, u({A, C}, T8) = u(A, T8) + u(C, T8) = 1×3 + 1×1 = 4, and u({A, C}) = u({A, C}, T8) + u({A, C}, T9) = 4 + 30 = 34. If ε = 130, {A, C} is a low utility itemset.

However, if an item is a low utility item, its superset may be a high utility itemset. For example, u(D) = 84 < 130, D is a low utility item, but its superset {B, D} is a high utility itemset because of u({B, D}) = 160 > 130. Hence, all the combinations of all items should be processed so that it never loses any high utility itemset. However the cost of either computation time or memory is intolerable.

Liu et al. [21] proposed the Two-Phase algorithm for pruning candidate itemsets and simplifying the calculation of utility. First, Phase I overestimates some low utility itemsets, but it never underestimates any itemsets. For the example in Table 4-1, the transaction utility of transaction Tq, denoted as tu(Tq), is the sum of the utilities of all items in Tq: tu(Tq) = . And the transaction-weighted utilization of an itemset X, denoted as twu(X), is the

sum of the transaction utilities of all the transactions containing X: twu(X) = . For example, twu(A) = tu(T Phase I overestimates some low utility itemsets, but it never underestimates any itemsets.

Table 4-2 gives the transaction utility for each transaction in Table 4-1. Second, one extra database scan is performed to filter the overestimated itemsets in Phase II. For example, twu(A) = 157 > 130 but u(A) = 45 < 130. Then item {A} is pruned. Otherwise, it is a high utility itemset. Finally, all of the high utility itemsets are identified in this way.

Table 4-2. Transaction utility of the database.

TID Transaction Utility TID Transaction Utility

T1 31 T7 105

T2 71 T8 27

T3 42 T9 40

T4 52 T10 62

T5 22 T11 42

T6 48 T12 21

Basic Concept of Significant Rare Utility Itemsets

In this chapter, we use RUT (Relative Utility Threshold) which identifies the association rules containing the significantly rare itemsets that have high confidence with regard to specific data. A significantly rare itemset is one in which its frequency in the database does not satisfy the utility threshold but appears associated with the specific data in high proportion to its frequency. To identify significantly rare itemsets in the existing high utility itemsets discovery algorithms such as Two-Phase algorithm [21], we should set the utility threshold, generate high utility itemsets of which the members satisfy the utility threshold, and apply the specified confidence to all rules that can be produced by the high utility itemsets. However in some cases, these significantly rare itemsets are not discovered during the actual computation of the high utility itemsets. For example, data items a, b and c exist in the database, where each of a, b and c has support of 25%, 35% and 30% respectively in the database and the user has set the minimum utility threshold to 35%. Then a and c cannot be the members of the high utility itemset since they do not satisfy the minimum utility threshold. However, the itemset {a, b, c} may have support of 23%, and 90% of a’s occurrences may come together with b and c. In the existing discovery methods, the itemset {a, b, c} is not discovered because it does not satisfy the minimum utility threshold of 35%.

To discover such significantly rare itemsets that are rarely discovered using the existing utility mining methods, our algorithms set and utilize two minimum utility thresholds. The two utility thresholds are defined as the first utility threshold and the second utility threshold.

Both of the utility thresholds are defined as follows:

Definition 4.1. 1st utility threshold: Critical value of the user-specified utility threshold used in the process of high utility itemsets discovery.

Definition 4.2. 2nd utility threshold: Critical value of the user-specified utility threshold used in the process of rare utility itemsets discovery.

The first utility threshold and the second utility threshold are set so that the condition

“1st utility threshold > 2nd utility threshold” is satisfied. In addition to the utility threshold, our algorithms use the relative utility threshold (RUT) that considers relative frequency between the data. RUT is one that measures the rare itemset satisfying the second utility threshold but not the first utility threshold. Using the RUT, we identify the significantly rare itemset. RUT is defined as follows:

Definition 4.3. Relative Utility Threshold (RUT): RUT(i1, i2, …, ik) = max(threshold(i1, i2, …, ik)/threshold(i1), threshold(i1, i2, …, ik)/threshold(i2),…, threshold(i1, i2, …, ik)/threshold(ik))

RUT is between 0 and 1, and is determined by selecting the largest one among the confidence values for the candidate itemset against each data item. A high value RUT implies that the user selects the items in which the percentage of the co-occurrence is high.

If we define RUT and discover the high utility itemsets using the utility threshold, we are able to discover the high utility itemsets in which the different frequencies of items are reflected. For example, it is less frequent for consumers to buy food processors or cooking pans in a supermarket than to buy bread or milk, but the former transactions are more profitable. When applying the existing methods that use only single utility threshold, we should set the utility threshold lower to discover the association with regard to food processors or cooking pans, and thus numerous unnecessary utility itemsets satisfying the low

utility threshold are produced. By using RUT, we can discover rare utility itemsets and prevent the generation of unnecessary utility itemsets.

Our algorithms TP-RUI-Mine and TRUI-Mine are based on the principle of the Two-Phase algorithm [21] and THUI-Mine [32], and we combine these with the concept of the significantly rare itemset and focus on utilizing incremental methods to improve the response time with fewer candidate itemsets and CPU I/O. In essence, by partitioning a transaction database into several partitions from temporal databases, algorithm TRUI-Mine employs a filtering threshold in each partition to deal with the transaction-weighted utilization itemsets generated. The cumulative information in the prior phases is selectively carried over toward the generation of transaction-weighted utilization itemsets in the subsequent phases by TRUI-Mine. In the processing of a partition, a progressive transaction-weighted utilization set of itemsets is generated by TRUI-Mine. Explicitly, a progressive transaction-weighted utilization set of itemsets is composed of the following two types of transaction-weighted utilization itemsets: (1) the transaction-weighted utilization itemsets that were carried over from the previous progressive candidate set in the previous phase and remain as transaction-weighted utilization itemsets after the current partition is taken into consideration;

and (2) the transaction-weighted utilization itemsets that were not in the progressive candidate set in the previous phase but are newly selected after only taking the current data partition into account. As such, after the processing of a phase, algorithm TRUI-Mine outputs a cumulative filter, denoted by CF, which consists of a progressive transaction-weighted utilization set of itemsets, their occurrence counts and the corresponding partial utility threshold required. Then temporal rare utility itemsets could be generated by RUT. With these design considerations, algorithm TRUI-Mine is shown to have very good performance for mining temporal rare utility itemsets from temporal databases. Although another algorithm TP-RUI-Mine is proposed by us and based on the principle of Two-Phase algorithm [21] and uses the same concept and processes with the part of generating temporal rare utility itemsets of TRUI-Mine.

However, TP-RUI-Mine would generate too many candidate itemsets compared to TRUI-Mine because of the principle of Two-Phase algorithm [21]. We found that TRUI-Mine is a more efficient algorithm than TP-RUI-Mine according to both theory and experimental results.

Hence, we only show the processes of TRUI-Mine in detail.

An Example for Mining Temporal Rare Utility Itemsets

The proposed TRUI-Mine algorithm can be best understood by the illustrative transaction database in Table 4-1 and Figure 4-1 where a scenario of generating high utility itemsets from temporal databases for mining temporal rare utility itemsets is given. We set the first utility threshold at 130 and second utility threshold at 90 in nine transactions. According to the characteristics of the procedure of utility mining, we should set second utility threshold to be the same as the initial threshold so as to filter utility itemsets. If we set the first utility threshold to be the initial threshold, we might lose some utility itemsets that could be rare utility itemsets. In addition, we set RUT=0.6 to find temporal rare utility itemsets. In fact, our algorithm TRUI-Mine not only could discover temporal high utility itemsets but also temporal rare utility itemsets. Without loss of generality, the temporal mining problem can be divided into two procedures:

1. Preprocessing procedure: This procedure deals with mining on the original transaction database.

2. Incremental procedure: The procedure deals with the update of the high utility itemsets and rare utility itemsets from temporal databases.

Figure 4-1. Temporal rare utility itemsets generated by TRUI-Mine.

The preprocessing procedure is only utilized for the initial utility mining in the original database, e.g., db^1,n. For the generation of mining high utility itemsets and rare utility itemsets in db^2,n+1, db^3,n+2, db^i,j, and so on, the incremental procedure is employed. Consider the database in Table 4-1. Assume that the original transaction database db^1,3 is segmented into three partitions, i.e., {P1, P2, P3}, in the preprocessing procedure. Each partition is scanned sequentially for the generation of candidate 2-itemsets in the first scan of the database db^1,3. After scanning the first segment of 3 transactions, i.e., partition P1, 2-itemsets {AB, AD AE, BD, BE, DE} are generated as shown in Figure 4-1. In addition, each potential candidate itemset c ∈ C2 has two attributes: (1) c.start which contains the identity of the starting partition when c was added to C2; and (2) transaction-weighted utility which is the sum of the transaction utilities of all the transactions containing c since c was added to C2. Since there are three partitions, the second utility threshold of each partition is 90 / 3 = 30. Such a partial utility threshold is called the “filtering threshold” in this chapter. Itemsets whose transaction-weighted utility are below the filtering threshold are removed. Then, as shown in Figure 4-1, only {AD, BD, BE, DE}, marked by “ ◎ ”, remain as temporal high transaction-weighted utilization 2-itemsets whose information is then carried over to the next

phase of processing. Similarly, after scanning partition P2, the temporal high transaction-weighted utilization 2-itemsets are recorded.

From Figure 4-1, it is noted that since there are also 3 transactions in P2, the filtering threshold of those itemsets carried out from the previous phase is 30 + 30 = 60, and that of newly identified candidate itemsets is 30. It can be seen from Figure 4-1 that we have 5 temporal high transaction-weighted utilization 2-itemsets in C2 after the processing of partition P2, and 3 of them are carried from P1 to P2 and 2 of them are newly identified in P2. Note that though appearing in the previous phase P1, itemset {AD} is removed from temporal high transaction-weighted utilization 2-itemsets once P2 is taken into account since its transaction-weighted utility does not meet the filtering threshold (i.e., 42 < 60). Finally, partition P3 is processed by algorithm TRUI-Mine. The resulting temporal high transaction-weighted utilization 2-itemsets are {AB, AC, AE, BC, BD, BE, DE} as shown in Figure 4-1. After the processing of partition P3, we do have two new itemsets, i.e., AC and BC, which join the C2 as temporal high transaction-weighted utilization 2-itemsets. Consequently, we have 7 temporal high transaction-weighted utilization 2-itemsets generated by TRUI-Mine, and 3 of them are carried from P1 to P3, while 2 of them are carried from P2 to P3 and 2 of them are newly identified in P3. After processing P1 to P3, those temporal high transaction-weighted utilization itemsets in db^1,3 are {A, B, C, D, E, AB, AC, AE, BC, BD, BE, DE}.

After generating temporal high transaction-weighted utilization 2-itemsets from the first scan of database db^1,3, we employ the scan reduction technique and use temporal high transaction-weighted utilization 2-itemsets to generate Ck (k = 3, 4, ..., n), where Cn is the candidate last itemset. It can be verified that temporal high transaction-weighted utilization 2-itemsets generated by TRUI-Mine can be used to generate the candidate 3-itemsets. Clearly, a C3 can be generated from temporal high transaction-weighted utilization 2-itemsets. For example, 3-candidate itemsets {ABC}, {ABE} and {BDE} are generated from temporal high

transaction-weighted utilization 2-itemsets {AB, AC, BC}, {AB, AE, BE,} and {BD, BE, DE}

in db^1,3. Similarly, all Ck can be stored in main memory, and we can find temporal high utility itemsets together by first utility threshold and temporal rare candidate itemsets between first utility threshold and second utility threshold when the second scan of the database db^1,3 is performed. Thus, only two scans of the original database db^1,3 are required in the preprocessing step. The resulting temporal high utility itemsets are {B} and {BE} because u(B) = 330 >130 and u({B, E}) = 215 > 130. In addition, the temporal rare candidate itemset is {BD} because u({B, D}) = 118 between 90 (second utility threshold) and 130 (first utility threshold). The individual relative utility thresholds of {B, D} are {B, D}/{B} = 2/5 =0.4 and {B, D}/{D} = 2/4 =0.5. So the maximum relative utility threshold of {B, D} is 0.5. However, RUT(B, D) = 0.5 < 0.6. Hence, there is no temporal rare utility itemset that could be found in the database db^1,3.

One important merit of TRUI-Mine lies in its incremental procedure. As depicted in Figure 4-1, the mining database will be moved from db^1,3 to db^2,4. Thus, some transactions, i.e., T1, T2, and T3, are deleted from the mining database and other transactions, i.e., T10, T11, and T12, are added. To illustrate more clearly, this incremental step can also be divided into three sub-steps: (1) generating temporal high transaction-weighted utilization 2-itemsets in D⁻

= db^1,3− ∆⁻, (2) generating temporal high transaction-weighted utilization 2-itemsets in db^2,4

= D⁻ + ∆⁺ and (3) scanning the database db^2,4 only once for the generation of all temporal high utility itemsets and temporal rare utility itemsets. In the first sub-step, db^1,3 − ∆⁻ = D⁻, we check the pruned partition P1, and reduce the value of transaction-weighted utility and set c.start = 2 for those temporal transaction-weighted utilization 2-itemsets where c.start = 1. It can be seen that itemsets {BD, DE} were removed. Next, in the second sub-step, we scan the incremental transactions in P4. The process in D⁻ + ∆⁺ = db^2,4is similar to the operation of scanning partitions, e.g., P2, in the preprocessing step. The new itemset {BD} joins the temporal high transaction-weighted utilization 2-itemsets after the scan of P4. In the third

sub-step, we use temporal high transaction-weighted utilization 2-itemsets to generate Ck as mentioned above. Finally, those temporal high transaction-weighted utilization itemsets in db^2,4 are {A, B, C, D, E, AE, BC, BD, BE}. Note that instead of 10 2-candidate itemsets that would be generated if TP-RUI-Mine were used, only 4 temporal high transaction-weighted utilization 2-itemsets are generated by TRUI-Mine. By scanning db2,4 only once, TRUI-Mine obtains temporal high utility itemsets {B, BE} in db^2,4 because u(B) = 270 >130 and u({B, E})

= 150 > 130. In addition, the temporal rare candidate itemset are {BC} and {BD} because u({B, C}) = 120 and u({B, D}) = 94 between 90 (second utility threshold) and 130 (first utility threshold). The individual relative utility thresholds of {B, C} are {B, C}/{B} = 3/7

=0.42 and {B, C}/{C} = 3/5 =0.6. The individual relative utility thresholds of {B, D} are {B, D}/{B} = 2/7 =0.28 and {B, D}/{D} = 2/3 =0.67. So the maximum relative utility thresholds of {B, C} and {B, D} are 0.6 and 0.67. It could be found that RUT(B, C) = 0.6 ≧ 0.6 and RUT(B, D) = 0.67 > 0.6. Consequently, TRUI-Mine obtains temporal rare utility itemsets {BC, BD} in the database db^2,4.

Through the example above, we confirm that items C and D, though they are rare data items not satisfying the first utility threshold, always occur simultaneously with item B; and the algorithm TRUI-Mine can discover the temporal rare utility itemsets that are not included in the temporal high utility itemsets but still significant in terms of the relative utility threshold. In addition, our algorithm TRUI-Mine not only could discover temporal high utility itemsets but also temporal rare utility itemsets.

TRUI-Mine Algorithm

For easier illustration, the meanings of various symbols used are given in Table 4-3. The preprocessing procedure and the incremental procedure of algorithm TRUI-Mine are described as follows.

Preprocessing procedure of TRUI-Mine

The preprocessing procedure of Algorithm TRUI-Mine is shown in Figure 4-2. Initially, the database db^1,n is partitioned into n partitions by executing the preprocessing procedure (in Step 2), and CF, the cumulative filter, is empty (in Step 3). Let Thtw^1,n be the set of progressive temporal high transaction-weighted utilization 2-itemsets of db^i,j. Algorithm TRUI-Mine only records Thtw^1,n which is generated by the preprocessing procedure to be used by the incremental procedure. From Step 4 to Step 16, the algorithm processes one partition at a time for all partitions. When partition Pi is processed, each potential candidate 2-itemset is read and saved to CF. The transaction-weight utility of an itemset I and its starting partition are recorded in I.twu and I.start, respectively. An itemset, whose I.twu ≥ s, will be kept in CF. Next, we select Thtw^1,n from I where I∈CF and keep I.twu in main memory for the subsequent incremental procedure. By employing the scan reduction technique from Step 19 to Step 26, (h ≥ 3) are generated in main memory. After refreshing I.count = 0 where I.twu = 0 and where I∈Thtw

Ch¹^,

1,n, we begin the last scan of the database for the preprocessing procedure from Step 28 to Step 31. Finally, those itemsets satisfying the constraint that I.u ≧ s×P.count and I.RUT ≧ RUT are finally obtained as the temporal high utility itemsets and temporal rare utility itemsets.

Table 4-3. Meanings of symbols used.

db^i,j Partitioned_database (D) from Pi to Pj

s Second utility threshold in one partition F First utility threshold

RUT Relative utility threshold

| Pk| Number of transactions in partition Pk

TUPk

(I)

Transactions in Pk that contain itemset I with transaction utility

UPk (I) Transactions in Pk that contain itemset I with utility

| db^1,n,(I)

Transactions number in db^1,n that contain itemset I

C^i,j The progressive candidate sets of db^i,j

The progressive temporal high transaction-weighted utilization 2-itemsets of db^i,j

Thtw^i,j

Thu^i,j The progressive temporal high utility itemsets of db^i,j Tru^i,j The progressive temporal rare utility itemsets of db^i,j

∆⁻ The deleted portion of an ongoing database D⁻ The unchanged portion of an ongoing database

∆⁺ The added portion of an ongoing database

Figure 4-2. Preprocessing procedure of TRUI-Mine.

Incremental procedure of TRUI-Mine

As shown in Table 4-3, D⁻ indicates the unchanged portion of an ongoing transaction database.

The deleted and added portions of an ongoing transaction database are denoted by ∆⁻ and ∆⁺, respectively. It is worth mentioning that the sizes of ∆⁺ and ∆⁻, i.e., | ∆⁺ | and | ∆⁻ | respectively, are not required to be the same. The incremental procedure of TRUI-Mine is devised to maintain temporal high utility itemsets efficiently and effectively. This procedure is shown in Figure 4-3. As mentioned before, this incremental step can also be divided into three sub-steps: (1) generating temporal high transaction-weighted utilization 2-itemsets in D⁻ =

在文檔中時序資料庫中高效率頻繁樣式探勘演算法之研究 (頁 66-0)