CHAPTER 1 Introduction
1.3 Organization of Thesis
The rest of this thesis is organized as follows. We review the background and the related works including association-rule mining with with multiple minimum supports and utility mining in Chapter 2. The problem definitions for the proposed utility mining with minimum constraints and the experiments for the proposed TPMmin algorithm for finding high utility itemsets are described in Chapter 3. In addition, Chapter 4 then states the problem definitions for the proposed utility mining with maximum constraints and the experiments related to the proposed TPMmax approach. Finally, the conclusions and future of this work are given and discussed in Chapter 5.
CHAPTER 2
Review of Related Works
In this section, some related studies on association-rule mining, association-rule mining with multiple criteria and utility mining are briefly reviewed.
2.1 Association-Rule Mining
Data mining techniques are used to extract useful information from various types of data. Among the published techniques, association-rule mining is one of important issues in the field of data mining due to the consideration of the co-occurrence rela-tionship of items in transactions [3][4]. For example, assume there is a product com-bination with high-frequency, “{milks, breads}”, which presents that most customers usually buy the two products together in that supermarket. To find such information, Agrawal et al. proposed a well-known mining approach named Apriori to achieve this goal [3][4]. The process of Apriori algorithm could be divided into the two stages, (1) finding frequent itemset stage and (2) generating association rule stage. In the first stage, candidate itemsets were generated and then counted by scanning transaction data. If the count of an itemset in the transaction database was larger than or equal to the pre-defined threshold value (called minimum support threshold), the itemset was identified as a frequent one. Itemsets containing only one item were processed first.
Frequent itemsets containing only single items items were then combined to form candidate itemsets with two items. The above process was then repeated until no can-didate itemsets were generated. In the second phase, association rules were derived from the set of frequent itemsets found in the first phase. All possible association combinations for each frequent itemset were formed, and those with calculated confi-dence larger than or equal to a pre-defined threshold (called the minimum conficonfi-dence threshold) are output as the association rules. Afterward, many studies based on the framework of the Apriori approach had also been published to find association rules in databases [5][9][10][15][16].
2.2 Association-Rule Mining with Multiple Criteria
As mentioned above, association rule mining only uses a single minimum sup-port threshold to determine whether or not an item is frequent in a database [3][4].
However, in practical application, items may have different criteria to assess their importance [6][10][11][12][13][15][16][17][20][21][22][26][28][29][30]. That is, different items should be different support requirements. For example, the lower minimum support should be given the product “LCD TVs” with high-profit but low-frequency in a supermarket when compared with another product “Milks” with
low-profit but high-frequency.
To address this problem, Liu et al. presented a new issue, namely associa-tion-rule mining with multiple minimum supports [26], which agreed the users to as-sign different minimum requirements for items by the as-significance of the items, such as profit or cost. However, since an itemset was composed of several distinct items, how to give the proper minimum support of the itemset is an important problem. To solve this, Liu et al. designed a minimum constraint to determine the minimum sup-port of an itemset [26]. The main concept of the minimum constraint was that the minimum value of the minimum supports of all items in an itemset was regarded as the minimum support of that itemset. For example, assume there are three items, A, B and C, and the minimum supports of the three items are 0.3, 0.6 and 0.4, respectively.
According to Liu et al.’s study [26], the minimum support of an assumed itemset {ABC} was 0.3 because the value of 0.3 was the minimum value of the three mini-mum supports for the three items, A, B and C. However, based on the minimini-mum con-straint, the minimum support of an itemset might be lower than the subset of the itemset. Continuing the above example, the minimum support of the itemset {ABC} is lower than that of its subset {B}. As this notes, the downwad-closure property in as-sociation-rule mining could not be kept in Liu et al.’s problem [26]. An effective strategy, which all distinct items in a database were sorted in ascending order of their
minimum support values, was proposed to achieve to solve this. Continuing with the above example, since the minimum supports of the three items A, B and C are 0.3, 0.6 and 0.4, a sorted list A, C and B can be obtained according to their minimum supports.
With the help of the strategy, an Apriori-based approach in Liu et al.’s study [26] was developed to effectively find frequent itemsets when items had different minimum supports using the minimum constraint.
Different from Liu et al.’s study [26], Wang et al. [30] presented a bin-oriented, non-uniform support constraint, which allowed the minimum support value of an itemset to be any function of the minimum support values of items contained in the itemset. The main concept is that items were first grouped into disjoint sets called bins, and items within the same bin were regarded as non-distinguishable with respect to the specification of a minimum support [30]. However, although their proposed ap-proach is flexible in terms of assigning minimum supports to itemsets, performance of their mining approach is not good due to its generality [30].
To effectively reduce the time complexity, Lee et al. proposed another viewpoint, namely maximum constraint, to assign the minimum support requirement of an item-set [20]. Their proposed algorithm was easy and efficient under the maximum con-straint when compared to the previous studies [26][30]. In Lee et al.’s study [20], the experimental results also showed the number of frequent itemsets using maximum
constraints was less than that using the minimum constraint, and thus the mined asso-ciation rule set could be more compact.
2.3 Utility mining
In real applications, a transaction in a transaction database usually involves quantities and profits of items other than the item information [3][4]. However, due to the consideration of only the occurrence relationship of items, association-rule mining is insufficient to be used to cope with such data [3][4]. For example, both jewel and diamond have high utility values but may not be frequent product combinations when compared to food and drink in a transaction database. Thus high-profit but low-frequency itemsets may not be found by the traditional association-rule mining approaches. To address this problem, Yao et al. proposed a utility function [31], which considered not only the quantities of the items but also their individual profits in transactions, to find high utility itemsets from a transaction database. According to Yao et al.’s definitions [31], local transaction utility (quantity) and external utility (profit) are used to measure the utility of an item. By using a transaction dataset and a utility table together, the discovered itemset is able to better match a user’s expecta-tions than if found by considering only the transaction dataset itself.
However, the downward-closure property in association-rule mining cannot be
directly applied in the utility mining problem [31]. To effectively reduce search space in mining, Liu et al. proposed a two-phase approach (abbreviated as TP) to efficiently handle the problem of utility mining [27]. In particular, an upper-bound model (called transaction-weighted utilization, TWU) was developed to keep the downward-closure property in mining [27]. The main principle of the model was that the summation of utility values of all the items in a transaction was regarded as the upper bound of any itemset in that transaction. The whole process of the mining algorithm could be di-vided into two phases. In the first phase, the promising itemsets with high utility up-per-bounds were found from a transaction database by the TWU model [27]. Next in the second phase, an additional data scan was performed to find the actual utility of each promising itemset and found the ones that have actual utility values larger than or equal to a predefined threshold (called the minimum utility threshold). Afterward, most of existing approaches were based on the framework of the TP algorithm to copy with various applications with the viewpoint of utility mining, such as the efficiency improvement of utility mining [1][2][8][23][32], utility mining with negative item profits [7], on-shelf utility mining [18][19], incremental process for utility mining [24], and so on.
As association-rule mining [3][4], one of main limitations for utility mining is that all the items are treated uniformly [27][31]. However, in real applications,
differ-ent items may have differdiffer-ent criteria to judge their importance, and thus the utility re-quirements should vary with different items. Designing a utility-based framework with multiple minimum utilities is a critical issue. In addition, since the down-ward-closure property in association-rule mining is not kept in utility mining, the former is more difficult than the latter. Hence, how to develop an effective model for avoiding any information losing case is also another critical issue when items have different minimum utilities.
As mentioned above, this motivates our exploration of the new issue, mul-ti-criteria utility mining. In addition, two different viewpoints, minimum constraint and maximum constraint, are also considered in the multi-criteria utility mining prob-lem, and two effective approaches are presented in the thesis to cope with the two problems.
CHAPTER 3
Multi-criteria Utility Mining Using Minimum Constraints.
3.1 Introduction
Association-rule mining techniques consider only the co-occurrence frequency of items in transactions, but in retailing, transactions usually involve profits, costs and sold quantities of items. In addition, the same importance is assumed for all items in databases. Hence, the association-rule mining techniques are not insufficient to be used to recognize the actual significance of an item in databases. To address this problem, Yao et al. thus introduced a new utility-based framework [31], which con-sidered both the quantities bought and profits of items in transactions to recognize the actual utility of itemsets in a database. With help of the utility function, itemsets with actual utilities larger than or equal to a predefined threshold (called the minimum util-ity threshold) could be found in databases [31].
However, all items in the utility-based framework are treated uniformly. That is, a single minimum utility is used as the utility requirement for all items in a database.
As mentioned in these studies [6][10][11][12][13][15][16][17][20][21][22][26][28]
[29][30], then, a single minimum utility is not easily used to reflect the natures of the
items, such as the significances of items. For example, in real world, since the profit of the item “LCD TV” is obviously higher than that of “Milk”, only a utility requirement is not easily used to reflect the importance of the two items. As this example notes, developing a utility-based framework with multiple minimum utilities is a critical issue. In addition, since the existing utility mining approaches cannot directly be applied to handle such utility mining problem. Accordingly, designing a proper mining method for avoiding any information losing case in the problem is also another critical issue.
Due to the above reasons, this work presents a new research issue named multi-criteria utility mining, which allows users to define different minimum utilities for all items in databases. In particular, to find all possible interesting information, a minimum constraint is adopted in the work to achieve this goal. Based on the existing upper-bound model, an effective strategy, sorting, is designed to keep the downward-closure property in mining under the minimum constraint. In addition, a two-phase mining approach is also developed to cope with the problem of multi-criteria utility mining with the consideration of minimum constraints. The efficiency in finding high utility itemsets can thus be raised. Finally, the experimental results on synthetic datasets show the proposed approach has good performance in execution efficiency when compared with the state-of-the-art mining approach, TP.
The rest of this chapter is organized as follows. The problems to be solved and related definitions are described in Section 3.2. The execution details of the proposed two-phase mining TPMmin algorithm are introduced in Section 3.3. An example is presented in Section 3.4. Finally, the experimental results and discussions are then shown in Section 3.5.
3.2 Problem Statement and Definitions
In this section, to clearly explain of the problem to be solved, a set of terms related to utility mining with multiple minimum utilities is then defined as follows.
Table 3.1: The ten transactions in this example
TID A B C D E F
1 1 0 2 1 1 1
2 0 1 25 0 0 0
3 0 0 0 0 2 1
4 0 1 12 0 0 0
5 2 0 8 0 2 0
6 0 0 4 1 0 1
7 0 0 2 1 0 0
8 3 2 0 0 2 3
9 2 0 0 1 0 0
10 0 0 4 0 2 0
Table 3.2: The individual profit of items in the utility table Item Profit
A 3
B 10
C 1
D 6
E 5
F 2
Table 3.3: The individual threshold of items Item Threshold
A 0.20
B 0.40
C 0.25
D 0.15
E 0.20
F 0.15
Definition 1. An itemset X is a subset of the items I, X ⊆ I. If | X | = r, the set X is
called r-itemsets. I = {i1, i2, ..., in} is a set of items may appear in the transaction. For example, the 2-itemset {BC} includes two items, B and C.
Definition 2. A transaction (Trans) consists of a set items purchased with their
quantities. For example, in Table 3.1, the second transaction includes the two items, B and C, and the quantities of the two items are 1 and 25, respectively.
Definition 3. A database D is composed of a set of transactions. That is, D =
{Trans1, Trans2, …, Transy, …, Transz}, where Transy is the y-th transaction in D.
Definition 4. The quantity of an item i in a transaction Trnasy is called qyi. For example, in Table 3.1, the quantity of item A in a transaction Trnas1 is 1.
Definition 5. The external utility si is the individual profit of an item i in the utility
table. For example, in Table 3.2, the profit of item A in the utility table is 3.
Definition 6. The utility uyi of an item i in Transy is the external utility si multi-plied by the quantity qzj of i in Transy. That is,
.
For example, according to Table 3.1 and Table 3.2, the utility of item A in the first transaction can be calculated as 3*1, which is 3.
Definition 7. The transaction utility tuy is the sum of the utility values of all items contained in Transy. That is,
For example, the utility tu1 =3*1+1*2+6*1+5*1+2*1=18 in Table 3.1 and Table 3.2
Definition 8. The utility uyX of an itemset X in Transy is the summation of the
utilities of all items in X in Transy. That is,
.
For example, according to the three tables, the utility of the itemet {AE} in the
yi
first transaction can be calculated as 3*1 + 5*1, which is 8.
Definition 9. The actual utility auX of an itemset X in a transaction database D is
the summation of the utilities of X in the transactions including X of D. That is, .
For example, according to the three tables, the utility au{AE} of the itemset {AE}
in Table 3.1 can be calculated as 8 + 16 + 19, which is 43.
Definition 10. The actual utility ratio aurX of an itemset X in D is the summation of the utilities of X in the transactions including X of D over the summation of the
transaction utilities of all transactions D. That is, .
For example, in Table 3.1, the summation of the transaction utilities of all trans-actions can be calculated as 18 + 35 + 12 + 22 + 24 + 12 + 8 + 45 + 12 + 14, which is 211. Since the actual utility au{AE} of the itemset {AE} in Table 3.1 is 48, the actual
utility ratio au{AE} of {AE} in Table 3.1 can be calculated as 43/202, which is 0.2128.
Definition 11. Let i be the predefined individual minimum utility threshold of an
item i. Note that here a minimum constraint is used to select the minimum value of minimum utilities of all items in X as the minimum utility threshold X of X. Hence, an
itemset X is called a high utility itemset (abbreviated as HU) if auX ≧X.
For example, in Table 3.3, since the minimum utilities of the two items, A and E,
are all 0.2, the minimum value (= 0.2) is selected as the minimum utility threshold of {AE} is the value of 0.2. In Table 3.1, the actual utility ratio of the itemset {AE} is a high utility itemset under the minimum constraint due to its actual utility ratio (=
0.2128).
Next to keep the downward-closure property in this problem the tradi-tion-weighted utilization model (abbreviated as TWU) model is introduced to solve this problem. Then, a set of terms related to this model is described below.
Definition 12. The transaction-weighted utility twuX of an itemset X in a transac-tion database D is the summatransac-tion of the transactransac-tion utilities of the transactransac-tions transac-tions, Trans1, Trans5 and Trans8, and the transaction utilities of the three transactions are 19, 26 and 47, the transaction-weighted utility of {AE} can be calculated as 18 + 24 + 45, which is 87.
Definition 13. The transaction-weighted utility ratio twurX of an itemset X in D is the summation of the transaction utilities of the transactions including X in D over the summation of transaction utilities of all transactions in D. That is,
tu ,
transac-tion-weighted utility ratio of {AE} can be calculated as 87/202, which is 0.4307.
Definition 14. Let i be the predefined individual minimum utility threshold of an
item i, and the minimum value of minimum utilities of all items in X is selected as the minimum utility threshold X of X by the minimum constraint. Hence, an itemset X is
called a high transaction-weighted utilization itemset (abbreviated as HTWU) if twuX
≧X.
For example, in Table 3.1, the itemset {AE} is a high transaction-weighted utili-zation itemset due to its transaction-weighted utility ratio (= 0.4307) and the mini-mum utility threshold (= 0.2).
Problem Statement: Based on the above definitions, the problem to be solved in this
work is to find the itemsets with actual utilities larger than or equal to a pre-defined corresponding minimum utility threshold from a given transaction database D under
the minimum constraint. The details of the proposed TPMmin algorithm are then de-scribed in the next section.
3.3 The Proposed Algorithm(TPM
min)
In this study, the proposed TPMmin algorithm based on minimum constraints con-sists of two phases, finding high transaction-utility upper-bound itemsets and finding high utility itemsets. The execution process of the TPMmin is then stated as follows.
INPUT: A set of items, each with a profit value and a minimum utility threshold, a transaction database D, in which each transaction includes a subset of items with quantities.
OUTPUT: A final set of high utility itemsets (HUs) satisfying their minimum utilities.
Phase 1: Finding high transaction-weighted utilization itemsets (HTWUs)
STEP 1: Sort the items in transactions in ascending order of their minimum utility values.
STEP 2: For each transaction Transy in D, do the following substeps.
(a) Find the utility uyz of each item iz in Transy. That is,
STEP 3: Find the total transaction utility of transaction utilities of all transactions in D.
STEP 4: For each item i in D, calculate the transaction-weighted utility ratio twuri of item i as:
STEP 5: Find the smallest value of the minimum utilities of all items in D, and denote as min,1.
STEP 6: For each item i in D, if the transaction-weighted utility ratio twuri of i is larger than or equal to the corresponding minimum utility threshold i of the item i, put it in the set of high transaction-weighted utilization 1-itemsets, HTWU1.
STEP 7: Set r = 1, where r represents the number of items in the current set of candidate utility r-itemsets (Cr) to be processed.
STEP 8: Generate from the set HTWUr the candidate set Cr+1, in which all the r-sub-itemsets of each candidate must be contained in the set of HTWUr.
STEP 9: For each candidate (r+1)-itemset X in the set Cr+1, find the
STEP 9: For each candidate (r+1)-itemset X in the set Cr+1, find the