The main goal in traditional association rule mining is to find the common occurrence relationship of items in a database [1][2]. The discovered rules are not sufficient information to decision makers analysis, such as quantities of items. A transaction in retail databases usually includes the quantities of bought items.
Traditional frequent itemset mining techniques are thus insufficient for such quantitative data. Srikant et al. thus proposed quantitative association rule mining, where several attribute value ranges are partitioned for each attribute to find useful quantitative association rules [31]. For example, consider the quantitative rule
“{age:[20~29]car:[0, 1]}”, which states that most customers who are 20 to 29 years old usually do not buy cars or buy just one car [31]. However, deciding suitable intervals for the domain values in each attribute is difficult, and the discovered rules are not easy to comprehend.
Fuzzy set theory is widely used in intelligent systems due to its simplicity and comprehensibility to human reasoning. Kuok et al. proposed fuzzy data mining, which applies the concept of fuzzy set theory to data mining [23]. Quantitative values in
transactions are converted into linguistic regions, and then the count of a fuzzy itemset in a transaction could be calculated by the product of fuzzy regions of all fuzzy terms of the itemset in that transaction. Information from the set of transactions with linguistic regions is simple and easy to comprehend. Hong et al. proposed an effective Apriori-based mining algorithm that adopts a minimum operator to obtain the scalar cardinality value for an itemset in a transaction [15]. Hong et al. proposed an advanced mining approach that considers the trade-off between the number of rules and computation time. The fuzzy term with the highest fuzzy count for the items is kept in the set of fuzzy frequent 1-itemsets, which reduces the number of generated fuzzy rules [16]. Several other Apriori-based fuzzy data mining techniques have been proposed [4][17][19][21][25][26][27][33][34][35]. These approaches spend a lot of time generating a large number of candidates and determining their fuzzy counts in transactions. A method for efficiently finding fuzzy rules in quantitative databases is thus required.
CHAPTER 3
Problem Statement and Definitions
To describe the proposed fuzzy mining problem clearly, assume a quantitative transaction database (QDB) with ten transactions as shown in Table 3.1.1, in which each transaction consists of three features - transaction identification (TID), items purchased, and their quantities in the transaction. There are four items in the transactions, denoted as A to D. The value attached to each item in the corresponding slot is the quantity sold in a transaction, and the same membership functions with the three fuzzy regions, Low, Middle and High, are given to the four items for simplicity.
Table 3.1.1: Set of ten quantitative transactions in the example.
TID A B C D
Trans1 11 0 0 0
Trans2 2 0 1 1
Trans3 0 1 0 0
Trans4 8 0 1 2
Trans5 6 0 0 2
Trans6 7 0 1 3
Trans7 0 2 3 0
Trans8 2 0 1 1
Trans9 0 1 0 0
Trans10 10 0 0 0
Figure 3.1.1: Membership functions for the four items in the example.
Using the above example, a set of terms related to the proposed fuzzy data mining approach is defined as follows.
Definition 1. An itemset X is a subset of items; that is, X ⊆ I. If |X| = r, itemset X
is called an r-itemset. Let I = {i1, i2, i3, …, im} be a set of items which may appear in transactions. For example, itemset {AB} contains two items and is called a 2-itemset.Definition 2. A quantitative transaction (Trans) is composed of a set of purchased
items with their quantities. For example, in Table 3.1.1, the second quantitative transaction {2A, 1C, 1D} contains the items A, C and D, whose quantity values are 2, 1, and 1, respectively.
Definition 3.
A quantitative database QDB is composed of a set of quantitative transactions. That is, QDB = {Trans1, Trans2, …, Transy, …, Transz}, where Transy is the y-th quantitative transaction in QDB and z is the number of transactions.Definition 4.
The quantitative value vyz is the quantity of the z-th item iz in a transaction Transy. For example, in Table 3.1.1, the quantitative value v2,C (or v2,3) of the third item C in Trans2 is 1.Definition 5.
The fuzzy set fyz of the quantitative value vyz of the z-th item iz in a Transy can be represented by the given membership functions for the item iz as:
yzl is the fuzzy membership value of v
yz of i
z in the l-th fuzzy region R
zl.
For example, in Table 3.1.1, the quantitative value (= 1) of item C in Trans2 can be
converted to
2 using the given membership functions of
item C with the three fuzzy regions shown in Figure 3.1.1.
Definition 6.
The scalar cardinality values of the fuzzy memberships for the fuzzy term izl in the transactions including izl can be defined as:∑
⊆Definition 7.
The summation of the scalar cardinality values of the fuzzy term izl in the modification transactions including the fuzzy term izl can be defined as:∑
⊆ ∈ ∧ ⊆Definition 8.
Let λ be a pre-defined minimum fuzzy support threshold. A fuzzy itemset X is called a fuzzy frequent itemset (abbreviated as FF) if FFX≧ λ. For example in Table 1, if λ = 1.6 then the fuzzy 1-itemset {C.Low} is a fuzzy frequent 1-itemset because FF1{C.Low}= 2.0, which is larger than λ.Based on the above definitions, the problem of fuzzy mining to be solved in this study is defined as follows. Assume a quantitative transaction database QDB contains a number of quantitative transactions, and each transaction is recorded with the purchased items and quantities. The problem to be solved is to find all the fuzzy frequent itemsets with their actual fuzzy values larger than or equal to a predefined minimum fuzzy
support threshold λ. In this thesis, a gradual data-reduction approach for fuzzy itemset mining (abbreviated as GDF) and a projection-based fuzzy mining approach (abbreviated as PFA) are proposed to effectively and efficiently discover fuzzy frequent itemsets from a given quantitative transaction database QDB.
CHAPTER 4
The Proposed Algorithms
Data mining is used to extract interesting rules from large databases. Traditional association rule mining was proposed to find relationships between items from a set of data [1][2]. Transactions in retail databases usually include the quantities of bought items. For instance, assume a product combination {bread, milk} is a frequent one in a transaction database, which means that most customers usually buy bread and milk together in the store. The most well-known of the early algorithms for association-rule mining is the Apriori one, consisting of two phases, finding frequent itemsets and finding association rules. In general, however, the quantities of the items bought in transaction databases are usually also recorded in the data. Since traditional frequent itemset mining techniques only consider the occurrence of items in transactions, they are not able used to handle data with quantitative values. To address this, Srikant et al.
proposed a new approach, namely the use of quantitative association rules, which partitions several attribute value ranges for each attribute to find useful quantitative rules. For example, consider the quantitative rule “{age:[20~29]car:[0, 1]}”, which states that most customers who are 20 to 29 years old usually do not buy cars or buy just one car [31]. However, how to determine the suitable value ranges for the domain values of each attribute is difficult, and the discovered rules are not easily comprehended by decision makers.
In order to making the mine and discover quantitative rules with simplicity and comprehensibility to human reasoning. Most previous approaches focus on definition support and confidence thresholds for all the items or itemsets [5][6][8][20][26]. When the minimum support value of an itemset is set as the lowest minimum support of the
items, the candidate itemsets may be large, and much time may be needed for the mining process. However, in real word applications, some cheaper items are difficult to discover, as the minimum supports may be set to high and thus these itemsets may be missing from the results. Therefore, different items may have different criteria that can be used to judge their importance, and the support requirements should thus vary for different items. Hong et al. proposed a new approach, fuzzy data mining, which applies the concept of fuzzy set theory to data mining [14]. The main concept of this method is that quantitative values in transactions are converted into linguistic regions by fuzzy theory [24], and a minimum operator in fuzzy theory is applied to obtain the overlap value (minimum value) of membership regions in different items. Different from traditional association and quantitative rules, interesting knowledge that is both simple and comprehensible can be found using fuzzy data mining from the set of transactions with linguistic regions. To handle this problem and Hong et al. proposed an effective Apriori-based mining algorithm, which uses a minimum operator from fuzzy theory to count the scalar cardinality value for an itemset in a transaction, in order find interesting fuzzy association rules [15]. Afterward, Hong et al. also proposed an advanced mining approach, which considers the trade-off between the number of fuzzy rules and computational time required. The main concept for the trade-off approach is that the fuzzy region with the highest degree value for the items can be kept in the set of fuzzy frequent 1-itemsets, and then a large number of fuzzy rules can thus be avoided and thus the execution efficiency can be improved [16]. However, both Hong et al.’s approaches use a level-wise technique to handle the problem of fuzzy rule mining, and thus they may need to spend considerable time generating a large number of candidates and counting their actual fuzzy counts in transactions.
In this thesis, we thus propose two efficient fuzzy mining algorithms, which is a
gradual data-reduction approach (also termed the gradual data-reduction fuzzy approach, or GDF) and a projection-based fuzzy mining approach (abbreviated as PFA), to discover fuzzy association rules from a quantitative transaction database. In particular, the new pruning strategy can be applied to reduce the number of candidates in each pass, in which unpromising items are removed early from transactions, and then the data size in each pass can be gradually reduced to save the scanning time needed for mining. The experimental results also show that the number of candidates generated is significantly less than that generated by the traditional algorithms developed by Hong et al. [15]. In addition, the proposed GDF and PFA algorithms are also faster than the existing ones.
Then, the detail of the proposed GDF is first introduced in following subsection.