Thesis Organization - 有效權重資料探勘方法之研究

CHAPTER 1 Introduction

1.3 Thesis Organization

The remaining parts of the thesis are organized as follows. Some related works including frequent itemset mining, sequential pattern mining, weighted frequent itemset mining, and weighted sequential pattern mining are reviewed in CHAPTER 2. Weighted frequent itemset mining with an improved upper-bound model, weighted frequent itemset mining with

improved strategies, and experimental evaluation are discussed in CHAPTER 3. Weighted sequential pattern mining with improved strategies and experimental evaluation are discussed in CAHPTER 4. Conclusions and future works are stated in CHAPTER 5.

CHAPTER 2 Review of Related Works

In this chapter, some studies on frequent itemset mining, sequential pattern mining, weighted frequent itemset mining and weighted sequential pattern mining are briefly reviewed.

2.1 Frequent Itemset Mining

The main purpose of data mining in knowledge discovery is to extract desired rules of patterns in a set of data. One common type of data mining is to derive association rules from a transaction dataset, such that the presence of items in a transaction will imply the presence of some other items. To address this, Agrawal et al. proposed several mining algorithms based on the concept of large itemset to find association rules from transaction data. The Apriori algorithm [4] on association-rule mining was the most well-known of existing

algorithms. The process of association-rule mining could be divided into two main phases. In the first phase, candidate itemsets were generated and counted by scanning transaction data.

If the count of an itemset in the transaction database was larger than or equal to the pre-defined threshold value (called minimum support threshold), the itemset was identified as a frequent one. Itemsets containing only one item were processed first. Frequent itemset

containing only single items items were then combined to form candidate itemsets with two items. The above process was then repeated until no candidate itemsets were generated. In the second phase, association rules were derived from the set of frequent itemsets found in the first phase. All possible association combinations for each frequent itemset were formed, and those with calculated confidence larger than or equal to a pre-defined threshold (called the minimum confidence threshold) are output as the association rules.

As mentioned above, however, the Apriori algorithm may generate many unnecessary candidate itemsets for mining, and it then requires a considerable amount of execution time to calculate the supports of the candidate itemsets, meaning that its execution efficiency is not very good. For this reason, many other algorithms, such as the Pincer-Search [22], FP-growth [16], OP [28], and ExAMiner [11], have been developed and proposed as superior

alternatives with regard to finding frequent itemsets. Unlike the other algorithms, the FP-growth algorithm [16] uses a compact tree structure, called a Frequent-Pattern tree

(FP-tree). With the aid of this, the algorithm only needs to scan the database twice, and does not need to generate any candidate sets, and thus the FP-growth algorithm [16] can effectively and efficiently undertake the frequent itemset data mining.

2.2 Sequential Pattern Mining

In general, the transaction time (or time stamp) of each transaction for real-world

application is usually recorded in database. The transaction then can be listed as a time-series data (called a sequence data) in the occurring time order of the transactions. To handle such data with time, a new issue, namely sequential pattern mining, was first developed to achieve the goal, and the three Apriori-based algorithms, AprioriAll, AprioriSome, and DynamicSome, were also proposed to find sequential patterns in a sequence data. However, the algorithms, which were the level-wise techniques, had to execute multiple data scans to complete sequential pattern mining tasks. Afterward, several algorithms for sequential pattern mining were proposed to improve efficiency in dealing with large set of dataset, such as Apriori-based algorithm, GSP [36] and pattern-growth algorithm, PrefixSpan [32].

Different from the Apriori-based related algorithms, the PrefixSpan algorithm, which was proposed by Han et al., was a pattern-growth algorithm and adopted a projection technique to efficiently mine sequential patterns in a sequence database. The concept of the projection technique used in the PrefixSpan algorithm was derived from the similar concept of database query that a query condition was given by users, and then the whole record in a database, which satisfied the query condition, was effectively found by the query condition.

By using the projection technique, a set of sequences could continually be divided for mining when the items in a prefix subsequence were increased. Then, the search space for subsequences could effectively be reduced, and the efficiency for finding sequential patterns could thus be improved by the PrefixSpan algorithm when compared with the Aprior-based

algorithms, such as AprioriAll, GSP, and so forth.

2.3 Weighted Frequent Itemset Mining

An itemset in association-rule mining only considers the frequency of the itemset in databases, and all the items in the itemsets are assumed to have the same significance. In reality, however, the importance of items in a database may be different according to different factors, such as profit and cost. For example, LCD TVs may not have high frequency but is a high-profit product when compared to food items or drink in a database. Therefore, some useful item products may not be discovered by using traditional frequent itemset mining techniques. To handle the problem, Yun et al. then proposed weighted itemset mining [45][45], to find weighted frequent itemsets in transaction databases with the weights being flexibly given by users. The average-weight function in Yun et al.’s study [45] was designed to evaluate the weight of an itemset in a transaction. Different from frequent itemsets with only consideration of frequency, the found itemsets with high-weight values might be used as managers’ auxiliary information in terms of making decisions.

However, the downward-closure property in association-rule mining cannot be kept in the problem of weighted frequent itemset mining with the average-weight function. To address this, Yun et al. proposed an upper-bound model to construct a new downward-closure property [41][44][45], which adopted the maximum weight of a database as the weight

upper-bound of each transaction, and the two algorithms, Apriori-based and FP-growth-based, were also developed to find weighted frequent itemsets in transaction databases, and the two algorithms in their study had a good performance in terms of handling the problem of weighted frequent itemset mining.

2.4 Weighted Sequential Pattern Mining

As mentioned previously in weighted itemset mining, the similar problem for the same significance of items also existed in sequential pattern mining. To deal with this, Yun et al.

thus proposed a new research issue, named weighted sequential pattern mining [41][44], to

find weighted sequential patterns from the a sequence database. Similarly, different weights were given items by referring to factors, such as their profits, their costs, or users’ preferences,

and then the actual importance of a pattern could be easily recognized when compared with the traditional sequential pattern mining. Different from the function in weighted itemset mining, the time factor was considered to develop a new average-weight function [41][45], and the new function could be applied to identify the weight value of a pattern in a sequence.

Based on the function, however, the downward-closure property in traditional sequential pattern mining could not be kept on weighted sequential pattern mining. To address the problem, a new upper-bound model [41][45], which the maximum weight in a sequence database was regarded as the upper-bound of each sequence, was directly derived from Yun et

al.’s proposed model in weighted itemset mining [45]. However, it was observed that a huge

amount of unpromising subsequences still had to be generated by using the traditional upper-bound model [41][44][45] for mining, and its performance was thus not good. Based on the reasons, this motivates our exploration of the issue of effectively and efficiently mining weighted sequential patterns from a set of sequences.

CHAPTER 3 Weighted Frequent Itemset Mining

In this chapter, we propose two algorithms for mining weighted frequent itemsets. We first propose a projection-based weighted frequent itemset mining algorithm (PWA), in which transaction maximum weight is adopted for mining weighted frequent itemsets. Next, the projection-based weighted frequent itemset mining algorithm with improved strategies (PWAI), tightening and filtering, is developed. The performance of PWAI can further be improved.

3.1 Problem and Definitions

To understand the problem of weighted frequent itemset mining, consider the transaction database given in Table 3.1 and, in which each transaction consists of two features, the transaction identification (TID) and items purchased (or events frequency). There are eight items in the transactions, respectively denoted as A to H. The predefined weight of each item is shown in Table 3.2.

Table 3.1: Set of five transactions for given example.

Table 3.2: Weights of items given in Table 3.1.

Item Weight

For the formal definition of weighted frequent itemset mining, a set of terms related to the problem of weighted frequent itemset mining [45] is defined below.

Definition 1. An itemset X is a subset of items or events, X ⊆ I. if |X| = r, the itemse X is called an r-itemset. Here I = {i₁, i₂, …, im} is a set of items or events, which may appear in transactions. For example, the itemset {AB} contains two items and is so called a 2-itemset.

Note that the items in an itemset are sorted in alphabetical order.

Definition 2. A transaction database TDB is composed of a set of transactions. That is,

TDB = {Trans₁, Trans₂, …, Trans_y, …, Trans_z}, where Trans_y is the y-th transaction in TDB.

Definition 3. The weight of an item i, w_i, ranges from 0 to 1. For example, w_A = 0.30 in Table 3.2.

Definition 4. The weight of an itemset X, w_X, is the sum of the weights of all items in X divided by the number of items in X. That is:

where lX is the number of items in itemset X. For example, in Table 3.2, the weights of the two items in the itemset {AB} are 0.30 and 0.60, respectively, and the number of items in {AB} is 2. Therefore, w{AB} = (0.30 + 0.60) / 2 = 0.45.

Based on the fourth definition, the formula is an average weight function. To obtain the calculation base of weighted support value in a database for an itemset, the maximum weight in a transaction is regarded as the transaction weight of the transaction. The reason for this is that the weight value of any sub-itemset in a transaction has to be less than the maximum weight in the transaction. The weighted support of an itemset is further described below.

Definition 5. The transaction maximum weight of a transaction Trans, tmw_Trans is the maximum weight value among those of all items in transaction Trans. For example, in Table 3.1, the second transaction includes two items, B and H, whose weights are 0.60 and 0.95, respectively. Therefore, tmw_{BH} = 0.95.

Definition 6. The total transaction maximum weight of a transaction database TDB,

ttmw, is the sumof the transaction maximum weight values of all transactions in TDB. That is: maximum weight ttmw of the transaction database TDB. That is:

ttmw

Definition 8. Let λ be a pre-defined minimum weighted support threshold. An itemset X is called a weighted frequent itemset (WF) if wsup_X ≥ λ. For example, if λ = 30%, then the itemset {AE} is a weighted frequent itemset, since wsup_{AE} = 30% ≥ 30%.

However, the downward-closure property used in association-rule mining does not hold with regard to the problem of weighted frequent itemset mining. This is because the weight function is an average concept, and thus the actual weight supports for itemsets cannot be directly used to find the weighted frequent itemsets in databases. Take the item A in Table 3.1

as an example. There are three transactions that include this item in Table 3.1, and the weight of the item A in Table 3.2 is 0.30. The weighted support value of the itemset {A} can be then calculated as (0.30 + 0.30 + 0.30) / 3.50, which is 25.71%. If λ is set at 30%, then the itemset {A} is not a weighted frequent itemset, but its super-itemset {AE} is a weighted frequent itemset. As this example shows, the problem of weighted frequent itemset mining is more difficult to solve compared with traditional frequent itemset mining. Yun et al.

subsequently proposed an upper-bound model to address this, in which the maximum weight in a database is regarded as the upper bound of weight value of each transaction to hold the downward-closure property on weighted frequent itemset mining [45]. However, the downward-closure property can be further achieved by using the maximum weight in a transaction. We thus, propose an effective transaction maximum weight (TMW) model to tighten the upper bounds of the weight values used when mining itemsets, and the relevant terms used in our proposed TMW model are defined as follows.

Definition 9. The transaction-weighted upper bound of an itemset X, twub_X, is the sum of the transaction maximum weights of the transactions including X in TDB divided by the total transaction maximum weight, ttmw of TDB. That is:

ttmw and Trans₅, whose transaction maximum weights are 0.50, 0.60, 0.95 and 0.50, respectively.

Therefore, twub{E} = 2.55 / 3.50 = 72.85%.

Definition 10. Let λ be a pre-defined minimum weighted support threshold. An itemset

X is called a weighted frequent upper-bound itemset (WFUB) if twub_X ≥ λ. For example, if λ = 30%, then the itemset {E} is a weighted frequent upper-bound itemset since twub_{E} =

72.85% ≥ λ.

Based on the definitions given above, a weighted frequent itemset considers the individual weights of items in a transaction dataset. The goal is to solve effectively and efficiently find all the weighted frequent itemsets whose weights are larger than or equal to a predefined minimum weighted support threshold λ in a given transaction database. The details of the proposed PWA are given in the next section.

3.2 The Projection-based Weighted Frequent Itemset Mining Algorithm, PWA

In this section, a new projection-based mining algorithm is proposed to effectively handle the problem of finding weighted frequent itemsets from a transaction database. The improved upper-bound model and the pruning strategy used in the proposed algorithm are developed to help its execution. The improved upper-bound model is first described below.

3.2.1 Improved Upper-bound Model

A weight upper-bound model is proposed to enhance the traditional weight upper-bound model [45]. The proposed model tightens the upper bounds of weights for itemsets in the mining process. In the traditional upper-bound model [45], the maximum weight in a transaction dataset is used as the upper bound of the weight for each transaction to maintain the downward-closure property for weighted frequent itemset mining. However, the maximum weight in a transaction can be used to achieve the same goal. That is, the maximum weight in a transaction can be regarded as the upper bound of the weight for the transaction. To illustrate the completeness of the TMW model, two lemmas are given below to prove that no weighted frequent itemsets are skipped in any weighted frequent itemset mining case.

Lemma 3.1: The transaction-weighted upper bound of an itemset X maintains the

downward-closure property.

Proof: Let X be a weighted frequent upper-bound itemset and d_X be the set of transactions that contain X in a transaction database TDB. If y is a super-itemset of X, then y cannot exist in any transaction where X is absent. Therefore, the transaction-weighted upper bound twub_X of X is the maximum upper bound of the weight value of y. Accordingly, if twub_X is less than a predefined minimum weighted support threshold, then y cannot be a weighted frequent upper-bound itemset.

Lemma 3.2: For a transaction database TDB and a predefined minimum weighted

support threshold, the set of weighted frequent itemsets WF is a subset of weighted frequent upper-bound itemsets WFUB.

Proof: Let X be a weighted frequent itemset. According to Definitions 9 and 10, the

actual weighted support wsupX of X must be less than or equal to its transaction-weighted upper-bound twubX. Accordingly, if X is a weighted frequent itemset, then it must be a weighted frequent upper-bound itemset. As a result, X is a member of the set WFUB.

Based on Lemmas 3.1 and 3.2, all weighted frequent itemsets in a transaction database can be discovered. The proposed model can thus be used to effectively tighten the upper bounds of weights for itemsets compared to those obtained using the traditional upper-bound model [45]. An example is given below to illustrate how the model improves the upper bounds of weights for itemsets.

According to the traditional upper-bound model [45], the maximum weight value in Table 3.1 is 0.95, which is regarded as the upper bound of each sequence in the dataset. Take item E as an example. It appears in four transactions, namely Trans₁, Trans₂, Trans₃, and Trans₅. The upper bounds of the weight for the four transactions are all 0.95. The upper bound of the weight for E can be calculated as (0.95 + 0.95 + 0.95 + 0.95), which is 3.8.

Based on the proposed upper-bound model, the upper bound of the weight for E can be tightened. First, the maximum weight in a transaction is found. Take the first transaction Trans₁: {ACEF} in Table 3.1 as an example. The transaction includes four items, A, C, E and

F, whose weights are 0.30, 0.45, 0.40 and 0.50, respectively. The maximum weight is 0.50,

which is regarded as the upper bound of the weight for the transaction Trans1. The other transactions in Table 3.1 can be similarly processed. The maximum weights for the five transactions are 0.50, 0.95, 0.60, 0.95, and 0.50, respectively. The transaction-weighted upper bound of item E can be then calculated as 0.50 + 0.60 + 0.95 + 0.5, which is 2.55.

3.2.2 Pruning Strategy for Unpromising Items

In this section, a simple pruning strategy based on the proposed model and a projection-based technique is designed to effectively reduce the number of unpromising itemsets for mining. According to Lemmas 3.1 and 3.2, the downward-closure property for weighted frequent itemset mining can be maintained by using the proposed model. Based on the model, any sub-itemset of a weighted frequent upper-bound itemset must be a weighted frequent upper-bound itemset. In contrast, if there exists a weighted infrequent upper-bound sub-itemset for an itemset, then the itemset is not a weighted frequent upper-bound itemset, and the itemset is also not a weighted frequent itemsets. In this case, the itemset can be skipped, since it is impossible for it to be a weighted frequent upper-bound itemset. This concept is applied in the pruning strategy to reduce the number of unpromising itemsets in the recursive process.

The proposed projection-based algorithm is as follows. First, when all the weighted

frequent upper-bound r-itemsets with r items are found, all items in the set of weighted frequent upper-bound r-itemsets are gathered as the pruning information for each a prefix r-itemset to be processed. Next, the additional (r+1)-th item of each generated (r+1)-itemset

in the next recursive process is checked for whether it appears in the set of gathered items. If it does, the generated (r+1)-itemset are placed in the set of (r+1)-itemsets; otherwise, it is pruned. An example is given below to illustrate the pruning of unpromising itemsets in the recursive process.

For example, consider the transaction {ABDEF}, where symbols represent items, and the itemsets {AB} and {AC}, which are included in the set of weighted frequent upper-bound 2-itemsets WFUB_2,{A} with {A} as their prefixes. In this case, only the three items, A, B, and C, are gathered from the set as the pruning information. The next prefix itemset to be processed is {AB}. For the transaction {ABDEF}, the transaction is the projected transaction of {AB}, the items D, E, and F, do not appear in the set of gathered items. Therefore, the three items have to be removed from the transaction; the modified transaction is {AB}.The items are removed because the super-items consisting of the three items and the prefix {A} are not weighted upper-bound itemsets. Moreover, since the number of items kept in the modified

在文檔中有效權重資料探勘方法之研究 (頁 17-0)