有效權重資料探勘方法之研究

全文

(1)國立高雄大學資訊工程學系(研究所) 碩士論文. 有效權重資料探勘方法之研究 A Study on Efficient Approaches for Weighted Data Mining. 研究生 : 李弘裕撰指導教授 : 洪宗貝博士藍國誠博士. 中華民國一百零一年七月.

(2) A Study on Efficient Approaches for Weighted Data Mining. Advisor: Dr. (Professor) Tzung-Pei Hong Department of Computer Science and Information Engineering National University of Kaohsiung Advisor: Dr. Guo-Cheng Lan Department of Computer Science and Information Engineering National Cheng Kung University Student: Hong-Yu Lee Department of Computer Science and Information Engineering National University of Kaohsiung. Abstract Weighted data mining in the field of data mining has been widely discussed in recent years due to its various practical applications. Different from traditional association-rule mining, an item on weighted data mining is flexibly given a suitable weight value to represent its importance in a database, and then weighted frequent itemsets can be found from a database. But, the downward-closure property in association-rule mining can be not kept in the weighted data mining. Although traditional upper-bound model can be applied to achieve the goal, lots of unpromising candidate itemsets still have to be generated by using the traditional model. To address this, we thus develop several efficient methods for mining. I.

(3) weighted frequent itemsets and weighted sequential patterns. For the issue of weighted itemset mining, a new upper-bound model, which adopts the maximum weight in a transaction as upper-bound of the transaction, is first proposed to obtain more accurate upper-bound for itemsets. In addition, two effective strategies, pruning and filtering, are designed to further improve the model. To effectively utilize the model and strategies, the two efficient algorithms, projection-based weighted mining algorithms based on the improved upper-bound approach with the pruning strategy and projection-based weighted mining algorithms based on the improved upper-bound approach with effective strategies, are proposed for finding weighted frequent itemsets in databases. On the other hand, the proposed concepts on weighted itemset mining can be further extended to the problem of weighted sequential pattern mining. Finally, the experimental results on the synthetic and real datasets also show the performance of the proposed algorithms outperforms the traditional weighted mining algorithms under various parameter settings.. Keywords: Data mining association-rule mining, sequential pattern mining, weighted itemset mining, weighted sequential pattern mining, upper bound, pruning strategy.. II.

(4) 有效權重資料探勘方法之研究指導教授 : 洪宗貝博士國立高雄大學資訊工程所共同指導教授 : 藍國誠博士國立成功大學資訊工程所學生 : 李弘裕國立高雄大學資訊工程所. 中文摘要於資料探勘的領域內，權重資料探勘因其實用性，在最近幾年被廣泛的討論。與過去的關聯規則探勘方法不同，權重探勘給予項目適當的權重以表示此項目在資料集合的重要性，這樣出現次數低但重要性高的項目集合可以被找出。只是，關聯規則探勘裡的向下封閉性在權重資料探勘裡不會成立。雖然過去使用的高估模型可以產生向下封閉性，可是在探勘過程，很多不需要產生的候選項目仍然會被產生。為了解決這個問題，我們設計了幾個有效率權重項目探勘方法與權重序列探勘方法。在權重項目探勘的議題中，我們提出採用一筆交易資料裡面最高權重當做高估值的新高估模型以得到比之前更準確的高估值。另外，兩個有效率的機制，修剪與過濾，被設計出去更進一步改善高估模型。為了有效率的使用新的高估模型與機制，我們先提出了投影基底演算法結合改進的高估模型，然後，另外一個投影基底演算法結合改進高估模型與有效的改進機制被提出並應用在找出權重項目。另一方面，在權重項目集探勘所提的概念可以被進一步的擴展到權重序列探勘。最後，實驗結果顯示出我們所提演算法 III.

(5) 之效能於不同的資料參數設定下優於之前的權重探勘演算法。. 關鍵字: 資料探勘、關聯規則探勘、序列樣式探勘、權重項目集探勘、權重序列樣式探勘、高估值、修剪機制。. IV.

(6) 誌謝就讀碩士班的求學過程中，首先最要感謝的是我的指導教授洪宗貝老師，從老師身上學生除了知道要如何做研究外，更重要的是學到要怎樣處理研究外的問題，讓學生覺得學到不只在研究方面的學到很多，對於想法的助益也是很多。在論文口試時，也要感謝李建興教授、江明朝教授、林浚瑋老師對研究的提點，讓學生受益匪淺，也讓論文更佳完善。接著要感謝實驗室的學長、同學與學弟們的幫忙與照顧，國誠、浚瑋、明泰等學長的關心，讓學弟在研究生活裡有前進的動力，特別是國誠學長，在學弟對研究有疑惑時總能拉我們一把，常常陪學生討論到很晚，同時在研究這段路有時會有情緒的問題，學長也會照顧學弟妹們的心情，真的是很感謝學長的照顧。也很感謝同學嘉蔚、郁凱、佩珊、怡杏、聰榮對於研究的討論。尤其嘉蔚同學，在寫論文的那段時間裡與你在實驗室相處的時間是最多的，常常一起在實驗室待到很晚，這段時間或許有點辛苦，可是回想的時候，會覺得很充實。謝謝你們的陪伴。新程、宏全與俊欽這些學弟們，許多事情都是你們的幫忙，才可以讓我們更全力在研究上。謝謝你們的陪伴，讓我在求學生涯裡更有時間去想研究的問題。還要感謝我的父母親，因為有你們的支持，並且無微不至的照顧，才可以讓我更無後顧之憂的完成碩士學位。最後，很感謝各位的幫忙，讓我的碩士生涯更完整。. 李弘裕. V. 謹誌. 國立高雄大學.

(7) Content Abstract ................................................................................................................ I 中文摘要 ............................................................................................................ III 誌謝 ...................................................................................................................... V CHAPTER 1 Introduction ................................................................................. 1 1.1 Background and Motivation..................................................................... 1 1.2 Contributions ............................................................................................ 3 1.3 Thesis Organization.................................................................................. 4 CHAPTER 2 Review of Related Works ............................................................ 6 2.1 Frequent Itemset Mining .......................................................................... 6 2.2 Sequential Pattern Mining ........................................................................ 7 2.3 Weighted Frequent Itemset Mining .......................................................... 9 2.4 Weighted Sequential Pattern Mining ..................................................... 10 CHAPTER 3 Weighted Frequent Itemset Mining ......................................... 12 3.1 Problem and Definitions ........................................................................ 12 3.2 The Projection-based Weighted Frequent Itemset Mining Algorithm, PWA .............................................................................................................. 17 3.2.1 Improved Upper-bound Model .................................................... 17 3.2.2 Pruning Strategy for Unpromising Items ..................................... 20 3.2.3 The Proposed Weighted Mining Algorithm ................................. 22 3.2.4 An Example of PWA..................................................................... 26 VI.

(8) 3.3 Projection-based Weighted Mining Algorithm with Improved Strategies, PWAI............................................................................................................. 35 3.3.1 Tightening Upper-bound Strategy ................................................ 35 3.3.2 Filtering Strategy .......................................................................... 36 3.3.3 The Proposed Mining Algorithm with Improved Strategies ........ 38 3.3.4 An Example of PWAI ................................................................... 44 3.4 Experimental Evaluation ........................................................................ 51 3.4.1 Experimental Datasets .................................................................. 51 3.4.2 Evaluation of the Improved Strategy ........................................... 53 3.4.3 Efficiency Evaluation ................................................................... 56 3.4.4 Evaluation on A Real Dataset, Foodmart..................................... 58 CHAPTER 4 Weighted Sequential Pattern Mining ...................................... 61 4.1 Problem and Definitions ........................................................................ 61 4.2 Projection-based Weighted Sequential Pattern Mining with Improved Strategies, PWSI ........................................................................................... 66 4.2.1 Improved Upper-bound Model .................................................... 66 4.2.2 Pruning Strategy for Unpromising Items ..................................... 67 4.2.3 Tightening Upper bound strategy ................................................. 68 4.2.4 Filtering Strategy .......................................................................... 68 4.2.5 The Proposed Mining Algorithm with Improved Strategies ........ 69 4.2.6 An Example of PWSI ................................................................... 75 4.3 Experimental Evaluation ........................................................................ 84 4.3.1 Experimental Datasets .................................................................. 84 4.3.2 Evaluation on the Improved Strategy........................................... 85 VII.

(9) 4.3.3 Efficiency Evaluation ................................................................... 88 4.3.4 Evaluation on A Real Dataset ....................................................... 90 CHAPTER 5 Conclusions and Future Works ................................................ 93 References .......................................................................................................... 96. VIII.

(10) List of Tables Table 3.1: Set of five transactions for given example. ............................................................. 13 Table 3.2: Weights of items given in Table 3.1. ....................................................................... 13 Table 3.3: Transaction maximum weights of the five transactions in the example. ................ 27 Table 3.4: The transaction-weighted upper bounds and weighted supports of all candidate 1-itemsets................................................................................................................. 29 Table 3.5: The set of the weighted frequent upper-bound 1-itemsets in the example. ............ 30 Table 3.6: The set of the weighted frequent 1-itemsets in the example. .................................. 30 Table 3.7: All the modified transactions in this example. ........................................................ 31 Table 3.8: The transaction-weighted upper bounds and the actual weighted supports of all the 2-itemsets with the prefix {A} in this example. ...................................................... 33 Table 3.9: The modified transactions in tdb{A} and their transaction maximum weight values. ................................................................................................................................. 34 Table 3.10: The final set of WF in the example. ...................................................................... 34 Table 3.11: Set of the weighted frequent itemsets with the five itemsets as their prefix itemsets in this example. ......................................................................................... 47 Table 3.12: Set of weighted frequent upper-bound itemsets with the five itemsets as their prefix itemsets in this example. ............................................................................... 47 Table 3.13: The transaction-weighted upper bounds and the actual weighted supports of all the 2-itemsets with the prefix {A} in this example. ................................................ 49 Table 3.14: The modified transactions in tdb{A} and their transaction maximum weight values. ................................................................................................................................. 50 Table 3.15: Parameters used in the series of experiments. ...................................................... 51 Table 4.1: Set of five sequences for given example. ................................................................ 62 Table 4.2: Weights of items given in Table 4.1. ....................................................................... 62. IX.

(11) Table 4.3: Sequence maximum weights of the five sequences in the example. ...................... 76 Table 4.4: Sequence-weighted upper bounds and weighted supports of all 1-subsequences in the example.............................................................................................................. 77 Table 4.5: Set of the weighted frequent upper-bound 1-patterns in the example. ................... 78 Table 4.6: Set of weighted sequential 1-patterns in the example. ............................................ 78 Table 4.7: All modified sequences and their sequence maximum weights in the example. .... 79 Table 4.8: Set of weighted sequential patterns with the five patterns as their prefix patterns in the example.............................................................................................................. 80 Table 4.9: Set of weighted frequent upper-bound patterns with the five patterns as their prefix patterns in the example. ........................................................................................... 80 Table 4.10: Sequence-weighted upper-bound and actual weighted support values of all 2-subsequences with prefix <A> in the example. .................................................... 81 Table 4.11: All modified sequences in sdb<A> in the example. ................................................ 82 Table 4.12: Final set of all weighted sequential patterns (WS) in the example. ...................... 83 Table 4.13: Parameters used in the experiment. ...................................................................... 84. X.

(12) List of Figures Figure 3.1: Weight-value distribution of generated transaction datasets. ................................ 52 Figure 3.2: Weight-value distribution of the real dataset, Foodmart. ...................................... 53 Figure 3.3: Comparison of number of weighted frequent upper-bound itemsets required by the three algorithms under various parameter settings. ......................................... 54 Figure 3.4: Pruning effects of algorithms, PWA and PWAI, under various parameter settings. ............................................................................................................................... 55 Figure 3.5: Efficiency comparison of the three algorithms under various parameter settings, min_wsup, D, and T............................................................................................... 56 Figure 3.6: Efficiency improvement rate of PWA and PWAI under various parameter settings. ............................................................................................................................... 57 Figure 3.7: Comparison of the numbers of weighted frequent upper-bound itemsets required by the three algorithms under different thresholds. .............................................. 58 Figure 3.8: The pruning rate of proposed two algorithms under different thresholds. ............ 59 Figure 3.9: Efficiency comparison of the three algorithms under different thresholds. .......... 59 Figure 3.10: Efficiency improvement of the proposed algorithms under different thresholds. ............................................................................................................................... 59 Figure 4.1: Weight-value distribution of generated transaction datasets. ................................ 85 Figure 4.2: Comparison of numbers of weighted frequent upper-bound patterns required by the three algorithms under different parameter settings. ....................................... 86 Figure 4.3: The pruning effect of the proposed algorithms, PWS and PWSI under various parameters, min_wsup, D, and S. .......................................................................... 87 Figure 4.4: Execution efficiency of all algorithms under various parameter settings, min_wsup, D, and S. .............................................................................................. 89 Figure 4.5: Improvement rates of the PWS and the PWSI under various parameter settings. . 90. XI.

(13) Figure 4.6: Comparison of numbers of weighted frequent upper-bound patterns needed by the three algorithms under different minimum weighted support thresholds. ............ 91 Figure 4.7: Pruning effect of the proposed algorithms under different minimum weighted support thresholds. ................................................................................................ 91 Figure 4.8: Execution efficiency of all the algorithms under different thresholds. ................. 92 Figure 4.9: Pruning effect of PWSI under different thresholds................................................ 92. XII.

(14) CHAPTER 1 Introduction 1.1 Background and Motivation Data mining can be used to extract useful rules or patterns from a set of data. Traditional data mining, however, considers only the occurrence of items, it does not reflect any other factors, such as price or profit. In addition, all items in a set of data are assumed to have the same significance. Thus, the actual significance of an item cannot be easily recognized. For example, some low-frequency events are important, such as high-profit projects in a transaction database or attack events in long-term network monitor data. Such events cannot be easily found using traditional mining techniques. To handle the problem, weighted frequent itemset mining [4][12][40] was proposed, in which weights are assigned to items based on importance of items. In the related studies, weight function was designed to evaluate the weight of an itemset. However, with weight function, the downward-closure property in traditional itemset mining cannot be maintained. To address this, Yun et al. [45] designed an upper-bound model, in which the maximum weight among items in a transaction database is used as the maximum weight of each transaction, to construct a downward-closure property for weighted frequent itemset mining. However, Yun et al.’s WFIM algorithm [45] generates a large number of candidate itemsets for mining, making the algorithm inefficient.. 1.

(15) To find out the relationship between items attached with the importance in a long-term collected dataset. Yun et al. proposed weighted sequential pattern mining [41], in which weights are assigned to items based on preference. Yun et al. also designed an average-weight function to evaluate the weight of a pattern in a sequence. However, with the average-weight function, the downward-closure property in traditional sequential pattern mining cannot be maintained. To handle this, Yun et al. [41][44] designed an upper-bound model, in which the maximum weight among items in a sequence database is used as the maximum weight of each sequence, to construct a downward-closure property for weighted sequential pattern mining. However, Yun et al.’s WSpan algorithm [41][44] generates a large number of candidate subsequences for mining and making the low performance. It is thus critical to develop a suitable model for weighted frequent itemset mining and weighted sequential pattern mining. In the proposed thesis, we first proposed a projection-based weighted frequent itemset mining algorithm (PWA) based on improved upper-bound model, in which the maximum weight in a transaction is regarded as the upper-bound weight of the transaction, to tighten the upper bounds of weights for itemsets. In addition, an effective pruning strategy is adopted to reduce the number of unpromising itemsets in the mining process, thus avoiding unnecessary evaluation and increasing the efficiency of finding weighted frequent itemsets. Next, we proposed projection-based weighted frequent itemset mining algorithm with improved. 2.

(16) strategies (PWAI) for mining weighted frequent itemsets, in which two improved strategies, tightening and filtering, are proposed to improve the upper bounds of weighted values for itemsets. In the final part of proposed thesis, the proposed upper-bound model and improved strategies in weighted itemset mining can be extended to the problem of weighted sequential pattern mining. We proposed two algorithms, projection-based weighted sequential pattern mining algorithm (PWS) and projection-based weighted sequential pattern mining algorithm with improved strategies (PWSI) for mining weighted sequential patterns. The experimental results show the proposed algorithms have good performance in terms of both the pruning effect and execution efficiency for synthetic datasets and various parameter settings.. 1.2 Contributions In the thesis, the major contributions of our studies are stated as follows: 1. For the issue of weighted itemset mining, two novel algorithms, called projection-based weighted mining algorithm (PWA) and projection-based weighted mining algorithm with improved strategies (PWAI), are developed for discovering weighted frequent itemsets from transaction databases. In particular, we also design a new upper-bound model that adopts maximum weight for a sequence to tighten the upper-bounds of weighted values for itemsets. Correspondingly, the proposed upper-bound model in weighted itemset mining can be extended to handle the problem of weighted sequential pattern mining. Two efficient. 3.

(17) algorithms, projection-based weighted sequential pattern mining (PWS) and projection-based weighted sequential pattern mining with improved strategies (PWSI), are thus developed for weighted sequential pattern mining. 2. Two effective strategies, tightening and filtering, are designed to obtain more accurate upper bounds of weighted values for itemsets and patterns in mining, and then a large number of unpromising candidates and data size can be recursively reduced when compared to the existing algorithms. Thus, the proposed algorithm to handle the problem of weighted data mining is lower in respect to cost. 3. Based on the developed model and strategies, the experimental results show the proposed algorithms have good performance in terms of both the pruning effect and execution efficiency compared to the existing algorithm under various parameter settings when working with synthetic datasets generated by a public IBM data generator and a public real dataset, Foodmart.. 1.3 Thesis Organization The remaining parts of the thesis are organized as follows. Some related works including frequent itemset mining, sequential pattern mining, weighted frequent itemset mining, and weighted sequential pattern mining are reviewed in CHAPTER 2. Weighted frequent itemset mining with an improved upper-bound model, weighted frequent itemset mining with. 4.

(18) improved strategies, and experimental evaluation are discussed in CHAPTER 3. Weighted sequential pattern mining with improved strategies and experimental evaluation are discussed in CAHPTER 4. Conclusions and future works are stated in CHAPTER 5.. 5.

(19) CHAPTER 2 Review of Related Works In this chapter, some studies on frequent itemset mining, sequential pattern mining, weighted frequent itemset mining and weighted sequential pattern mining are briefly reviewed.. 2.1 Frequent Itemset Mining The main purpose of data mining in knowledge discovery is to extract desired rules of patterns in a set of data. One common type of data mining is to derive association rules from a transaction dataset, such that the presence of items in a transaction will imply the presence of some other items. To address this, Agrawal et al. proposed several mining algorithms based on the concept of large itemset to find association rules from transaction data. The Apriori algorithm [4] on association-rule mining was the most well-known of existing algorithms. The process of association-rule mining could be divided into two main phases. In the first phase, candidate itemsets were generated and counted by scanning transaction data. If the count of an itemset in the transaction database was larger than or equal to the pre-defined threshold value (called minimum support threshold), the itemset was identified as a frequent one. Itemsets containing only one item were processed first. Frequent itemset. 6.

(20) containing only single items items were then combined to form candidate itemsets with two items. The above process was then repeated until no candidate itemsets were generated. In the second phase, association rules were derived from the set of frequent itemsets found in the first phase. All possible association combinations for each frequent itemset were formed, and those with calculated confidence larger than or equal to a pre-defined threshold (called the minimum confidence threshold) are output as the association rules. As mentioned above, however, the Apriori algorithm may generate many unnecessary candidate itemsets for mining, and it then requires a considerable amount of execution time to calculate the supports of the candidate itemsets, meaning that its execution efficiency is not very good. For this reason, many other algorithms, such as the Pincer-Search [22], FP-growth [16], OP [28], and ExAMiner [11], have been developed and proposed as superior alternatives with regard to finding frequent itemsets. Unlike the other algorithms, the FP-growth algorithm [16] uses a compact tree structure, called a Frequent-Pattern tree (FP-tree). With the aid of this, the algorithm only needs to scan the database twice, and does not need to generate any candidate sets, and thus the FP-growth algorithm [16] can effectively and efficiently undertake the frequent itemset data mining.. 2.2 Sequential Pattern Mining In general, the transaction time (or time stamp) of each transaction for real-world. 7.

(21) application is usually recorded in database. The transaction then can be listed as a time-series data (called a sequence data) in the occurring time order of the transactions. To handle such data with time, a new issue, namely sequential pattern mining, was first developed to achieve the goal, and the three Apriori-based algorithms, AprioriAll, AprioriSome, and DynamicSome, were also proposed to find sequential patterns in a sequence data. However, the algorithms, which were the level-wise techniques, had to execute multiple data scans to complete sequential pattern mining tasks. Afterward, several algorithms for sequential pattern mining were proposed to improve efficiency in dealing with large set of dataset, such as Apriori-based algorithm, GSP [36] and pattern-growth algorithm, PrefixSpan [32]. Different from the Apriori-based related algorithms, the PrefixSpan algorithm, which was proposed by Han et al., was a pattern-growth algorithm and adopted a projection technique to efficiently mine sequential patterns in a sequence database. The concept of the projection technique used in the PrefixSpan algorithm was derived from the similar concept of database query that a query condition was given by users, and then the whole record in a database, which satisfied the query condition, was effectively found by the query condition. By using the projection technique, a set of sequences could continually be divided for mining when the items in a prefix subsequence were increased. Then, the search space for subsequences could effectively be reduced, and the efficiency for finding sequential patterns could thus be improved by the PrefixSpan algorithm when compared with the Aprior-based. 8.

(22) algorithms, such as AprioriAll, GSP, and so forth.. 2.3 Weighted Frequent Itemset Mining An itemset in association-rule mining only considers the frequency of the itemset in databases, and all the items in the itemsets are assumed to have the same significance. In reality, however, the importance of items in a database may be different according to different factors, such as profit and cost. For example, LCD TVs may not have high frequency but is a high-profit product when compared to food items or drink in a database. Therefore, some useful item products may not be discovered by using traditional frequent itemset mining techniques. To handle the problem, Yun et al. then proposed weighted itemset mining [45][45], to find weighted frequent itemsets in transaction databases with the weights being flexibly given by users. The average-weight function in Yun et al.’s study [45] was designed to evaluate the weight of an itemset in a transaction. Different from frequent itemsets with only consideration of frequency, the found itemsets with high-weight values might be used as managers’ auxiliary information in terms of making decisions. However, the downward-closure property in association-rule mining cannot be kept in the problem of weighted frequent itemset mining with the average-weight function. To address this, Yun et al. proposed an upper-bound model to construct a new downward-closure property [41][44][45], which adopted the maximum weight of a database as the weight. 9.

(23) upper-bound of each transaction, and the two algorithms, Apriori-based and FP-growth-based, were also developed to find weighted frequent itemsets in transaction databases, and the two algorithms in their study had a good performance in terms of handling the problem of weighted frequent itemset mining.. 2.4 Weighted Sequential Pattern Mining As mentioned previously in weighted itemset mining, the similar problem for the same significance of items also existed in sequential pattern mining. To deal with this, Yun et al. thus proposed a new research issue, named weighted sequential pattern mining [41][44], to find weighted sequential patterns from the a sequence database. Similarly, different weights were given items by referring to factors, such as their profits, their costs, or users’ preferences, and then the actual importance of a pattern could be easily recognized when compared with the traditional sequential pattern mining. Different from the function in weighted itemset mining, the time factor was considered to develop a new average-weight function [41][45], and the new function could be applied to identify the weight value of a pattern in a sequence. Based on the function, however, the downward-closure property in traditional sequential pattern mining could not be kept on weighted sequential pattern mining. To address the problem, a new upper-bound model [41][45], which the maximum weight in a sequence database was regarded as the upper-bound of each sequence, was directly derived from Yun et. 10.

(24) al.’s proposed model in weighted itemset mining [45]. However, it was observed that a huge amount of unpromising subsequences still had to be generated by using the traditional upper-bound model [41][44][45] for mining, and its performance was thus not good. Based on the reasons, this motivates our exploration of the issue of effectively and efficiently mining weighted sequential patterns from a set of sequences.. 11.

(25) CHAPTER 3 Weighted Frequent Itemset Mining In this chapter, we propose two algorithms for mining weighted frequent itemsets. We first propose a projection-based weighted frequent itemset mining algorithm (PWA), in which transaction maximum weight is adopted for mining weighted frequent itemsets. Next, the projection-based weighted frequent itemset mining algorithm with improved strategies (PWAI), tightening and filtering, is developed. The performance of PWAI can further be improved.. 3.1 Problem and Definitions To understand the problem of weighted frequent itemset mining, consider the transaction database given in Table 3.1 and, in which each transaction consists of two features, the transaction identification (TID) and items purchased (or events frequency). There are eight items in the transactions, respectively denoted as A to H. The predefined weight of each item is shown in Table 3.2.. 12.

(26) Table 3.1: Set of five transactions for given example. TID. Transactions. Trans1. {ACEF}. Trans2. {BH}. Trans3. {BCE}. Trans4. {ACDEGH}. Trans5. {ADEF}. Table 3.2: Weights of items given in Table 3.1. Item. Weight. A. 0.30. B. 0.60. C. 0.45. D. 0.20. E. 0.40. F. 0.50. G. 0.15. H. 0.95. For the formal definition of weighted frequent itemset mining, a set of terms related to the problem of weighted frequent itemset mining [45] is defined below. Definition 1. An itemset X is a subset of items or events, X ⊆ I. if |X| = r, the itemse X is called an r-itemset. Here I = {i1, i2, …, im} is a set of items or events, which may appear in transactions. For example, the itemset {AB} contains two items and is so called a 2-itemset. Note that the items in an itemset are sorted in alphabetical order. Definition 2. A transaction database TDB is composed of a set of transactions. That is, TDB = {Trans1, Trans2, …, Transy, …, Transz}, where Transy is the y-th transaction in TDB.. 13.

(27) Definition 3. The weight of an item i, wi, ranges from 0 to 1. For example, wA = 0.30 in Table 3.2. Definition 4. The weight of an itemset X, wX, is the sum of the weights of all items in X divided by the number of items in X. That is: |X |. wX . w. i i X ^ X  I. lX. .. where lX is the number of items in itemset X. For example, in Table 3.2, the weights of the two items in the itemset {AB} are 0.30 and 0.60, respectively, and the number of items in {AB} is 2. Therefore, w{AB} = (0.30 + 0.60) / 2 = 0.45. Based on the fourth definition, the formula is an average weight function. To obtain the calculation base of weighted support value in a database for an itemset, the maximum weight in a transaction is regarded as the transaction weight of the transaction. The reason for this is that the weight value of any sub-itemset in a transaction has to be less than the maximum weight in the transaction. The weighted support of an itemset is further described below. Definition 5. The transaction maximum weight of a transaction Trans, tmwTrans is the maximum weight value among those of all items in transaction Trans. For example, in Table 3.1, the second transaction includes two items, B and H, whose weights are 0.60 and 0.95, respectively. Therefore, tmw{BH} = 0.95. Definition 6. The total transaction maximum weight of a transaction database TDB,. 14.

(28) ttmw, is the sumof the transaction maximum weight values of all transactions in TDB. That is:.  tmw. ttmw . Transy TDB. y. .. For example, in Table 3.1, the transaction maximum weights of the five transactions are 0.50, 0.95, 0.60, 0.95, and 0.50, respectively. Then ttmw = 0.50 + 0.95 + 0.60 + 0.95 + 0.50 = 3.50. Definition 7. The weighted support value of an itemset X, wsupX, is the sum of the weight values of X in the transactions including X in TDB divided by the total transaction maximum weight ttmw of the transaction database TDB. That is:. w sup X . w. X X Transy ^Transy TDB. ttmw. .. For example, in Table 3.1, the itemset {AE} appears in the three transactions, Trans1, Trans4, and Trans5. The weight value of {AE} is 0.35. Then, wsup{AE} = (0.35 + 0.35 + 0.35) / 3.50 = 30%. Definition 8. Let λ be a pre-defined minimum weighted support threshold. An itemset X is called a weighted frequent itemset (WF) if wsupX ≥ λ. For example, if λ = 30%, then the itemset {AE} is a weighted frequent itemset, since wsup{AE} = 30% ≥ 30%. However, the downward-closure property used in association-rule mining does not hold with regard to the problem of weighted frequent itemset mining. This is because the weight function is an average concept, and thus the actual weight supports for itemsets cannot be directly used to find the weighted frequent itemsets in databases. Take the item A in Table 3.1 15.

(29) as an example. There are three transactions that include this item in Table 3.1, and the weight of the item A in Table 3.2 is 0.30. The weighted support value of the itemset {A} can be then calculated as (0.30 + 0.30 + 0.30) / 3.50, which is 25.71%. If λ is set at 30%, then the itemset {A} is not a weighted frequent itemset, but its super-itemset {AE} is a weighted frequent itemset. As this example shows, the problem of weighted frequent itemset mining is more difficult to solve compared with traditional frequent itemset mining. Yun et al. subsequently proposed an upper-bound model to address this, in which the maximum weight in a database is regarded as the upper bound of weight value of each transaction to hold the downward-closure property on weighted frequent itemset mining [45]. However, the downward-closure property can be further achieved by using the maximum weight in a transaction. We thus, propose an effective transaction maximum weight (TMW) model to tighten the upper bounds of the weight values used when mining itemsets, and the relevant terms used in our proposed TMW model are defined as follows. Definition 9. The transaction-weighted upper bound of an itemset X, twubX, is the sum of the transaction maximum weights of the transactions including X in TDB divided by the total transaction maximum weight, ttmw of TDB. That is:.  tmw. twub X . y X Transy ^Transy TDB. ttmw. .. For example, in Table 3.1, item E appears in four transactions, Trans1, Trans3, Trans4, and Trans5, whose transaction maximum weights are 0.50, 0.60, 0.95 and 0.50, respectively. 16.

(30) Therefore, twub{E} = 2.55 / 3.50 = 72.85%. Definition 10. Let λ be a pre-defined minimum weighted support threshold. An itemset X is called a weighted frequent upper-bound itemset (WFUB) if twubX ≥ λ. For example, if λ = 30%, then the itemset {E} is a weighted frequent upper-bound itemset since twub{E} = 72.85% ≥ λ. Based on the definitions given above, a weighted frequent itemset considers the individual weights of items in a transaction dataset. The goal is to solve effectively and efficiently find all the weighted frequent itemsets whose weights are larger than or equal to a predefined minimum weighted support threshold λ in a given transaction database. The details of the proposed PWA are given in the next section.. 3.2 The Projection-based Weighted Frequent Itemset Mining Algorithm, PWA In this section, a new projection-based mining algorithm is proposed to effectively handle the problem of finding weighted frequent itemsets from a transaction database. The improved upper-bound model and the pruning strategy used in the proposed algorithm are developed to help its execution. The improved upper-bound model is first described below.. 3.2.1 Improved Upper-bound Model. 17.

(31) A weight upper-bound model is proposed to enhance the traditional weight upper-bound model [45]. The proposed model tightens the upper bounds of weights for itemsets in the mining process. In the traditional upper-bound model [45], the maximum weight in a transaction dataset is used as the upper bound of the weight for each transaction to maintain the downward-closure property for weighted frequent itemset mining. However, the maximum weight in a transaction can be used to achieve the same goal. That is, the maximum weight in a transaction can be regarded as the upper bound of the weight for the transaction. To illustrate the completeness of the TMW model, two lemmas are given below to prove that no weighted frequent itemsets are skipped in any weighted frequent itemset mining case. Lemma 3.1: The transaction-weighted upper bound of an itemset X maintains the downward-closure property. Proof: Let X be a weighted frequent upper-bound itemset and dX be the set of transactions that contain X in a transaction database TDB. If y is a super-itemset of X, then y cannot exist in any transaction where X is absent. Therefore, the transaction-weighted upper bound twubX of X is the maximum upper bound of the weight value of y. Accordingly, if twubX is less than a predefined minimum weighted support threshold, then y cannot be a weighted frequent upper-bound itemset. Lemma 3.2: For a transaction database TDB and a predefined minimum weighted. 18.

(32) support threshold, the set of weighted frequent itemsets WF is a subset of weighted frequent upper-bound itemsets WFUB. Proof: Let X be a weighted frequent itemset. According to Definitions 9 and 10, the actual weighted support wsupX of X must be less than or equal to its transaction-weighted upper-bound twubX. Accordingly, if X is a weighted frequent itemset, then it must be a weighted frequent upper-bound itemset. As a result, X is a member of the set WFUB. Based on Lemmas 3.1 and 3.2, all weighted frequent itemsets in a transaction database can be discovered. The proposed model can thus be used to effectively tighten the upper bounds of weights for itemsets compared to those obtained using the traditional upper-bound model [45]. An example is given below to illustrate how the model improves the upper bounds of weights for itemsets. According to the traditional upper-bound model [45], the maximum weight value in Table 3.1 is 0.95, which is regarded as the upper bound of each sequence in the dataset. Take item E as an example. It appears in four transactions, namely Trans1, Trans2, Trans3, and Trans5. The upper bounds of the weight for the four transactions are all 0.95. The upper bound of the weight for E can be calculated as (0.95 + 0.95 + 0.95 + 0.95), which is 3.8. Based on the proposed upper-bound model, the upper bound of the weight for E can be tightened. First, the maximum weight in a transaction is found. Take the first transaction Trans1: {ACEF} in Table 3.1 as an example. The transaction includes four items, A, C, E and. 19.

(33) F, whose weights are 0.30, 0.45, 0.40 and 0.50, respectively. The maximum weight is 0.50, which is regarded as the upper bound of the weight for the transaction Trans1. The other transactions in Table 3.1 can be similarly processed. The maximum weights for the five transactions are 0.50, 0.95, 0.60, 0.95, and 0.50, respectively. The transaction-weighted upper bound of item E can be then calculated as 0.50 + 0.60 + 0.95 + 0.5, which is 2.55.. 3.2.2 Pruning Strategy for Unpromising Items In this section, a simple pruning strategy based on the proposed model and a projection-based technique is designed to effectively reduce the number of unpromising itemsets for mining. According to Lemmas 3.1 and 3.2, the downward-closure property for weighted frequent itemset mining can be maintained by using the proposed model. Based on the model, any sub-itemset of a weighted frequent upper-bound itemset must be a weighted frequent upper-bound itemset. In contrast, if there exists a weighted infrequent upper-bound sub-itemset for an itemset, then the itemset is not a weighted frequent upper-bound itemset, and the itemset is also not a weighted frequent itemsets. In this case, the itemset can be skipped, since it is impossible for it to be a weighted frequent upper-bound itemset. This concept is applied in the pruning strategy to reduce the number of unpromising itemsets in the recursive process. The proposed projection-based algorithm is as follows. First, when all the weighted. 20.

(34) frequent upper-bound r-itemsets with r items are found, all items in the set of weighted frequent upper-bound r-itemsets are gathered as the pruning information for each a prefix r-itemset to be processed. Next, the additional (r+1)-th item of each generated (r+1)-itemset in the next recursive process is checked for whether it appears in the set of gathered items. If it does, the generated (r+1)-itemset are placed in the set of (r+1)-itemsets; otherwise, it is pruned. An example is given below to illustrate the pruning of unpromising itemsets in the recursive process. For example, consider the transaction {ABDEF}, where symbols represent items, and the itemsets {AB} and {AC}, which are included in the set of weighted frequent upper-bound 2-itemsets WFUB2,{A} with {A} as their prefixes. In this case, only the three items, A, B, and C, are gathered from the set as the pruning information. The next prefix itemset to be processed is {AB}. For the transaction {ABDEF}, the transaction is the projected transaction of {AB}, the items D, E, and F, do not appear in the set of gathered items. Therefore, the three items have to be removed from the transaction; the modified transaction is {AB}.The items are removed because the super-items consisting of the three items and the prefix {A} are not weighted upper-bound itemsets. Moreover, since the number of items kept in the modified transaction is less than the value of three, which is the number of items in the 3-itemsets to be generated, the modified transaction can be removed from the projected transactions of {AB}. As the example shows, this strategy effectively speeds up the proposed algorithm by. 21.

(35) removing unpromising items.. 3.2.3 The Proposed Weighted Mining Algorithm The projection-based weighted itemset mining algorithm with the pruning strategy (PWA) is as follow.. INPUT: A set of items, each with a weight; a transaction database TDB, in which each transaction includes a subset of items; a minimum weighted support threshold λ. OUTPUT: A final set of weighted frequent itemsets, WF.. STEP 1: For each y-th transaction Transy in TDB, find the transaction maximum weight tmwy of the transaction Transy as: tmwy = max{wy1, wy2, …, wyj}. where wyj is the weight value wi of each j-th item i in Transy. STEP 2: Find the total transaction maximum weight ttmw of the transaction database TDB as:. ttmw .  tmw. Transy TDB. y. .. STEP 3: For each item i in TDB, do the following substeps. (a) Calculate the transaction-weighted upper-bound twubi of item i as:. 22.

(36) . twubi . tmw y. i Transy ^Transy TDB. ttmw. .. where tmwy is the transaction maximum weight of each Transy in TDB. (b) Calculate the actual weighted support wsupi of the i as:. . w supi . w. i i Transy ^Transy TDB. ttmw. .. where wi is the weight value of item i in weight table. STEP 4: For each item i in TDB, do the following substeps. (a) If the transaction-weighted upper-bound twubi of i is larger than or equal to the minimum weighted support threshold , put it in the set of weighted frequent upper-bound 1-itemsets, WFUB1. (b) If the actual weighted support wsupi of i is larger than or equal to the minimum weighted support threshold , put it in the set of weighted frequent 1-itemsets, WF1. STEP 5: Set r = 1, where r represents the number of items in the processed itemsets. STEP 6: Gather the items appearing in the set of WFUB1, and put them in the set of possible items, PIr. STEP 7: For each y-th transaction Transy in TDB, do the following substeps. (a) Get each item i in Transy. (b) Check whether item i appears in PIr or not. If it does, then keep the item i in the. 23.

(37) transaction Transy; otherwise, remove item i from TDB. (c) If the number of items kept in the modified transaction Transy is less than the value (= r+1), then remove the modified transaction Transy from TDB; otherwise, keep it in TDB. STEP 8: Process each itemset in the set of WFUB1 in alphabetical order by the following substeps. (a) Find the relevant transactions including X in TDB, and put the transactions in the set of projected transactions tdbX of the itemset X. (b) Find all the weighted frequent itemsets with X as their prefix itemset by the Finding-WF(X, tdbX, r) procedure. Let the set of returned weighted frequent itemsets be WFX. STEP 9: Output the the all weighted frequent itemsets in all the WFs.. After STEP 9, all the weighted frequent itemsets are found by Finding-WF(X, tdbX, r) procedure, and the procedure is stated below.. The Finding-WF(X, tdbX, r) procedure: Input: A prefix r-itemset X and its corresponding projected transactions tdbX. Output: The weighted frequent itemsets with the prefix itemset X.. 24.

(38) PSTEP 1: Initialize the temporary itemset TIX table as an empty table, in which each tuple consists of three fields: itemset, transaction-weighted upper bound (twub) of the itemset, and the actual weighted support (wsup) of the itemset. PSTEP 2: For each y-th transaction Transy in tdbX, do the following substeps. (a) Get each item i located after X in Transy. (b) Generate the (r+1)-itemset X’ composed of the prefix r-itemset X and i, put it in the temporary set of itemsets. If the itemset X’ does not appear in the temporary set of itemsets, then put it in the temporary set; otherwise, remove the itemset X’. (c) For each itemset in the temporary set of itemsets, add the transaction maximum weight tmwy of the y-th transaction Transy, and the weight wX’ of the itemset X’ in corresponding fields in the TIX table. PSTEP 3: For each (r+1)-itemset in the TIX table, do the following substeps. (a) If the transaction-weighted upper-bound twubX’ of the (r+1)-itemset X’ is larger than or equal to the minimum weighted support threshold , put it in the set of weighted frequent upper-bound (r+1)-itemsets with the X as their prefix itemset, WFUB(r+1), X. (b) If the actual weighted support wsupX’ of the (r+1)-itemset X’ is larger than or equal to the minimum weighted support threshold , put it in the set of weighted frequent. 25.

(39) itemsets, WFX. PSTEP 4: Gather the items appearing in the set of WFUB(r+1), X of X, and put them in the set of possible items, PI(r+1), X. PSTEP 5: Set r = r + 1, where r represents the number of items in the processed itemsets. PSTEP 6: For each y-th transaction Transy in tdbX, do the following substeps. (a) Check whether each item i in Transy appears in PIr, X or not. If it does, then keep the item i in Transy; otherwise, remove the item i from Transy. (b) If the number of items kept in the m odified transaction Transy is less than the value (= r+1), remove the modified transaction Transy from tdbX; otherwise, kept it in tdbX’. PSTEP 7: Process each itemset X’ in the set of WFUBr,. X. in alphabetical order by the. following substeps. (a) Find the relevant transactions including X’ from tdbX’, and then put the transactions including X’ in the set of projected transaction database tdbX’ of prefix X’. (b) Find all weighted frequent itemsets with X’ as their prefix itemset by the Finding-WF(X, tdbX, r) procedure. where X = X’ and tdbX = tdbX’. PSTEP 8: Return the weighted frequent itemsets in the set of WFX.. 3.2.4 An Example of PWA. 26.

(40) In this section, an example is given to illustrate how to find weighted frequent itemsets from a transaction database by the proposed algorithm. Assume there are five transactions in a transaction database, as shown in Table 3.1, and the eight items in the transactions are denoted as A to H, respectively. In addition, assume the individual weights of the weight items are set in Table 3.2. The minimum weighted support threshold λ is set at 30%. According to the given data, the details of the proposed algorithm are as follow. STEP 1: The transaction maximum weight tmw of each transaction in the transaction database TDB is first found. Take the first transaction Trans1 in Table 3.1 as an example. The transaction includes the four items, A, C, E and F, their weight values are 0.30, 0.45, 0.40, and 0.50, respectively. The maximum weight value among them is 0.50, and the value is viewed as the transaction maximum weight of the transaction Trans1. All the other transactions in Table 3.1 can be processed in the same way, and the results are shown in Table 3.3.. Table 3.3: Transaction maximum weights of the five transactions in the example. TID. Transactions. tmwy. Trans1. {ACEF}. 0.50. Trans2. {BH}. 0.95. Trans3. {BCE}. 0.60. Trans4. {ACDEGH}. 0.95. Trans5. {ADEF}. 0.50. 27.

(41) STEP 2: In the example, because the transaction maximum weights of the five transactions in Table 3.3 are 0.50, 0.95, 0.60, 0.95, and 0.50, the total transaction maximum weight ttmw can be found as 0.50 + 0.95 + 0.60 + 0.95 + 0.50 = 3.50. STEP 3: The transaction-weighted upper bound (twub) and weighted support (wsup) of each item in TDB are found simultaneously. Take item A in Table 3.3 as an example. Item A appears in three transactions, Trans1, Trans4, and Trans5, and the transaction maximum weights of the three transactions are 0.50, 0.95, and 0.50, respectively. In addition, the weight of item A in Table 3.2 is 0.30, and the total transaction maximum weight ttmw of Table 3.3 is 3.50. Therefore, the transaction-weighted upper bound twubA of item A can be calculated as (0.50 + 0.95 + 0.50) / 3.50, which is 55.71%, and its weighted support wsupA can be calculated as (0.30 + 0.30 + 0.30) / 3.50, which is 25.71%. The same process can be done for all the other items in TDB. The results for the transaction-weighted upper bounds and the weighted supports of all 1-items in TDB are shown in Table 3.4.. 28.

(42) Table 3.4: The transaction-weighted upper bounds and weighted supports of all candidate 1-itemsets. Itemset. twub. wsup. {A}. 55.71%. 25.71%. {B}. 44.28%. 34.28%. {C}. 58.57%. 38.57%. {D}. 47.14%. 11.42%. {E}. 72.85%. 45.71%. {F}. 28.57%. 28.57%. {G}. 27.14%. 4.28%. {H}. 54.28%. 54.28%. STEP 4: The weighted frequent upper bound 1-itemsets (WFUB1) and the weighted frequent 1-itemsets (WF1) can be found simultaneously from Table 3.4. Take 1-itemset {D} in Table 3.4 as an example. The transaction-weighted upper bound and the weighted support values of 1-itemset {D} in Table 3.4 are 47.14% and 11.42%, respectively. Since the transaction-weighted upper-bound twub{D} is larger than or equal to the minimum weighted support threshold 30%, 1-itemset {D} is a weighted frequent upper-bound 1-itemset, but not a weighted frequent 1-itemset. All the other 1-itemsets in Table 3.4 can be processed in the same way. After this step is finished, the set of the weighted frequent upper-bound 1-itemsets (WFUB1) contains the following six itemsets, {A}, {B}, {C}, {D}, {E}, and {H}, and the set of weighted frequent 1-itemsets (WF1) contains, {B}, {C}, {E}, and {H}. The results for the set of the weighted frequent upper-bound 1-itemsets (WFUB1) and the set of weighted frequent 1-itemsets (WF1) are shown in Table 3.5 and Table 3.6, respectively.. 29.

(43) Table 3.5: The set of the weighted frequent upper-bound 1-itemsets in the example. Itemset. twub. {A}. 55.71%. {B}. 44.28%. {C}. 58.57%. {D}. 47.14%. {E}. 72.85%. {H}. 54.28%. Table 3.6: The set of the weighted frequent 1-itemsets in the example. Itemset. wsup. {B}. 34.28%. {C}. 38.57%. {E}. 45.71%. {H}. 54.28%. STEP 5: The variable r is set to 1 initially, where r represents the number of items in the processed itemsets. STEP 6: In the example, the set of weighted frequent upper-bound 1-itemsets contains {A}, {B}, {C}, {D}, {E}, and {H}. The six items A, B, C, D, E, and H are collected from the set of WFUB1, and then they are denoted as PI1 (Possible Items). STEP 7: For each transaction in Table 3.3, the items not appearing in the set of PI1 are removed from the transaction. Take the last transaction Trans5 in Table 3.3 as an example. The last transaction contains four items, A, D, E, and F, and the set of PI1 contains six items A, B, C, D, E, and H. Because the last item F in Trans5 is not appearing in the set of PI1, the 30.

(44) item F is removed from Trans5, and the transaction is modified to {ADE}. In addition, the transaction maximum weight of the modification transaction Trans5 is still the original value of 0.50. Next, since the number of items (= 3) kept in the modified transaction is larger than or equal to the value of two, the modified transaction can be kept in the Table 3.3. All the other four transactions in Table 3.3 can be processed in the same way. The results for the modified new transactions and transaction maximum weight values of all the modified transactions are shown in Table 3.7.. Table 3.7: All the modified transactions in this example. Transactions. tmwy. {ACE}. 0.50. {BH}. 0.95. {BCE}. 0.60. {ACDEH}. 0.95. {ADE}. 0.50. STEP 8: Each 1-itemset in the set of WFUB1 is processed in alphabetical order with 1-itemset {A} being processed first. In this example, the projected transactions tdb{A} in which item A is appearing for the prefix item A in Table 3.7 include three transactions, Trans1:{ACE}, Trans4:{ACDEH}, and Trans5:{ADE}, respectively. In addition, the transaction maximum weight values of the three transactions are 0.50, 0.95, and 0.50. According to the information above, the weighted frequent itemsets with prefix itemset {A}. 31.

(45) are then found by using the Finding-WF(X, tdbX, r) procedure with the parameters X = {A}, tdbX = tdb{A}, and r = 1, and the procedure is stated below. PSTEP 1: The temporary itemset table, TI{A}, is initialized as an empty table, in which each tuple consists of three fields: itemset, transaction-weighted upper bound (twub) of the itemset, and actual weighted support (wsup) of the itemset. PSTEP 2: For each transaction in the projected transactions tdb{A} of {A}, all possible 2-itemsets with the prefix {A} are generated. Take the first projected transaction {ACE} in tdb{A} as an example. Since there are two items located after prefix itemset {A} in the transaction, the two 2-itemsets, {AC} and {AE}, are generated from the transaction {ACE}, and the weight values of {AC} and {AE} in the transaction are 0.375 and 0.35, respectively. Then, the two 2-itemsets are then put in the TI{A} table, and the transaction maximum weight value (= 0.50) of the transaction {ACE} and the weight values of two 2-itemsets with the prefix {A} are also added in the suitable field values of the two 2-itemsets in the TI{A} table. The other two projected transactions in tdb{A} can be processed in the same way. The results for the transaction-weighted upper bounds and weighted supports of all the possible 2-itemsets with the prefix {A} are shown in Table 3.8.. 32.

(46) Table 3.8: The transaction-weighted upper bounds and the actual weighted supports of all the 2-itemsets with the prefix {A} in this example. Itemset. twub. wsup. {AC}. 41.42%. 21.42%. {AD}. 41.42%. 14.28%. {AE}. 55.71%. 30.0%. {AH}. 27.14%. 17.85%. PSTEP 3: All weighted frequent upper-bound 2-itemsets in Table 3.8 (WFUB2, {A}) and weighted frequent 2-itemsets (WF{A}) in Table 3.8 can be found simultaneously. The process is the same as that mentioned in the STEP 4. In this example, the three 2-itemsets, {AC}, {AD} and {AE}, are put in the set of WFUB2,{A}, since their transaction-weighted upper bounds satisfy the minimum weighted support threshold (= 30%). However, only the itemset {AE} can be put in the set of WF{A}. PSTEP 4: In the example, the four items A, C, D and E are collected from the set of the weighted frequent upper-bound 2-itemsets with prefix itemset {A}, and they are then denoted as PI2,{A}. PSTEP 5: The value of the variable r is updated as 2. PSTEP 6: For each projected transaction in tdb{A}, the items not appearing in PI2,{A} in the projected transaction are removed. The process in STEP 7 can be similarly done for this step. After the step, the results of all the modified transactions with the original transaction maximum weight values in tdb{A} are shown in Table 3.9.. 33.

(47) Table 3.9: The modified transactions in tdb{A} and their transaction maximum weight values. TID. Transactions. tmwy. Trans1. {ACE}. 0.50. Trans2. {ACDE}. 0.95. Trans3. {ADE}. 0.50. PSTEP 7: Each itemset in the set of WFUB2,{A} is processed in alphabetical order. The prefix itemset to be processed is thus {AC}, and the projected transactions, {ACE} and {ACDE}, in Table 3.9, are put in the set of the projected transactions tdb{AC} of prefix itemset {AC}. The other itemsets in WFUB1 can be recursively processed in the same way until all the 1-itemsets in the set of WFUB1 have been done. The results for all the weighted frequent itemsets are shown in Table 3.10.. Table 3.10: The final set of WF in the example. Itemsets. wsup. {B}. 34.28%. {E}. 38.57%. {H}. 45.71%. {AE}. 30.0%. {CE}. 36.42%. STEP 9: In this example, the four weighted frequent itemsets in Table 3.10 are output to users.. 34.

(48) 3.3 Projection-based Weighted Mining Algorithm with Improved Strategies, PWAI In this section, a projection-based weighted mining algorithm with effective strategies, tightening and filtering, are proposed to improve the efficiency of the PWA. The tightening strategy is then described in 3.3.1 and filtering strategy is then described in 3.3.2.. 3.3.1 Tightening Upper-bound Strategy In this section, a pruning strategy for mining weighted frequent itemsets is proposed to tighten the upper-bounds of transactions in the mining process. The main concept of the strategy is that the unpromising items in transactions can be removed from the transactions according to the items appearing in the set of weighted frequent upper-bound itemsets, and then the upper bounds of the transactions can be reduced to tighten the upper bounds of weights for itemsets. Since the downward-closure property can be kept by the proposed transaction-weighted upper-bound model, the items not appearing in the set of weighted frequent upper-bound itemsets can be removed from transactions. An example is given to describe the procedures of tightening strategy. For example, an assumed transaction is {ABCDE} with five items, respectively denoted as A to E, and the weights of the five items are 0.2, 0.60, 0.30, 0.40, and 0.50, respectively. In addition, assume weighted frequent upper-bound 1-itemsets include {A}, {D}, {E}, and {F}.. 35.

(49) Since item B in {ABDEF} is not appeared in the weighted frequent upper-bound 1-itemsets, item B can be removed from the transaction, and then the new transaction is {ADEF}. The new upper bound of weight for the transaction {ADEF} can be re-updated as 0.5. As the example describes, the unpromising items can be pruned effectively by the items appearing in the weighted frequent upper-bound itemsets. Hence, the pruning strategy can be used to improve the performance in finding weighted frequent itemsets.. 3.3.2 Filtering Strategy Different from the pruning strategy in the Section 3.3.1, the filtering strategy can further be used to tighten the upper bounds of weights for itemsets by using the weighted frequent upper-bound 2-itemsets. The reason for this is that the pruning strategy only prunes the unpromising items not appearing in the set of weighted frequent upper-bound itemsets. According to the weighted frequent upper-bound 2-itemsets, for a prefix itemset to be processed, it is observed that each item in projected transactions with the prefix itemset is checked whether the combination of the prefix itemset and the item is a weighted frequent upper-bound 2-itemset or not. If it is, the item can be kept in the transaction; otherwise, the item is removed from the transaction. Based on the concept, the upper-bounds of weights of the projected transactions for a prefix itemset can be further tightened to improve the execution efficiency.. 36.

(50) To perform the strategy efficiently, the weighted frequent 1-itemsets are sorted in alphabetical order of them and are processed by using the bottom-up way. That is, the last 1-itemset in the sorted 1-itemsets is first processed. For a prefix itemset to be processed, in addition, each item located after the prefix in a transaction with the prefix is needed to be checked whether the combination of the last item in the prefix and the item is a weighted frequent upper-bound 2-pattern or not. the relationships between each item located after the prefix in each transaction and the last item in the prefix needed to be determined if it is a weighted frequent upper bound itemset. An example is given to illustrate the concept of filtering strategy below. For example, assume there is a transaction {ABCDE} with five weighted frequent items, and the weight values of the five items are 0.2, 0.30, 0.40, 0.60, and 0.50, respectively. Also, assume the weighted frequent upper-bound itemsets with the other four items as their prefix itemsets have been done in addition for item A due to the bottom-up way. Then, the weighted frequent upper-bound 2-itemsets with the four items can be obtained, such as {BC}, {CD}, and {DE}. Next, assume the current prefix itemset is {A}, and the weighted frequent upper-bound 2-itemsets (WFUB2,{A}) with item A as their prefix items, {AB}, {AC}, {AD}, and {AE}, have been found. For now, the itemset {AB} for the four itemsets with {A} as their prefix items is first processed. Based on the pruning strategy mentioned in Section 3.3.1, however, since all the five items in the assumed transaction {ABCDE} appear in the set of. 37.

(51) WFUB2,{A}, no items can be removed from the transaction {ABCDE}. But, it can be further observed that the last item of the prefix {AB} is item B, and only item C for all items located after item B in the transaction, which the item combination consists of the two items is a weighted frequent upper-bound 2-itemset, is promising by using the information of WFUB2,{B}. Thus, the other two items, D and E, can be removed from the transaction, and the newly projected transaction of {AB} is then {ABC}. In addition, the transaction maximum weight of the transaction is 0.4. From the example, it is observed that the filtering strategy can be further used to remove unpromising items for current prefix itemsets, and then more unpromising upper-bound itemsets can be avoided being generated for mining. Hence, the proposed algorithm can effectively improve execution efficiency with the help of the two strategies in finding weighted frequent itemsets.. 3.3.3 The Proposed Mining Algorithm with Improved Strategies The procedures of the proposed mining algorithm with effective strategies are then stated below.. INPUT: A set of items, each with a weight; a transaction database TDB, in which each transaction includes a subset of items; a minimum weighted support threshold λ.. 38.

(52) OUTPUT: A final set of weighted frequent itemsets, WF.. STEP 1: For each y-th transaction Transy in TDB, find the transaction maximum weight tmwy of the transaction Transy as: tmwy = max{wy1, wy2, …, wyj}. where wyj is the weight value wi of each j-th item i in Transy. STEP 2: Find the total transaction maximum weight ttmw for TDB as:. ttmw .  tmw. Transy TDB. y. .. STEP 3: For each item i in TDB, do the following substeps. (a) Calculate the transaction-weighted upper bound twubi of item i as:. . twubi . tmw y. i Transy ^Transy TDB. ttmw. .. where tmwy is the transaction maximum weight of each Transy in TDB. (b) Calculate the actual weighted support wsupi of the item i as:. . w supi . w. i i Transy ^Transy TDB. ttmw. .. where wi is the weight value of the item i in weight table. STEP 4: For each item i in TDB, do the following substeps. (a) If the transaction-weighted upper bound twubi of i is larger than or equal to the minimum weighted support threshold , put it in the set of weighted frequent. 39.

(53) upper-bound 1-itemsets, WFUB1. (b) If the actual weighted support wsupi of i is larger than or equal to the minimum weighted support threshold , put it in the set of weighted frequent 1-itemsets, WF1. STEP 5: Set r = 1, where r represents the number of items in the processed itemsets. STEP 6: Gather the items appearing in the set of WFUB1, and put them in the set of possible items, PIr. STEP 7: For each y-th transaction Transy in TDB, do the following substeps. (a) Get each item i in Transy. (b) Check whether the item i appears in PIr or not. If yes, then keep the item i in the transaction Transy; otherwise, remove the item i from the transaction Transy. (c) If the number of items kept in the modified transaction Transy is less than the value (= r+1), then remove the modified transaction Transy from TDB; otherwise, keep it in TDB. STEP 8: Process each itemset X in the set of WFUB1 from the last one to the first one in alphabetical order of them by the following substeps. (as mentioned in Section 3.3.2) (a) Find the relevant transactions including X in TDB, and put the transactions in the set of projected transactions tdbX of the itemset X.. 40.

(54) (b) Find the transaction maximum weight tmwy of Transy in tdbX as: tmwy = max{wy1, wy2, …, wyj}. where wyj is the weight value wi of each j-th item i in Transy. (c) Find all the weighted frequent itemsets with X as their prefix itemset by the Finding-WF(X, tdbX, r) procedure. Let the set of returned weighted frequent itemsets be WFX. STEP 9: Output the all weighted frequent itemsets in all the WFs.. After STEP 9, all the weighted frequent itemsets are found by Finding-WF(X, tdbX, r) procedure, and the procedure is stated below.. The Finding-WF(X, tdbX, r) procedure: Input: A prefix r-itemset X and its corresponding projected transactions tdbX. Output: The weighted frequent itemsets with the prefix itemset X.. PSTEP 1: Initialize the temporary itemset TIX table as an empty table, in which each tuple consists of three fields: itemset, transaction-weighted upper bound (twub) of the transaction, and the actual weighted support (wsup) of the itemset. PSTEP 2: For each y-th transaction Transy in tdbX, do the following substeps.. 41.

(55) (a) Get each item i located after X in Transy. (b) Generate the (r+1)-itemset X’ composed of the prefix r-itemset X and i, put it in the temporary set of itemsets. If the itemset X’ does not appear in the temporary set of itemsets, then put it in the temporary set; otherwise, remove the itemset X’. (c) For each itemset in the temporary set of itemsets, add the transaction maximum weight tmwy of the y-th transaction Transy, and the weight wX’ of the itemset X’ in corresponding fields in the TIX table. PSTEP 3: For each (r+1)-itemset in the TIX table, do the following substeps. (a) If the transaction-weighted upper bound twubX’ of the (r+1)-itemset X’ is larger than or equal to the minimum weighted support threshold , put it in the set of weighted frequent upper bound (r+1)-itemsets with the X as their prefix itemset, WFUB(r+1), X. (b) If the actual weighted support wsupX’ of the (r+1)-itemset X’ is larger than or equal to the minimum weighted support threshold , put it in the set of weighted frequent itemsets, WFX. PSTEP 4: Gather the items appearing in the set of WFUB(r+1), X of X, and put them in the set of possible items, PI(r+1), X. PSTEP 5: Set r = r + 1, where r represents the number of items in the processed itemsets. PSTEP 6: For each y-th transaction Transy in tdbX, do the following substeps.. 42.

(56) (a) Check whether each item i in Transy appears in PIr, X or not. If yes, then keep the item i in Transy; other, remove the item i from Transy. (b) If the number of items kept in the modified transaction Transy is less than the value (= r+1), remove the modified transaction Transy from tdbX; otherwise, keep it in tdbX. PSTEP 7: Process each itemset X’ in the set of WFUBr, X in the alphabetical order by the following substeps. (a) Find the relevant transactions including X’ from tdbX, and then put the transactions including X’ in the set of projected transaction database tdbX’ of prefix X’. (b) If the relationship between the r-th item of X’ and each item i located after X’ in each transaction of tdbX’ is a weighted frequent upper bound, keep the item i in the projected transactions; otherwise, remove the item i from the projected transactions with itemset X’ as prefix. (c) If the length of the projected transaction in tdbX’ is less than r+1, remove the projected transaction from tdbX’. (d) Calculate the new transaction maximum weight tmwy of Transy in tdbX’ as: tmwy = max{wy1, wy2, …, wyj}, where wyj is the weight value wi of each j-th item i in Transy. (e) Find all weighted frequent itemsets with X’ as their prefix itemset by the. 43.

(57) Finding-WF(X, tdbX, r) procedure, where X = X’ and tdbX = tdbX’. PSTEP 8: Return the weighted frequent itemsets in the set of WFX.. 3.3.4 An Example of PWAI In the section, an example is stated to describe how to find weighted frequent itemsets from a transaction database by the PWAI. The details of dataset are shown in Table 3.1, the individual weights of the weight items are set in Table 3.2. Then the minimum weighted support threshold λ is set at 30%. According to the given data, the detail of process of the proposed algorithm is described below. STEP 1: The transaction maximum weight tmw of each transaction in the transaction database TDB is first found. Take the first transaction Trans1 in Table 3.1 as an example. The transaction includes the four items, A, C, E and F, their weight values are 0.30, 0.45, 0.40, and 0.50, respectively. The maximum weight value among them is 0.50, and the value is viewed as the transaction maximum weight of the transaction Trans1. All the other transactions in Table 3.1 can be similarly processed in the same way. The results for the transaction maximum weights of the five transactions are shown in Table 3.3. STEP 2: In the example, because the transaction maximum weights of the five transactions in Table 3.3 are 0.50, 0.95, 0.60, 0.95, and 0.50, the total transaction maximum weight ttmw can be found as 0.50 + 0.95 + 0.60 + 0.95 + 0.50 = 3.50.. 44.

(58) STEP 3: The transaction-weighted upper bound (twub) and weighted support (wsup) of each item in TDB are found simultaneously. Take item A in Table 3.3 as an example. Item A appears in three transactions, Trans1, Trans4, and Trans5, and the transaction maximum weights of the three transactions are 0.50, 0.95, and 0.50, respectively. In addition, the weight of item A in Table 3.2 is 0.30, and the total transaction maximum weight ttmw of Table 3.3 is 3.50. Hence, the transaction-weighted upper bound twubA of item A can be calculated as (0.50 + 0.95 + 0.50) / 3.50, which is 55.71%, and its weighted support wsupA can be calculated as (0.30 + 0.30 + 0.30) / 3.50, which is 25.71%. The same process can be done for all the other items in TDB. The results for the transaction-weighted upper bounds and the weighted supports of all 1-items in TDB are shown in Table 3.4. STEP 4: The weighted frequent upper bound 1-itemsets (WFUB1) and the weighted frequent 1-itemsets (WF1) can be found simultaneously from Table 3.4. Take the 1-itemset {D} in Table 3.4 as an example. The transaction-weighted upper bound and the weighted support values of the 1-itemset {D} in Table 3.4 are 47.14% and 11.42%, respectively. Since the transaction-weighted upper-bound twub{D} is larger than or equal to the minimum weighted support threshold 30%, 1-itemset {D} is a weighted frequent upper-bound 1-itemset, but not a weighted frequent 1-itemset. All the other 1-itemsets in Table 3.4 can be processed in the same way. After the step is finished, the set of the weighted frequent upper-bound 1-itemsets (WFUB1) contains the following six itemsets, {A}, {B}, {C}, {D}, {E}, and {H}, and the set. 45.