高平均效益項目集之探勘

全文

(1)國立高雄大學電機工程學系(研究所) 碩士論文. 高平均效益項目集之探勘 Mining High Average-Utility Itemsets. 研究生：李卓翰撰指導教授：洪宗貝博士. 中華民國九十八年七月.

(2) 高平均效益項目集之探勘指導教授：洪宗貝博士國立高雄大學資訊工程所學生：李卓翰國立高雄大學電機工程所摘要效益探勘是頻繁項目集探勘的一種延伸，它考慮到價格、利潤或是其他來自使用者選擇的度量。傳統上，一個項目集的效益是所有交易中項目集的效益的加總，並沒有考慮到項目集本身的長度。因此，相較於原本的效益度量法，這篇論文提出了平均效益的度量法來顯現一個較好的利益效果。這篇論文也提出了一個有效率地尋找高平均效益項目集的演算法。此方法使用上限邊界值來高估項目集的真實平均效益以滿足向下封閉的特性來減少候選項目集的數量。在同樣的門檻值下，探勘出的高平均效益項目集會少於高效益項目集。因此，相較於原來的方法，使用平均效益度量法的探勘可以在有著較高門檻值及相關且重要的準則下執行。在論文的第二部份，我們討論到高平均效益項目集的維護。這篇論文也提出了兩個漸進式的平均效益探勘演算法來處理資料的新增及刪除。這兩個演算法根據 FUP 演算法的概念並且使用先前挖掘出的高平均效益項目集來進行維護，因此能加速整個探勘的過程。實驗結果顯示提出的批次(batch)挖掘演算法能有效率地挖掘出高平均效益項目集，提出的漸進式平均效益挖掘演算法可以有效地處理資料庫的更新。關鍵字：平均效益、效益探勘、向下封閉性、兩階段探勘、漸進式探勘。. I.

(3) Mining High Average-Utility Itemsets Advisor: Dr. Tzung-Pei Hong Institute of Computer Science and Information Engineering National University of Kaohsiung Student: Cho-Han Lee Institute of Electrical Engineering National University of Kaohsiung ABSTRACT Utility mining is an extension of frequent-itemset mining, considering cost, profit or other measures from user preference. Traditionally, the utility of an itemset is the summation of the utilities of the itemset in all the transactions regardless of its length. The average-utility measure is thus proposed in this thesis to reveal a better utility effect of combining several items than the original utility measure. A mining algorithm is proposed to efficiently find the high average-utility itemsets. It uses upper-bound values to overestimate the actual average utilities of the itemsets for satisfying the “downward closure” property to decrease the number of candidate itemsets. The mined high average-utility itemsets will be fewer than the high utility itemsets under the same threshold. Mining with the average-utility measure can thus be executed under a larger threshold than that with the original, thus with a more significant and relevant criterion. In the second part of the thesis, we discuss the maintenance of the high average-utility itemsets. Two incremental average-utility mining algorithms are proposed for record insertion and deletion, respectively. The proposed algorithms are based on the concept of the FUP algorithm and utilize previously discovered high average-utility itemsets in the maintenance process, thus speeding up the mining process. Experimental results show that the proposed batch mining algorithm is efficient to find high average-utility itemsets and the incremental average-utility mining algorithms are useful to process the updated database. Keywords: average utility, utility mining, downward closure, two-phase mining, incremental mining.. II.

(4) Acknowledgements 對於這篇論文能順利完成，首先我要感謝的是我的指導教授，洪宗貝博士。洪老師在論文的方向及構思上總是給予許多建議及思考空間；且在繁忙的教學研究過程中，老師還不時撥空為學生的論文做指導與修改，學生至為感激。除此之外，在碩士班的生涯中，老師也教導了我許多待人處事以及組織學習方面的道理，讓我獲益良多。在此謹致上本人最深及由衷的感謝。同時，我要感謝我的碩士論文口試委員：潘正祥教授、李健興教授及吳志宏教授。謝謝你們撥空來參加我的碩士論文口試，並且對於我的碩士論文的內容給予寶貴的建議及指導，使本論文更臻完善。此外，我要感謝這兩年來指導我進行研究的教授們，包括林文揚教授、王學亮教授。也要感謝實驗室的浚瑋學長、國誠學長、韋体學長、志偉、博正、正男、欣怡、升馨、偉屏與廷一，在課業和生活上的互相幫助，讓這段碩士班修業過程增添了許多色彩。尤其是要特別感謝國誠學長在論文上給予我許多的幫助及建議。最後，我要感謝我的父母親及我的家人，感謝他們對我的支持、鼓勵和無悔的付出，使我能專心於學業上，順利取得碩士學位。謹誌於知識工程與人工智慧實驗室李卓翰 2009.07.29. III.

(5) Index Chinese Abstract ...................................................................................................................... I English Abstract ......................................................................................................................II Acknowledgements ............................................................................................................... III Index....................................................................................................................................... IV List of Figures........................................................................................................................ VI List of Tables.........................................................................................................................VII CHAPTER 1 Introduction ......................................................................................................1 1.1 Motivation....................................................................................................................1 1.2 Contributions................................................................................................................6 1.3 Reader’s Guide.............................................................................................................6 CHAPTER 2 Review of Related Works.................................................................................8 2.1 Apriori Principle...........................................................................................................8 2.2 Utility Mining ..............................................................................................................9 2.3 Some Utility Mining Methods .....................................................................................9 2.4 The FUP Algorithm.................................................................................................... 11 CHAPTER 3 Mining High Average-Utility Itemsets ..........................................................18 3.1 The Proposed Algorithm for Mining High Average-Utility Itemsets ........................21 3.2 An Example................................................................................................................24 CHAPTER 4 Incremental Utility Mining Algorithm for Record Insertion .....................34 4.1 Notation......................................................................................................................37 4.2 The Proposed Incremental Utility Mining Algorithm for Record Insertion ..............38 4.3 An Example................................................................................................................43 CHAPTER 5 Incremental Utility Mining Algorithm for Record Deletion.......................60 5.1 Notation......................................................................................................................62 5.2 The Proposed Incremental Utility Mining Algorithm for Record Deletion...............64 5.3 An Example................................................................................................................69 CHAPTER 6 Experimental Results .....................................................................................85. IV.

(6) 6.1 Experimental Results of the Proposed Algorithm for Mining High Average-Utility Itemsets .....................................................................................................................85 6.2 Experimental Results of Incremental Utility Mining Algorithms..............................88 CHAPTER 7 Discussion and Conclusion ............................................................................92 References ...............................................................................................................................94. V.

(7) List of Figures Figure 2-1: Four cases arising from adding new transactions to an existing database. ...........12 Figure 2-2: Four cases arising from deleting transactions from an existing database. ............15 Figure 6-1: Numbers of candidate itemsets along with different minimum utility thresholds for the two approaches. ........................................................................................86 Figure 6-2: Execution time along with different minimum utility thresholds for the two approaches............................................................................................................87 Figure 6-3: The execution time of ITPAU vs. TPAU on different numbers of transactions (threshold=0.01%). ..............................................................................................89 Figure 6-4: The execution time of ITPAU vs. TPAU on different numbers of transactions (threshold=0.05%). ..............................................................................................89 Figure 6-5: The execution time of ITPAU vs. TPAU on different numbers of transactions (threshold=0.09%). ..............................................................................................90. VI.

(8) List of Tables Table 2-1: Four cases and their results in FUP for record insertion.........................................14 Table 2-2: Four cases and their results in FUP for record deletion..........................................16 Table 3-1: A transaction as the example. .................................................................................18 Table 3-2: The predefined profit values of the items. ..............................................................19 Table 3-3: Five transactions as an example. ............................................................................20 Table 3-4: The set of ten transaction data for this example. ....................................................25 Table 3-5: The predefined profit values of the items. ..............................................................25 Table 3-6: The utility values of all the items in each transaction. ...........................................26 Table 3-7: The maximal utility values in each transaction of all the given ten transactions. ..27 Table 3-8: The average-utility upper bounds of 1-itemsets. ....................................................28 Table 3-9: The candidate average-utility 1-itemsets, C1. .........................................................28 Table 3-10: The average-utility upper bounds of the 2-itemsets. ............................................29 Table 3-11: The remaining candidate average-utility 2-itemsets, C2. ......................................30 Table 3-12: The average-utility upper bounds of the 3-itemsets. ............................................30 Table 3-13: All the candidate average-utility itemsets in the example. ...................................31 Table 3-14: The actual average-utility values of the candidate average-utility itemsets. ........32 Table 3-15: High average-utility itemsets................................................................................33 Table 4-1: The set of ten transaction data in the original database. .........................................44 Table 4-2: The predefined profit values of the items. ..............................................................44 Table 4-3: The average-utility upper bounds and the actual average-utility values of the high upper-bound average-utility itemsets from the original database..........................45 Table 4-4: The three newly inserted transactions.....................................................................46. VII.

(9) Table 4-5: The utility values of all the items in the newly inserted transactions.....................47 Table 4-6: The maximal utility values in the newly inserted transactions...............................47 Table 4-7: The average-utility upper bounds of the 1-itemsets in the new transactions..........48 Table 4-8: The set of high upper-bound average-utility 1-itemsets for the new transactions, HU 1N . .....................................................................................................................49. Table 4-9: The set of all the updated high upper-bound average-utility 1-itemsets, HU 1U . ...51 Table 4-10: The average-utility upper bounds of 2-itemsets in the new transactions..............52 Table 4-11: The set of high upper-bound average-utility 2-itemsets for the new transactions, HU 2N . ...................................................................................................................52. Table 4-12: The set of all the updated high upper-bound average-utility 2-itemsets, HU 2U . .55 Table 4-13: The set of all the updated high upper-bound average-utility itemsets, HUU. .......56 Table 4-14: The actual average-utility value of each high upper-bound average-utility itemset s in HUU of the updated database appearing in the set of high upper-bound average-utility itemsets (HUD) of the original database. .....................................58 Table 4-15: All the high average-utility itemsets for the updated database, HU. .....................59 Table 5-1: The set of ten transaction data in the original database. .........................................69 Table 5-2: The predefined profit values of the items. ..............................................................70 Table 5-3: The average-utility upper bounds and the actual average-utility values of the high upper-bound average-utility itemsets from the original database..........................71 Table 5-4: The three deleted transactions. ...............................................................................71 Table 5-5: The utility values of all the items in the deleted transactions.................................72 Table 5-6: The maximal utility values in the deleted transactions...........................................73 Table 5-7: The average-utility upper bounds of the 1-itemsets in the deleted transactions.....74 Table 5-8: The set of high upper-bound average-utility 1-itemsets for the deleted transactions, HU 1R . .....................................................................................................................74. Table 5-9: The set of all the updated high upper-bound average-utility 1-itemsets, HU 1U . ...76 VIII.

(10) Table 5-10: The average-utility upper bounds of 2-itemsets in the deleted transactions.........77 Table 5-11: The set of high upper-bound average-utility 2-itemsets for the deleted transactions, HU 2R . ...................................................................................................................78. Table 5-12: The set of all the updated high upper-bound average-utility 2-itemsets, HU 2U . .80 Table 5-13: The set of all the updated high upper-bound average-utility itemsets, HUU. .......81 Table 5-14: The actual average-utility value of each high upper-bound average-utility itemset s in HUU of the updated database appearing in the set of high upper-bound average-utility itemsets (HUD) of the original database. .....................................83 Table 5-15: All the high average-utility itemsets for the updated database, HU. .....................84 Table 6-1: Comparison of the numbers of candidate itemsets (CI), high average-utility itemsets (HAUI) and high utility itemsets (HUI) of the two approaches. .............87. IX.

(11) CHAPTER 1 Introduction 1.1 Motivation. Mining frequent itemsets from a transaction database is a fundamental task for knowledge discovery such as association rules [1], sequential patterns [2-4] and classification [5, 6]. In the past, numerous methods were proposed to discover frequent itemsets. Among them, the most two famous kinds were level-wise algorithms [1, 7-10] and pattern-growth methods [11-15]. These approaches, however, only considered whether an item was bought in a transaction or not. Thus, frequent itemsets just reveal the frequency of occurrence of the itemsets, but do not reflect any other factors, such as price or profit. In some situations, frequent itemsets may only contribute a small portion to the overall profit, while non-frequent ones may contribute a large portion to the profit [16]. For example, sale of diamonds may occur less frequently than that of clothing in a department store, but the former gives a much higher profit per unit sold than the latter. Only Frequency is thus not sufficient to identify the items which are highly profitable or have other potential effects. Utility mining [17] is thus proposed to partially solve the above problem. It may be thought of as an extension of frequent-itemset mining with the sold quantities and the item. 1.

(12) profits considered. Utility is used to estimate how “useful” an itemset is. An itemset is said useful to a user if it satisfies the utility constraint; that is, the utility of the itemset must be larger than a threshold defined by the user. In practice, the utility value of an itemset can be measured in terms of cost, profit or other measures from user preference. For example, someone may be interested in finding the itemsets with good profits and another may focus on the itemsets with low pollution while manufacturing. In utility mining, local transaction utility and external utility are used to measure the utility of an item. The local transaction utility of an item is the information stored in the transaction dataset, like the quantity of the item sold in the transaction. The external utility of an item is defined by user’s preference, like the profits. Thus, external utility often reflects user’s preference and can be represented by a utility table or a utility function. By combining a transaction dataset and a utility table together, the discovered itemsets will better match a user’s expectation than those by only considering the transaction dataset itself [16]. In frequent-itemset mining, the Apriori-like strategy [18] is often adopted to search for frequent itemsets level by level. The basic principle of the Apriori-like strategy is the “downward closure” property (anti-monotone property). That is, any superset of a non-frequent itemset is also non-frequent. This anti-monotone property of itemsets is used to reduce the search space by pruning the non-frequent itemsets early. The “downward closure” principle cannot, however, be directly applied to discover high utility itemsets. Without the. 2.

(13) “downward closure” property, the number of candidate itemsets generated at each level is close to all the combinations of all the items. The computation time for handling this thus becomes intolerable. In this thesis, we proposed a new idea to evaluate the utilities of itemsets. Traditionally, the utility of an itemset is the summation of the utilities of the itemset in all the transactions regardless of its length. Thus, the utility of an itemset in a transaction will increase along with the increase of its length. That is, longer itemsets in a transaction result in higher utility values. Thus, using the same minimum threshold to judge itemsets with different lengths is not fair. In order to alleviate the effect of the length of itemsets and identify really good utility itemsets, the average utility measure was applied to reveal a better utility effect of combining several items than the original utility measure. It is defined as the total utility of an itemset divided by its number of items within it. The average utility of an itemset is then compared with a threshold to decide whether it is a high average-utility itemset. An algorithm is also proposed to find all the high average-utility itemsets. Like two-phase mining for high utility itemsets, the proposed mining algorithm for high average-utility itemsets uses average-utility upper bounds to overestimate the actual average utilities of itemsets for satisfying the downward closure property. The average-utility upper bound of an itemset is designed here as the summation of the maximal utility among the items in each transaction including the itemset. Only the combinations of the itemsets which have. 3.

(14) their average-utility upper bounds beyond the user-defined threshold are added into the candidate set in a level-wise way. The downward closure property can thus be maintained in this way. That is, any subset of an itemset with high average-utility bound must also be of high average utility. Therefore, the size of the candidate set is substantially reduced during the level-wise search. Only one database scan is needed to filter out the promising candidates. After that, the second database scan is performed to find the actual average utility of each candidate and decide whether it is desired. Most of the above approaches assumed that the database was static and focused on batch mining. However, in real-world applications, the database varies with newly added records (or transactions). When new records are added into the database, some of the originally frequent or high utility itemsets may become invalid, or some new implicitly frequent or high utility itemsets may appear in the whole updated database [19-22]. In this situation, the conventional batch mining algorithms must re-process the whole updated database to find the frequent or high utility itemsets. However, re-processing the whole updated database results in the following two disadvantages: (a) The computation time for mining from the whole updated database is spent. If the original database is large, much computation time is wasted. (b) The previously mined information mined from the original database provides no help in the incremental mining process.. 4.

(15) To avoid the shortcomings, FUP (Fast UPdate algorithm) [19] was then proposed to incrementally maintain the mined results for traditional association rules. The FUP algorithm was based on the Apriori mining algorithm [18] and adopted the pruning techniques of the DHP (Direct Hashing and Pruning) algorithm [23]. It first calculated large itemsets from newly inserted transactions, and compared them with the previous large itemsets from the original database. According to the comparison results, FUP determined whether re-scanning the original database was needed, thus saving some time in maintaining association rules. In this thesis, we also proposed two algorithms to maintain the discovered high average-utility itemsets in the database varying with the new inserted records and the deleted records. These algorithms adopt the Apriori strategy to search for high utility itemsets level by level. The “downward closure” property of itemsets, which is the basic principle of the Apriori-like strategy, is used to reduce the search space by pruning low utility itemsets early. With the “downward closure” property, the number of candidate itemsets generated at each level is greatly reduced. Besides, in order to handle the database varying with new inserted records and deleted records, the concept of FUP algorithm is also adopted to reduce the time to re-process the whole updated database. Finally, the performances of the proposed mining algorithms are verified by a real database.. 5.

(16) 1.2 Contributions. This section states the main contributions of the thesis. The contributions can be divided into the following three parts. (1) We propose the average utility measure to reveal a better utility effect of combining several items than the original utility measure. (2) We propose a two-phase average-utility mining algorithm to efficiently discover high average-utility itemsets. The number of candidate itemsets generated by the proposed algorithm is greatly reduced when compared to the traditional approach, and thus a lot of computational time is saved. (3) We proposed two incremental utility mining algorithms for record insertion and deletion. The proposed two incremental utility mining algorithms are effective for maintaining the discovered high average-utility itemsets for record insertion and deletion.. 1.3 Reader’s Guide. The remaining parts of this thesis are organized as follows. In Chapter 2, we review some related researches, including Apriori principle, utility mining, some utility mining methods. 6.

(17) and the FUP algorithm. The definition and the meaning of the high average-utility itemsets are given in Chapter 3. The algorithm for mining high average-utility itemsets and an example to illustrate it are also given in that chapter. Two incremental utility mining algorithms for record insertion and deletion are proposed in Chapters 4 and 5, respectively. Some examples to illustrate them are also given there. The experimental results are presented in Chapter 6. Finally, discussion and conclusion are given in Chapter 7.. 7.

(18) CHAPTER 2 Review of Related Works In this chapter, we review some related researches about this thesis. Section 2.1 describes the Apriori principle. Section 2.2 introduces the general concept of utility mining. Section 2.3 reviews some utility-mining methods. Section 2.4 states the FUP algorithm, which is used for incrementally maintaining association rules.. 2.1 Apriori Principle. Agrawal and Srikant proposed the Apriori algorithm [18] to mine association rules from a set of transactions in a level-wise way. In each pass, Apriori employs the downward-closure (anti-monotone) property to prune impossible candidates, thus improving the efficiency of identifying frequent itemsets. This property states that each subset of a frequent itemset must be frequent and each superset of an infrequent itemset must be infrequent. With the property in mining, the number of itemsets to be checked can decrease remarkably.. 8.

(19) 2.2 Utility Mining. Utility mining [17], an extension of frequent itemset mining, is based on the measurement of local transaction utility and external utility. Given a transaction database, a utility table and a minimum utility threshold, the goal of utility mining is to discover the itemsets whose utility value is larger than the defined minimum utility threshold. The utility of an item in a transaction is defined as the product of its quantity (local transaction utility) multiplied by its profit (external utility). The utility of an itemset in a transaction is thus the sum of the utilities of all the items in the transaction. If the sum of the utilities of an itemset in all the transactions is larger than the predefined utility threshold, then the itemset is called a high utility itemset. In utility mining, the downward-closure property no long exists since the utility of an itemset will grow monotonically and the frequency of an itemset will reduce monotonically along with the number of items in an itemset. The two different monotonic properties make the downward-closure property invalid in utility mining.. 2.3 Some Utility Mining Methods. In the past, several mining approaches were proposed for fining high utility itemsets. For example, Barber and Hamilton proposed the approaches of Zero pruning (ZP) and Zero. 9.

(20) subset pruning (ZSP) to exhaustively search for all high utility itemsets in a database [24, 25]. They generated all the itemsets as candidates except the ones with their local measure values (utilities) being exactly zero. Although ZP and ZSP could discover all high utility itemsets in a database, the computation cost was, however, very high. Li et al. proposed the FSM, the ShFSM and the DCG methods [26-28] to discover all high utility itemsets by taking advantage of the level-closure property. These methods relied on the critical function of each candidate to remove useless candidates. Yao then proposed a framework for mining high utility itemsets based on mathematical properties of utility constraints. Two pruning strategies based on utility upper bounds and expected utility upper bounds respectively were adopted to reduce the search space. These pruning strategies were then incorporated into the mining approach Umining and its heuristic successor, Umining_H [29]. Liu et al. then presented a two-phase algorithm for fast discovering all high utility itemsets [16, 30]. It had two phases. In the first phase, the transaction utility was used as the effective upper bound of each candidate itemset in the transaction such that the “transaction-weighted downward closure property” could be kept in the search space to decrease the number of candidate itemsets. In the second phase, an additional database scan was performed to find out the real utility values of the remaining candidates and identifies the high utility itemsets.. 10.

(21) 2.4 The FUP Algorithm. Generally, the following four cases (illustrated in Figure 2-1) may arise while considering an original database and newly inserted transactions: Case 1: An itemset is large (frequent) in the original database and in the newly inserted transactions; Case 2: An itemset is large in the original database, but is not large (small) in the newly inserted transactions; Case 3: An itemset is not large in the original database, but is large in the newly inserted transactions; Case 4: An itemset is not large in the original database and in the newly inserted transactions.. 11.

(22) Inserted records. Original database. Large itemset. Small itemset. Large itemset. Case 1. Case 2. Small itemset. Case 3. Case 4. Figure 2-1: Four cases arising from adding new transactions to an existing database.. Cases 1 and 4 will not affect the existing rules (results). Case 2 may remove existing rules (results), and case 3 may add new rules (results). Cheung et al. proposed the FUP [19] algorithm to incrementally maintain the mined results of association rules when new transactions were inserted. In FUP, large itemsets with their counts in preceding runs were recorded for later use in maintenance. As new transactions were added, FUP first scanned them to generate candidate 1-itemsets, and then compared these itemsets with the previous ones to decide which case the itemset belonged to. The corresponding process was thus executed according to the four cases, respectively as follows. Case 1: An itemset was large in the original database and in the newly inserted. 12.

(23) transactions. In this case, the itemset was always large in the updated database, and the only thing to do was to re-calculate the counts of the itemset in the updated database. Case 2: An itemset was large in the original database, but was not large in the newly inserted transactions. In this case, the counts of the itemset were re-calculated and the itemset was then checked against the minsup to decide whether it was a large itemset. Case 3: An itemset was not large in the original database, but was large in the newly inserted transactions. In this case, a database rescan was needed to determine the counts of the itemset in the original database. Also, the counts of the itemset were re-calculated and the itemset was then checked against the minsup to decide whether it was a large itemset. Case 4: An itemset was not large in the original database and in the newly inserted transactions. The itemset was always small in the updated database. Nothing had to be done for this case. Summarization of the four cases and their results in FUP is thus listed in Table 2-1.. 13.

(24) Table 2-1: Four cases and their results in FUP for record insertion. Cases: Original - New. Results. Case 1: Large - Large. Always large. Case 2: Large - Small. Decided from the existing information. Case 3: Small - Large. Decided by rescanning the original database. Case 4: Small - Small. Always small. After all the large 1-itemsets for an entire updated database were found, candidate 2-itemsets from the newly inserted transactions were generated and the same procedure was used to find all large 2-itemsets. This procedure was repeated until all large itemsets had been found. Also, the following four cases (illustrated in Figure 2-2) may arise while considering an original database and deleted transactions: Case 1: An itemset is large (frequent) in the original database and in the deleted transactions; Case 2: An itemset is large in the original database, but is not large (small) in the deleted transactions; Case 3: An itemset is not large in the original database, but is large in the deleted transactions; Case 4: An itemset is not large in the original database and in the deleted transactions.. 14.

(25) Deleted records. Original database. Large itemset. Small itemset. Large itemset. Case 1. Case 2. Small itemset. Case 3. Case 4. Figure 2-2: Four cases arising from deleting transactions from an existing database.. Cases 2 and 3 will not affect the existing rules (results). Case 1 and Case 4 may add new rules (results) or remove existing rules (results). FUP first scanned the deleted transactions to generate candidate 1-itemsets, and then compared these itemsets with the previous ones to decide which case the itemset belonged to. The corresponding process was thus executed according to the four cases, respectively as follows. Case 1: An itemset was large in the original database and in the deleted transactions. In this case, the counts of the itemset were re-calculated and the itemset was then checked against the minsup to decide whether it was a large itemset.. 15.

(26) Case 2: An itemset was large in the original database, but was not large in the deleted transactions. In this case, the itemset was always large in the updated database, and the only thing to do was to re-calculate the counts of the itemset in the updated database. Case 3: An itemset was not large in the original database, but was large in the deleted transactions. The itemset was always small in the updated database. Nothing had to be done for this case. Case 4: An itemset was not large in the original database and in the deleted transactions. In this case, a database rescan was needed to determine the counts of the itemset in the original database. Also, the counts of the itemset were re-calculated and the itemset was then checked against the minsup to decide whether it was a large itemset. Summarization of the four cases and their results in FUP is thus listed in Table 2-2.. Table 2-2: Four cases and their results in FUP for record deletion. Cases: Original - Deleted. Results. Case 1: Large - Large. Decided from the existing information. Case 2: Large - Small. Always large. Case 3: Small - Large. Always small. Case 4: Small - Small. Decided by rescanning the original database. 16.

(27) After all the large 1-itemsets for an entire updated database were found, candidate 2-itemsets from the deleted transactions were generated and the same procedure was used to find all large 2-itemsets. This procedure was repeated until all large itemsets had been found.. 17.

(28) CHAPTER 3 Mining High Average-Utility Itemsets In this thesis, we would like to find high average-utility itemsets instead of traditional high utility itemsets. It is reasonable and can effective reduce the size of candidates. The average utility of an itemset is first defined below. Traditionally, the utility of an itemset is the summation of the utilities of the itemset in all the transactions regardless of its length. Thus, the utility of an itemset in a transaction will increase along with the increase of its length. That is, longer itemsets in a transaction result in higher utility values. For example, assume a transaction is given as shown in Table 3-1. There are five items in the transaction, respectively denoted A to E. The value attached to each item is the quantity sold in the transaction.. Table 3-1: A transaction as the example. TID. A. B. C. D. E. tx. 1. 1. 4. 1. 0. Assume the predefined profit of each item is defined in Table 3-2. The utility of the 1-itemset {A} in the transaction is thus calculated as 1*3, which is 3, according to the above two tables. The utility of the 2-itemset {AB} in the transaction is calculated as 1*3+1*10,. 18.

(29) which is 13. Similarly, the utility of the 3-itemset {ABC} is calculated as 1*3+1*10+4*1, which is 17. Accordingly, the utility of the 3-itemset {ABC} is larger than the 2-itemset {AB}, which is further larger than the 1-itemset {A}. Longer itemsets result in higher utility values. This property is very obvious since longer itemsets will include some more items than their proper subsets. This effect will attenuate the judgment about whether an itemset is really better than its subsets.. Table 3-2: The predefined profit values of the items. Item. Profit. A. 3. B. 10. C. 1. D. 6. E. 5. Let’s give another example to show our idea. Assume there are five transactions and only two items, A and B, in the data set shown in Table 3-3. Assume the sale quantities of both the items each time are equal if they are purchased and the profits of the two items are also the same as well. Thus, the utility values of both the items are the same in a transaction if they are purchased. Let the utility value of a purchased item in a transaction as X.. 19.

(30) Table 3-3: Five transactions as an example. TID. A. B. t1. X. 0. t2. 0. X. t3. 0. X. t4. X. X. t5. X. X. For the first transaction in Table 3-3, item A is purchased and its utility is thus X. B is not purchased and its utility is thus 0. Besides, the support of A is 0.6 and the utility is 3X. The support of B is 0.8 and the utility is 4X. However, the support of the 2-itemset AB is 0.4, but the utility of AB is 4X, which doesn’t decrease along with its lower support value. Besides, the utility (4X) of selling A and B together in the case does not mean better than the total utility (7X) of individually selling A and selling B. It is because the length of the itemset {AB} is 2, which is not considered when the utility of the itemset is calculated. The average utility measure is thus adopted in this thesis to reveal a better utility effect of combining several items than the original utility measure. It is defined as the total utility of an itemset divided by its number of items within it. In this example, the utility of {AB} is divided by 2, which is equal to 2X. The average utility of an itemset is then compared with a threshold to decide whether it is a high average-utility itemset. As expected, the mined itemsets in the proposed way will be fewer than those in the original way under the same threshold. Our proposed approach can thus be executed under a larger threshold than the original, thus with a more significant and relevant. 20.

(31) criterion. The approach for mining useful itemsets under the proposed criterion is stated below.. 3.1 The Proposed Algorithm for Mining High Average-Utility Itemsets. In the proposed algorithm, the anti-monotone property is used to decrease the number of itemsets to be scanned level by level. There are two phases in the proposed algorithm. In phase 1, the average-utility upper bound is used to overestimate the itemsets. The average-utility upper bound is an overestimated utility value instead of actual utility value. The average-utility upper bound can ensure the anti-monotone property. Thus, each subset of an itemset with high average-utility upper bound must be high; each superset of an itemset with low average-utility upper bound must be low. It can thus prune many low average-utility upper bound itemsets level by level and decrease the time to scan a database. In phase 2, we just need to scan the database once to check the result of phase 1 is actually high or not. The proposed algorithm first finds all the candidate average-utility 1-itemsets C1. The 1-itemsets whose average-utility upper bounds larger than or equal to minimum average-utility threshold are put in the set of candidate average-utility 1-itemset C1. Candidate average-utility 2-itemsets C2 are formed from C1. The proposed algorithm then check all the candidate average-utility 2-itemsets C2 by comparing the average-utility upper bound with the minimum. 21.

(32) average-utility threshold. The itemsets which do not exceed the minimum average-utility threshold are removed from the candidate 2-itemsets. The same procedure is repeated until all the itemsets have been found. Then we calculate the actual average-utility value of each candidate average-utility itemset. If the itemset is larger than or equal to the minimum average-utility threshold, put it in the set of high average-utility itemsets, H. The details of the proposed mining algorithm are described below.. Two-phase algorithm for mining high average-utility itemsets: INPUT: 1. A set of m items I = {i1, i2, … , ij, … , im}, each ij with a profit value pj, j = 1 to m; 2. A transaction database D = {T1, T2, … , Tn}, in which each transaction includes a subset of items with quantities; 3. The minimum average-utility threshold . OUTPUT: A set of high average-utility itemsets.. STEP 1: Calculate the utility value ujk of each item ij in each transaction Tk as ujk =qjk*pj, where qjk is the quantity of ij in Tk for j = 1 to m and k = 1 to n. STEP 2: Find the maximal utility value muk in each transaction Tk as muk = max{u1k, u2k, … , umk} for k = 1 to n.. 22.

(33) STEP 3: Calculate the average-utility upper bound ubj of each item ij as the summation of the maximal utilities of the transactions which include ij. That is: ub j .  mu. i j Tk. k. .. STEP 4: Check whether the average-utility upper bound of an item ij is larger than or equal to. . If ij satisfies the above condition, put it in the set of candidate average-utility 1-itemsets, C1. That is: C1  {i j | ub j   ,1  j  m} .. STEP 5: Set r = 1, where r is used to represent the number of items in the current candidate average-utility itemsets to be processed. STEP 6: Generate the candidate set Cr+1 from Cr with all the r-subitemsets in each candidate in Cr+1 must be contained in Cr. STEP 7: Calculate the average-utility upper bound ubs of each candidate average-utility (r+1)-itemset as the summation of the maximal utilities of the transactions which include s. That is: ubs .  mu. s Tk. k. .. STEP 8: Check whether the average-utility upper bound of each candidate (r+1)-itemsets s is larger than or equal to . If s does not satisfy the above condition, remove it from Cr+1. That is: New Cr 1  {s ubs   , s  original Cr 1} .. 23.

(34) STEP 9: IF Cr+1 is null, do the next step; otherwise, set r = r + 1 and repeat STEPs 6 to 9. STEP 10: For each candidate average-utility itemset s, calculate its actual average-utility value aus as follows:. au s .  u. s Tk i j s. jk. |s|. ,. where ujk is the utility value of each item ij in transaction Tk and |s| is the number of items in s. STEP 11: Check whether the actual average-utility value aus of each candidate average-utility itemset s is larger than or equal to . If s satisfies the above condition, put it in the set of high average-utility itemsets, H. That is: H  {s aus   , s  C} ,. where C is the set of all the candidate average-utility itemsets.. 3.2 An Example. In this section, an example is given to demonstrate the proposed mining algorithm based on the average-utility of items. This is a simple example to show how the proposed algorithm can be easily used to find out the high average-utility itemsets from a set of transactions. Assume the ten transactions shown in Table 3-4 are used for mining. Each transaction consists of two features, transaction identification (TID) and items purchased.. 24.

(35) Table 3-4: The set of ten transaction data for this example. TID. A. B. C. D. E. t1. 1. 1. 4. 1. 0. t2. 0. 1. 0. 3. 0. t3. 2. 0. 0. 1. 0. t4. 0. 0. 1. 0. 0. t5. 1. 2. 0. 1. 3. t6. 1. 1. 1. 1. 1. t7. 0. 2. 3. 0. 1. t8. 0. 0. 0. 1. 2. t9. 7. 0. 1. 1. 0. t10. 0. 1. 1. 1. 1. Also assume that the predefined profit value for each single item is defined in Table 3-5.. Table 3-5: The predefined profit values of the items. Item. Profit. A. 3. B. 10. C. 1. D. 6. E. 5. Moreover, the minimum average-utility threshold  is set as 45.4 which is 20% of total utility. In order to find the high average-utility itemsets from the data in Table 3-4, the proposed mining algorithm proceeds as follows.. 25.

(36) STEP 1: The utility value of each item occurring in each transaction in Table 3-4 is calculated. Take item B in transaction 7 as an example. The quantity of item B in transaction 7 is 2, and its profit is 10. The utility value of B is thus calculated as 2*10, which is 20. The utility values of all the items in each transaction are shown in Table 3-6.. Table 3-6: The utility values of all the items in each transaction. TID. A. B. C. D. E. t1. 3. 10. 4. 6. 0. t2. 0. 10. 0. 18. 0. t3. 6. 0. 0. 6. 0. t4. 0. 0. 1. 0. 0. t5. 3. 20. 0. 6. 15. t6. 3. 10. 1. 6. 5. t7. 0. 20. 3. 0. 5. t8. 0. 0. 0. 6. 10. t9. 21. 0. 1. 6. 0. t10. 0. 10. 1. 6. 5. STEP 2: The utility values of the items in each transaction are compared and the maximal utility value in the transaction is found. Take transaction 1 as an example. It can be observed from Table 3-6 that the utility value of B is 10, which is the maximal in transaction 1. The maximal utility value in each transaction is shown in Table 3-7.. 26.

(37) Table 3-7: The maximal utility values in each transaction of all the given ten transactions. TID. A. B. C. D. E. Maximal Utility Value in Transaction. t1. 3. 10. 4. 6. 0. 10. t2. 0. 10. 0. 18. 0. 18. t3. 6. 0. 0. 6. 0. 6. t4. 0. 0. 1. 0. 0. 1. t5. 3. 20. 0. 6. 15. 20. t6. 3. 10. 1. 6. 5. 10. t7. 0. 20. 3. 0. 5. 20. t8. 0. 0. 0. 6. 10. 10. t9. 21. 0. 1. 6. 0. 21. t10. 0. 10. 1. 6. 5. 10. STEP 3: The average-utility upper bound of 1-itemsets is calculated. Take item A as an example. It appears in transactions 1, 3, 5, 6 and 9. The average-utility upper bound of A is thus the total amount of the maximal utility values of these transactions. It is calculated as 10 + 6 + 20 + 10 + 21, which is 67. The upper-bound values of all the items are shown in Table 3-8.. 27.

(38) Table 3-8: The average-utility upper bounds of 1-itemsets. Candidate Itemset. Average-Utility Upper Bound. A. 67. B. 88. C. 72. D. 105. E. 70. STEP 4: Check whether the average-utility upper bound of 1-itemsets is larger than or equal to user-defined minimum average-utility threshold , which is 45.4. In this example, the average-utility upper bound of 1-itemsets exceeds the minimum average-utility threshold . All the items are recorded as candidate average-utility 1-itemsets, C1, shown in Table 3-9.. Table 3-9: The candidate average-utility 1-itemsets, C1. Candidate 1-Itemset. Average-Utility Upper Bound. A. 67. B. 88. C. 72. D. 105. E. 70. 28.

(39) STEP 5: The variable r is set at 1, where r is used to represent the number of items in the current candidate average-utility itemsets to be processed. STEP 6: The candidate average-utility 2-itemsets (C2) are then generated from C1. They are {AB}, {AC}, {AD}, {AE}, {BC}, {BD}, {BE}, {CD}, {CE}, {DE}. STEP 7: The average-utility upper bound of each 2-itemset is calculated. Take the itemset {AB} as an example. It appears in transactions 1, 5 and 6. The average-utility upper bound of {AB} is thus the total amount of the maximal utility values of these transactions as 10 + 20 + 10, which is 40. The upper-bound values of all the 2-itemsets are shown in Table 3-10.. Table 3-10: The average-utility upper bounds of the 2-itemsets. Candidate 2-Itemset. Average-Utility Upper Bound. AB. 40. AC. 41. AD. 67. AE. 30. BC. 50. BD. 68. BE. 60. CD. 51. CE. 40. DE. 50. 29.

(40) STEP 8: The average-utility upper bound of each 2-itemset is thus checked against the user-defined minimum average-utility threshold . In this example, the itemsets {AB}, {AC}, {AE} and {CE} do not exceed . These itemsets are thus removed from C2. The remaining candidate average-utility 2-itemsets are shown in Table 3-11.. Table 3-11: The remaining candidate average-utility 2-itemsets, C2. Candidate 2-Itemset. Average-Utility Upper Bound. AD. 67. BC. 50. BD. 68. BE. 60. CD. 51. DE. 50. STEP 9: Since C2 is not null, r is incremented to 2 and STEPs 6 to 9 are repeated. C3 is then generated from C2 as shown in Table 3-12.. Table 3-12: The average-utility upper bounds of the 3-itemsets. Candidate 3-Itemset. Average-Utility Upper Bound. BCD. 30. BDE. 40. 30.

(41) Since the average-utility upper bounds of both the two candidate 3-itemsets are less than , they are removed from C3 and C3 becomes null. After this step, all the candidate average-utility itemsets are shown in Table 3-13.. Table 3-13: All the candidate average-utility itemsets in the example. Candidate Itemset. Average-Utility Upper Bound. A. 67. B. 88. C. 72. D. 105. E. 70. AD. 67. BC. 50. BD. 68. BE. 60. CD. 51. DE. 50. STEP 10: The actual average-utility value aus of each candidate average-utility itemset is calculated. Take the itemset {AD} as an example. The actual utility values of items A and D in transaction 1 are 3 and 6, respectively. Since the itemset {AD} contains 2 items, its actual average-utility value in transaction 1 is calculated as (3 + 6) / 2, which is 4.5. The itemset {AD} appears in transactions 1, 3, 5, 6 and 9. The actual. 31.

(42) average-utility value of {AD} is thus the total amount of actual average-utility values of these transactions. The value is calculated as (9 + 12 + 9 + 9 + 27) / 2, which is 33. The actual average-utility value of each candidate average-utility itemset is shown in Table 3-14.. Table 3-14: The actual average-utility values of the candidate average-utility itemsets. Candidate Itemset. Average-Utility. A. 36. B. 80. C. 11. D. 60. E. 40. AD. 33. BC. 29.5. BD. 51. BE. 45. CD. 15.5. DE. 29.5. STEP 11: The actual average-utility value of each candidate average-utility itemset is then compared with the user-defined minimum average-utility threshold . In this example, the actual average-utility values of itemsets {B}, {D} and {BD} are larger than or equal to . They are thus put into the set of high average-utility itemsets, H,. 32.

(43) as shown in Table 3-15.. Table 3-15: High average-utility itemsets. High Average-Utility Itemset. Average-Utility. B. 80. D. 60. BD. 51. In this example, four high average-utility itemsets are generated. Note that if the traditional utility criterion is used, the results will be {B}, {D}, {AD}, {BC}, {BD}, {BE} and {DE}. The number of the high average-utility itemsets is less than that of the high utility itemset. Under the perspective of the average utility, the utility values of itemsets won’t increase with the increase of itemset length. The item combination in a high average-utility itemset can thus really show its excellence in obtaining profits.. 33.

(44) CHAPTER 4 Incremental Utility Mining Algorithm for Record Insertion The proposed incremental average-utility mining algorithm was based on the concept of the four cases in FUP but for average-utility itemsets. There are two phases in the proposed incremental average-utility mining algorithm. In the first phase, the average-utility upper bound is used to overestimate the itemsets. The average-utility upper bound is an overestimated utility value instead of actual utility value. The average-utility upper bound can ensure the anti-monotone property which is used to decrease the number of itemsets to be scanned level by level. The itemsets which have their average-utility upper bounds larger than or equal to the user-defined threshold are defined as “high upper-bound average-utility itemsets”. Otherwise, they are regarded as “low upper-bound average-utility itemsets”. Each subset of a “high upper-bound average-utility itemset” is certainly a “high upper-bound average-utility itemset” and each superset of a “low upper-bound average-utility itemset” is certainly a “low upper-bound average-utility itemset”. It can thus prune many “low upper-bound average-utility itemsets” level by level and decrease the time to scan a database. The proposed algorithm first scans the new transactions to obtain all candidate 1-itemsets with their average-utility upper bounds. The candidate 1-itemsets are then checked. 34.

(45) against the minimum average-utility threshold to decide whether they are high upper-bound average-utility itemsets for the new transactions. For each 1-itemset in the set of high upper-bound average-utility itemsets from the original database, if it appears in the set of high upper-bound average-utility 1-itemsets in the new transactions, it belongs to Case 1 (Large - Large) which is similar to that mentioned in Table 2-1. Thus, it is still a high upper-bound average-utility itemset for the whole updated database. The updated average-utility upper bound of the itemset can easily be obtained by using addition. For each 1-itemset in the set of high upper-bound average-utility itemsets from the original database, if it does not appear in the set of high upper-bound average-utility 1-itemsets in the new transactions, it belongs to Case 2 (Large - Small). The updated average-utility upper bound of the itemset is thus re-calculated and checked against the minimum average-utility threshold to determine whether it is a high upper-bound average-utility itemset in the updated database. For each itemset in the set of high upper-bound average-utility itemsets in the new transactions, if it does not appear in the set of high upper-bound average-utility itemsets in the original database, it belongs to Case 3 (Small – Large). A database rescan is needed to determine the average-utility upper bound of the itemset for the original database. The upper-bound value is then re-calculated and checked against the minimum average-utility. 35.

(46) threshold to determine whether it is a high upper-bound average-utility itemset in the updated database. All the high upper-bound average-utility 1-itemsets for the whole updated database are then formed. Next, candidate 2-itemsets based on the high upper-bound average-utility 1-itemsets from the new transactions are generated. The same procedure is repeated, each time with one more item added, until no high upper-bound average-utility itemsets are formed. After the first phase, all the high upper-bound average-utility itemsets for the whole updated database are formed. Then the second phase begins. In this phase, the actual average-utility values of the high upper-bound average-utility itemsets are calculated. Also, these itemsets are checked against the minimum average-utility threshold to determine whether they are actually high or not. All the actual high average-utility itemsets can thus be found. Our incremental utility mining algorithm can reduce the time to re-process the whole updated database when compared with conventional batch utility mining algorithms. The details of the proposed incremental average-utility mining algorithm are described below.. 36.

(47) 4.1 Notation. Notation used in this algorithm is described as follows: D : the original database; N : the set of new transactions; U : the entire updated database, i.e., D∪ N; d : the number of transactions in D; n : the number of transactions in N;.  : the minimum average-utility ratio; αD : the minimum average-utility threshold defined in the original database; αN : the minimum average-utility threshold for the new transactions; αU : the minimum average-utility threshold for the updated database; HU kD : the set of high upper-bound average-utility k-itemsets in the original database;. HU D : the set of high upper-bound average-utility itemsets in the original database; HU kN : the set of high upper-bound average-utility k-itemsets in the new transactions;. HU N : the set of high upper-bound average-utility itemsets in the new transactions; HU kU : the set of high upper-bound average-utility k-itemsets in the updated database;. HU U : the set of high upper-bound average-utility itemsets in the updated database;. HD: the set of high average-utility itemsets in the original database;. 37.

(48) HU: the set of high average-utility itemsets in the updated database; muk: the maximal utility value muk in each transaction Tk; s: an itemset; ubD(s): the average-utility upper bound of itemset s in the original database; ubN(s): the average-utility upper bound of itemset s in the new transactions; ubU(s): the average-utility upper bound of itemset s in the updated database; auD(s): the actual average-utility value of itemset s in the original database; auN(s): the actual average-utility value of itemset s in the new transactions; auU(s): the actual average-utility value of itemset s in the updated database;. 4.2 The Proposed Incremental Utility Mining Algorithm for Record Insertion. INPUT: The profit values of the items, the minimum average-utility ratio , an original database D with its minimum average-utility threshold αD (= total utility*), high upper-bound average-utility itemsets (HUD) and high average-utility itemsets (HD), and a set of new transactions N = {T1, T2, … , Tn}. OUTPUT: A set of high average-utility itemsets (HU) for the updated database U (= D∪ N).. 38.

(49) STEP 1: Calculate the minimum average-utility thresholds (αN and αU) respectively for the new transactions N and for the updated database U as follows:. N . D D  n and  U   (d  n), d d. where αD is the minimum average-utility threshold for the original database, d is the number of transactions in the original database, and n is the number of new transactions. STEP 2: Calculate the utility value ujk of each item Ij in each new transaction Tk as ujk = qjk * pj, where qjk is the quantity of Ij in Tk, pj is the profit value of Ij, j = 1 to m and k = 1 to n. STEP 3: Find the maximal item-utility value muk in each new transaction Tk as muk = max{u1k, u2k, … , umk}, k = 1 to n. STEP 4: Set k = 1, where k records the number of items in the itemsets currently being processed. STEP 5: Generate the candidate k-itemsets and calculate their average-utility upper bounds from the new transactions. The average-utility upper bound ubs of each candidate k-itemset s is set as the summation of the maximal item-utilities of the transactions which include s. That is: ubs .  mu. sTk. k. .. STEP 6: Check whether the average-utility upper bound of each candidate k-itemset s from the new transactions is larger than or equal to N. If s satisfies the above condition,. 39.

(50) put it in the set of high upper-bound average-utility k-itemsets for the new transactions, HU kN . STEP 7: For each k-itemset s in the set of high upper-bound average-utility itemsets ( HU kD ) from the original database, if it appears in the set of high upper-bound average-utility k-itemsets ( HU kN ) in the new transactions, do the following substeps. Substep 7-1: Set the newly updated average-utility upper bounds of itemset s as: ubU(s) = ubD(s) + ubN(s). Substep 7-2: Put s in the set of updated high upper-bound average-utility k-itemsets, HU kU .. STEP 8: For each k-itemset s in the set of high upper-bound average-utility itemsets ( HU kD ) from the original database, if it does not appear in the set of high upper-bound average-utility k-itemsets ( HU kN ) in the new transactions, do the following substeps. Substep 8-1: Set the updated average-utility upper bound of itemset s as : ubU(s) = ubD(s) + ubN(s). Substep 8-2: Check whether the average-utility upper bound of itemset s is larger than or equal to U. If it satisfies the above condition, put it in the set of updated high upper-bound average-utility k-itemsets, HU kU . STEP 9: For each k-itemset s in the set of high upper-bound average-utility itemsets ( HU kN ) in the new transactions, if it does not appear in the set of high upper-bound. 40.

(51) average-utility k-itemsets ( HU kD ) in the original database, do the following substeps. Substep 9-1: Rescan the original database to determine the average-utility upper bound (ubD(s)) of itemset s. Substep 9-2: Set the updated average-utility upper bound of itemset s as: ubU(s) = ubD(s) + ubN(s). Substep 9-3: Check whether the average-utility upper bound of itemset s is larger than or equal to U. If it satisfies the above condition, put it in the set of updated high upper-bound average-utility k-itemsets, HU kU . STEP 10: Generate the candidate (k+1)-itemsets from the set of high upper-bound average-utility k-itemsets ( HU kN ) in the new transactions; If any k-sub-itemsets of a candidate (k+1)-itemsets is not contained in the set of updated high upper-bound average-utility k-itemsets ( HU kU ), remove it from the candidate set. STEP 11: Set k = k+1. STEP 12: Repeat STEPs 5 to 11 until no new candidate itemsets are generated. STEP 13: For each high upper-bound average-utility itemset s in HUU of the updated database, if it appears in the set of high upper-bound average-utility itemsets (HUD) of the original database, do the following substeps. Substep 13-1: Calculate the actual average-utility value of each itemset s for the new transactions as:. 41.

(52) au ( s )  N.  u. s Tk i j s. jk. ,. |s|. where ujk is the utility value of each item Ij in transaction Tk and |s| is the number of items in s. Substep 13-2: Set the new actual average-utility value of s in the updated database as: auU(s) = auD(s) + auN(s). Substep 13-3: Check whether the actual average-utility value of itemset s is larger than or equal to αU. If it satisfies the above condition, put it in the set of updated high average-utility itemsets, HU. STEP 14: For each high upper-bound average-utility itemset s in HUU of the updated database, if it does not appears in the set of high upper-bound average-utility itemsets (HUD) of the original database, do the following substeps. Substep 14-1: Calculate the actual average-utility value of each itemset s for the new transactions as:. au ( s )  N.  u. s Tk i j s. |s|. jk. ,. where ujk is the utility value of each item Ij in transaction Tk and |s| is the number of items in s. Substep 14-2: Rescan the original database to determine the actual average-utility value auD(s) in HUD.. 42.

(53) Substep 14-3: Set the new actual average-utility value of s in the updated database as: auU(s) = auD(s) + auN(s). Substep 14-4: Check whether the actual average-utility value of itemset s is larger than or equal to αU. If it satisfies the above condition, put it in the set of updated high average-utility itemsets, HU. After Step 14, the final updated high average-utility itemsets for the updated database can then be found.. 4.3 An Example. In this section, an example is given to demonstrate the proposed incremental average-utility mining algorithm for record insertion. This is a simple example to show how the proposed algorithm can be easily used to efficiently find out high average-utility itemsets from incrementally coming transaction data without less rescans of original databases. Assume the original database includes 10 transactions, shown in Table 4-1. Each transaction consists of its transaction identification (TID) and items purchased. The numbers represents the quantities purchased.. 43.

(54) Table 4-1: The set of ten transaction data in the original database. TID. A. B. C. D. E. t1. 1. 1. 4. 1. 0. t2. 0. 1. 0. 3. 0. t3. 2. 0. 0. 1. 0. t4. 0. 0. 1. 0. 0. t5. 1. 2. 0. 1. 3. t6. 1. 1. 1. 1. 1. t7. 0. 2. 3. 0. 1. t8. 0. 0. 0. 1. 2. t9. 7. 0. 1. 1. 0. t10. 0. 1. 1. 1. 1. Also assume that the profit value of each item is defined in Table 4-2.. Table 4-2: The predefined profit values of the items. Item. Profit. A. 3. B. 10. C. 1. D. 6. E. 5. Suppose the minimum average-utility ratio  is set at 20%. Thus, the minimum average-utility threshold is calculated as the total utility value multiplied by 20%, which is 45.4. Using the batch mining algorithm for the original database, the set of high upper-bound average-utility itemsets generated in Phase 1 are shown in Table 4-3. The average-utility. 44.

(55) upper bound and the actual average-utility value of each high upper-bound average-utility itemset are also recorded in Table 4-3.. Table 4-3: The average-utility upper bounds and the actual average-utility values of the high upper-bound average-utility itemsets from the original database. Itemset. Average-Utility Average-Utility Upper Bound. A. 67. 36. B. 88. 80. C. 72. 11. D. 105. 60. E. 70. 40. AD. 67. 33. BC. 50. 29.5. BD. 68. 51. BE. 60. 45. CD. 51. 15.5. DE. 50. 29.5. Assume the three new transactions shown in Table 4-4 are inserted after the initial data set is processed. The proposed incremental average-utility mining algorithm proceeds as follows.. 45.

(56) Table 4-4: The three newly inserted transactions. TID. A. B. C. D. E. t11. 1. 1. 1. 2. 0. t12. 0. 3. 4. 1. 0. t13. 2. 0. 2. 0. 1. STEP 1: The minimum average-utility thresholds (αN and αU) respectively for the newly inserted transactions N and for the updated database U are calculated. In this example, there are 3 newly inserted transactions and thus 13 (10+3) transactions in the updated database. According to the formulas, αN and αU are calculated as follows:. N  U . D 45.4 n   3  13.62 d 10. D 45.4  ( d  n)   (10  3)  59.02 d 10. STEP 2: The utility value of each item occurring in each newly inserted transaction is calculated. Take item {D} in transaction 11 as an example. The quantity of item {D} in transaction 11 is 2, and its profit is 6. The utility value of {D} is thus calculated as 2*6, which is 12. The utility values of all the items in each newly inserted transaction are shown in Table 4-5.. 46.

(57) Table 4-5: The utility values of all the items in the newly inserted transactions. TID. A. B. C. D. E. t11. 3. 10. 1. 12. 0. t12. 0. 30. 4. 6. 0. t13. 6. 0. 2. 0. 5. STEP 3: The utility values of the items in a transaction are compared and the maximal utility value in the transaction is found. Take transaction 12 as an example. It can be observed from Table 4-5 that the utility value of {B} is 30, which is the maximal in transaction 12. The maximal utility value in each transaction is shown in Table 4-6.. Table 4-6: The maximal utility values in the newly inserted transactions. TID. A. B. C. D. E. Maximal Utility Value in a Transaction. t11. 3. 10. 1. 12. 0. 12. t12. 0. 30. 4. 6. 0. 30. t13. 6. 0. 2. 0. 5. 6. STEP 4: k is set to 1, where k is used to record the number of items in the itemsets currently being processed. STEP 5: The average-utility upper bounds of the 1-itemsets in the newly inserted transactions are first calculated. Take item {A} as an example. It appears in. 47.

(58) transactions 11 and 13. The average-utility upper bound of {A} is thus the total amount of the maximal utility values of these transactions. It is calculated as 12+6 (=18) in the example. The upper-bound values of all the items in the new transactions are shown in Table 4-7.. Table 4-7: The average-utility upper bounds of the 1-itemsets in the new transactions. 1-Itemset. Average-Utility Upper Bound. A. 18. B. 42. C. 48. D. 42. E. 6. STEP 6: The average-utility upper bounds of the 1-itemsets are checked against the minimum average-utility threshold N (which is 13.62) for the new transactions. In this example, the four 1-itemsets {A}, {B}, {C}, {D} are larger than N. The four items are then put in the set of high upper-bound average-utility 1-itemsets for the new transactions, HU 1N , which are thus shown in Table 4-8.. 48.

(59) Table 4-8: The set of high upper-bound average-utility 1-itemsets for the new transactions, HU 1N . 1-Itemset. Average-Utility Upper Bound. A. 18. B. 42. C. 48. D. 42. STEP 7: For each 1-itemset s in the set of high upper-bound average-utility itemsets ( HU 1D ) from the original database, if it appears in the set of high upper-bound average-utility 1-itemsets ( HU 1N ) in the new transactions, the following substeps are done. In this case, the four 1-itemsets {A}, {B}, {C} and {D} are then processed. Substep 7-1: The newly updated average-utility upper bound of each itemset is calculated. Take {A} as an example. Its average-utility upper bounds in the original database and in the new transactions are 67 and 18, respectively. As a result, the average-utility upper bound of {A} in the updated database is calculated as 67+18, which is 85. The average-utility upper bounds for the other three items can be easily calculated in the same way. Substep 7-2: The four 1-itemsets, {A}, {B}, {C} and {D}, are then put into the set of updated high upper-bound average-utility 1-itemsets, HU 1U .. 49.

(60) STEP 8: For each 1-itemset s in the set of high upper-bound average-utility itemsets ( HU 1D ) from the original database, if it does not appear in the set of high upper-bound average-utility 1-itemsets ( HU 1N ) in the new transactions, the following substeps are done. In this case, only {E} satisfies the condition. Substep 8-1: The updated average-utility upper bound of {E} is calculated as 70+6, which is 76. Substep 8-2: The updated average-utility upper bound of the itemset {E} is larger than. U, which is 59.02. Itemset {E} is thus put in the set of updated high upper-bound average-utility 1-itemsets, HU 1U . STEP 9: In this case, there are no 1-itemsets in the set of high upper-bound average-utility itemsets ( HU 1N ) in the new transactions not appearing in the set of high upper-bound average-utility 1-itemsets ( HU 1D ) in the original database, this step is then skipped.. After Step 9, all the updated high upper-bound average-utility 1-itemsets are shown in Table 4-9.. 50.

(61) Table 4-9: The set of all the updated high upper-bound average-utility 1-itemsets, HU 1U . 1-Itemset. Average-Utility Upper Bound. A. 85. B. 130. C. 120. D. 147. E. 76. STEP 10: The candidate 2-itemsets are generated from the set of high upper-bound average-utility 1-itemsets ( HU 1N ) in the new transactions; If any 1-sub-itemsets of a candidate 2-itemsets is not contained in the set of updated high upper-bound average-utility 1-itemsets ( HU 1U ), it will be removed from the candidate set. In this case, the 2-itemsets are {AB}, {AC}, {AD}, {BC}, {BD} and {CD}. STEP 11: k is set to 2, where k is used to record the number of items in the itemsets currently being processed. STEP 12: The average-utility upper bounds of 2-itemsets in the newly inserted transactions are calculated. The upper-bound values of all the 2-itemsets in the new transactions are shown in Table 4-10.. 51.

(62) Table 4-10: The average-utility upper bounds of 2-itemsets in the new transactions. 2-Itemset. Average-Utility Upper Bound. AB. 12. AC. 18. AD. 12. BC. 42. BD. 42. CD. 42. STEP 13: The average-utility upper bounds of the 2-itemsets are checked against the minimum average-utility threshold N (which is 13.62) for the new transactions. In this example, the four 2-itemsets {AC}, {BC}, {BD}, {CD} are larger than N. The four itemsets are then put in the set of high upper-bound average-utility 2-itemsets for the new transactions, HU 2N , which are thus shown in Table 4-11.. Table 4-11: The set of high upper-bound average-utility 2-itemsets for the new transactions, HU 2N . 2-Itemset. Average-Utility Upper Bound. AC. 18. BC. 42. BD. 42. CD. 42. 52.

(63) STEP 14: For each 2-itemset s in the set of high upper-bound average-utility itemsets ( HU 2D ) from the original database, if it appears in the set of high upper-bound average-utility 2-itemsets ( HU 2N ) in the new transactions, the following substeps are done. In this case, the three 2-itemsets {BC}, {BD} and {CD} are then processed. Substep 14-1: The newly updated average-utility upper bounds of itemsets {BC}, {BD} and {CD} are calculated, which are (92), (110) and (93), respectively. Substep 14-2: The three 2-itemsets, {BC}, {BD} and {CD}, are then put into the set of updated high upper-bound average-utility 2-itemsets, HU 2U . STEP 15: For each 2-itemset s in the set of high upper-bound average-utility itemsets ( HU 2D ) from the original database, if it does not appear in the set of high upper-bound average-utility 2-itemsets ( HU 2N ) in the new transactions, the following substeps are done. In this case, the three 2-itemsets {AD}, {BE} and {DE} are then processed. Substep 15-1: The newly updated average-utility upper bounds of itemsets {AD}, {BE} and {DE} are calculated, which are (79), (60) and (50), respectively. Substep 15-2: The updated average-utility upper bounds of the itemsets {AD} and {BE} are larger than U, which is 59.02. Itemsets {AD} and {BE} are thus put in the set of updated high upper-bound average-utility 2-itemsets, HU 2U .. 53.

(64) STEP 16: For each 2-itemset s in the set of high upper-bound average-utility itemsets ( HU 2N ) in the new transactions, if it does not appear in the set of high upper-bound average-utility 2-itemsets ( HU 2D ) in the original database, the following substeps are done. In this case, the 2-itemset {AC} is then processed. Substep 16-1: The original database is rescanned to determine the average-utility upper bound of itemset {AC}, which is 41. Substep 16-2: The updated average-utility upper bound of itemset {AC} is calculated. The average-utility upper bounds of {AC} in the original database and new transactions are 41 and 18, respectively. Thus, the updated average-utility upper bound of {AC} is calculated as 41+18, which is 59. Substep 16-3: The updated average-utility upper bound of the itemset {AC} is smaller than U, which is 59.02. Thus, nothing has to be done.. After Step 16, all the updated high upper-bound average-utility 2-itemsets are shown in Table 4-12.. 54.

(65) Table 4-12: The set of all the updated high upper-bound average-utility 2-itemsets, HU 2U . 2-Itemset. Average-Utility Upper Bound. AD. 79. BC. 92. BD. 110. BE. 60. CD. 93. STEP 17: The candidate 3-itemsets are generated from the set of high upper-bound average-utility 2-itemsets ( HU 2N ) in the new transactions; If any 2-sub-itemsets of a candidate 3-itemsets is not contained in the set of updated high upper-bound average-utility 2-itemsets ( HU 2U ), it will be removed from the candidate set. In this case, the 3-itemset is {BCD}. STEP 18: k is set to 3, where k is used to record the number of items in the itemsets currently being processed. STEP 19: Repeat STEPs 5 to 11 until no new candidate itemsets are generated.. The set of all the updated high upper-bound average-utility itemsets are shown in Table 4-13.. 55.