適用於串流資料中探勘間接關聯規則的通用型架構及演算方法

全文

(1)國立高雄大學資訊工程學系(研究所) 碩士論文. 適用於串流資料中探勘間接關聯規則的通用型架構及演算方法 A Generic Framework and Algorithms for Mining Indirect Associations from Data Streams. 研究生：魏佑恩撰指導教授：林文揚博士.

(2) 論文審定書論文審定書. I.

(3) 致謝首先要感謝我的指導老師林文揚老師，他在我碩士求學過程中，一直給我莫大的幫助，無論是生活上的關懷，益或者是解開研究路上的疑惑，在關鍵的每個時刻，他所寄給我的信總給我很大的勇氣與信心，感謝老師，辛苦了。再來要謝謝實驗室的長榮學長與和益學長，在我遇到瓶頸時與他們討論使我能繼續突破困境，在學校修課上的建議也幫助我許多，而吳錦昂學姊與蘇家輝學長在共同討論時也給了我不少實用的建議，順發學弟在我寫論文繁忙時，也幫我處理得很多雜務，而同屆的所有同學們，無論在修課上或是在平常事務處理上，也都幫助了我許多，在這裡要感謝他們。接著要感謝替我口試的幾個委員，感謝你們能替我口試，提出了許多建議使論文更為完整，而洪宗貝老師實驗室的學長、學姐、學弟和學妹們也感謝你們在 Meet 上的幫助，每次 Meet 總是能帶給我許多的想法。最後感謝我的家人朋友們，他們在背後給我支持的力量，使我能繼續走下去，在生活上也無憂無慮，能專心的完成我的論文。其實這段兩年的時光過的很快，而過程之中總是能遇到貴人，感謝你們，在我的人生階段，有來讀碩士來體驗碩士的生活，是一件幸福無比的事。. II.

(4) 適用於串流資料中探勘間接關聯規則的通用型架構及演算方法指導教授：林文揚博士（教授）國立高雄大學資訊工程所. 學生：魏佑恩國立高雄大學資訊工程所. 摘要. 間接關聯規則其概念是藉由一組稱之為“中介子(Mediator)”的頻繁樣式的聯繫，將一組出現頻率較低的項目關聯起來。近年來，間接關聯規則被視為一種有用的規則型態，在許多應用上都可以揭示隱藏的有趣信息，例如推薦排名、找出替代項目或競爭項目、網頁瀏覽路徑、基因表現分析等。然而就我們所知，目前關於間接關聯探勘的研究的重點仍在於在靜態資料，幾乎沒有研究探討如何在資料串流中發現此類型的樣式。在本文中，將考慮於資料串流如何探勘間接關聯規則。不同於傳統在資料串流上探勘的研究，通常針對不同的資料串流模型分別討論如何進行探勘，我們提出了一個通用的探勘架構，此架構可以涵蓋目前最常使用的三種串流視窗模式，包括界標視窗模式、時間遞減視窗模式、滑動視窗模式等，且允許使用者自訂適合其需求的視窗模式。基於此通用架構，我們發展了兩個可有效探勘間接關聯規則的演算法。我們證明所提出的方法， GIAMS-IND 與 GIAMS-MED，皆保證不會產生錯誤的間接關聯規則，且其品質的誤差皆在一定的範圍內。我們並透過在合成以及實際的資料集上進行完整深入的實驗來驗證所提出的演算方法的效用與效能。關鍵字：間接關聯規則、資料串流探勘、通用串流模型關鍵字. III.

(5) A Generic Framework and Algorithms for Mining Indirect Associations from Data Streams Advisor(s): Dr. (Professor) Wen-Yang Lin Institute of Computer Science and Information Engineering National University of Kaohsiung Student: You-En Wei Institute of Computer Science and Information Engineering National University of Kaohsiung. ABSTRACT An indirect association refers to an infrequent itempair, each item of which is highly co-occurred with a frequent itemset called “mediator”. Although indirect associations have been recognized as powerful patterns in revealing interesting information hidden in many applications, such as recommendation ranking, substitute items or competitive items, and common web navigation path, gene expression, etc, all work conducted up to now has focused on mining indirect associations from static data; almost no work, to our knowledge, has investigated how to discover this type of patterns from streaming data. In this thesis, the problem of mining indirect associations from data streams is considered. Unlike contemporary research work on stream data mining that investigates the problem individually from different types of streaming models, we treat the problem in a generic way. We propose a generic framework that can encompass all classical streaming models, including landmark window model, time-fading window model, and sliding window model, and allows the flexibility of users specified window model through parameter settings to fit their needs. Based on this generic framework, we develop two efficient algorithms to fulfill the task of generating indirect associations in this context. We prove that the proposed two algorithms GIAMS-IND and GIAMS-MED can guarantee no false positive patterns and bounded error on the quality of the discovered indirect associations. Comprehensive experiments on both synthetic and real datasets under three widely used streaming models show the effectiveness and efficiency of the proposed algorithms. Keywords: Indirect association, data stream mining, generic streaming model. IV.

(6) Contents. 論文審定書........................................................................................................... I 論文審定書致謝.....................................................................................................................II 致謝摘要....................................................................................................................III 摘要 ABSTRACT ...................................................................................................... IV Contents..............................................................................................................V Chapter 1 Introduction.......................................................................................1 Chapter 2 Background and Related Work .......................................................3 2.1 Data Stream Mining.....................................................................................................3 2.1.1 Landmark window model .................................................................................3 2.1.2 Time-fading window model..............................................................................4 2.1.3 Sliding window model ......................................................................................4 2.2 Indirect Association Mining.........................................................................................5. Chapter 3 The Proposed Generic Framework for Streaming Indirect Associations Mining ...........................................................................7 3.1 Proposed Generic Streaming Window Model..............................................................7 3.2 Proposed Generic Framework for Indirect Associations Mining ................................9. Chapter 4 The Proposed Algorithms............................................................... 11 4.1 Algorithm GIAMS-IND............................................................................................. 11 4.1.1 Procedure BlockDelete ...................................................................................15 4.1.2 Procedure TransactionMerge ..........................................................................17 4.1.3 Procedure DelayInsert.....................................................................................18 4.1.4 Procedure Decay&Pruning .............................................................................19. V.

(7) 4.1.5 Indirect Association Generation......................................................................20 4.2 Algorithm GIAMS-MED ...........................................................................................22. Chapter 5 Theoretical Analyses .......................................................................28 5.1 Support Error Bound Analysis ...................................................................................28 5.2 Performance Comparison...........................................................................................30. Chapter 6 Experimental Results......................................................................32 6.1 Evaluation on Synthetic Data.....................................................................................32 6.1.1 Landmark window model ...............................................................................32 6.1.2 Time-fading window model............................................................................39 6.1.3 Sliding window model ....................................................................................44 6.2 Evaluation on Real Data ............................................................................................48 6.2.1 Landmark window model ...............................................................................48 6.2.2 Time-fading window model............................................................................53 6.2.3 Sliding window model ....................................................................................58. Chapter 7 Conclusions and Future Work.......................................................62 References ..........................................................................................................64. VI.

(8) List of Figures Figure 2-1. Landmark window model....................................................................4 Figure 2-2. Time-fading window model. ................................................................4 Figure 2-3. Sliding window model. .........................................................................5 Figure 2-4. The INDIRECT algorithm. ..................................................................6 Figure 3-1. Generic window model.........................................................................8 Figure 3-2. An example for illustrating the generic window model. ...................8 Figure 3-3. Generic framework for indirect association mining. ......................10 Figure 4-1. Illustration of Card-Stree for maintaining FP..................................12 Figure 4-2. The GIAMS-IND algorithm. ..............................................................14 Figure 4-3. An example data stream whose first three blocks are displayed....15 Figure 4-4. Description of procedure BlockDelete. .............................................16 Figure 4-5. An example for block delete function. ..............................................16 Figure 4-6. Description of procedure TransactionMerge. ..................................17 Figure 4-7. The resulting compact table CT after merging the first block. ......17 Figure 4-8. An illustration of delay insertion when processing the first transaction in block 2.....................................................................................18 Figure 4-9. Description of procedure DelayInsert. .............................................19 Figure 4-10. Description of procedure Decay&Pruning.....................................20 Figure 4-11. Description of procedure IndirectAssociationGen..........................21 Figure 4-12. An illustration of indirect association generation..........................22 Figure 4-13. The visual for (1)...............................................................................23 Figure 4-14. The relations between x, y and M. ..................................................24 Figure 4-15. Description of procedure IndirectAssociationGen-Med in algorithm GIAMS-MED. ...............................................................................25 Figure 4-16. The example for generate mediator and indirect item pairs. .......26 Figure 4-17. Generate indirect association rules from mediator and IIS. ........27 Figure 5-1. The description of maximal possible error and pruning threshold. ..........................................................................................................................29 Figure 6-1. Execution time and memory usage for running process 1, with varying transaction sizes and σfs. .................................................................33 Figure 6-2. Execution time and memory usage for process 1, with varying transaction size and stride.............................................................................34 Figure 6-3. Average support error of generated frequent itemsets...................35 Figure 6-4. Recall of discovered indirect associations with different strides (Block Size) and σfs........................................................................................36. VII.

(9) Figure 6-5. The execute time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying σfs. ......37 Figure 6-6. The number of candidate rules generated by GIAMS-IND and GIAMS-MED with varying σfs......................................................................38 Figure 6-7. Execution time and memory usage for running process 1, with varying transaction sizes and decay rates....................................................39 Figure 6-8. Execution time and memory usage for process 1, with varying transaction sizes and σfs ................................................................................40 Figure 6-9. Average support error of generated frequent itemsets. ..................41 Figure 6-10. Recall of discovered indirect associations with different decay rates and σfs....................................................................................................42 Figure 6-11. The execute time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying σfs .......43 Figure 6-12. The number of candidate rules generated by GIAMS-IND and GIAMS-MED with varying σfs.. .....................................................................43 Figure 6-13. Execution time and memory usage for running process 1, with varying transaction sizes and strides............................................................45 Figure 6-14. Execution time and memory usage for running process 1, with varying transaction sizes and window sizes.................................................46 Figure 6-15. The execution time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying σfs. ......47 Figure 6-16. The number of candidate rules generated by GIAMS-IND and GIAM-MED with varying σfs........................................................................48 Figure 6-17. Execution time and memory usage for running process 1, with varying transaction sizes and σfs. .................................................................49 Figure 6-18. Execution time and memory usage for process 1, with varying transaction sizes and strides( Block Size ). ..................................................50 Figure 6-19. Average support error of generated frequent itemsets. ................51 Figure 6-20. Recall of discovered indirect associations with different strides and σfs. ............................................................................................................51 Figure 6-21. The execution time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying σfs. ......52 Figure 6-22. The number of candidate rules generated by GIAMS-IND and GIAMS-MED with varying σfs......................................................................53 Figure 6-23. Execution time and memory usage for running process 1, with varying transaction sizes and decay rates....................................................54 Figure 6-24. Execution time and memory usage for process 1, with varying. VIII.

(10) transaction size and mediator support thresholds. .....................................55 Figure 6-25. Average support error of generated frequent itemsets. ................56 Figure 6-26. Recall of discovered indirect associations with different decay rates. ................................................................................................................56 Figure 6-27. The execution time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying mediator support threshold. ..........................................................................................57 Figure 6-28. The number of candidate rules generated by GIAMS-IND and GIAMS-MED with varying mediator support threshold............................58 Figure 6-29. Execution time and memory usage for process 1, with varying transaction sizes and strides..........................................................................59 Figure 6-30. Execution time and memory usage for process 1, with varying window sizes. ..................................................................................................60 Figure 6-31. The execution time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying mediator support threshold. ..........................................................................................61 Figure 6-32. The number of candidate rules generated by GIAMS-IND and GIAMS-MED with varying mediator support threshold............................61. IX.

(11) Chapter 1 Introduction Recently, the problem of mining interesting patterns or knowledge from large volumes of continuous, fast growing dataset over time, called data streams, has emerged as one of the most challenging issues to the data mining research community [1],[3], [6], [9], [10] [11],[14],[15],[16],[24]. Although over the past few years there is a large volume of literature on mining frequent patterns from streaming data, such as itemsets [6],[11][14][16], maximal itemsets [15], closed itemsets [7], etc., no work, to our knowledge, has endeavored to discover indirect associations, a recently coined new type of infrequent patterns. The term indirection association, first proposed by Tan and Kumar in 2000 [19], refers to an infrequent itempair, each item of which is highly co-occurred with a frequent itemset called “mediator”. Indirect associations have been recognized as powerful patterns in revealing interesting information hidden in many applications, such as recommendation ranking, substitute items (or competitive items), common web navigation path, and gene expression, etc. In this thesis, the problem of mining indirect associations from data streams is considered. Unlike contemporary research work on stream data mining that investigates the problem individually from different types of streaming models, we treat the problem in a generic way. We propose a generic framework that encompasses all classical streaming models and develop two efficient algorithms to fulfill the task of generating indirect associations in this context. The contributions of this thesis are threefold:. 1.

(12) (1) We propose a generic streaming data model, which is capable of encompassing contemporary streaming window models, including landmark window, time-fading window, and sliding window models. (2) We propose a generic approach for mining indirect associations over the generic streaming window model, which guarantees no false positive patterns and bounded error on the quality of the discovered associations. (3) Based on the generic framework, two efficient algorithms for mining indirection associations are developed. The organization of this thesis is as follows. Chapter 2 contains a summary of background and related work. Chapter 3 describes the proposed general framework for mining indirect association rules over streaming data, including the generic window model and generic mining framework. In Chapter 4, we propose two efficient algorithms based on the generic framework in Chapter 3, called GIAMS-IND and GIAMS-MED. Chapter 5 analyzes some properties of the proposed algorithms, including error bound for the supports of discovered frequent itemsets, theoretical performance comparison between GIAMS-MED and GIAMS-IND in indirect association generation. The experimental results on some synthetic and real work datasets are presented in Chapter 6. Finally, in Chapter 7, conclusions and future work are drawn.. 2.

(13) Chapter 2 Background and Related Work 2.1 Data Stream Mining Suppose that we have a data stream S = (t0, t1, … ti, …), where ti denotes the transaction arrived at time i. Since data stream is a continuous and unlimited incoming data along with time, a window W usually is specified, representing the sequence of data arrived from ti to tj, denoted as W[i, j] = (ti, ti+1, …., tj). In the literature, there are three main different types of window models for data stream mining, i.e., landmark window, time-fading window, and sliding window models.. 2.1.1 Landmark window model The landmark model monitors the entire history of stream data from a specific time point called landmark to the present time. Consider Figure 2-1 for example. Window W1 means it collects stream data from ti to tj. In the same way, windows W2 and W3 span stream data from ti to tj+1 and ti to tj+2, respectively.. 3.

(14)  t0 t1 t2. .  ti. tj. W1 W W2 W W2. tj+1 tj+2. time. . Figure 2-1. Landmark window model.. 2.1.2 Time-fading window model The time-fading model (also called damped model) assigns more weights to recently arrived transactions such that new transactions have higher weights than old ones. At every moment, based on a fixed decay rate d, where 0 < d < 1, a transaction processed n time steps ago is assigned a weight dn, and the occurrence of a pattern within that transaction is decreasing accordingly. Figure 2-2 illustrates the conceptual model of a time-fading window.. …. t0 t1. …. …. tj tj−1 tj tj+1. ti ti+1. Weight: dj−i.  Weight: dj−i+1. . Wk … Wk+1 …. d. time. 1.  d2. d. 1. Figure 2-2. Time-fading window model.. 2.1.3 Sliding window model The sliding window model refers to a window sliding with time is kept to capture the data within a fixed time or a fixed number of transactions. When the mining process is requested, only the data kept in the window is considered. A sliding window model keeps a. 4.

(15) window of size ω. The window contains the last ω transactions that have arrived. As shown in Figure 2-3, when a new transaction arrives, the oldest resident in the window is considered obsolete and is deleted to make room for the new one.  ... t0 t1 t2. ti. ti+1 ti+2.  ... ...............ttj.  ... tj+1 tj+2. time. W1 W2 W3. . Figure 2-3. Sliding window model.. 2.2 Indirect Association Mining The concept of indirect associations was first proposed by Tan et al. [19]. for interpreting the value of infrequent patterns. To facilitate the presentation of our framework, we provide a formal definition of indirect associations. Definition 1 An itempair {a, b} is indirectly associated via a mediator M, denoted as ⟨a, b | M⟩ if the following conditions hold: 1. sup({a, b}) < σs (Itempair support condition); 2. sup({a} ∪ M) ≥ σf, sup({b} ∪ M) ≥ σf (Mediator support condition); 3. dep({a}, M) ≥ σd, and dep({b}, M) ≥ σd (Mediator dependence condition); where dep(P, Q) is a measure of the dependence between itemsets P and Q. The well-known dependence function, IS measure, will be used in this thesis.. IS ( P, Q) =. Pr( P, Q ) Pr( P) Pr(Q). 5. (1).

(16) where Pr(⋅) denotes the probability function. In recent years, the problem of indirect association mining has become more and more important because of its applications on different fields [5],[8],[12],[13],[17],[18], [20], [21],[22][23]. Current research work on indirect association mining can be divided into two categories: One is focused on proposing more efficient mining algorithms [5], [22], and the other is on extending the definition of indirect association to fit different applications [5], [12], [13]. The original indirect association mining approach proposed by Tan and Kumar [19, 20], called “INDIRECT”, is shown in Figure 2. In general, it can be divided into two phases, including frequent itemsets mining phase (Step 1) and indirect associations mining (Steps 2-9). Algorithm Name: INDIRECT Input: Transactions database D, σs, σf and σd. Output: Indirect Associations IA. Step: 1:. Extract frequent itemsets, L1, L2, …, Ln, using frequent itemsets generation algorithm, where Li is the set of all frequent i-itemsets.. 2: 3: 4: 5:. IA = ∅; for k = 2 to n do Ck+1 = join(Lk, Lk);. 6: 7: 8:. for each <x, y| M> ∈ Ck + 1 do if (sup({x, y}) < σs and dep({x}, M) ≥ σd and dep({y}, M) ≥ σd) then IA = IA∪{< x, y| M >}; endfor Figure 2-4. The INDIRECT algorithm.. 6.

(17) Chapter 3 The Proposed Generic Framework for Streaming Indirect Associations Mining Although a considerate number of stream mining algorithms have been developed, each of them is confined to some specific type of window models. Nevertheless, we conjecture that a generic algorithm can be devised if a more general window model that encompasses all of the contemporary models can be specified. In this chapter, we first describe the generic streaming window model, and then devise a generic framework for mining indirect associations.. 3.1 Proposed Generic Streaming Window Model Definition 2 Given a data stream S = (t0, t1, … ti, …) as defined before, a generic window model Ψ is dictated as a four-tuple specification, Ψ(l, ω, s, d), where l denotes the timestamp at which the window starts, ω the window size, s the stride the window moves forward, and d is the decay rate. The stride notation s is introduced to allow the window moving forward in a batch of transactions (of size s). That is, if the current window under concern is (tj−ω+1, tj−ω+2, …, tj), then the next window will be (tj−ω+s+1, tj−ω+s+2, …, tj+s), and the weight of a transaction within (tj−s+1, tj−s+2, …, tj), say α, is decayed to αd, and the weight of a transaction within (tj+1, …, tj+s) is 1. The concept of the proposed generic window model is depicted in Figure 3-1.. 7.

(18) …. … …. t0 t1. …. …. …. tl tj−ω+1tj−ω+2 tj−ω+s+1 tj−s+1 tjj tj+1. tj+s. time. Wk α. Wk+1. αd. 1. Figure 3-1. Generic window model.. Example 1 Let w = 4, s = 2, l = t1, and d = 0.9. An illustration of the generic streaming window model is depicted in Figure 3-2. The first window W1 = [1, 4] = (t1, t2, t3, t4), which consists of two blocks, B1 = {AH, AI} and B2 = {AH, AH}, for B1 receiving weight 0.9 while B2 receiving 1. Next, the window moves forward with stride 2. That means B1 is outdated and a new block B3 is added, resulting to a new window W2 = [3, 6] = (t3, t4, t5, t6).. t1 A H. t2 A I. t3 A H. t4 AH. t6 A B D. t5 A I. t7 B C D. t8 C D. W1 A A H I 0.9 0.9. A H 1. A H 1. A H 0.9. …. ti. w=4 s=2 l = t1 d = 0.9. W2 A H 0.9. t9 t10 BD D E E F G. A I 1. A B D 1 W3. A I 0.9. A B D 0.9. B C D 1. C D 1. Figure 3-2. An example for illustrating the generic window model.. 8.

(19) Below we show that this generic window model can be degenerated into any one of the models described in Figure 3-1. •. Landmark model: Ψ(l, ∞, 1, 1). Since ω = ∞, there is no limitation on the window size and so the corresponding window at timestamp j is (tl, tl+1, …, tj) and is (tl, tl+1, …, tj, tj+1) at timestamp j+1.. •. Time-fading model: Ψ(l, ∞, 1, d). This model is similar to landmark except that a decay rate less than 1 is specified.. •. Sliding window model: Ψ(l, ω, 1, 1). Since the window size is limited to ω, the corresponding window at timestamp j is (tj−ω+1, tj−ω+2, …, tj) and is (tj−ω+2, …, tj, tj+1) at timestamp j+1.. 3.2 Proposed Generic Framework for Indirect Associations Mining According to the paradigm proposed by Tan and Kumar [19], [20], the work of indirect association mining can be divided into two subtasks: First, discovers the set of frequent itemsets with support higher than σf, and then generates the set of qualified indirect associations from the frequent itemsets. Our framework adopts this paradigm, working in the following scenario: (1) The user first sets the streaming window model by specifying the parameters described previously; (2) The framework then executes the process for discovering and maintaining the set of frequent itemsets FP as the data continuously stream in;. 9.

(20) (3) At any moment once the user issues a query about the current indirect associations the second process for generating the qualified indirect associations is executed to generates from FP the set of indirect associations IA. The proposed generic streaming framework for indirect associations mining is illustrated in Figure 3-3.. model setting & adjusting. data stream. query. Process 1: Discover & maintain FP Access & update. result. Process 2: Generate IA. FP. Access. Figure 3-3. Generic framework for indirect association mining.. 10.

(21) Chapter 4 The Proposed Algorithms Based on the generic framework proposed in Chapter 3, we present in this chapter two different algorithms for accomplishing the two main processes, namely GIAMS-IND (Generic Indirect Association Mining over Streams derived from INDirect algorithm) and GIAMS-MED (Generic Indirect Association Mining over Streams by exploiting MEDiators). Both algorithms differ only in the second process for generating qualified indirect associations. Algorithm GIAMS-IND adopts the approach used in the INDIRECT algorithm, generating the indirect associations directly from the frequent itemsets. Algorithm GIAMS-MED adopts a novel and more efficient way for generating the indirect associations by utilizing the mediator sets.. 4.1 Algorithm GIAMS-IND First, we describe the data structure used in the GIAMS-IND algorithm. Conceptually, the set of frequent itemsets FP can be maintained in a monitoring lattice structure as described in [22]. But from the implementation viewpoints, the lattice structure is not efficient to maintain and would incur lots of pointers to maintain the subset and superset relations. For this reason, we introduce a data structure called Card-Stree, which is a forest of search trees keeping itemsets of different cardinalites, appearing in the current window say ST1, ST2, …, STk, for STk maintaining the set of frequent k-itemsets.. 11.

(22) Each node in Card-Stree except the root keeps the information of the maintained itemset. More specifically, for an itemset X, the information includes X.id, X.countv, and X.bidv, where X.id denotes the identifier of X, X.bidv is the vector of identifiers of the blocks that X appears, and X.countv the vector that stores the number of occurrences of X within each block. Figure 4-1 depicts an illustration of Card-Stree, where for example, the accumulated number of occurrences of A is 2.17 in block 1, and 3.5 in block 3.. X.id=A X.bidv=[1,3] X.countv=[2.17,3.5]. X.id=CDE X.bidv=[14,15] X.countv=[15,2]. R length=1. length=2. length=3. length=n ...... Figure 4-1. Illustration of Card-Stree for maintaining FP.. The GIAMS-IND algorithm is described in Figure 4-2, which consists of two current processes running simultaneously: FP-monitoring and IA-generation. The first process is activated when the users specifies the window parameters to set the type of window model, responsible for generating itemsets from the incoming block of transactions and inserting those that are potentially frequent into Card-Stree. The second process is activated when the user issues a query about the current indirect associations, responsible for generating the qualified patterns from Card-Stree maintained by process FP-monitoring. Details of the main 12.

(23) procedures in GIAMS-IND, including deleting outdated block (BlockDelete), merging analogical transactions (TransactionMerge), delaying the insertion of itemsets (DelayInsert), decaying the accumulated count and pruning infrequent itemsets (Decay&Pruning), and indirect association generation (IndirectAssociationGen) will be described in the following subsections. An example shown in Figure 4-3 will be used to illustrate the processes. Hereafter, the parameters settings used in Figure 4-3 are window size ω = 10, stride s = 5 (block size); the mediator support σf and mediator dependence σd are set at 0.3 and 0.1, while the itempair support σs is set at 0.1, and decay rate is 0.9.. 13.

(24) Algorithm Name: GIAMS-IND Input: Itempair support threshold σs, association support threshold σf, dependence threshold σd, decay rate d, window size w, support error threshold ε. Output: Indirect Associations IA. Initialization: 1. Let N be the accumulated number of transactions, N = 0; 2. Let η be the decayed accumulated number of transactions, η = 0; 3. Let cbid be the current block id, cbid = 0, sbid the starting block id of window, sbid = 1; 4. repeat 5. Process 1; 6. Process 2; 7. until terminate; Process 1: FP-monitoring 1. Reading the new coming block Bi; 2. cbid = cbid + 1; 3. N = N + s; η = η × d + s; 4. if ( N > ω) then 5. BlockDelete(sbid, Card-Stree); // Delete outdated block Bsbid 6. sbid = sbid + 1; 7. N = N – s; // Decrease the transaction size in current window cbid – sbid+1 8. η = η – s×d ; // Decrease the decayed transaction size in current window 9. endif 10. TransactionMerge(Bcbid, CT); // Merge anological transactions into a compact table CT 11. DelayInsert(CT, Card-Stree, σf, cbid, η); //Constructing Card-Stree using transactions in CT 12. Decay&Pruning(d, s, ε, cbid, Card-Stree); //Removing infrequent itemsets from Card-Stree Process 2: IA-generation 1. if user query request = true then 2. IndirectAssociationGen(Card-Stree, σf, σd, σs, N); //Generate all indirect associations Figure 4-2. The GIAMS-IND algorithm.. 14.

(25) Blcok1 TID1 TID2 TID3 TID4 TID5. AH AI AH AH AI. Blcok2 TID6 TID7 TID8 TID9 TID10. Blcok3. ABD BCD CD BDE DEFG. TID11 TID12 TID13 TID14 TID15. ACFG ADEGH BCFG CDEGH ADEFG. Blcok4. …... Time Figure 4-3. An example data stream whose first three blocks are displayed.. 4.1.1 Procedure BlockDelete As a new block of transactions is coming, there may be some transactions becoming too old that are out of the range for observation. In the generic model, when the addition of a new block making the window size over w, we will delete the oldest block, whose id is kept in sbid. Then we have to check all the itemsets maintained in Card-Stree; if the itemset appeared in block sbid, then we delete the stored block id and count. Figure 4-5 shows the process for procedure BlockDelete. Consider the example in Figure 4-3, after the processing of block 2 and going to process block 3. Since the window accommodates only two blocks, we have to outdate block 1, and update the information of all itemsets in block 1. Node A has received count 4.05 from block 1 and now is deleted. Node I only appeared in block 1 so we empty its bidv and countv Finally, node I is deleted. Description of procedure BlockDelete is shown in Figure 4-4.. 15.

(26) Procedure Name: BlockDelete Input: Card-Stree and sbid the block id at which the window starts Output: Updated Card-Stree. Steps: 1. foreach node p in Card-Stree do 2. if p.bidv[1] = sbid then 3. delete the first value in p.bidv and the count in p.countv; 4. if p.bidv is empty then 5. delete p; 6. endfor Figure 4-4. Description of procedure BlockDelete. X.id=A X.bidv=[1,2] X.countv=[4.05,0.9] length=1. X.id=I X.bidv=[1] X.countv=[1.62]. R. length=2. length=3 .... Figure 4-5. An example for the BlockDelete procedure. Next, our algorithm proceeds to count the occurrences of all itemsets contained in block 3, inserting newly generated itemsets into Card-Stree and updating the counts of existing itemsets. We adopt two optimization techniques further to reduce the execution time for this process: transaction-merging and delay-insertion.. 16.

(27) 4.1.2 Procedure TransactionMerge Usually in a dataset, there are many transactions of the same content. Therefore, it is naïve to compress the dataset by keeping only one tuple for each group of analogical transactions. The procedure described in Figure 4-6 adopts this simple concept to compress the input block into a compact table CT composed of three attributes, transaction id, tid, the stored transaction, trans, and the number of analogical transactions, count. Figure 4-7 illustrates the resulting CT after performing transaction merging to the first block. Consider the example, transaction AH is appear three times from block 1 and now merge into one tuple.. Procedure Name: TransactionMerge Input: Block Bi and compact table CT(tid, trans, count). Output: CT. Steps: 1. foreach transaction x in Bi do 2. if x is in CT then 3. Let tr be the transaction equal to x; 4. tr.count = tr.count +1; 5. else 6. Insert x into CT with trans = x and count = 1; 7. endif Figure 4-6. Description of procedure TransactionMerge.. Compact Table (trans, count) AH. 3. AI. 2. Blcok2 TID6 TID7 TID8 TID9 TID10. Blcok3. ABD BCD CD BDE BDEFG. TID11 TID12 TID13 TID14 TID15. ACFG ADEGH BCFG CDEGH ADEFG. Blcok4. …... Time Figure 4-7. The resulting compact table CT after merging the first block.. 17.

(28) 4.1.3 Procedure DelayInsert The concept of delay insertion was first introduced in Hidber’s work [10]. Briefly, a newly generated itemset can be inserted into the lattice only if all of its subsets are potentially frequent and maintained in the lattice. Intuition behind this strategy is the apriori principle: If any subset of an itemset is not significant, neither is the itemset itself. Since our Card-Stree maintains all 1-itemsets and 2-itemsets seen so far, this optimization applies to k-itemsets with cardinality k ≥ 3. Consider the example shown in Figure 4-3. After processing the first block, the second block arrives. The first new transaction is TID6 = {A, B, D}, from which we can generate 1-items {A}, {B}, {D} and 2-itemsets {A, B}, {A, D}, and {B, D}. These itemsets are inserted immediately into Card-Stree or their bidvs and countvs are updated if they already existed. Finally, itemset {A, B, C} can be inserted into Card-Stree only when all of its 2-subsets are frequent, as shown in Figure 4-8. The algorithmic process of this procedure is described in Figure 4-9.. Compact Table Blcok2 TID6 ABD TID7 BCD TID8 CD TID9 BDE TID10 BDEFG. ABD. 1. BCD. 1. CD. 1. BDE. 1. BDEFG. 1. 1-item & 2-itemset A B D AB AD BD ABD. If all ABD’s 2-subsets are frequent, then generate ABD, and insert it into Card-Stree.. Figure 4-8. An illustration of delay insertion when processing the first transaction in block 2.. 18.

(29) Procedure Name: DelayInsert Input: CT, Card-Stree, cbid. Output: Updated Card-Stree. Steps: 1. foreach transaction tr in CT do 2. foreach subset X of tr.trans do 3. if |X| = 1 then 4. Insert X with X.bid=cbid into or update its count in ST1; 5. else if |X| = 2 then 6. Insert X with X.bid=cbid into or update its count in ST2; 7. else if all the immediate subsets of X is in ST|X|−1 then Insert X with X.bid=cbid into ST|X|−1; 8. endif Figure 4-9. Description of procedure DelayInsert.. 4.1.4 Procedure Decay&Pruning One of the design issues for efficient stream data mining algorithms is the concern of limited memory usage. That is, an efficient stream data mining algorithm should consume as less as possible memory storage. An intuitive way is eliminating those itemsets with the least possibility to become frequent afterwards. The key point is how we can determine the reasonable threshold for deleting infrequent itemset. In this procedure, after processing the insertion of the new arriving block, we first decay the accumulated count of each maintained itemset, then we prune any itemset X whose count satisfies the following condition: X.count < ε × s × (d + d2 + … + dcbid − sbid + 1) = ε × s ×. 1 − d ( cbid − sbid +1) 1− d. (2). where cbid and sbid denote the current block id and the first block that X appears and enters into Card-Stree, respectively. Note that the term s×(d + d2 + … + dcbid − sbid + 1) equals to the decayed amount of transactions between the first block that X appears and the current block within the current window. Therefore, we delete X when its count is far less than ε of this value. In Chapter 5, we will prove that this pruning condition guarantees the error of the. 19.

(30) itemset support generated by our algorithm is bounded by ε. Consider an itemset ABD. Let ε = 0.1 and d = 0.9. Assume ABD was inserted first in block 2, the current block id is 5 and ABD’s count is 5. Then ABD can be pruned when its count is less than 0.1 * 5 * (1 - 0.9(5-2+1)) / (1 - 0.9). The algorithmic description of this procedure is shown in Figure 4-10.. Procedure Name: Decay&Pruning Input: d, s, ε, cbid, Card-Stree. Output: Updated Card-Stree. Steps: 1. foreach node p in Card-Stree do 2. foreach block bid stored in p do 3. p.countv[bid] = p.countv[bid] × d; // Decay the count in block bid; sum = sum + p.countv[bid]; // Compute the accumulated count 4. 5. endfor 6. sbid = p.bidv[1]; //the first block p appear cbid − sbid + 1 7. if sum < ε × s × (1 − d ) / (1 – d) then 8. delete p; 9. endfor Figure 4-10. Description of procedure Decay&Pruning.. 4.1.5 Indirect Association Generation This procedure for generating indirect associations adopts the concept used by the INDIRECT algorithm. A candidate indirect association is generated by checking two frequent k-itemsets, k ≥ 2, say A and B. If the cardinality of the intersection of A and B is k − 1, say M = A ∩ B and |M| = k − 1, then M could be a candidate mediator of A and B. Let {a} = A − M and {b} = B − M. Then we only have to check if sup({a, b}) < σs, dep({a}, M) ≥ σd and dep({b}, M) ≥ σd to determine whether ⟨a, b | M⟩ is an indirect association or not. This procedure is described in Figure 4-11 from the itemsets stored in Card-Stree using the example in Figure. 20.

(31) 4-3. After generating all frequent 2-itemsets, for example, we can join {A, D} and {A, H} to obtain a candidate 3-itemset, <D, H|A>. Figure 4-12 shows the process for generating indirect associations.. Procedure Name: IndirectAssociationGen Input: Card-Stree, σs, and σd. Output: Indirect association rule set IA. Steps: 1. Let L1, L2, …, Ln be the set of all frequent i-itemsets in Card-Stree;. IA = ∅; 2. 3. for k = 2 to n do 4. Ck+1 = Join(Lk, Lk); 5. for each (a, b, M) ∈ Ck+1 do 6. 7. 8.. if (sup({a, b}) < σs and dep({a}, M) ≥ σd and dep({b}, M) ≥ σd) IA = IA∪{⟨ a, b| M⟩}; endfor Figure 4-11. Description of procedure IndirectAssociationGen.. 21.

(32) frequent itemsets. L2. L1. L2. A. AD. B C. AH BD. D E F. CG DE DG. G. EG FG. Item A. Item B. Mediator. AD. D. H. A. AH BD CG DE DG EG FG. A A A B B C C …. B E G E G D E …. D D D D D G G …. Figure 4-12. An illustration of indirect association generation by procedure IndirectAssociationGen.. 4.2 Algorithm GIAMS-MED As we have mentioned, the only difference of the GIAMS-MED algorithm from GIAMS-IND is in the second process for generating indirect associations. Therefore, in this section, we will only describe the design of this process, named procedure IndirectAssociationGen-Med. The basic concept of our design is to facilitate the property of mediator, i.e., the support threshold for a qualified mediator, to reduce the number of. 22.

(33) candidate mediators. In the following, we first show that for an itemset X to be a qualified mediator, the support of X should be no less than a threshold σm, and σm = 2σf − σs. Theorem 1. The support of a mediator M should be no less than σm = 2σf − σs, i.e., sup(M) ≥ σm. Proof: First, let us consider any three sets, A, B, and C. According to the set theory, we have the following equation (see Figure 4-13): C ⊇ (A ∩ C) ∪ (B ∩ C). (3). Figure 4-13. Visualization of the concept revealed in (3). Then from (1), we can derive according to the inclusion-exclusion principle |C| ≥ |(A ∩ C) ∪ (B ∩ C)| = |A ∩ C| + |B ∩ C| − |A ∩ B ∩ C| ≥ |A ∩ C| + |B ∩ C| − |A ∩ B|. (4). Now, let us consider a qualified indirect association <{x, y} | M>. (see Figure 4-14). 23.

(34) ≥σ f. x. M. ≥σ f. y <σs. Figure 4-14. The relations between x, y and M. Let TXM, TYM, and TM be the set of transactions containing {x}∪ M, {y}∪ M, and M, respectively. Then according to (2) we have |TM| ≥ | TXM | + | TYM | − | TXM ∩ TYM |. (5). That is, sup(M) ≥ sup({x}∪ M) + sup({y}∪ M) − sup({x, y}∪ M ) From Definition 1, we know that the minimum support of itemsets {x}∪M and {y}∪M are both larger than σf and the maximum support of indirect itempair {x, y} is less than σs. Note that sup({x, y}∪M) ≤ sup({x, y}). We thus can conclude that if M is a mediator, then the support of M should be no less than 2σf − σs. The main deficiency of IndirectAssociationGen is that many candidate indirect associations are generated. We observe most of the supports of the mediators are smaller than σm. In other words, these candidates can not become indirect association rules. In order to reduce the number of candidate indirect associations, GIAMS-MED uses σm as another threshold to make sure every candidate indirect association generated with its support larger than σm. That helps reduce the cost of unnecessary calculation.. 24.

(35) Procedure Name: IndirectAssociationGen-Med Input: d, σs, σf, σd, η, Card-Stree. Output: Set of Indirect associations IA. Steps: 1. Let σm = 2σf − σs; 2. Let L1 be the set of all 1-itemsets in Card-Stree with support no less than σf; 3. Let M1 be the set of all 1-itemsets in Card-Stree with support no less than σm; //generate mediators of length 1 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.. k = 1; C2 = join(L1, L1) foreach itemset X∈ C2 and Y.count < σs ×η do //generate IIS insert X into IIS; //X is a candidate itempair IA = ∅; while (Mk ≠ ∅) do //generate indirect association rules foreach {a, b} ∈ IIS do foreach M ∈Mk do if dep({a}, M) ≥ σd and dep({b}, M) ≥σd then IA = IA∪{⟨ a, b| M⟩}; endfor Ck+1 = join(Mk, Mk) //generate next level candidate mediators foreach itemset X ∈ Ck+1 and X.count ≥ σm×η do insert X into Mk+1; k = k + 1; endwhile. Figure 4-15. Description of procedure IndirectAssociationGen-Med in algorithm GIAMS-MED. The description of procedure IndirectAssociationGen-Med is shown in Figure 4-15. Figure 4-16 and Figure 4-17 illustrate the process of IndirectAssociationGen-Med procedure. First, it generates frequent all one-items, then forms candidate 2-itemsets from those frequent(σf ) one-items. Any 2-itemset with its count less than indirect itempair threshold (σs) is inserted into the candidate IIS. After generates all frequent one-items, the procedure also from (k-1)-mediators find candidate mediator(large than σm), then employs a level-wise generation of all possible. 25.

(36) mediators.. R. L1. IIS. A. AB. B C D E F G. MediatorArray. MediatorArray. D. DG. G Figure 4-16. An example for generatory mediators and indirect item pairs. Finally, after generate IIS and mediator then we can form a candidate indirect association rule. And we output the rule if it satisfies the dependence condition (σd) and mediator support threshold (σf) by joining an IIS and a mediator. For example, as Figure 4-17 shows, an IIS-pair {A, B} and mediator D can form a candidate indirect association rule <A, B |D>. If both {A, D} and {B, D} have supports larger than mediator support threshold (σf) and dependences, then we get the indirect association A and B via D.. 26.

(37) IIS AB. ItemA. ItemB. Mediator. M1. A. B. D. D G. A A. B B. G DG. M2 DG. Figure 4-17. An illustration of generating indirect association rules from mediators and IIS.. 27.

(38) Chapter 5. Theoretical Analyses In this chapter, we will analyze some properties of the proposed two algorithms. First, we will prove that the support error of any frequent itemset discovered by the proposed two algorithms is less than the user specified error threshold ε. Then we will provide a theoretical performance comparison between GIAMS-MED and GIAMS-IND in indirect association generation.. 5.1 Support Error Bound Analysis In this section, we will show that the pruning technique used in procedure Decay&Pruning always guarantees a bounded error within the user specified threshold. To facilitate the discussion, we introduce some new notation. Let the true support of an itemset X, called Tsup(X), be the fraction of transactions so far containing X, and the estimated support of an itemset X, called Esup(X). We will show that the difference between Esup(X) and Tsup(X) is smaller than the support error threshold ε.. 28.

(39) …. …. …. xbid-1. sbid. …. … xbid. …. dcbid-xbid+1 … …. cbid time 1. Maximal possible error. 1 − d ( cbid − xbid +1) 1− d. ηxbid-1. ηxbid-1× d Decay. ηxbid-1×d2 Decay. pruning threshold. cbid−xbid+1 … ηxbid-1×d. Decay. Figure 5-1. The description of maximal possible error and pruning threshold. Theorem 2 Tsup( X ) − Esup ( X ) ≤ ε . Proof: Consider a generated frequent itemset X. Let sbid be the identifier of the starting block of the current window, xbid be the smallest identifier of the block that itemset X appears and remains in the current Card-Stree, and cbid be the current block. Since the count information of X within blocks from xbid to cbid is maintained, the maximum counting error should be equal to the part dropping in blocks sbid to xbid−1. This concept is illustrated in Figure 5-1. Let ηxbid-1 denote the decayed accumulated amount of transactions in processing block xbid−1. This value will continue decaying in processing blocks xbid to cbid. Then the difference between the estimated count of itemset X, Ecount(X), and true count of itemset X, Tcount(X), is smaller than ε × ηxbid-1 × d(cbid−xbid+1). Thus, we have. Tcount ( X ) − Ecount ( X ) ≤ ε × η xbid −1 × d cbid − xbid +1 Dividing (6) by the current decayed transaction size ηcbid we obtain. 29. (6).

(40) Tsup( X ) − Esup ( X ) ≤ ε ×. η xbid −1 × d cbid − xbid +1 η cbid. (7). Since the minimum and maximum values of xbid are 1 and cbid, respectively, it follows that the difference between Esup(X) and Tsup(X) is smaller than the error threshold ε when we prune itemset where count is less than ε ×ηxbid.. 5.2 Performance Comparison In this section we will compare the performance of GIAMS-IND and GIAMS-MED from a theoretical viewpoint. Since both algorithms differ only in the second process for indirection association generation, we will focus on this process. It suffices to show how many candidate mediators can be pruned by IndirectAssociationGen-Med as compared with IndirectAssociationGen. Assume. that. the. IndirectAssociationGen. number requires. of. frequent. 2-itemsets. C2n = n(n − 1) / 2. set. is joins.. n.. Then Recall. procedure that. in. IndirectAssociationGen-Med we divide the frequent 2-itemsets into two parts according to σm. One contains those with support greater than σf but less than σm, assuming the number of itemsets is n1. The other consists of those whose support is greater than σm, assuming its size is n2. So, n = n1 + n2. The cost of procedure IndirectionAssociationGen-Med is C2n2 + ( n1 × n2 ) = n2 ( n2 − 1) / 2 + n1n2. (8). 30.

(41) Therefore,. the. difference. between. IndirectionAssociationGen. and. IndirectionAssociationGen-Med will be. n (n − 1) n2 ( n2 − 1) + n1n2 ) −( 2 2 (n1 + n2 ) 2 − ( n1 + n2 ) − n2 2 + n2 − 2n1n2 2 2 n − n1 = 1 2 =. (9). That is, the performance improvement of IndirectionAssociationGen-Med over IndirectionAssociationGen depends on the gap between σf and σm, i.e., the larger the gap, the greater the number n1.. 31.

(42) Chapter 6 Experimental Results To evaluate the performance and effectiveness of the proposed algorithms, GIAMS-IND and GIAMS-MED, we conducted comprehensive experiments on synthetic dataset as well as real datasets, considering all three commonly used window models, including landmark, time-fading, and sliding window models. The evaluation was inspected from three aspects, execution time, memory usage, and pattern accuracy. All experiments were done on an AMD X3-425 (2.7 GHz) PC with 3GB of main memory, running the Windows XP operating system. All algorithms were implemented in Visual C++ 2008.. 6.1 Evaluation on Synthetic Data The synthetic dataset T5.I5.N0.1K.D1000K was used in our experiments, which was generated using the program in [2]. In each of the following sections, we degenerate the generic data stream model to three common data stream models, landmark window model, time fading window model and sliding window model.. 6.1.1 Landmark window model Mediator support threshold: We first examine the effect of varying mediator support thresholds. In this experiment, the mediator support condition σf was set from 0.01 to 0.018. The other parameter settings are shown as follows.. 32.

(43) ω ∞. s 10000. d 1. σs. σd. ε. 0.01. 0.1. 0.001. Figure 6-1 depicts the execution time and memory usage of running process 1 for generating frequent itemsets. Note that this process is the same for both GIAMS-IND and GIAMS-MED. It can be seen that the smaller σf is, the less execution time the algorithm spends, since smaller σf results in less number of itemsets. And the overall trend is linear, proportionally to the transaction size. The memory usage exhibits similar phenomenon: the lower σf is, the more memory is consumed.. T5.I5.N0.1K.D1000K 2.5. 24. 1.5 14. 1 0.5. 9. 0. 4. Memory(MB). 19. 80 K 16 0K 24 0K 32 0K 40 0K 48 0K 56 0K 64 0K 72 0K 80 0K 88 0K 96 0K. Time(sec). 2. Mem 0.01 Mem 0.012 Mem 0.014 Mem 0.016 Mem 0.018 Time 0.01 Time 0.012 Time 0.014 Time 0.016 Time 0.018. Transaction size. Figure 6-1. Execution time and memory usage for running process 1, with varying transaction sizes and σfs. Stride: In this experiment, we examine the effect of varying strides (block size). The stride value was set from 10000 to 80000. The other parameter settings are shown as follows.. 33.

(44) ω ∞. d 1. σs. σf. σd. ε. 0.01. 0.01. 0.1. 0.001. The results are shown in Figure 6-2. We observe two noticeable phenomena. First, the execution time is decreasing as the stride increases. This is because larger stride encourages the possibility of analogical transactions. That is, more transactions can be merged together and it reduces the cost of subset generation. Second, longer stride also is helpful for reducing the memory usage, because smaller stride makes the pruning threshold stricter. Finally, as time goes by, the execution time and memory usage are also increasing.. T5.I5.N0.1K.D1000K 2.5. 20. 2. 18 16. 1.5. 14 1. 12. 0.5. 10 8. Mem Stride=40000 Mem Stride=80000 Time Stride=10000 Time Stride=20000 Time Stride=40000 Time Stride=80000. 80 K 16 0K 24 0K 32 0K 40 0K 48 0K 56 0K 64 0K 72 0K 80 0K 88 0K 96 0K. 0. Mem Stride=20000 Memory(MB). Time(sec). Mem Stride=10000. Transaction size. Figure 6-2. Execution time and memory usage for process 1, with varying transaction sizes and strides.. Accuracy: Since our algorithms introduce the pruning technique to reduce the memory usage, so error may occur to the maintained frequent itemsets and discovered indirect associations.. 34.

(45) First, we check the difference between the true support and estimated support, which is measured by the following formula called ASE (Average Support Error):. ASE =. ∑ (T sup( x) − E sup( x)) x∈F. (10). F. where F denote the set of all frequent itemsets with respect to σf. In order to verify our proposed formula, this experiment does not perform the scheme of delay-insert. The stride value was set from 10000 to 80000 and the mediator support condition σf from 0.001 to 0.018. The other parameters are shown as follows. ω ∞. d 1. σs. σd. ε. 0.01. 0.1. 0.001. The results are depicted in Figure 6-3. All ASEs are zero, indeed less than the user specified error ε=0.001. This asserts our derivation in Theorem 2.. T5.I5.N0.1K.D1000K. Stride=10000 Stride=20000 Stride=40000 Stride=80000. 1 0.9 0.8. ASE. 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.01. 0.012. 0.014 0.016 Mediator support threshold. 0.018. Figure 6-3. Average support error of generated frequent itemsets.. 35.

(46) The accuracy of discovered indirect association rules was measured by inspecting how many rules are missed, i.e., recall, which is defined as follows:. Re call =. IAEST ∩ IAtrue. (11). IAtrue. where IAEST denotes the set of indirection associations discovered by our algorithms, and IAtrue denotes the set of true indirect associations. Figure 6-4 shows the results. All recalls are 100%.. Stride=10000 Stride=20000 Stride=40000 Stride=80000. T5.I5.N0.1K.D1000K. 100 90 80 Recall(%). 70 60 50 40 30 20 10 0 0.01. 0.012. 0.014 0.016 Mediator support threshold. 0.018. Figure 6-4. Recall of discovered indirect associations with different strides (Block Size) and σfs.. Performance of process 2 for rule generation: In this experiment, we compare the performance of the two algorithms in implementing the process for rule generation. The parameters are shown as follows.. 36.

(47) ω. s. d. σs. σd. ε. ∞. 10000. 1. 0.01. 0.1. 0.001. The results are presented in Figure 6-5. First, let us look at the memory usage. There is no significant difference; GIAMS-IND and GIAMS-MED consume approximately the same amount. Next we examine the execution time. Clearly, GIAMS-MED is much faster than GIAMS-IND. The reason is that as shown in Figure 6-6, the number of candidate rules generated by GIAMS-IND is much more than that by GIAMS-MED.. T5.I5.N0.1KD1000K Mem. GIAMS-IND Mem. GIAMSMED Time GIAMS-IND 25. 3. 20. 2 15 1.5 10 1 5. 0.5. 0. 0 0.01. 0.012. 0.014 0.016 Meditaor support threshold. 0.018. Figure 6-5. The execute time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying σfs.. 37. Memory(MB). Time(sec). 2.5.

(48) T5.I5.D1000K GIAMS-IND. Number of candidate indirect assocation rules. 200000. GIAMS-MED. 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 0.01. 0.012. 0.014 0.016 Mediator support threshold. 0.018. Figure 6-6. The number of candidate rules generated by GIAMS-IND and GIAMS-MED with varying σfs.. 38.

(49) 6.1.2 Time-fading window model Decay rate: We first compare the performance of our algorithms executed in this model with varying decay rates. The parameters were set as follows: s 1. ω ∞. d 0.7-0.9. σs. σf. σd. ε. 0.01. 0.012. 0.1. 0.001. As the experimental results shown in Figure 6-7, the memory usage is increasing when the decay rate is getting larger. This is because the support count of an itemset will be declined more slowly and makes that item stay longer in the search tree. The situation is controversy for execution times. When the decay rate is smaller, an itemset becomes outdated more quickly. That means more itemsets will be added into and deleted from the search tree in faster recession duration.. 300. 100 90. 250. 80 70. 150. 60 50. 100. 40 50. Mem. d=0.9 Memory(MB). Time(sec). 200. Mem. d=0.8 Mem. d=0.7 Time d=0.9 Time d=0.8 Time d=0.7. 30 20. 80 K 16 0K 24 0K 32 0K 40 0K 48 0K 56 0K 64 0K 72 0K 80 0K 88 0K 96 0K. 0. Transaction size. Figure 6-7. Execution time and memory usage for running process 1, with varying transaction sizes and decay rates. 39.

(50) Mediator support threshold: In this experiment, we changed σf (mediator support threshold) from 0.01 to 0.018. The other parameters were set as follows. ω ∞. s 1. d 0.9. σs. σd. ε. 0.01. 0.1. 0.001. As the results shown in Figure 6-8, the execution time and memory usage increase proportionally to transaction sizes, but the effect of varying minimum supports is not significant. Compared with other data stream models, the time-fading window model spends more time in transaction insertion. The reason is that the time-fading window model process one transaction at a time.. 100. 10000. 90. Mem. 0.01 Mem. 0.012. 80. Time(sec). 1000. 70 60. Mem. 0.014 Mem. 0.016 Mem. 0.018 Time 0.01 Time 0.012. 100. 50. Time 0.014 Time 0.016. 40 30. 80 K 16 0K 24 0K 32 0K 40 0K 48 0K 56 0K 64 0K 72 0K 80 0K 88 0K 96 0K. 10. Time 0.018. Transaction size. Figure 6-8. Execution time and memory usage for process 1, with varying transaction sizes and σfs Accuracy: In this experiment we compared the ratio of average support error with respect to different decay rates. The parameters were set as follows.. 40.

(51) s 1. ω ∞. d 0.9-0.7. σs. σf. σd. ε. 0.01. 0.01-0.018. 0.1. 0.001. As shown in Figure 6-9, higher decay rates lead to larger errors. The mediator support threshold also affects the error but the influence is far less than decay rate. We also observed the recalls of discovered indirect associations. All of the results as shown in Figure 6-10 are almost 100% because we keep all 2-itemsets, so most of the indirect association rules can be discovered successfully. T5.I5.N0.1K.D1000K. d=0.9. d=0.8. d=0.7 4.5 4. ASE(x10 -7). 3.5 3 2.5 2 1.5 1 0.5 0 0.01. 0.012. 0.014 0.016 Mediator support threshold. 0.018. Figure 6-9. Average support error of generated frequent itemsets.. 41.

(52) d=0.9. d=0.8. T5.I5.N0.1K.D1000K d=0.7 100 90 80. Recall(%). 70 60 50 40 30 20 10 0 0.01. 0.012. 0.014 0.016 Mediator support threshold. 0.018. Figure 6-10. Recall of discovered indirect associations with different decay rates and σfs. Performance of rule generate operator: We varied the σf from 0.01 to 0.18. The other parameters are shown as follows: ω ∞. s 1. d 0.9. σs. σd. ε. 0.01. 0.1. 0.001. As shown in Figure 6-11, GIAMS-MED is much faster than GIAMS-IND. The reason is that as shown in Figure 6-12 the number of candidate rules generated by GIAMS-MED is less than GIAMS-IND. Both methods consume similar amount of memory because the main memory used is for maintaining the Card-Stree.. 42.

(53) T5.I5.N0.1KD1000K Mem. GIAMS-IND Mem. GIAMSMED Time GIAMS-IND. 10.6 10.4. 10. 10.2 10. 1. 9.8 9.6. 0.1 0.01. 0.012 0.014 0.016 Mediator support threshold. 0.018. Figure 6-11. The execute time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying σfs. T5.I5.D1000K 1200000. GIAMS-IND GIAMS-MED. Number of candidate rules. 1000000. 800000. 600000. 400000. 200000. 0 0.01. 0.012. 0.014 0.016 Mediator support threshold. 0.018. Figure 6-12. The number of candidate rules generated by GIAMS-IND and GIAMS-MED with varying σfs... 43. Memory(MB). Time(sec). 100.

(54) 6.1.3 Sliding window model Stride: We first evaluated the effect of varying strides and observed the difference between both algorithms. The parameters are shown as follows. ω. s. d. σs. σf. σd. ε. 80000. 10000-80000. 1. 0.01. 0.012. 0.1. 0. From Figure 6-13 we can see that bigger strides result in less execution time because less number of blocks has to be processed. A peculiar phenomenon is when the transaction is 160K, an unordinary peak occurs. We guess it is because the length of transaction in that case is longer than the other cases. The difference in memory usage with respect to varying s is not significant, since the memory usage is more dependent on the window size.. 44.

(55) T5.I5.N0.1K.D1000K 35. 2 1.8. 30. 1.6. Mem s=10000 25. 1.2. 20. 1 15. 0.8 0.6. Mem s=20000 Memory(MB). Time(sec). 1.4. Mem s=40000 Mem s=80000 Time s=10000 Time s=20000 Time s=40000. 10. Time s=80000. 0.4 5. 0.2. 0. 80 K 16 0K 24 0K 32 0K 40 0K 48 0K 56 0K 64 0K 72 0K 80 0K 88 0K 96 0K. 0. Transaction size. Figure 6-13. Execution time and memory usage for running process 1, with varying transaction sizes and strides. Window size: The effect of window size is evaluated in this experiment, which value was changed from 10000 to 80000. The other parameters are shown as follows. s. d. σs. σf. σd. ε. 10000. 1. 0.01. 0.012. 0.1. 0. Intuitively, the more information we would like to observe, the larger memory is required. So as shown in Figure 6-14, the large windows size would lead to large memory usage. However, the effect of varying window size on the execution time does not exhibit obvious regulation. In general, the larger the window is, the more execution time is required. However, the case for s=10000 does not conform to this trend because the pruning scheme can not be applied when s=ω.. 45.

(56) T5.I5.N0.1K.D1000K 1.4 14 1.2 12 10. 0.8. 8. 0.6. 6. 0.4. 4. 0.2. 2. 0. 0. Mem w=20000 Memory(MB). Time(sec). 1. Mem w=10000 Mem w=40000 Mem w=80000 Time w=10000 Time w=20000 Time w=40000. 80 K 16 0K 24 0K 32 0K 40 0K 48 0K 56 0K 64 0K 72 0K 80 0K 88 0K 96 0K. Time w=80000. Transaction size. Figure 6-14. Execution time and memory usage for running process 1, with varying transaction sizes and window sizes.. Performance of rule generation: We varied σf from 0.01 to 0.018. The other parameters are shown as follows. ω 80000. s 10000. d 1. σs. σd. ε. 0.01. 0.1. 0.001. As shown in Figure 6-15, GIAMS-MED performs better than GIAMS-IND in all of the cases. And its curve is analogous to that in Figure 6-16. This once again shows that most of the execution time is spent on performing joining and inspection of candidate rules. In addition, the execution times of both algorithms are decreasing as σf increases. This is because when σf is lager, less number of frequent itemsets could be generated, and so less number of candidate rules will be discovered.. 46.

(57) T5.I5.N0.1KD1000K Mem. GIAMS-IND Mem. GIAMSMED Time GIAMS-IND 12. 0.25. Time(sec). 8 0.15 6 0.1 4 0.05. 2 0. 0 0.01. 0.012 0.014 0.016 Mediator support threshold. 0.018. Figure 6-15. The execution time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying σfs.. 47. Memory(MB). 10. 0.2.

(58) 200000. T5.I5.D1000K GIAMS-IND. 180000. GIAMS-MED. Number of candidate MSS-pairs. 160000 140000 120000 100000 80000 60000 40000 20000 0 0.01. 0.012. 0.014 tf. 0.016. 0.018. Figure 6-16. The number of candidate rules generated by GIAMS-IND and GIAM-MED with varying σfs.. 6.2 Evaluation on Real Data In this section, we present the experimental results on the real dataset constructed from the web log of news pages in msn.com for the entire day of September, 28, 1999. More detailed description of this dataset can be found in [4].. 6.2.1 Landmark window model Mediator support threshold: We increased σf from 0.01 to 0.018 and the other parameters are shown as follows: ω ∞. s 10000. d 1. 48. σs. σd. ε. 0.01. 0.1. 0.001.

(59) The results are depicted in Figure 6-17. Since the average length of transactions in this real data set is shorter than that in the synthetic data, the insertion time is shorter than synthetic data. Similar to the experimental results for synthetic data, the larger mediator support thresholds favor faster execution time and lower memory usage, though the memory usage is smaller.. 0.5. 30. 0.4. 25 20. 0.3. 15 0.2. 10. 0.1. 5. 0. 0. Mem 0.01 Mem 0.012. Memory(MB). Time(sec). msnbc. Mem 0.014 Mem 0.016 Mem 0.018 Time 0.01 Time 0.012. 80 K 16 0K 24 0K 32 0K 40 0K 48 0K 56 0K 64 0K 72 0K 80 0K 88 0K 96 0K. Time 0.014 Time 0.016 Time 0.018. Transaction size. Figure 6-17. Execution time and memory usage for running process 1, with varying transaction sizes and σfs. Stride: We varied the stride from 10000 to 80000. The other parameters are shown as follows: ω 80000. d 1. σs. σf. σd. ε. 0.01. 0.012. 0.1. 0.001. The results are shown in Figure 6-18. The stride is a critical factor to the effectiveness of the pruning phase, as revealed in (7). A large stride would make the itemset more easily be pruned. So the execution time and memory usage would be larger when the stride is smaller.. 49.

(60) 8. 0.4. 7. 0.3. 6. 0.2 0.1. 5. 0. 4. Mem stride=10000. Memory(MB). 0.5. Mem stride=20000 Mem stride=40000 Mem stride=80000 Time stride=10000 Time stride=20000 Time stride=40000. 80 K 16 0K 24 0K 32 0K 40 0K 48 0K 56 0K 64 0K 72 0K 80 0K 88 0K 96 0K. Time(sec). msnbc. Time stride=80000. Transaction size. Figure 6-18. Execution time and memory usage for process 1, with varying transaction sizes and strides (Block Size). Accuracy: The parameters in this experienced are shown as follows. ω ∞. s 10000-80000. d 1. σs. σf. σd. ε. 0.01. 0.01-0.018. 0.1. 0.001. The results in Figure 6-19 show that ASEs with respect to all cases are zero. This is because in this dataset, the transactions are almost less than five; our model would keep most of the itemsets and never prune them. The recall ratio in Figure 6-20 are also 100% for all cases, showing that our algorithms can exhibit the same good results in real data in this experiment.. 50.

(61) ASE. msnbc. stride=10000 stride=20000 stride=40000 stride=80000. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.01. 0.012. 0.014 0.016 Mediator support threshold. 0.018. Recall(%). Figure 6-19. Average support error of generated frequent itemsets.. 100 90 80 70 60 50 40 30 20 10 0. stride=10000 stride=20000 stride=40000 stride=80000. 0.01. 0.012. 0.014 0.016 Mediator support threshold. 0.018. Figure 6-20. Recalls of discovered indirect associations with different strides and. σfs.. Performance of rule generation: Then we come to observe the performance for the rule generation procedure. The parameters in this experiment are shown as follows. ω ∞. s 10000. d 1. σs. σf. σd. ε. 0.01. 0.01-0.018. 0.1. 0.001. 51.

(62) As shown in Figure 6-21, GIAMS-MED is faster than GIAMS-IND and the gap between GIAMS-IND and GIAMS-MED is smaller than that observed in synthetic data because the difference in the number of candidate associations, as shown in Figure 6-22, is smaller.. msnbc. Mem. GIAMS-IND 7 Mem. GIAMSMED Time GIAMS-IND. 0.03. Time(sec). 0.025. 6 5. 0.02. 4. 0.015. 3. 0.01. 2. 0.005. 1. 0. 0 0.01. 0.012 0.014 0.016 Mediator support threshold. 0.018. Figure 6-21. The execution time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying σfs.. 52. Memory(MB). 0.035.

(63) msnbc. Number of candidate candidate rules. 1400. GIAMS-IND GIAMS-MED. 1200 1000 800 600 400 200 0 0.01. 0.012. 0.014 0.016 mediator support threshold. 0.018. Figure 6-22. The number of candidate rules generated by GIAMS-IND and GIAMS-MED with varying σfs.. 6.2.2 Time-fading window model Decay rate: The parameters in this experiment are shown as follows. ω ∞. s 1. d 0.7-0.9. σs. σf. σd. ε. 0.01. 0.012. 0.1. 0.001. It can be seen from the experimental results in Figure 6-23 that the execution time decreasing as the decay rate is increasing. The performance is slower than other models since the cost of dividing all transactions into subsets is rather high. The memory usage in general is decreasing as the decay rate is decreasing.. 53.

(64) msnbc 250 29 200. 25. 100. 23. 50. 21. 0. 19. Memory(MB). 150. Mem. d=0.9 Mem. d=0.8 Mem. d=0.7 Time d=0.9 Time d=0.8 Time d=0.7. 80 K 16 0K 24 0K 32 0K 40 0K 48 0K 56 0K 64 0K 72 0K 80 0K 88 0K 96 0K. Time(sec). 27. Transaction size. Figure 6-23. Execution time and memory usage for running process 1, with varying transaction sizes and decay rates.. Mediator support threshold: The parameters in this experiment are shown as follows. ω ∞. s 1. d 0.9. σs. σf. σd. ε. 0.01. 0.01-0.018. 0.1. 0.001. The results in Figure 6-24 show similar phenomenon as observed in Figure 6-8 for synthetic data; i.e., the execution time and memory usage are not affected by the mediator support threshold. The reason is the same as we mentioned in the case of synthetic data.. 54.

(65) msnbc 200. 30. 180. 29. 160. 28. 140. 27. 120. 26. 100. 25. 80. 24. 60. 23. Time 0.014. 40. 22. Time 0.016. 20. 21. 0. 20. Mem. 0.012 Memory(MB). Mem. 0.014 Mem. 0.016 Mem. 0.018 Time 0.01 Time 0.012. Time 0.018. 80 K 16 0K 24 0K 32 0K 40 0K 48 0K 56 0K 64 0K 72 0K 80 0K 88 0K 96 0K. Time(sec). Mem. 0.01. Transaction size. Figure 6-24. Execution time and memory usage for process 1, with varying transaction sizes and mediator support thresholds.. Accuracy: The parameters in this experiment are shown as follows. ω ∞. s 1. d 0.7-0.9. σs. σf. σd. ε. 0.01. 0.01-0.018. 0.1. 0.001. As the results in Figure 6-25 show, the time-fading window model incurs almost no error and the error increases as the decay rate increases. Furthermore, the error is smaller than that in synthetic data. This is because most of the transactions in msnbc are shorter than 5 and our data structure Card-Stree keeps all one-items and two-itemsets. Therefore, the recall rates are 100% as shown in Figure 6-26.. 55.

(66) 3. 2.5 d=0.9. ASE(x10 -12). 2. d=0.8. d=0.7 1.5. 1. 0.5. 0 0.01. 0.012. 0.014 0.016 Mediator support threshold. 0.018. Figure 6-25. Average support error of generated frequent itemsets.. d=0.9. d=0.8. msnbc d=0.7 100 90 80. Recall(%). 70 60 50 40 30 20 10 0 0.01. 0.012. 0.014 0.016 Mediator support threshold. 0.018. Figure 6-26. Recalls of discovered indirect associations with different decay rates.. 56.

(67) Performance of rule generation: In this experiment, the parameters are shown as follows. ω. s. d. σs. σf. σd. ε. ∞. 10000. 1. 0.01. 0.01-0.018. 0.1. 0.001. The results are depicted in Figures 6-20 and 6-21. The number of candidate rules is quite smaller than that in synthetic data, so the processing time is less than 0.01 second.. msnbc. 2. Mem. GIAMS-IND. 45 Mem. GIAMSMED Time GIAMS-IND. 1.6 Time(sec). 1.4. 40 35. 1.2. 30. 1. 25. 0.8. 20. 0.6. 15. 0.4. 10. 0.2. 5. 0. 0 0.01. 0.012. 0.014 0.016 Mediator support threshold. Memory(MB). 1.8. 50. 0.018. Figure 6-27. The execution time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying mediator support threshold.. 57.