CHAPTER 3 GIAMS: A Review
3.2 The Generic Algorithm
Based on the generic framework in Figure 3-3, the generic algorithm employed by GIAMS consists of two concurrent processes running simultaneously:
PF-monitoring and IA-generation. The first process is activated when the users specifies the window parameters to set the type of window model, responsible for generating itemsets from the incoming block of transactions and inserting those that are potentially frequent into a repository called monitoring lattice. The second process is activated when the user issues a query about the current indirect associations, responsible for generating the qualified patterns from the frequent itemsets maintained by process PF-monitoring. A sketch of the generic algorithm is described in Figure 3-4.
Algorithm Name: GIAMS
Input: Itempair support threshold σs, association support threshold σf, dependence threshold σd
Output: Indirect Associations IA.
, decay rate d, window size w, support error threshold ε.
Initialization:
1. Let N be the accumulated number of transactions, N = 0;
2. Let η be the decayed accumulated number of transactions, η = 0;
3. Let cbid be the current block id, cbid = 0, sbid the starting block id of
10. TransactionMerge(Bcbid 11.
, CT); // Merge anological transactions into a compact table CT
DelayInsert(CT, FP, σf
12.
, cbid, η); // Constructing FP using transactions in CT Decay&Pruning(d, s, ε, cbid, FP); // Removing infrequent itemsets from FP Process 2: IA-generation
1. if user query request = true then
2. IndirectAssociationGen(FP, σf, σd, σs,
Figure 3-3. The GIAMS algorithm.
N); // Generate all indirect associations
CHAPTER 4
The Proposed Resource-Aware GIAMS Framework
In this chapter, we describe our proposed resource-aware GIAMS framework, namely RA-GIAMS. We will first give an overview of RA-GIAMS, then focus on the design of two kernel functionalities, the adaptation schemes for CPU computing power variability and available memory space variability, respectively
4.1 Framework Overview
Based on the GIAMS framework proposed in [14, 25], our proposed RA-GIAMS add some mechanisms to cope with the variation of available resources, considering both CPU power and memory space, making use of most available resources to accomplish the discovery of indirect association rules.
As depicted in Figure 4.1, the new components added into our RA-GIAMS include the resource monitor, responsible for monitoring the current CPU computing power and available memory space; the load shedder, responsible for throwing off part of the incoming data; a buffer, using as a temporary container for keeping the incoming data; and the storage shedder, responsible for pruning maintained frequent itemsets to reduce memory requirement.
These new mechanisms work in the following scenario to realize the functionality of resource-awareness.
1. The resource monitor will periodically monitor the current CPU computing power and the available memory space. As we will show in later sections, the CPU computing power can be represented as the number of dominating operations accomplished within a time unit, and the memory usage can be represented as the amount of card-tree nodes, because each tree node consumes similar memory space. The information collected is then forwarded to the load shedder and storage shedder to take necessary action.
2. When the load shedder receives the CPU power information, it will compare this with the ongoing workload to see if it will exceed the estimated CPU computing power; if so, it will shed part of the input data.
3. Likewise, as the storage shedder receives the memory usage information, it will inspect if the available memory enough for processing the incoming transactions.
If not, it will perform a node replacement scheme to replace some of the frequent itemsets maintained in FP.
Figure 4-1. The Proposed RA-GIAMS framework.
4.2 Notation Description
Before we proceed to the detailed design of adaptation schemes, we describe in this section the notation that will be used. The description is presented in Table 4-1.
Table 4-1 Notation used in the design of RA-GIAMS.
Notation Description
r The data arrival rate.
Ŵi The workload for processing block Bi
ϖ
.
The predictive workload for processing block B
i i
ŵ*
.
The average workload for processing each transaction in B
i i
ϖ*
.
The predictive average workload for processing each transaction in B
i i
Ρ
.
The sampling rate.
θi The actual CPU efficiency for processing block Bi
e
. The predictive CPU efficiency for processing block B
i i
ς
. The available buffer space.
Ti The executing times to complete block Bi
τ
.
The estimated execution times for processing block B
i i.
4.3 Adaption Scheme for CPU Power Awareness
In this section, we will present the design of the adaption scheme for CPU power awareness.
4.3.1 Basic Concept
The basic concept of our design is depicted in Figure 4.2., where the buffer is regarded as a circular queue. Conforming to the generic window model used in GIAMS, we assume that the input stream is processed block by block. After the completion of the (i-1)th block Bi−1, the resource monitor can calculate the CPU current efficiency θi-1 and predict the CPU efficiency ei for processing the i-th block Bi
e
. A simple estimation taking the actual efficiency and predictive efficiency into account described in Eq. (4.1) is used, where α denotes a weight, 0 ≤ α ≤ 1; if α is higher than 0.5 means that the estimated efficiency is more important than the actual one. Although we would use other more complicated estimation methods, in this study we prefer simpler methods to avoid too much computation overhead.
i = α θi-1 + (1 − α) ei−1 (4.1)
Let ϖI denote the estimated workload for processing block Bi and ς is the available buffer space currently. Our intention is to ensure that during the course for processing block Bi
P × r × τ
, the amount of arriving data will not over the available buffer size. If not, we then activate the load shedder to shed the input data with a sampling rate P. It is not hard to derive the value of P to satisfy this situation.
i
That is,
≤ ς (4.2)
P ≤ ς / (r × τi
where the estimated execution times for processing B
) (4.3)
i will be τi = ϖi / ei.
Figure 4-2. CPU awareness scheme.
4.3.2 CPU Efficiency and Workload Estimation
In the description of the basic concept for our adaptation scheme for CPU awareness, there are two key points need further clarification. They are the CPU efficiency monitoring and the estimation of the workload ϖI for processing block Bi
The CPU efficiency, though the meaning is intuitively simple, i.e., the number of operations can be accomplished within a time unit, is not easy to monitor and calculate in real time. Our idea is to cope with the estimation from a computation complexity viewpoint. First, we observe that the most time-consumption procedure of algorithm GIAMS (see Figure 3-3) is the delay insert, which is responsible for decomposing transactions into itemsets and maintaining frequent patterns. The most important operations of delay insert are the node creation, update, and replacement (will illustrate at the memory awareness). For this reason, we can represent the CPU efficiency as the amount of these operations accomplished within one second. The actual CPU efficiency θ
.
i for completing the process of block Bi
θ
can be defined as
i = ŵi / ti
where ŵ
(4.4)
i denotes the number of operations, including node insertion, update and replacement, for completing Bi and ti
The workload estimation ϖ
is the execution time.
i for processing Bi, however, needs employing a different strategy because this has to be done before block Bi is processed.
Our idea is estimating the average number of operations needed to process a transaction by using the statistics collected in processing the previous block Bi−1. Let ŵ*i−1 and ϖ*i-1 be the actual and estimated average number of operations to process a transaction for block Bi−1, respectively. That is, ŵ*i−1 = ŵi−1 / |Bi−1| and ϖ*i−1 =
Consider the example stream in Figure 4-3. There are three blocks of transactions, the block B3 is just finished and the block B4 is going to be processed, and block B5 is store in the buffer; the block 6 is new generated from stream. We assume that the predictive CPU efficiency of e3 is equal to 21; the workload of ŵ3 is equal to 36 and takes 2 seconds; the predictive workload ϖ3 is equal to 42; the available buffer space ς = 2; the data arrival rate r = 3; and the estimation weight α = 0.5.
Figure 4-3. An example data stream.
Tid 6 ABC
We first calculate the CPU efficiency θ3 = 36/2 = 19 and predict the CPU efficiency during the course for processing block B4, e4 = 0.5*19 + 0.5*21 = 20. The average transactional workload for processing block B3, ŵ*3 = 36/3 = 12; and predictive counterpart ϖ*3
Then we can predict the average transactional workload for processing block B
= 42/3 = 14.
4, calculated as ϖ*4 = 0.5*12 + 0.5*14 = 13. So we have ϖ4 = 13*(3) = 39 and the estimated execution time will be τ4
4.4 Adaption Scheme for Available Memory Awareness
= 39/20 = 1.95. Finally we obtain the sample rate P = 2 / (3 ×1.95) = 2/8.4 = 34.18%.
In this section, we describe the adaption scheme for available memory awareness.
Recall that our framework relies on the maintenance of promising frequent itemsets, PF. Our concern thus is how to deal with the situation that available memory is not enough to hold all the frequent itemsets maintained in PF, and develop an adaptive scheme for adjusting PF to utilize the most of current available memory space.
The structure used in our RA-GIAMS for realizing PF is a modification of the tree structure used in GIAMS, called Card-Stree, which is a forest of search trees keeping itemsets of different cardinalities, appearing in the current window, say ST1, ST2, …, STk, for STk maintaining the set of frequent k-itemsets. We name the modified structure Card-Stree*. Each node in Card-Stree* except the root keep the information of the maintained itemset. More specifically, for each itemset X, the node records X.id, the identifier of X; X.bidv, the vector of identifiers of the blocks that X appears;
X.countv, the vector that stores the number of occurrences of X within each block; and X.tlcount, the total number of times that X appears in the current window under concern. An example of Card-Stree* is depicted in Figure 4-4. Because each node consumes approximately the same amount of memory space, in what follows we use a node as the memory unit.
Figure 4-4. An illustration of Card-Stree*.
4.4.1 Basic Concept
A simple and intuitive approach is blindly dropping some itemsets while the memory space is not enough. However, it is very likely too much information will loss, making the mining results incorrect and leading to wrong analysis. Rather, we employ a strategy similar to the concept of cache replacement. That it, when the memory space is insufficient, we decide which itemsets in the current Card-Stree* are less important and can be deleted to release enough space for accommodating the incoming, more important itemsets.
Root
In this regard, we propose a node releasing mechanism to cope with the situation when memory space is not enough to maintain all of potential frequent itemsets in the Card-tree*. Note that the processing of stream mining needs to be computed in real time. As such, the main design concern of our approach is the efficiency, i.e., how to efficiently search and determine the victim nodes for deletion, without sacrificing too much the accuracy of the discovered rules.
First, we note that for mining indirect association, the set of 2-itemsets is the most important set, because from which all length-2 mediators and the infrequent itempairs are generated. As such, our node replacement is executed only when the memory space is not enough and the new generated itemsets from the incoming transaction are of lengths 1 and 2. In other words, those new generated k-itemsets with k > 2 are discarded immediately when the memory is not enough.
Second, considering that the frequency of long itemsets is usually less than that of short itemsets, and the lengthy rules constructed from long itemsets are less understandable to the users, our approach replaces nodes according to their cardinalities, first choosing the longest itemsets. More precisely, suppose that we need to release n nodes, and k denotes the largest length of itemsets in Card-Stree*. Our approach will search for the top n nodes with the smallest counts in the STk subtree. If there are less than n nodes found in STk, then the search continues in subtrees STk-1, STk-2, and so on. However, during the search process, we will delete any node whose count is equal to 1 and decrement the number of nodes to be released. This is because these nodes represent the least occurring itemsets.
Third, to facilitate the search of n victim nodes, we introduce a link structure called victim-list to maintain the nodes in Card-Stree* chosen for deletion. Each node in victim-list contains three fields, the count of the itemsets, the number of itemsets having this count, and pointers to all the corresponding nodes in the Card-Stree*. All nodes in victim-list are sorted in decreasing order of counts. Figure 4-5 illustrates the structure of victim-list.
Below we summarize the main steps for searching and eliminating victim nodes using the victim-list structure.
1. If there exist new generated 1- or 2-itemsets need to be inserted but the memory space is not enough, figure out the amount of nodes n need to be deleted, and call the node releasing procedure in step 2.
Figure 4-5. An illustration of victim-list
2. For each subtree in Card-Stree*, starting from STk, inspect each node X and execute the following substeps to determine if choosing X as a victim and inserting it into victim-list.
2-1. If X.tlcount = 1, delete X and decrease n by 1. Then check the number of victims stored in the first node victim-list[0]. If the total number of victims maintained in victim-list, say #victim, subtract victim-list[0].num is at least n, then eliminate victim-list[0] and update #victim.
2-2. Else if the number of nodes in victim-list is less than n, then insert X into victim-list in decreasing order of count.
2-3: Else, perform either one of the following cases.
case 1: The count of X is larger than that of the first node in victim-list. Skip X and continue to inspect the next node in STk
case 2: The count of X is equal to that of the first node in victim-list. Insert X into the first node and update victim-list[0].num and victim-list[0].pointer.
.
case 3: The count of X is less than that of the first node in victim-list. In this case, we first insert X into victim-list. Then check the number of victims stored in the first node victim-list[0]. If the total number of victims maintained in victim-list, say #victim, subtract victim-list[0].num is at least n, then eliminate victim-list[0] and update #victim.
3. If the number of victims in victim-list is less than n and k – 1 < 2, then k = k – 1 and go to step 2 (searching the next subtree STk-1
4. Eliminate all victims maintained in victim-list from Card-Stree* and insert at most
#victim new generated itemsets into Card-Stree*.
).
There are some remarks noticeable in the aforementioned procedure before we proceed to the detailed algorithm description. First, the victims are maintained and deleted in groups, distinguishing by count. That is why we may end up with more than n victims in victim-list. Second, for efficiency concern, we only compare the count of the inspecting node X with that of the first group in victim-list (see step 3).
This avoids the overhead for linear searching along the entire victim-list.
4.4.2 Algorithm Description
We first present the algorithm description of DelayInsert, which is responsible for constructing and maintaining Card-Stree* using transactions in CT, then detail the victim searching & releasing algorithm, which acts as a procedure called by DelayInsert when the memory shortage does occur.
The algorithm DelayInsert is described in Figure 4-6. In summary, for each k-itemset X generated from the input transaction, if X is already in Card-Stree*, update its information. Otherwise, if the memory space is enough, then insert X if k <
3, or perform delay insert if k ≥ 3. On the other hand, if the memory is not enough and k ≥ 3, then we simply discard itemset X. But for k < 3, we temporarily store X into a buffer. After all itemsets in Ck have been inspected, then call algorithm Victim_Searching&Releasing to release memory, and insert at most #victim of the new generated k-itemsets in buffer into Card-Stree*.
Procedure Name: DelayInsert Input: CT, Card-Stree*, cbid.
Output: Updated Card-Stree*.
Steps:
1. foreach transaction tr in CT do
2. foreach itemset X in Ck, the set of k-itemset Ck
update X.bidv, X.countv, and X.tlcount;
6. else if memory is enough then 7. if k < 3 then insert X into STk 8.
;
else insert X into STk only if all immediate subset of X in STk-1 9.
17. #victim = Victim_Searching&Releasing(Card-Stree*, n);
18. insert at most #victim k-itemsets in buffer into STk
19.
; endif
20. endfor
Figure 4-6. Algorithm description of DelayInsert
Algorithm: Victim searching & releasing Input: Number of nodes n and Card-Stree*
Output: Number of victim nodes releasing from Card-Stree*, #victim 1: #victim = 0;
2 k = the largest length of itemsets in Card-Stree*;
3: repeat
8: if #victim − victim-list[0].num >= n then delete victim-list[0];
9: endif
16: insert node X into victim-list; #victim++;
17: case 3:
18: insert node X into victim-list;
19: #victim++;
20: if #victim − victim-list[0].num >= n then 21: delete victim-list[0];
28: foreach victim X in victim-list do
29: delete the corresponding nodes pointed by X.addr from Card-Stree*;
30: return #victim;
Figure 4-7. Algorithm for victim searching&releasing
The algorithm for finding victim for releasing is described in Figure 4-7, which details the steps implementing the idea presented in subsection 4.4.1.
4.4.3 An Example
Suppose the Card-Stree* contains the set of frequent itemsets shown in Figure 4-8. For simplicity, we only show the total counts of each itemsets. Suppose that we want to insert F and G into Card-Stree* but found the memory is not enough. Then the procedure Victim_Searching&Releasing is activated to perform victim searching and node releasing, with n = 2.
Figure 4-8. The itemsets maintained in an example Card-Stree*.
1. The victim search starts from itemsets in subtree of length 4, i.e., ST4
. Since the victim-list is empty, itemset ABCD is inserted into victim-list with count
= 3 and link to its corresponding node in Card-Stree*. The result is shown in Figure 4-9.
Figure 4-9. The Card-Stree* and victim-list after inserting ABCD.
2. Since there are other subtrees with cardinality larger than 2 and #victim < n, the victim search continues to subtree of cardinality 3, first inspecting the node ABC. Since #victim < 2 and ABC’s count is larger than 3, so we insert ABC to the front of victim-list. The result is shown in Figure 4-10.
Figure 4-10. The Card-Stree* and victim-list after inserting ABC.
3. The search process continues to examine other nodes in subtree ST3. The next node inspected is ABD. Note that its count is 4, smaller than that of ABC, and so we insert ABD into victim-list. However, we found #victim – victim-list[0].num = 2. Therefore, ABC is deleted. The result is shown in Figure 4-11.
Figure 4-11. The Card-Stree* and victim-list after inserting ABD.
4. The next node is ABE. Its count is 2, smaller than that of ABD. So ABE is inserted into victim-list. Again, since #victim – victim-list[0].num = 2, node ABD is deleted. The result is shown in Figure 4-12.
Figure 4-12. The Card-Stree* and victim-list after inserting ABE.
5. The next node is ACD. Its count is 3 equal to the first node ABCD in victim-list. So, we insert ACD to the same node wherein ABCD locates, as shown in Figure 4.13.
Figure 4-13. The Card-Stree* and victim-list after inserting ACD.
6. The next itemset is ACE. Its count is equal to 1, so we delete immediately the node containing ACE from ST3 and decrease the number of nodes for releasing by 1, obtaining n = 1. After this, we found there are far more victims in victim-list than required, i.e., #victim – victim-list[0].num ≥ n, so delete the first group in victim-list. The result is depicted in Figure 4-14.
6. The next itemset is ACE. Its count is equal to 1, so we delete immediately the node containing ACE from ST3 and decrease the number of nodes for releasing by 1, obtaining n = 1. After this, we found there are far more victims in victim-list than required, i.e., #victim – victim-list[0].num ≥ n, so delete the first group in victim-list. The result is depicted in Figure 4-14.