• 沒有找到結果。

CHAPTER 2 Background and Related Work

2.2 Resource-Aware Stream Mining

The research of resource-aware stream mining focuses mainly on how to use the limited resources to efficiently accomplish the mining task while guarantee as could as possible the accuracy of mining results. Contemporary research work on resource-aware mining over data streams can be viewed from three aspects, the type of resources under concern, the type of mining tasks employed, and the type of adaption techniques used.

Studies conducted from the viewpoint of resource type include in general situation, CPU power and memory space, or additional issues for mobile devices, battery and network bandwidth. The types of mining tasks studied include clustering, frequent pattern mining, kernel density estimation. Finally, the adaptation techniques can be input adaptation or algorithm adaptation. The input adaptation technique refers to schemes used in adjusting the amount of input data to keep up with the pace of the stream and meet the computing capacity. Three commonly used approaches including sampling, i.e., statistically choosing some input data, load shedding, i.e., discarding part of the input data, and data synopsis creation, referring to data summarization techniques that retain the data characteristics, statistics or profile.

Teng et al. [21] proposed a wavelet based data synopsis technique, called RAM-DS (Resource-Aware Mining for Data Streams), to transform the input data stream into different granularity of data from temporal view or frequency view in order to reduce the data amount. Their work focused on frequent pattern mining.

The group led by M.M. Gaber is one of the pioneers in resource-aware stream data mining.

They have conducted a series of research studies on resource-awarestream clustering [10, 11], with ultimate goal at developing a general resource-aware framework that can adapt to variability of different resource availability over time.

Their recent work [11] proposed a generic framework called Algorithm Granularity Settings, which uses three levels of adaptation strategies in the data input, algorithm processing, and data output, and cope with different issues of resource-awareness.

The work conducted by Dang et al. [7, 8] considers stream mining under CPU resource constraint, focusing on frequent pattern discovery, and using load shedding technique. Their approach relies on a way to estimate the system workload by approximately computing the number of maximal itemsets that can be generated from a transaction, and then employs the load shedding technique to sample transactions if the system workload is over the current CPU computing power.

Heinz and Seeger [16] proposed an algorithm adaptation method that is tailored to the problem of kernel density estimation. The resource type considered in their work is memory space.

In distributed computing environment, the network bandwidth available for data transmission is particularly important. Parthasarathy and Subramonian [17] developed a resource-aware scheduling scheme to cope with the bandwidth limitation.

A summary of the above work on resource-aware stream data mining is shown in Table 2-1.

Table 2-1 A summary of related work on resource-aware stream mining.

Authors

Resource Type Mining

Task

Adaptation Techniques

CPU Memory Battery Bandwidth Input Adaptation Process Adaptation

*SP: Sampling, DS: Data Synopsis, LS: Load Shedding

CHAPTER 3

GIAMS: A Review

GIAMS (Generic Indirect Association Mining over Streams) [14, 25] is an algorithmic framework that can accommodates the three stream window model called Landmark model, Time-fading model and Sliding window, while retains user flexibility for defining new models. Users only have to set four variables (timestamp, window size, stride and decay rate) in accordance with the generic window model, then GIAMS can discover indirect association rules. Since our work in this thesis is based on this framework, in this chapter we give a brief review of this framework.

3.1 The Generic Framework

Suppose that we have a data stream S = (t0, t1, t2,...ti,...), where ti denotes the transaction arrived at time i. Since data stream is a continuous and unlimited incoming data along with time, a window W usually is specified, representing the sequence of data arrived from ti to tj, denoted as W[i, j]=(ti, ti+1, ..., tj

GIAMS adopts a generic window model Ψ for data stream mining, which is dictated as a four-tuple specification, Ψ(l, w, s, d), where l denotes the timestamp at which the window start, w as the window size, s is the stride the window moves forward, and d is the decay rate.

).

The stride notation s is introduced to allow the window moving forward in a batch of transactions (of size s).

That is, if the current window under concern is (tj−w+1, tj−w+2, …, tj), then the next window will be (tj−w+s+1, tj−w+s+2, …, tj+s), and the weight of a transaction within (tj−s+1, tj−s+2, …, tj), say α, is decayed to αd, and the weight of a transaction within (tj+1, …, tj+s) is 1. The concept of the proposed generic window model is depicted in Figure 3-1.

Figure 3-1. The generic window model used in GIAMS [14].

The GIAMS framework is developed according to the paradigm proposed by Tan et al. [20]: First, discovers the set of frequent itemsets with support higher than σf

(1) The user first sets the streaming window model by specifying the parameters described previously;

, and then generates the set of qualified indirect associations from the frequent itemsets.

Based on this paradigm, GIAMS works in the following scenario:

(2) The framework then executes the process for discovering and maintaining the set of potential frequent itemsets PF as the data continuously stream in;

time

(3) At any moment once the user issues a query about the current indirect associations the second process for generating the qualified indirect associations is executed to generate from PF the set of indirect associations IA.

Figure 3-2 depicts the generic streaming framework for indirect associations mining.

Figure 3-2. The GIAMS generic framework for indirect association mining [14].

Process 1:

Discover &

maintain PF

PF model setting &

adjusting

Process 2:

Generate IA

result

data

Access &

update Access

query

3.2 The Generic Algorithm

Based on the generic framework in Figure 3-3, the generic algorithm employed by GIAMS consists of two concurrent processes running simultaneously:

PF-monitoring and IA-generation. The first process is activated when the users specifies the window parameters to set the type of window model, responsible for generating itemsets from the incoming block of transactions and inserting those that are potentially frequent into a repository called monitoring lattice. The second process is activated when the user issues a query about the current indirect associations, responsible for generating the qualified patterns from the frequent itemsets maintained by process PF-monitoring. A sketch of the generic algorithm is described in Figure 3-4.

Algorithm Name: GIAMS

Input: Itempair support threshold σs, association support threshold σf, dependence threshold σd

Output: Indirect Associations IA.

, decay rate d, window size w, support error threshold ε.

Initialization:

1. Let N be the accumulated number of transactions, N = 0;

2. Let η be the decayed accumulated number of transactions, η = 0;

3. Let cbid be the current block id, cbid = 0, sbid the starting block id of

10. TransactionMerge(Bcbid 11.

, CT); // Merge anological transactions into a compact table CT

DelayInsert(CT, FP, σf

12.

, cbid, η); // Constructing FP using transactions in CT Decay&Pruning(d, s, ε, cbid, FP); // Removing infrequent itemsets from FP Process 2: IA-generation

1. if user query request = true then

2. IndirectAssociationGen(FP, σf, σd, σs,

Figure 3-3. The GIAMS algorithm.

N); // Generate all indirect associations

CHAPTER 4

The Proposed Resource-Aware GIAMS Framework

In this chapter, we describe our proposed resource-aware GIAMS framework, namely RA-GIAMS. We will first give an overview of RA-GIAMS, then focus on the design of two kernel functionalities, the adaptation schemes for CPU computing power variability and available memory space variability, respectively

4.1 Framework Overview

Based on the GIAMS framework proposed in [14, 25], our proposed RA-GIAMS add some mechanisms to cope with the variation of available resources, considering both CPU power and memory space, making use of most available resources to accomplish the discovery of indirect association rules.

As depicted in Figure 4.1, the new components added into our RA-GIAMS include the resource monitor, responsible for monitoring the current CPU computing power and available memory space; the load shedder, responsible for throwing off part of the incoming data; a buffer, using as a temporary container for keeping the incoming data; and the storage shedder, responsible for pruning maintained frequent itemsets to reduce memory requirement.

These new mechanisms work in the following scenario to realize the functionality of resource-awareness.

1. The resource monitor will periodically monitor the current CPU computing power and the available memory space. As we will show in later sections, the CPU computing power can be represented as the number of dominating operations accomplished within a time unit, and the memory usage can be represented as the amount of card-tree nodes, because each tree node consumes similar memory space. The information collected is then forwarded to the load shedder and storage shedder to take necessary action.

2. When the load shedder receives the CPU power information, it will compare this with the ongoing workload to see if it will exceed the estimated CPU computing power; if so, it will shed part of the input data.

3. Likewise, as the storage shedder receives the memory usage information, it will inspect if the available memory enough for processing the incoming transactions.

If not, it will perform a node replacement scheme to replace some of the frequent itemsets maintained in FP.

Figure 4-1. The Proposed RA-GIAMS framework.

4.2 Notation Description

Before we proceed to the detailed design of adaptation schemes, we describe in this section the notation that will be used. The description is presented in Table 4-1.

Table 4-1 Notation used in the design of RA-GIAMS.

Notation Description

r The data arrival rate.

Ŵi The workload for processing block Bi

ϖ

.

The predictive workload for processing block B

i i

ŵ*

.

The average workload for processing each transaction in B

i i

ϖ*

.

The predictive average workload for processing each transaction in B

i i

Ρ

.

The sampling rate.

θi The actual CPU efficiency for processing block Bi

e

. The predictive CPU efficiency for processing block B

i i

ς

. The available buffer space.

Ti The executing times to complete block Bi

τ

.

The estimated execution times for processing block B

i i.

4.3 Adaption Scheme for CPU Power Awareness

In this section, we will present the design of the adaption scheme for CPU power awareness.

4.3.1 Basic Concept

The basic concept of our design is depicted in Figure 4.2., where the buffer is regarded as a circular queue. Conforming to the generic window model used in GIAMS, we assume that the input stream is processed block by block. After the completion of the (i-1)th block Bi−1, the resource monitor can calculate the CPU current efficiency θi-1 and predict the CPU efficiency ei for processing the i-th block Bi

e

. A simple estimation taking the actual efficiency and predictive efficiency into account described in Eq. (4.1) is used, where α denotes a weight, 0 ≤ α ≤ 1; if α is higher than 0.5 means that the estimated efficiency is more important than the actual one. Although we would use other more complicated estimation methods, in this study we prefer simpler methods to avoid too much computation overhead.

i = α θi-1 + (1 − α) ei−1 (4.1)

Let ϖI denote the estimated workload for processing block Bi and ς is the available buffer space currently. Our intention is to ensure that during the course for processing block Bi

P × r × τ

, the amount of arriving data will not over the available buffer size. If not, we then activate the load shedder to shed the input data with a sampling rate P. It is not hard to derive the value of P to satisfy this situation.

i

That is,

≤ ς (4.2)

P ≤ ς / (r × τi

where the estimated execution times for processing B

) (4.3)

i will be τi = ϖi / ei.

Figure 4-2. CPU awareness scheme.

4.3.2 CPU Efficiency and Workload Estimation

In the description of the basic concept for our adaptation scheme for CPU awareness, there are two key points need further clarification. They are the CPU efficiency monitoring and the estimation of the workload ϖI for processing block Bi

The CPU efficiency, though the meaning is intuitively simple, i.e., the number of operations can be accomplished within a time unit, is not easy to monitor and calculate in real time. Our idea is to cope with the estimation from a computation complexity viewpoint. First, we observe that the most time-consumption procedure of algorithm GIAMS (see Figure 3-3) is the delay insert, which is responsible for decomposing transactions into itemsets and maintaining frequent patterns. The most important operations of delay insert are the node creation, update, and replacement (will illustrate at the memory awareness). For this reason, we can represent the CPU efficiency as the amount of these operations accomplished within one second. The actual CPU efficiency θ

.

i for completing the process of block Bi

θ

can be defined as

i = ŵi / ti

where ŵ

(4.4)

i denotes the number of operations, including node insertion, update and replacement, for completing Bi and ti

The workload estimation ϖ

is the execution time.

i for processing Bi, however, needs employing a different strategy because this has to be done before block Bi is processed.

Our idea is estimating the average number of operations needed to process a transaction by using the statistics collected in processing the previous block Bi−1. Let ŵ*i−1 and ϖ*i-1 be the actual and estimated average number of operations to process a transaction for block Bi−1, respectively. That is, ŵ*i−1 = ŵi−1 / |Bi−1| and ϖ*i−1 =

Consider the example stream in Figure 4-3. There are three blocks of transactions, the block B3 is just finished and the block B4 is going to be processed, and block B5 is store in the buffer; the block 6 is new generated from stream. We assume that the predictive CPU efficiency of e3 is equal to 21; the workload of ŵ3 is equal to 36 and takes 2 seconds; the predictive workload ϖ3 is equal to 42; the available buffer space ς = 2; the data arrival rate r = 3; and the estimation weight α = 0.5.

Figure 4-3. An example data stream.

Tid 6 ABC

We first calculate the CPU efficiency θ3 = 36/2 = 19 and predict the CPU efficiency during the course for processing block B4, e4 = 0.5*19 + 0.5*21 = 20. The average transactional workload for processing block B3, ŵ*3 = 36/3 = 12; and predictive counterpart ϖ*3

Then we can predict the average transactional workload for processing block B

= 42/3 = 14.

4, calculated as ϖ*4 = 0.5*12 + 0.5*14 = 13. So we have ϖ4 = 13*(3) = 39 and the estimated execution time will be τ4

4.4 Adaption Scheme for Available Memory Awareness

= 39/20 = 1.95. Finally we obtain the sample rate P = 2 / (3 ×1.95) = 2/8.4 = 34.18%.

In this section, we describe the adaption scheme for available memory awareness.

Recall that our framework relies on the maintenance of promising frequent itemsets, PF. Our concern thus is how to deal with the situation that available memory is not enough to hold all the frequent itemsets maintained in PF, and develop an adaptive scheme for adjusting PF to utilize the most of current available memory space.

The structure used in our RA-GIAMS for realizing PF is a modification of the tree structure used in GIAMS, called Card-Stree, which is a forest of search trees keeping itemsets of different cardinalities, appearing in the current window, say ST1, ST2, …, STk, for STk maintaining the set of frequent k-itemsets. We name the modified structure Card-Stree*. Each node in Card-Stree* except the root keep the information of the maintained itemset. More specifically, for each itemset X, the node records X.id, the identifier of X; X.bidv, the vector of identifiers of the blocks that X appears;

X.countv, the vector that stores the number of occurrences of X within each block; and X.tlcount, the total number of times that X appears in the current window under concern. An example of Card-Stree* is depicted in Figure 4-4. Because each node consumes approximately the same amount of memory space, in what follows we use a node as the memory unit.

Figure 4-4. An illustration of Card-Stree*.

4.4.1 Basic Concept

A simple and intuitive approach is blindly dropping some itemsets while the memory space is not enough. However, it is very likely too much information will loss, making the mining results incorrect and leading to wrong analysis. Rather, we employ a strategy similar to the concept of cache replacement. That it, when the memory space is insufficient, we decide which itemsets in the current Card-Stree* are less important and can be deleted to release enough space for accommodating the incoming, more important itemsets.

Root

In this regard, we propose a node releasing mechanism to cope with the situation when memory space is not enough to maintain all of potential frequent itemsets in the Card-tree*. Note that the processing of stream mining needs to be computed in real time. As such, the main design concern of our approach is the efficiency, i.e., how to efficiently search and determine the victim nodes for deletion, without sacrificing too much the accuracy of the discovered rules.

First, we note that for mining indirect association, the set of 2-itemsets is the most important set, because from which all length-2 mediators and the infrequent itempairs are generated. As such, our node replacement is executed only when the memory space is not enough and the new generated itemsets from the incoming transaction are of lengths 1 and 2. In other words, those new generated k-itemsets with k > 2 are discarded immediately when the memory is not enough.

Second, considering that the frequency of long itemsets is usually less than that of short itemsets, and the lengthy rules constructed from long itemsets are less understandable to the users, our approach replaces nodes according to their cardinalities, first choosing the longest itemsets. More precisely, suppose that we need to release n nodes, and k denotes the largest length of itemsets in Card-Stree*. Our approach will search for the top n nodes with the smallest counts in the STk subtree. If there are less than n nodes found in STk, then the search continues in subtrees STk-1, STk-2, and so on. However, during the search process, we will delete any node whose count is equal to 1 and decrement the number of nodes to be released. This is because these nodes represent the least occurring itemsets.

Third, to facilitate the search of n victim nodes, we introduce a link structure called victim-list to maintain the nodes in Card-Stree* chosen for deletion. Each node in victim-list contains three fields, the count of the itemsets, the number of itemsets having this count, and pointers to all the corresponding nodes in the Card-Stree*. All nodes in victim-list are sorted in decreasing order of counts. Figure 4-5 illustrates the structure of victim-list.

Below we summarize the main steps for searching and eliminating victim nodes using the victim-list structure.

1. If there exist new generated 1- or 2-itemsets need to be inserted but the memory space is not enough, figure out the amount of nodes n need to be deleted, and call the node releasing procedure in step 2.

Figure 4-5. An illustration of victim-list

2. For each subtree in Card-Stree*, starting from STk, inspect each node X and execute the following substeps to determine if choosing X as a victim and inserting it into victim-list.

2-1. If X.tlcount = 1, delete X and decrease n by 1. Then check the number of victims stored in the first node victim-list[0]. If the total number of victims maintained in victim-list, say #victim, subtract victim-list[0].num is at least n, then eliminate victim-list[0] and update #victim.

2-2. Else if the number of nodes in victim-list is less than n, then insert X into victim-list in decreasing order of count.

2-3: Else, perform either one of the following cases.

case 1: The count of X is larger than that of the first node in victim-list. Skip X and continue to inspect the next node in STk

case 2: The count of X is equal to that of the first node in victim-list. Insert X into the first node and update victim-list[0].num and victim-list[0].pointer.

.

case 3: The count of X is less than that of the first node in victim-list. In this case,

case 3: The count of X is less than that of the first node in victim-list. In this case,

相關文件