• 沒有找到結果。

CHAPTER 1 Introduction

1.3 Thesis Organization

The remainder of this thesis is organized as follows. In Chapter 2, we provide some background knowledge and related work about indirect association mining, data stream window model, resource-aware stream mining, and load shedding. Since our work in this study is based on the GIAMS framework, a generic stream window model and algorithmic framework for indirect association rules mining, we give an overview of GIAMS and the algorithm in Chapter 3. Chapter 4 describes our proposed GIAMS-RW framework, an extension of GIAMS with resource-aware capability, including the revised system framework, the algorithmic design for adaptive functionalities with respect to CPU-power and memory space variation, respectively, and the detailed data structure and procedure for realizing the algorithmic framework. Chapter 5 then describes the series of experiments we conducted on evaluating the proposed framework. We considered both synthetic and real datasets and examined the effects of various factors, e.g., data arrival rate, CPU-power, available memory space. Finally, the conclusions and future work of this thesis are presented in Chapter 6.

CHAPTER 2

Background and Related Work

2.1 Indirect Association Mining

The concept of mining indirection association was first proposed by [19] for discovering the useful infrequent patterns beyond the association rules. To facilitate the presentation of our framework, we first give a formal definition of indirect associations below.

Definition 1 An indirect association rule is denoted as x,yM , meaning that an itempair {x, y} is indirectly associated by a mediator set M if the following conditions hold:

1. sup({a, b}) < σs

2. sup({a} ∪ M) ≥ σ

(Itempair support condition);

f and sup({b} ∪ M) ≥ σf

3. dep({a}, M) ≥ σ

(Mediator support condition);

d and dep({b}, M) ≥ σd

The factors σ

(Mediator dependence condition);

where sup(A) denotes the support of a itemset A, and dep(P, Q) is a measure of the dependence between itemsets P and Q.

s, σf and σd are described as follows. The first one σs means that if the support of itemset is lower than this threshold then it is recognized as an infrequent itemset. The second σf is the mediator support threshold that if the support of itemset is higher than it, then the itemset is a frequent itemset. Note that σf ≥ σs. The last one σd denotes the dependence support threshold. Many functions can be used for measuring the dependence of itemsets. In this thesis, we follow the suggestion in [19, 20], adopting the well-known dependence function, IS measure.

)

Existing researches on indirect association mining can be divided into two categories, either focusing on proposing more efficient mining algorithms or extending the definition of indirect association for different applications.

The original indirect association mining approach proposed by [20] called

“Indirect association mining algorithm” is shown in Figure 2-1. In general the algorithm could be divided into two phases, including frequent itemsets extract phase (step 1) and indirect associations mining (steps 2~7). However, it is time-consuming to generate all frequent itemsets before mining indirect association.

Algorithm: Indirect association mining algorithm Input: Transaction Database D, σs , σf, and σd

Output: Indirect Associations IA.

. 1: Extract frequent itemsets, Let L1, L2, …, Ln

2:

be the set of frequent i-itemsets generated by any frequent base algorithm;

IA = ∅;

Figure 2-1. Indirect association mining algorithm

Wan and An [23] proposed an approach, called HI-mine, for improving the efficiency of the INDIRECT algorithm.

Rather than generating all frequent itemsets, HI-mine focuses on finding all itempairs first, and then pursues the mediator of each itempair. The HI-mine algorithm adopts a data structure based on the concept of dynamic transaction projection of frequent item, through which there is no need for doing any join operation for candidate generation. Instead, Hi-mine generates two new sets, indirect itempair set and mediator support set, by recursively building the HI-struct for the database. Then indirect associations are discovered from these two sets directly.

Later, Wan and An proposed an enhancement of the HI-mine algorithm, called HI-mine* [22]. HI-mine* adopts a more compact data structure call Super Compact Transaction Database (STDB), on which some optimization strategies are introduced, including only one database scanning, direct frequent item projecting, and dynamic infrequent item pruning.

Chen et al. [6] also proposed an indirect association mining approach that was similar to HI-mine, namely MG-Growth. The differences between them are that the directed graph and bitmap are used in MG-Growth for constructing the indirect itempair set (IIS). The corresponding mediator graphs are then generated for deriving indirect associations.

As to extending the definition of indirect association, Kazienko et al. [13]

applied indirect association on web pages recommendation system. Chen et al. [6]

proposed an approach for mining indirect association of items by adding time feature of goods. Since each item has its lifespan, the relationships of new coming items can thus easily be discovered.

2.2 Resource-Aware Stream Mining

The research of resource-aware stream mining focuses mainly on how to use the limited resources to efficiently accomplish the mining task while guarantee as could as possible the accuracy of mining results. Contemporary research work on resource-aware mining over data streams can be viewed from three aspects, the type of resources under concern, the type of mining tasks employed, and the type of adaption techniques used.

Studies conducted from the viewpoint of resource type include in general situation, CPU power and memory space, or additional issues for mobile devices, battery and network bandwidth. The types of mining tasks studied include clustering, frequent pattern mining, kernel density estimation. Finally, the adaptation techniques can be input adaptation or algorithm adaptation. The input adaptation technique refers to schemes used in adjusting the amount of input data to keep up with the pace of the stream and meet the computing capacity. Three commonly used approaches including sampling, i.e., statistically choosing some input data, load shedding, i.e., discarding part of the input data, and data synopsis creation, referring to data summarization techniques that retain the data characteristics, statistics or profile.

Teng et al. [21] proposed a wavelet based data synopsis technique, called RAM-DS (Resource-Aware Mining for Data Streams), to transform the input data stream into different granularity of data from temporal view or frequency view in order to reduce the data amount. Their work focused on frequent pattern mining.

The group led by M.M. Gaber is one of the pioneers in resource-aware stream data mining.

They have conducted a series of research studies on resource-awarestream clustering [10, 11], with ultimate goal at developing a general resource-aware framework that can adapt to variability of different resource availability over time.

Their recent work [11] proposed a generic framework called Algorithm Granularity Settings, which uses three levels of adaptation strategies in the data input, algorithm processing, and data output, and cope with different issues of resource-awareness.

The work conducted by Dang et al. [7, 8] considers stream mining under CPU resource constraint, focusing on frequent pattern discovery, and using load shedding technique. Their approach relies on a way to estimate the system workload by approximately computing the number of maximal itemsets that can be generated from a transaction, and then employs the load shedding technique to sample transactions if the system workload is over the current CPU computing power.

Heinz and Seeger [16] proposed an algorithm adaptation method that is tailored to the problem of kernel density estimation. The resource type considered in their work is memory space.

In distributed computing environment, the network bandwidth available for data transmission is particularly important. Parthasarathy and Subramonian [17] developed a resource-aware scheduling scheme to cope with the bandwidth limitation.

A summary of the above work on resource-aware stream data mining is shown in Table 2-1.

Table 2-1 A summary of related work on resource-aware stream mining.

Authors

Resource Type Mining

Task

Adaptation Techniques

CPU Memory Battery Bandwidth Input Adaptation Process Adaptation

*SP: Sampling, DS: Data Synopsis, LS: Load Shedding

CHAPTER 3

GIAMS: A Review

GIAMS (Generic Indirect Association Mining over Streams) [14, 25] is an algorithmic framework that can accommodates the three stream window model called Landmark model, Time-fading model and Sliding window, while retains user flexibility for defining new models. Users only have to set four variables (timestamp, window size, stride and decay rate) in accordance with the generic window model, then GIAMS can discover indirect association rules. Since our work in this thesis is based on this framework, in this chapter we give a brief review of this framework.

3.1 The Generic Framework

Suppose that we have a data stream S = (t0, t1, t2,...ti,...), where ti denotes the transaction arrived at time i. Since data stream is a continuous and unlimited incoming data along with time, a window W usually is specified, representing the sequence of data arrived from ti to tj, denoted as W[i, j]=(ti, ti+1, ..., tj

GIAMS adopts a generic window model Ψ for data stream mining, which is dictated as a four-tuple specification, Ψ(l, w, s, d), where l denotes the timestamp at which the window start, w as the window size, s is the stride the window moves forward, and d is the decay rate.

).

The stride notation s is introduced to allow the window moving forward in a batch of transactions (of size s).

That is, if the current window under concern is (tj−w+1, tj−w+2, …, tj), then the next window will be (tj−w+s+1, tj−w+s+2, …, tj+s), and the weight of a transaction within (tj−s+1, tj−s+2, …, tj), say α, is decayed to αd, and the weight of a transaction within (tj+1, …, tj+s) is 1. The concept of the proposed generic window model is depicted in Figure 3-1.

Figure 3-1. The generic window model used in GIAMS [14].

The GIAMS framework is developed according to the paradigm proposed by Tan et al. [20]: First, discovers the set of frequent itemsets with support higher than σf

(1) The user first sets the streaming window model by specifying the parameters described previously;

, and then generates the set of qualified indirect associations from the frequent itemsets.

Based on this paradigm, GIAMS works in the following scenario:

(2) The framework then executes the process for discovering and maintaining the set of potential frequent itemsets PF as the data continuously stream in;

time

(3) At any moment once the user issues a query about the current indirect associations the second process for generating the qualified indirect associations is executed to generate from PF the set of indirect associations IA.

Figure 3-2 depicts the generic streaming framework for indirect associations mining.

Figure 3-2. The GIAMS generic framework for indirect association mining [14].

Process 1:

Discover &

maintain PF

PF model setting &

adjusting

Process 2:

Generate IA

result

data

Access &

update Access

query

3.2 The Generic Algorithm

Based on the generic framework in Figure 3-3, the generic algorithm employed by GIAMS consists of two concurrent processes running simultaneously:

PF-monitoring and IA-generation. The first process is activated when the users specifies the window parameters to set the type of window model, responsible for generating itemsets from the incoming block of transactions and inserting those that are potentially frequent into a repository called monitoring lattice. The second process is activated when the user issues a query about the current indirect associations, responsible for generating the qualified patterns from the frequent itemsets maintained by process PF-monitoring. A sketch of the generic algorithm is described in Figure 3-4.

Algorithm Name: GIAMS

Input: Itempair support threshold σs, association support threshold σf, dependence threshold σd

Output: Indirect Associations IA.

, decay rate d, window size w, support error threshold ε.

Initialization:

1. Let N be the accumulated number of transactions, N = 0;

2. Let η be the decayed accumulated number of transactions, η = 0;

3. Let cbid be the current block id, cbid = 0, sbid the starting block id of

10. TransactionMerge(Bcbid 11.

, CT); // Merge anological transactions into a compact table CT

DelayInsert(CT, FP, σf

12.

, cbid, η); // Constructing FP using transactions in CT Decay&Pruning(d, s, ε, cbid, FP); // Removing infrequent itemsets from FP Process 2: IA-generation

1. if user query request = true then

2. IndirectAssociationGen(FP, σf, σd, σs,

Figure 3-3. The GIAMS algorithm.

N); // Generate all indirect associations

CHAPTER 4

The Proposed Resource-Aware GIAMS Framework

In this chapter, we describe our proposed resource-aware GIAMS framework, namely RA-GIAMS. We will first give an overview of RA-GIAMS, then focus on the design of two kernel functionalities, the adaptation schemes for CPU computing power variability and available memory space variability, respectively

4.1 Framework Overview

Based on the GIAMS framework proposed in [14, 25], our proposed RA-GIAMS add some mechanisms to cope with the variation of available resources, considering both CPU power and memory space, making use of most available resources to accomplish the discovery of indirect association rules.

As depicted in Figure 4.1, the new components added into our RA-GIAMS include the resource monitor, responsible for monitoring the current CPU computing power and available memory space; the load shedder, responsible for throwing off part of the incoming data; a buffer, using as a temporary container for keeping the incoming data; and the storage shedder, responsible for pruning maintained frequent itemsets to reduce memory requirement.

These new mechanisms work in the following scenario to realize the functionality of resource-awareness.

1. The resource monitor will periodically monitor the current CPU computing power and the available memory space. As we will show in later sections, the CPU computing power can be represented as the number of dominating operations accomplished within a time unit, and the memory usage can be represented as the amount of card-tree nodes, because each tree node consumes similar memory space. The information collected is then forwarded to the load shedder and storage shedder to take necessary action.

2. When the load shedder receives the CPU power information, it will compare this with the ongoing workload to see if it will exceed the estimated CPU computing power; if so, it will shed part of the input data.

3. Likewise, as the storage shedder receives the memory usage information, it will inspect if the available memory enough for processing the incoming transactions.

If not, it will perform a node replacement scheme to replace some of the frequent itemsets maintained in FP.

Figure 4-1. The Proposed RA-GIAMS framework.

4.2 Notation Description

Before we proceed to the detailed design of adaptation schemes, we describe in this section the notation that will be used. The description is presented in Table 4-1.

Table 4-1 Notation used in the design of RA-GIAMS.

Notation Description

r The data arrival rate.

Ŵi The workload for processing block Bi

ϖ

.

The predictive workload for processing block B

i i

ŵ*

.

The average workload for processing each transaction in B

i i

ϖ*

.

The predictive average workload for processing each transaction in B

i i

Ρ

.

The sampling rate.

θi The actual CPU efficiency for processing block Bi

e

. The predictive CPU efficiency for processing block B

i i

ς

. The available buffer space.

Ti The executing times to complete block Bi

τ

.

The estimated execution times for processing block B

i i.

4.3 Adaption Scheme for CPU Power Awareness

In this section, we will present the design of the adaption scheme for CPU power awareness.

4.3.1 Basic Concept

The basic concept of our design is depicted in Figure 4.2., where the buffer is regarded as a circular queue. Conforming to the generic window model used in GIAMS, we assume that the input stream is processed block by block. After the completion of the (i-1)th block Bi−1, the resource monitor can calculate the CPU current efficiency θi-1 and predict the CPU efficiency ei for processing the i-th block Bi

e

. A simple estimation taking the actual efficiency and predictive efficiency into account described in Eq. (4.1) is used, where α denotes a weight, 0 ≤ α ≤ 1; if α is higher than 0.5 means that the estimated efficiency is more important than the actual one. Although we would use other more complicated estimation methods, in this study we prefer simpler methods to avoid too much computation overhead.

i = α θi-1 + (1 − α) ei−1 (4.1)

Let ϖI denote the estimated workload for processing block Bi and ς is the available buffer space currently. Our intention is to ensure that during the course for processing block Bi

P × r × τ

, the amount of arriving data will not over the available buffer size. If not, we then activate the load shedder to shed the input data with a sampling rate P. It is not hard to derive the value of P to satisfy this situation.

i

That is,

≤ ς (4.2)

P ≤ ς / (r × τi

where the estimated execution times for processing B

) (4.3)

i will be τi = ϖi / ei.

Figure 4-2. CPU awareness scheme.

4.3.2 CPU Efficiency and Workload Estimation

In the description of the basic concept for our adaptation scheme for CPU awareness, there are two key points need further clarification. They are the CPU efficiency monitoring and the estimation of the workload ϖI for processing block Bi

The CPU efficiency, though the meaning is intuitively simple, i.e., the number of operations can be accomplished within a time unit, is not easy to monitor and calculate in real time. Our idea is to cope with the estimation from a computation complexity viewpoint. First, we observe that the most time-consumption procedure of algorithm GIAMS (see Figure 3-3) is the delay insert, which is responsible for decomposing transactions into itemsets and maintaining frequent patterns. The most important operations of delay insert are the node creation, update, and replacement (will illustrate at the memory awareness). For this reason, we can represent the CPU efficiency as the amount of these operations accomplished within one second. The actual CPU efficiency θ

.

i for completing the process of block Bi

θ

can be defined as

i = ŵi / ti

where ŵ

(4.4)

i denotes the number of operations, including node insertion, update and replacement, for completing Bi and ti

The workload estimation ϖ

is the execution time.

i for processing Bi, however, needs employing a different strategy because this has to be done before block Bi is processed.

Our idea is estimating the average number of operations needed to process a transaction by using the statistics collected in processing the previous block Bi−1. Let ŵ*i−1 and ϖ*i-1 be the actual and estimated average number of operations to process a transaction for block Bi−1, respectively. That is, ŵ*i−1 = ŵi−1 / |Bi−1| and ϖ*i−1 =

Consider the example stream in Figure 4-3. There are three blocks of transactions, the block B3 is just finished and the block B4 is going to be processed, and block B5 is store in the buffer; the block 6 is new generated from stream. We assume that the predictive CPU efficiency of e3 is equal to 21; the workload of ŵ3 is equal to 36 and takes 2 seconds; the predictive workload ϖ3 is equal to 42; the available buffer space ς = 2; the data arrival rate r = 3; and the estimation weight α = 0.5.

Figure 4-3. An example data stream.

Tid 6 ABC

We first calculate the CPU efficiency θ3 = 36/2 = 19 and predict the CPU efficiency during the course for processing block B4, e4 = 0.5*19 + 0.5*21 = 20. The average transactional workload for processing block B3, ŵ*3 = 36/3 = 12; and predictive counterpart ϖ*3

Then we can predict the average transactional workload for processing block B

= 42/3 = 14.

4, calculated as ϖ*4 = 0.5*12 + 0.5*14 = 13. So we have ϖ4 = 13*(3) = 39 and the estimated execution time will be τ4

4.4 Adaption Scheme for Available Memory Awareness

= 39/20 = 1.95. Finally we obtain the sample rate P = 2 / (3 ×1.95) = 2/8.4 = 34.18%.

In this section, we describe the adaption scheme for available memory awareness.

Recall that our framework relies on the maintenance of promising frequent itemsets, PF. Our concern thus is how to deal with the situation that available memory is not enough to hold all the frequent itemsets maintained in PF, and develop an adaptive scheme for adjusting PF to utilize the most of current available memory space.

The structure used in our RA-GIAMS for realizing PF is a modification of the tree structure used in GIAMS, called Card-Stree, which is a forest of search trees keeping itemsets of different cardinalities, appearing in the current window, say ST1, ST2, …, STk, for STk maintaining the set of frequent k-itemsets. We name the modified structure Card-Stree*. Each node in Card-Stree* except the root keep the information of the maintained itemset. More specifically, for each itemset X, the node records X.id, the identifier of X; X.bidv, the vector of identifiers of the blocks that X appears;

X.countv, the vector that stores the number of occurrences of X within each block; and X.tlcount, the total number of times that X appears in the current window under concern. An example of Card-Stree* is depicted in Figure 4-4. Because each node

X.countv, the vector that stores the number of occurrences of X within each block; and X.tlcount, the total number of times that X appears in the current window under concern. An example of Card-Stree* is depicted in Figure 4-4. Because each node

相關文件