Quality-Aware Sampling and its Applications in Incremental Data Mining

(1)

Quality-Aware Sampling and Its Applications in

Incremental Data Mining

Kun-Ta Chuang, Member, IEEE, Keng-Pei Lin, and Ming-Syan Chen, Fellow, IEEE

Abstract—We explore in this paper a novel sampling algorithm, referred to as algorithm PAS (standing for Proportion Approximation Sampling), to generate a high-quality online sample with the desired sample rate. The sampling quality refers to the consistency between the population proportion and the sample proportion of each categorical value in the database. Note that the state-of-the-art sampling algorithm to preserve the sampling quality has to examine the population proportion of each categorical value in a pilot sample a priori and is thus not applicable to incremental mining applications. To remedy this, algorithm PAS adaptively determines the inclusion probability of each incoming tuple in such a way that the sampling quality can be sequentially preserved while also guaranteeing the sample rate close to the user specified one. Importantly, PAS not only guarantees the proportion consistency of each categorical value but also excellently preserves the proportion consistency of multivariate statistics, which will be significantly beneficial to various data mining applications. For better execution efficiency, we further devise an algorithm, called algorithm EQAS (standing for Efficient Quality-Aware Sampling), which integrates PAS and random sampling to provide the flexibility of striking a compromise between the sampling quality and the sampling efficiency. As validated in experimental results on real and synthetic data, algorithm PAS can stably provide high-quality samples with corresponding computational overhead, whereas algorithm EQAS can flexibly generate samples with the desired balance between sampling quality and sampling efficiency. In addition, while applying the sample generated by algorithms PAS and EQAS to incremental mining applications, a significant efficiency improvement can be obtained without compromising the resulting precision, showing the prominent advantage of both proposed algorithms to be the quality-aware sampling means for incremental mining applications.

Index Terms—Sequential sampling, incremental data mining.

Ç

1 I

NTRODUCTION

1.1 Motivations

R

ECENTLY, important applications have called for the

need for incremental mining to discover up-to-date patterns hidden in the continuous input data [1], [2], [3], [4], [5]. It is believed that the demand of online sampling techniques is increasing since they can prominently reduce the computational cost of the incremental mining applications [6]. However, using sampling prior to the targeted applications inevitably leads to the result being inconsistent with that obtained without sampling. If using sampling leads to a very inconsistent mining result, its usefulness for scaling up is in question. In practice, the level of consistency between results obtained in the whole population and those in a sample solely depends on the quality of the sample. Thus, how to guarantee the quality of samples is deemed the key to the success of sampling techniques [7]. In the literature, a common and successful measure of the sampling quality

is to measure the consistency between the population proportion and the sample proportion of every measured pattern [8], [9], [10], [11].

Traditionally, random sampling is the most widely utilized sampling strategy for data mining applications. According to the Chernoff bounds, the consistency between the population proportion and the sample proportion of a measured pattern can be probabilistically guaranteed when the sample size is large [9], [12]. However, the overhead of the posterior mining applications will be increased when the sample size is large, thus inevitably degrading the benefit of sampling. Sampling mechanisms to guarantee a high sampling quality without increasing the sample size are still strongly demanded. To achieve this, the state-of-the-art sampling approach, named algorithm EASE, was proposed in [8] to guarantee the quality of generated samples with a desired sample size. Specifically, the goal of EASE is to precisely preserve the population proportion of each categorical value in the sample. According to the proposed epsilon-approximation method, algorithm EASE will obtain the final sample with the desired sample size by a process of repeatedly halving the intermediate samples. As such, the difference between the sample proportion and the population proportion of each categorical value in the final sample can be limited below ", where the magnitude of

"depends on the desired sample size (a large sample size

leads to a small " and, in contrast, a small sample size leads to a large "). Algorithm EASE is shown to be an effective sampling means to provide the prominent proportion consistency in the sample with a specified size. As compared to random sampling, the result in [8] demon-. K.-T. Chuang is with the Graduate Institute of Communication

Engineer-ing, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan, ROC. E-mail: [email protected].

. K.-P. Lin is with the Department of Electrical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan, ROC. E-mail: [email protected].

. M.-S. Chen is with the Department of Electrical Engineering and the Graduate Institute of Communication Engineering, National Taiwan University, Taipei, Taiwan, ROC. E-mail: [email protected]. Manuscript received 27 Jan. 2006; revised 5 Oct. 2006; accepted 9 Oct. 2006; published online 19 Jan. 2007.

For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-0036-0106. Digital Object Identifier no. 10.1109/TKDE.2007.1005.

(2)

strates that preserving the population proportion of each categorical value in the sample can significantly improve the resulting model accuracy of various posterior

applica-tions such as association-rule mining and the 2 _{test for}

independence, to name a few.

Note, however, that algorithm EASE in essence cannot sequentially generate the sample, thus it is not applicable to incremental applications. Formally, EASE, which can be categorized as a two-phase sampling mechanism [10], requires a pilot sample of the whole population to be generated beforehand, indicating that the population data set will be treated as a static one as opposed to a dynamic one. Such a constraint is infeasible in incremental mining applications, where they usually deal with time-variant data and each tuple is unknown before we receive it. Moreover, in EASE, each tuple will be repeatedly examined until it has been decided whether it is to be selected or discarded, implying that the sampling quality is acquired at the cost of execution efficiency. The required computational overhead of generating samples compromises the spirit of sampling to speed up the execution. Consequently, it is essential to develop a new sampling algorithm to incre-mentally generate high-quality samples while not compro-mising the sampling efficiency and not increasing the sample size.

As a result, we present in this paper a novel sampling approach, called algorithm PAS (standing for Proportion Approximation Sampling) to achieve the goal of generating a high-quality online sample with the user-specified sample rate. As with EASE, the sampling quality in PAS is measured as the level of the proportion consistency, i.e., the consistency between the sample proportion and the population proportion of each measured pattern. Note that, in addition to applications of the frequent-pattern mining

and the 2 _{test studied in [8], it is also reported that}

guaranteeing the proportion consistency provides a great benefit to many different mining applications such as supervised learning, clustering [9], [11]. We thus believe that providing the high proportion consistency of each measured pattern can lead to general-purpose and high-quality samples for different application needs. Further-more, the proportion consistency can be guaranteed in two ways, namely, absolute proportion consistency and relative

proportion consistency. Specifically, assuming di and si

denote the population proportion and the sample propor-tion of a measured pattern, respectively, the value of

1jdisij

di

can be viewed as the relative proportion consis-tency of the pattern. Conversely, its absolute proportion

consistency is equal to the value of ð1 jdi sijÞ. As opposed

to guaranteeing the absolute proportion consistency (the goal in algorithm EASE), algorithm PAS aims to guarantee the relative proportion consistency because recent probabilistic thresholding methods for wavelet synopses point out that minimizing the relative error is the more desirable measure for data reduction techniques [13], [14]. In addition, due to the time-variant nature of real data, PAS sequentially reads

the incoming population and generates the sample with the given sample rate on the fly, where the sliding window model is imposed. Therefore, the up-to-date population characteristics can be precisely maintained in the generated sample, allowing PAS to be directly applicable to incre-mental mining applications.

Briefly, the basic idea behind PAS is to adaptively determine the inclusion probability of each incoming tuple as time advances, where the inclusion probability will be determined according to two criteria: 1) The relative proportion “inconsistency” of every attribute value can be guaranteed toward a user-specified error bound " and 2) the sample rate is close to the user-desired sample rate p. The concept is illustrated in Fig. 1. Specifically, PAS strives to minimize the relative proportion inconsistency of each attribute value progressively until the difference is smaller than ". While " is specified close to zero, PAS will quickly and stably keep the relative proportion inconsistency close to zero, as shown in Fig. 1, even though the data distribution is not stationary in a time-variant data source. In contrast, simple random sampling cannot guarantee that the relative error always approaches to zero in a time-variant data source, especially when the data distribution changes suddenly.

In addition, for multidimensional data, it is required to have an atomic unit of measured patterns, which is a multivariate statistic. However, maintaining the relative proportion consistency of every multivariate pattern incurs a large computational overhead, which is prohibitive in many applications. For efficiency purposes, PAS maintains propor-tion consistency of every attribute value, i.e., every single variate statistic, rather than every multivariate statistic, the same as in algorithm EASE. Importantly, even though PAS only maintains the relative proportion consistency of each categorical value, as shown in our analytical and algorith-mic results, the relative proportion consistency of multivariate statistics can also be excellently preserved. In contrast, EASE may preserve the proportion consistency of each categorical value, but lose the proportion consistency of multivariate statistics.

Formally, as in algorithm EASE, algorithm PAS also unavoidably incurs the computational overhead to guaran-tee the sampling quality by continuously tracking the relative proportion consistency of each categorical value. To provide the flexibility of striking a compromise between the Fig. 1. Illustration of the relative proportion inconsistency over time.

(3)

sampling efficiency and the sampling quality, we further devise in this paper another sampling algorithm, named algorithm EQAS (standing for Efficient Quality-Aware Sampling). The framework of algorithm EQAS is exhibited in Fig. 2. Specifically, from our empirical studies, algorithm PAS can quickly and stably guarantee the relative proportion consistency with the corresponding computational overhead. On the other hand, random sampling is very efficient, but the relative proportion consistency can only be slowly reduced. Random sampling is also highly sensitive to the burst sampling error, which is common when the data distribu-tion is time-variant [15]. Due to their complementary properties, algorithm EQAS is devised by integrating random sampling and algorithm PAS. By appropriately switching between random sampling and PAS, algorithm EQAS is able to preserve the advantages of these two schemes while diminishing their side effects. In addition, while applying the sample generated by algorithms PAS and EQAS to incremental mining applications, a significant efficiency improvement can be obtained without compro-mising the resulting precision, showing the prominent advantage of both proposed algorithms to be quality-aware sampling mechanisms for incremental mining applications.

1.2 Related Works

The scalability problem in database applications has been fully explored with the help of data reduction techniques such as sampling, histogram [14], and wavelet decomposi-tion [13]. The discussion here is limited to sampling techniques. For other data reduction techniques, which are out of scope for this paper, the reader is asked to follow the pointers in [16]. In practice, sampling techniques have been successfully used in various social and scientific applica-tions. The general introduction of sampling can be found in many well-known works, such as [7], [17]. Here, we focus on discussing sampling techniques related to obtaining high-quality samples for data mining applications.

In essence, sampling has a rich history in statistics with many variants, including simple random sampling with/ without replacement [7], adaptive sampling [18], and so on. Among them, simple random sampling is the most widely employed strategy due to its generality and its simplicity. Recent advances in streaming analysis and database query optimization specifically pay attention to utilizing the variants of simple random sampling, called sequential random sampling [19] and reservoir sampling [20] due to their high sampling efficiency. Explicitly, reservoir sampling main-tains a random sample with a fixed size M as time advances and Method D in [19] will progressively generate the sample with the specified sample rate p. Those two sampling methods can skip some data elements without processing them while guaranteeing the generated sample

is a uniform random sample. However, it is also reported that random samples suffer from insufficiency of the sampling quality, thus resulting in generating a model with low accuracy [8], [21], [22].

To obtain a model with high accuracy, new sampling approaches to generate high-quality samples are required. Algorithm EASE [8] is devised to guarantee high absolute proportion consistency of each categorical value. As we have discussed in Section 1.1, the goal of EASE is to limit the absolute difference between the sample proportion and the population proportion of each categorical value in the final sample below ", where the magnitude of " depends on the desired sample size and other parameters. Note that, for a measured pattern with the population proportion close to one, its absolute proportion consistency and relative proportion consistency are roughly the same. However, if the popula-tion proporpopula-tion is small, they are different. For example,

representing di¼ 0:01 by si¼ 0:02 and representing di¼

0:1by si¼ 0:11 have the same absolute proportion consistency.

However, the former has a 100 percent error rate and the latter has a 10 percent error rate, which can be estimated as

the relative proportion consistency. In this case, si¼ 0:02

deviates quite far from di¼ 0:01. Equally minimizing the

absolute proportion difference of every attribute value may result in poor performance for applications which are

sensitive to case, such as di¼ 0:01 in the previous example.

Clearly, most data mining applications will almost prefer the relative proportion consistency since mining algorithms is usually devoted to the discovery of uncommon/surprising patterns (usually with a small occurrences) as opposed to the discovery of common sense knowledge (usually with a large occurrence) [23].

In addition to algorithm EASE, density biased sampling (abbreviated as DBS) is another sampling strategy which recently received a great deal of attention in the data mining research community [21]. Specifically, DBS is devised based on the observation that the distribution of clusters’ sizes in real data is usually highly skewed. In such cases, random samples may miss points from small but dense regions, thus resulting in the loss of small clusters after sampling. DBS oversamples the regions with high spatial density and downsamples the regions with low spatial density. Note that the goal of DBS intrinsically differs from ours in this paper. First, we fairly reduce the difference between the population proportion and the sample proportion of each attribute value, whereas DBS emphasizes the density differences of each spatial region. Second, the targeted applications are quite different. DBS focuses on identifying small clusters rather than preserving the consistency between clustering results in the whole population and the sample. Using DBS is thus not appropriate for other clustering applications such as subspace clustering [24]. In contrast, we aim to make the resulting model obtained in the sample be consistent with that obtained in the population, which is applicable to various mining applications.

Progressive sampling/dynamic sampling [25], [22] is another way to improve the resulting model accuracy in the literature since the sample size estimated by Chernoff bounds is conservative and is usually too large for specified applications [12]. Progressive sampling algorithms are devised by iteratively executing the targeted application on random samples whose sizes are progressively increased and the process will be terminated when the mining Fig. 2. Framework of sequential sampling PAS and EQAS over sliding

(4)

accuracy is no longer significantly improved. Finally, a satisfactory model accuracy can be obtained without a prohibitively large sample size. However, the targeted application may be executed on many samples with varying sample sizes, which is also time-consuming.

Many recent applications, including credit card fraud protection and network intrusion detection, call for the need of incremental mining algorithms. Since their data are usually time-variant and the data characteristics may drift as time advances, traditional algorithms which are devised for mining on static data will fail in such cases. Various incremental mining algorithms, e.g., incremental mining of frequent itemsets [4], [6], incremental conceptual clustering [3], and concept-drifting classification [26], are thus speci-fically devised. We omit the details of these algorithms and concentrate on the discussion of sampling strategies in the incremental mining scenario. Due to the time-variant nature, incremental mining algorithms are designed to analyze the most recent data in order to retrieve up-to-date patterns [15]. Two common approaches are usually utilized to deal with old data in such cases. The first one is aging [27], where each data is assigned a weight and more recent data have higher weights. The other approach is to use a sliding window [4], where only the most recent data covered by a window are considered. Formally, sampling ap-proaches such as algorithm EASE will fail either in the aging or in the sliding window model since EASE assumes the population proportion in the whole population is static and can be known in advance. For incremental mining, only online and sequential sampling approaches can be utilized, such as Method D [19], reservoir sampling [20], and priority sampling [15], where, however, random samples are generated, rather than high-quality samples such as the one generated by EASE.

1.3 Our Contributions

Our contributions in this paper are many:

1. We propose algorithm PAS to sequentially generate

a sample in which the relative proportion inconsistency of each categorical value can be minimized toward a user-specified bound " while also guaranteeing the sample rate close to the user specific one. Impor-tantly, although PAS targets on guaranteeing the relative proportion consistency of each categorical value, as shown in our analytical and algorithmic results, the relative proportion consistency of multi-variate statistics can also be excellently preserved, which will be significantly beneficial to data mining applications.

2. For better execution efficiency, we further devise

another sampling algorithm, EQAS, to provide the flexibility of striking a compromise between the sampling efficiency and the sampling quality.

3. We complement our analytical and algorithmic

results by a thorough empirical study on real data and synthetic data and show that algorithm PAS can provide high-quality samples with slight computa-tional overhead and algorithm EQAS can flexibly generate samples with the desired balance between sampling quality and sampling efficiency. We also explore their benefits for incremental mining appli-cations. The result demonstrates their prominent

advantages to be the effective quality-aware sam-pling means for incremental mining applications. This rest of the paper is organized as follows: Section 2 introduces algorithm PAS. In Section 3, we give the details of algorithm EQAS. The experimental results are shown in Section 4. Finally, this paper concludes with Section 5.

2 O

NLINE

S

AMPLING FOR

G

UARANTEEING

R

ELATIVE

P

ROPORTION

C

ONSISTENCY

2.1 Fundamental Mathematical Model

In this section, we derive our model to generate online samples of guaranteed relative proportion consistency. For simplicity and effectiveness, we intend to follow the idea of random sampling without replacement. The variant lies in the strategy of determining the inclusion probability. As opposed to the fixed inclusion probability in random sampling without replacement, our model will dynamically determine the inclusion probability of each incoming tuple so that we can guarantee the relative proportion consistency on the fly while also ensuring that the size of generated sample is under the user’s control. We then discuss the analytical details step by step. For ease of reference, Table 1 shows a summary of major symbols used in this paper. 2.1.1 Problem Description

Suppose that D is a relational table with schema

ðA1; A2; . . . ; AhÞ, where A1; . . . ; Ah are attributes and h is

the number of attributes in D. Let ti ¼ ðxi1; xi2. . . ; xihÞ be

the ith tuple in D, where xij2 Aj for 1 j h. Moreover,

assuming that ajdenotes an attribute value in the domain of

Aj, 1 j h, aj is said to be contained in ti, i.e., aj2 ti, iff

aj¼ xij. Without loss of generality, we assume that 1) the

order i is able to indicate the receiving order of ti, 2) D

contains infinite tuples, and 3) Aj contains the finite

domain, for 1 j h (continuous attributes can be dis-cretized using methods such as that described in [28]). Note that those assumptions will be equally applicable to infinite streams and finite data sets.

To formalize the window-based sampling model,1 we

assume that D is segmented into disjoint windows,

fW1; W2; . . . ; Wn; . . .g, in light of a predefined time

granu-larity such as “day,” “business-week,” “month,” “quarter,”

and “year” to name a few. As such, Wkwill consist of a set

of tuples, ftk1; tk2; . . . ; tkig, where each one is received

1. We adopt the sliding window-based model in our work. Note that the sliding window-based sample is also applicable to the aging-based incremental mining applications.

TABLE 1

(5)

within the corresponding time period of Wk. Let jWkj and

NkðajÞ be the number of tuples in Wk and the number of

tuples containing the value ajin Wk, respectively. Then, we

have the population proportion of aj in Wk, denoted by

supðaj; WkÞ, where supðaj; WkÞ ¼ NkðajÞ=jWkj: In addition,

let Sk denote the sample window corresponding to Wk,

where the set of tuples in Sk is a subset of tuples in Wk.

Also, let jSkj and s NkðajÞ denote the number of tuples in Sk

and the number of tuples containing the value aj in Sk,

respectively. We have the sample proportion of aj in Sk,

denoted by supðaj; SkÞ, where supðaj; SkÞ ¼ s NkðajÞ=jSkj.

Our goal in this paper is to efficiently and sequentially generate a high-quality sample. Following the consideration in [8], the sampling quality considered in this study also refers to the consistency between the sample proportion

supðaj; SkÞ and the population proportion supðaj; WkÞ of

each attribute value aj in a window Wk. However, the

solution proposed in [8] merely attempts to reduce the absolute proportion difference as possible, i.e., minimizing

j supðaj; WkÞ supðaj; SkÞj for every value aj. As pointed out

earlier, minimizing the relative error is the more desirable measure for applications to discover uncommon/surprising patterns. Therefore, to further ensure the generated sample can better characterize the time-variant data source, we attempt to generate a sample in which relative proportion difference can be bounded below the specified error threshold ".

In general, one way to achieve the bounded relative proportion difference is to increase the sample size. However, a large sample size is usually prohibitive. The sample rate/sample size should be under the user’s control to prevent generating a large sample. We then formally present our goal as follows:

Proposition 1. Given a desired sample rate p and the relative

error bound ", we attempt to generate a sample fS1; . . . ; Sng

from the population fW1; . . . ; Wng, in which 1) _jWjSk_kj_j p,

where 1 k n, and 2)

j supðaj; WkÞ supðaj; SkÞj " supðaj; WkÞ;

for every attribute value aj.

2.1.2 Online Sampling Model with Equivalent Problem Transformation

Formally, it is difficult to devise an approach to simulta-neously consider those two heterogeneous criteria in Proposition 1 since various variables need to be taken into consideration at the same time. To solve the problem, we derive Theorem 1 below:

Theorem 1. Suppose that, in a sample S ¼ fS1; . . . ; Sng, we

have s NkðajÞ Nk_ða jÞ p "

2þ"p, for every value ajin each window.

Then, the sample also satisfies: 1) ð1 "Þp jSkj

jWkj ð1 þ "Þp,

where 1 k n, and 2)

j supðaj; WkÞ supðaj; SkÞj " supðaj; WkÞ;

for every value aj.

In the interests of space, proofs are given in the Appendix for interested readers.

Theorem 1 points out that a sample in which

s Nk_ða jÞ Nk_ða jÞ p "

2þ"p for every value aj will also satisfy the

goal in Proposition 1. As such, our goal shown in Proposition 1 can be equivalently transformed to generate

a sample in which s NkðajÞ

Nk_ða jÞ p

"

2þ"pfor every value aj.

However, the most important problem in the considered model is that, in the scenario of incremental mining, the

frequency of the attribute value aj in a window Wk, i.e.,

Nk_ða

jÞ, will be dynamic and up-to-date. Moreover, each

tuple should be processed on the fly, meaning that, once a tuple is selected or discarded, we cannot revoke this decision of this tuple. To meet such a constraint of sampling algorithms for incremental mining, we reasonably assume that the latest arriving tuple is the last tuple in the window and, thus, we shall determine the inclusion probability of this tuple so as to achieve our goal. In light of Proposition 1 and Theorem 1, we then formally present Proposition 2 as our new goal to generate online high-quality samples.

Proposition 2.At the arrival of the tuple ti, where ti2 Wk, we

aim to determine the inclusion probability of ti in such a way

that we can have s NikðajÞ

Nk iðajÞ p

"

2þ"p for every categorical

value ajappearing in the window, where s NikðajÞ and NikðajÞ

denote the frequency of aj in the sample window Sk and the

population window Wk after the selection/discard of ti,

respectively.

For simplicity, let s Nk

iðajÞ=NikðajÞ be denoted by

Fkðaj; iÞ. Importantly, jFkðaj; iÞ pj is equal to jFkðaj; i

1Þ pj for every attribute value aj if aj62 ti, implying that

selecting ti or not will not affect the proportion consistency

of ajwhen aj62 ti(for simplicity, hereafter we assume tiand

ti1 belong to the same window Wk). Therefore, when the

tuples sequentially arrive, we only need to ensure

jFkðaj; iÞ pj _2þ"" p for the value aj belonging to the

arriving tuple ti. This observation implies that the inclusion

probability of ti can be determined by considering at most

hhomogeneous variables.

Nevertheless, due to the inherent limit of sampling, it is

difficult to ensure jFkðaj; iÞ pj _2þ"" p for every attribute

value aj in the presence of a small " and a small p. The

problem will be apparent when a window is just initialized

or Nk

iðajÞ is very small. Note that the small " and the small p

are indeed two conflict goals (in algorithm EASE, a small sample size will inevitably incur a large "). Importantly, we can still pursue the goal in Proposition 2 by minimizing the

difference between Fkðaj; iÞ and p until the difference is

smaller than "

2þ"p. The feasibility of such a concept is shown

in Theorem 2 below:

Theorem 2. Suppose that s NkðajÞ

Nk_ða

jÞ ¼ ð1 þ jÞ p for every

value aj in the window Wk, which indicates that

s Nk ðajÞ Nk_ða jÞ p ¼ jjj p. Let

(6)

¼ PjAj j¼1j NkðajÞ PjAj j¼1NkðajÞ ;

where jAj is the number of distinct attribute values in Wk. We

will have the sample rate jSkj

jWkj equal to p ð1 þ Þ.

Furthermore, the relative proportion difference of the attribute

value aj will be equal to

supðaj; WkÞ supðaj; SkÞ supðaj; WkÞ ¼ 1 ð1 þ jÞ 1þ :

Formally, Theorem 2 provides the basis that, if we can

minimize jjj or 2j, i.e., minimize the difference between

s Nk_ða jÞ

Nk_ða

jÞ and p, the resulting sample rate

jSkj

jWkj will be close to p

and the relative proportion difference of each attribute value

ajwill be close to zero. Based on the foregoing, we therefore

aim to determine the inclusion probability of ti when ti

arrives according to the criterion to minimize ½Fkðaj; iÞ p2,

where aj2 ti.

2.1.3 Inclusion Probability with Minimized Relative Proportion Difference

Before presenting the details of the approach to determine

the inclusion probability by minimizing ½Fkðaj; iÞ p2 for

aj2 ti, we first introduce the proportion-preserved values,

defined as follows:

Definition 1 (Proportion-Preserved Values).A value aj2 ti

is called a proportion-preserved value of ti if we have

j supiðaj; WkÞ supiðaj; SkÞj " supiðaj; WkÞ, whether ti

will be sampled or not, where supiðaj; WkÞ and supiðaj; SkÞ

denote the population proportion and the sample proportion of

ajafter the selection/discard of ti, respectively.

Note that our essential goal is to have the relative

proportion difference of every attribute value aj being

bounded below " and proportion-preserved values indeed satisfy this requirement. Therefore, we will only need to concentrate on minimizing the proportion differences of the

remaining attribute values of ti(for ease of presentation, we

defer the advantage of excluding proportion-preserved values to Section 3.4). In view of this, proportion-preserved values of

ti will be excluded when we determine the inclusion

probability of ti. Note that, before the determination of the

inclusion probability of ti, we can determine whether a

value is a proportion-preserved value of ti or not in light of

Lemma 1 below.

Lemma 1. For the value aj2 ti, aj is a proportion-preserved

value of ti if two criteria are satisfied:

1. s Ni1k ðajÞþ1 jSk;i1jþ1 Nk i1ðajÞþ1 jWk;i1jþ1 " Ni1k ðajÞþ1 jWk;i1jþ1 and 2. s Ni1k ðajÞ jSk;i1j Nk i1ðajÞþ1 jWk;i1jþ1 " Nki1ðajÞþ1 jWk;i1jþ1,

where jSk;i1j and jWk;i1j denote the number of tuples in the

sample window Sk and the population window Wk after the

selection/discard of ti1, respectively.

Suppose that Vi denotes the set of attribute values of ti,

excluding proportion-preserved values of ti. We then discuss

the way to determine the inclusion probability of ti to

minimize ½Fkðaj; iÞ p2, where aj2 Vi. Note that, for the

value aj2 Vi, NikðajÞ will be equal to Ni1k ðajÞ þ 1. However,

due to randomness, s Nk

iðajÞ will be uncertain before tihas

been sampled or discarded. We will have s Nk

iðajÞ ¼

s Nk

i1ðajÞ þ 1 if tiis sampled or have s NikðajÞ ¼ s N_i1k ðajÞ

if tiis discarded. Since s NikðajÞ cannot be exactly identified

before selecting/discarding ti, E½Fkðaj; iÞ will be the best

estimator of Fkðaj; iÞ in such situations, where E½Fkðaj; iÞ is

the expectation of Fkðaj; iÞ. Specifically, to select or to discard

tiis a Bernoulli trial [7] for every value aj2 ti, indicating that

E½Fkðaj; iÞ is equal to

s Nk

i1ðajÞþprðtiÞ

Nk

i1ðajÞþ1 , where prðtiÞ is the

inclusion probability of ti. Let jVij denote the number of

attribute values in Vi. Following the conclusion of Theorem 2,

we thus aim to minimizePjVij

j¼1ðE½Fkðaj; iÞÞ pÞ2. Suppose

thatpbrðtiÞ denotes the inclusion probability of ti

correspond-ing to the minimization ofPjVij

j¼1ðE½Fkðaj; iÞÞ pÞ2. We can

derive the closed form of pbrðtiÞ, as shown in Theorem 3

below:

Theorem 3.Note thatp_brðtiÞ can be formalized as

b prðtiÞ ¼ arg min 0prðtiÞ1 XjV ij j¼1 s Nk i1ðajÞ þ prðtiÞ Nk i1ðajÞ þ 1 p 2 " # : Suppose that a ¼PjV ij_j¼1 1 Nk i1ðajÞþ1 2 and b¼X jV ij j¼1 1 Nk i1ðajÞ þ 1 _{s N}k i1ðajÞ Nk i1ðajÞ þ 1 p :

The closed form ofp_brðtiÞ will be

b prðtiÞ ¼ b a; if 1 b a 0 1; ifb a> 1 0; ifb a< 0: 8 < :

Consequently, we can devise algorithm PAS as a sequential sampling mechanism which determines the

inclusion probability of tiaspbrðtiÞ when tiarrives. As such,

the goal in Proposition 2 will be achieved, meaning that our essential goal in Proposition 1 is also achieved.

Furthermore, the previous discussion only shows how to guarantee the precision of the marginal distribution, i.e., the relative proportion consistency of each categorical value, rather than how to guarantee the precision of the joint distribution. Importantly, Theorem 4 below shows that our model will also minimize the relative proportion inconsistency of multivariate statistics, thus ensuring the preservation of the joint distribution.

Theorem 4. Let Mi denote a multivariate statistic in the

database. PAS will minimize the relative inconsistency between its sample proportion and the population proportion.

(7)

Since mining applications usually concentrate on finding interesting multidimensional knowledge, it is clear that PAS can generate more desirable sample for different applica-tion needs.

2.2 Examples of PAS

We show some examples to illustrate operations of PAS. Suppose that the sample rate p and the error bound " are specified as 33 percent and 0.3, respectively. Our goal is to sequentially generate a sample in which the relative proportion difference of each attribute value can be bounded below 0.3.

Example 2.1. For the first tuple, t1¼ ðA; X; F Þ, we have

N1

0ðAÞ ¼ 0, s N01ðAÞ ¼ 0, N01ðXÞ ¼ 0, s N01ðXÞ ¼ 0,

N₀1ðF Þ ¼ 0, and s N1

0ðF Þ ¼ 0 since the window is

initialized. According to Lemma 1, the attribute values

A, X, and F are not proportion-preserved values of t1.

Therefore, the inclusion probability pbrðt1Þ of t1 is

determined by considering all three values: b prðt1Þ ¼ 1 1ð 0 1 0:33Þ þ 1 1ð 0 1 0:33Þ þ 1 1ð 1 1 0:33Þ ð1 1Þ 2 þ ð1 1Þ 2 þ ð1 1Þ 2 ¼ 0:33:

As the same as general cases in random sampling, the inclusion probability of the first tuple in PAS is equal to p to

ensure that E½supðA; S1Þ, E½supðX; S1Þ, and E½supðF ; S1Þ

are equal to supðA; W1Þ, supðX; W1Þ, and supðF ; W1Þ,

respectively. One important property of PAS is thus identified: The sample proportion of each attribute value is the unbiased estimator of the corresponding population proportion, which is an essential requirement for prob-abilistic sampling methods.

Example 2.2. Suppose that two tuples were selected in the

sample before the tuple t9¼ ðB; Y ; HÞ 2 W1 arrives. In

addition, we have N1

8ðBÞ ¼ 2, s N81ðBÞ ¼ 0, N81ðY Þ ¼ 2,

s N1

8ðY Þ ¼ 0, N81ðHÞ ¼ 4, and s N81ðHÞ ¼ 1. Note that, for

attribute value H, s N1 8ðHÞ þ 1 2þ 1 N1 8ðHÞ þ 1 8þ 1 0:3 N 1 8ðHÞ þ 1 8þ 1 and s N81ðHÞ 2 N1 8ðHÞþ1 8þ1 0:3 N18ðHÞþ1 8þ1 , meaning that H is

a proportion-preserved value of t9. Therefore, we will only

consider B and Y when we calculate the inclusion

probability of t9, which yields that

b prðt9Þ ¼ 1 3ð 0 3 0:33Þ þ 1 3ð 0 3 0:33Þ ð1 3Þ 2 þ ð1 3Þ 2 ¼ 0:99:

In this case, PAS prefers to select t9 with a high

probability 99 percent and we can expect j supðB; W1Þ

supðB; S1Þj " supðB; W1Þ since supðB; S1Þ will be

equal to1

3 with the high probability. Note that, whether

t9 will be selected or not, we always have the

j supðH; W1Þ supðH; S1Þj " supðH; W1Þ since H is a

proportion-preserved values of t9.

On the other hand, while H is considered in the

determination of the inclusion probability of t9, we will

have the inclusion probability be equal to

b prðt9Þ ¼ 1 3ð 0 3 0:33Þ þ 1 3ð 0 3 0:33Þ þ 1 5ð 1 5 0:33Þ ð1 3Þ 2 þ ð1 3Þ 2 þ ð1 5Þ 2 ¼ 0:93:

Comparing pbrðt9Þ with pbrðt9Þ, it can be seen that

considering proportion-preserved values when we deter-mine the inclusion probability of the incoming tuple will lead to the lower probability to have the bounded relative proportion differences of other attribute values. In this case, we demonstrate the feasibility to exclude proportion-preserved values when we determine the inclusion prob-ability of the incoming tuple.

Example 2.3.Suppose that, before tuple t10000¼ ðA; Z; F Þ 2

W10arrives, we have

N10

9999ðAÞ ¼ 99; s N999910 ðAÞ ¼ 30; N999910 ðZÞ ¼ 9

s N999910 ðZÞ ¼ 2; N999910 ðF Þ ¼ 2; s N999910 ðF Þ ¼ 0;

jS10;9999j ¼ 31; and jW10;9999j ¼ 100:

It can be seen that the data are highly skewed in this window since attribute value A frequently occurs but attribute value F rarely occurs. In this case, attribute

value A will be a proportion-preserved value of t10000

because A satisfies the two criteria stated in Lemma 1.

Note that 101ð 2 100:33Þþ 1 3ð 0 30:33Þ ð1 10Þ 2 þð1 3Þ

2 ¼ 1:01. In such cases, PAS

will determine if pbrðt10000Þ is equal to one so that t10000

will be definitely selected in the sample. As a result, the relative proportion differences of F and Z are

3 101 1 32 = 3 101¼ 0:05 < " and 10 101 3 32 =10 101¼ 0:05 < ",

re-spectively. In this example, we show the feasibility of PAS in skewed data.

3 E

FFICIENT

Q

UALITY

-A

WARE

S

AMPLING

In Section 2, we have introduced algorithm PAS to generate a high-quality online sample by adaptively determining the

inclusion probability of each incoming tuple ti. We, in this

section, present algorithm EQAS (standing for Efficient Quality-Aware Sampling) to provide the flexibility of striking a compromise between the sampling quality and the sampling efficiency.

3.1 Framework of EQAS

Before presenting the details of algorithm EQAS, we first show the implementation of algorithm PAS. Specifically, in PAS, the frequency of every attribute value in a window needs to be maintained in main memory. To efficiently achieve this, PAS is devised by employing a hash structure,

called CF (standing for cumulative filter). Let CF ðajÞ denote

the hash function to hash the attribute value aj, where the

hash entry contains the up-to-date frequencies of aj in the

sample window and the population window. While a new window starts, all entries in CF will be released and initialized again, indicating that the memory usage is irrelevant to the size of input data and will be bounded with respect to the count of distinct attribute values in the database. The function of PAS to determine whether the

(8)

tuple ti is selected in the sample or not is outlined in the

procedure IsSample_PAS.

Indeed, as compared to traditional sequential sampling algorithms such as random sampling, PAS will incur the higher computational overhead since the sample proportion and the population proportion of at most h attribute values will be examined when a tuple arrives. As a result, we further devise algorithm EQAS to simultaneously achieve high sampling quality and the high sampling efficiency by integrating PAS and random sampling. The basic concept behind algorithm EQAS is to switch between PAS and random sample at the appropriate moment. While deferring the details, we first formally present an important measure of the sampling quality, called the expected square relative error.

Definition 2 (Expected Square Relative Error).The expected

square relative error (abbreviated as ESRE) before tiþ1arrives

is defined as ErðiÞ ¼PjAjj¼1

Nk iðajÞ

hjWk;ij

supiðaj;WkÞsupiðaj;SkÞ

supiðaj;WkÞ

2

,

where supiðaj; WkÞ and supiðaj; SkÞ denote the population

proportion and the sample proportion of the attribute value aj

in the kth window before tiþ1arrives, respectively.

Specifically, ESRE is a fair measure for the sampling quality of the sequentially generated sample because ESRE represents the expected relative proportion inconsistency of an attribute value over time. It is worth mentioning that, as will be validated in our empirical studies, PAS can quickly reduce ESRE since pursuing the minimization of the relative proportion error is the inherent goal of PAS. However, as illustrated in Fig. 3, ESRE will no longer be significantly reduced by PAS while the passing data size in the window

exceeds a size, denoted by Nmin in Fig. 3. Formally, due to

the natural limit of sampling, the relative proportion errors of some values may still exceed the desired relative error

bound " after jWk;ij > Nmin. For example, assuming an

attribute value occurs only one time during a window

(usually deemed as noise), we cannot ensure that its relative proportion difference will be bounded below " ¼ 0:1 when

p¼ 0:1. Actually, such a problem is well reported in the

literature and a reasonable sanity bound is usually used to avoid that the relative error metric is unduly dominated by attribute values with very small occurrences, where the relative error is defined as the form ofsupiðaj;WkÞsupiðaj;SkÞ

maxf ;supiðaj;WkÞg [13].

In our cases, further attempting to reduce their errors by PAS will pay for the computational overhead without the prominent improvement of the sampling quality. As such, we can initially execute PAS (to pursue the high sampling quality) and switch to random sampling when the window

size exceeds Nmin (to pursue the high sampling efficiency).

Note that random sampling cannot guarantee the sampling quality in the presence of the burst sampling error when the data distribution changes suddenly. There-fore, we periodically perform an offline probing process to examine the expected square relative error when random sampling is executed. If the result shows that ESRE drastically increases, the sampling approach will be switched back to PAS to ensure the sampling quality. Accordingly, the comparison of ESRE among random sampling, PAS and EQAS, is illustrated in Fig. 4, where ESRE in algorithm EQAS is expected to be close to ESRE in algorithm PAS. Correspondingly, the sampling time con-sumed by algorithm EQAS will be close to the time consumed by random sampling. As a result, algorithm EQAS can achieve the high sampling quality and the high sampling efficiency at the same time.

The overall implementation of EQAS is thus outlined below with three algorithm inputs: the data source D, the sample rate p, and the relative error bound ". In addition, a global variable, called Status, indicates the up-to-date variation of the expected square relative error. While Status is identified as unstable, meaning that the variation of ESRE is obvious (either drastically increases or drastically de-creases), algorithm EQAS will execute PAS to sample the following tuples for pursuing the high sampling quality. Alternatively, while Status is identified as stable, meaning that the variation of ESRE is insignificant (either slightly increases or slightly decreases), algorithm EQAS will execute random sampling to sample the following tuples for pursuing high execution efficiency.

Fig. 3. ESRE in random sampling and PAS.

Fig. 4. The illustration of the expected square relative error among three sampling approaches.

(9)

The remaining issue of algorithm EQAS is to identify the timing to switch from PAS to random sampling and vice versa. We first discuss the timing to switch from PAS to random sampling, which can be considered as the process of identifying the convergence of the ESRE curve over time. 3.1.1 Convergence Detection

We refer to the technique of the convergence detection utilized in progressive sampling [22]. Formally, the power-law fit [25] and the linear regression with local sampling [22] are two common approaches in progressive sampling to detect the convergence point of the learning curve. We follow the idea of the linear regression with local sampling since, as demonstrated in [22], it is robust and is the state-of-the-art approach.

Specifically, we periodically examine ESRE for a short time duration when PAS is executed. Suppose we obtain

ErðiÞ; Erði þ 1Þ; . . . ; Erði þ mÞ, where m is the length of a

duration. Those values are then used to estimate a linear regression line whose slope is compared to zero. If the slope is smaller than a threshold , which is a value sufficiently

close to zero, the tuple tican be deemed as the convergence

point. In such cases, the variable Status will be modified as stable and the sampling strategy will be turned to random sampling.

We then present how to determine the time of switching from random sampling to PAS. The procedure is called the probing procedure.

3.1.2 Probing

Note that originally, frequencies of all attribute values in the sample and in the population will not be maintained during the execution of random sampling, thus causing the difficulty of having to examine the up-to-date ESRE over time. In practice, we can approximately estimate ESRE by only maintaining frequencies of several attribute values. More specifically, at the end of executing PAS, we randomly select jF j attribute values from CF (others will be released) and continue to monitor the frequency of those jF j attribute values during the execution of random sampling. Therefore, we can periodically execute the probing process to calculate the estimated ESRE, which is

formularized as PjF j_j¼1 NikðajÞ

hjWk;ij

supiðaj;WkÞsupiðaj;SkÞ

supiðaj;WkÞ

2

. If the estimated ESRE is larger than times the estimated ESRE obtained in the end of the former execution of PAS, the variable Status will be modified as unstable and the sampling strategy will be turned to PAS. Note that investigating the estimated ESRE will increase the com-plexity by a constant factor since jF j attribute values need to be continuously maintained. However, the overhead is slight as compared to time consumed by PAS.

The implementation of the convergence detection and the probing method is outlined in the procedure Status_ Detection() below, in which the function linear_reg refers to a function executing the linear regression and returning the slope of the regression line. In addition, the variable pre_e denotes the value of estimated ESRE obtained in the end of the former execution of PAS.

3.2 Parameters in Algorithm EQAS

Note that, as discussed in related works, it can be seen that sampling quality is usually obtained at the cost of the extra computational overhead, which somewhat compromises the applicability of those sampling algorithms. Indeed, algorithm EQAS enables the flexibility between the sam-pling quality and the samsam-pling efficiency. Without loss of generality, the trade-off between the sampling quality and the sampling efficiency solely depends on the fraction of the execution of algorithm PAS. While the sampling quality is the primary concern, the fraction of the execution of algorithm PAS can be raised. On the other hand, the fraction of the execution of algorithm PAS will be reduced when we pursue the high sampling efficiency. As shown in Fig. 5, three parameters, i.e., , , and the probing interval, Fig. 5. The trade-off between the sampling quality and the sampling efficiency.

(10)

in algorithm EQAS, will control the execution fraction of PAS, indicating the position either close to the high sampling quality or close to the high sampling efficiency. Specifically, setting small and implies that we need to strictly check whether random sampling can handle the following tuples or not. In such cases, algorithm PAS usually carries over to handle the following tuples. More-over, the small probing interval will lead to frequently performing the probing process, which will raise the probability of switching from random sampling to PAS. In contrast, setting large , , and probing interval will tend to frequently switch from PAS to random sampling for the following tuples. From our empirical studies, we suggest that ¼ 0:01, ¼ 1:1, and the probing interval is equal to 100 (the probing procedure is periodically executed every 100 tuples), which usually leads to the better balance between the sampling efficiency and the sampling quality. Readers may be able to achieve the analogous flexibility by giving the execution fraction of PAS and then periodi-cally switching between random sampling and PAS without the need of the probing procedure and the convergence detection. However, such a straightforward solution suffers from the problem that tuples, during the burst sampling error, cannot be precisely handled by PAS. It will drastically affect the sampling quality. Note that it may also lead to the meaningless situation of using PAS to handle data which are stable and uniformly distributed. Algorithm EQAS with the proposed convergence detection and the probing procedure will obviously outperform the naive approach.

The remaining issue in algorithm EQAS is to determine the appropriate value of jF j, i.e., the number of attribute values whose frequencies will be maintained during the execution of random sampling. Note that maintaining a large jF j will pay for the overhead similar to PAS, thus losing efficiency gained from random sampling. Formally, it is meaningful to have jF j larger than 30 in the sense of statistics [7]. We therefore set jF j ¼ 30 in default.

3.3 Complexity Analysis

Formally, random sampling is with the linear time complexity and its space complexity is a constant, which implies that the overhead required by EQAS is dominated by the complexity of algorithm PAS. As such, we show the complexity of EQAS by analyzing the one of PAS at first. 3.3.1 Time Complexity

The time complexity of PAS is Oðh jDjÞ, where jDj is the data set and h is the number of dimensions in the population. Since either PAS or random sampling is linear with respect to the database size, algorithm EQAS is also linear with a factor determined by how many tuples are passed by PAS. In addition, the execution of the function Status_Detection in algorithm EQAS requires constant time to look up all attribute values in CF, thus only increasing the complexity of algorithm EQAS by a negligible constant factor.

3.3.2 Space Complexity

The space complexity of PAS is OðjAjÞ, where jAj is the number of distinct values in a window. While considering the distribution of real data generally follows the Zipf

distribution [29], an OðN1=z_{Þ upper bound of the memory}

usage is derived, where N is the number of tuples in a window and z is the level of skewness in the distribution of attribute values. Specifically, assuming the distribution follows the Zipf distribution with parameter z, the

frequency of the ith rank attribute value is equal to N

ðzÞiz,

where ðzÞ ¼PjAji¼1i1z [29]. Note that

P1

i¼1i1z converges to a

small constant2 when z > 1 (this is a common case in real

data), implying that ðzÞ is also a small constant. Since the absolute frequency of an attribute value always exceeds

one, we have N

jAjz 1. Therefore, the bound of jAj is OðN

1=z_Þ,

indicating that the space complexity of PAS is OðN1=z_Þ.

Moreover, since the memory consumed by random sam-pling is a constant, the space complexities of algorithms EQAS and PAS are the same.

3.4 Discussions on Quality-Aware Sampling

In this section, we provide more insights into the proposed algorithms. We first describe why PAS and EQAS utilize probabilistic sampling mechanisms rather than a determi-nistic sampling mechanism like the one used in algorithm EASE [8]. Specifically, readers may argue why we do not

simply select or drop the tuple ti based on which action

results in a smaller value of PjVij

j¼1ðFkðaj; iÞÞ pÞ2, thus

resulting in a deterministic process of selecting tuples. The major reason is that probabilistic models can provide randomness and unbiasedness. Note that, in EASE, the data distribution is observed in a pilot sample and it can utilize a deterministic procedure to select the sample based on what it has observed beforehand. In contrast, PAS and EQAS do not check the population in advance so as to fulfill the need for incremental mining. As can be simply seen, the first tuple is always dropped if p < 0:5 when a deterministic procedure is applied in PAS, leading to a biased sample (as compared with the discussion in Example 2.1). Since the unbiasedness is an important property for sampling [17], the probabilistic model is more appropriate for our model. In addition, comparing our quality-aware sampling mechanisms and EASE [8], it is clear that the error threshold " in PAS and EQAS can be specified by users and, in contrast, the absolute error in EASE is guaranteed below a system-determined bound whose magnitude depends on the desired sample size and can be estimated after the first pilot sample is generated. In practice, the small error bound and the small sample size/rate are two conflicting goals due to the inherent limits of sampling (the relative error upper bound will be dominated by attribute values with very small occurrences [13]). Coupled with the mechanism of excluding proportion-preserved values de-scribed in Section 2.1.3, the tunable error threshold will enable the balance between the small error bound and the small sample size. The details will be observed and discussed in our experimental results.

4 E

XPERIMENTAL

R

ESULTS

The simulation model of our experimental studies is described in Section 4.1. To assess the performance of PAS and EQAS, we present empirical studies based on both 2. Refer to http://en.wikipedia.org/wiki/Riemann_zeta_function for details.

(11)

synthetic and real data sets. The feasibility and the scalability are examined in Section 4.2. In Section 4.3, the effectiveness of sampling for the various mining applica-tions is demonstrated.

4.1 Simulation Model

In our experiments, synthetic data sets are generated for the sensitivity analysis. Those synthetic data sets are generated as follows: First, a Zipfian data generator was used to produce Zipfian frequencies for various levels of skew. By tuning a parameter z, i.e., the level of skewness in the distribution of attribute values, we generate data sets to simulate highly skewed ðz ¼ 1:5Þ and weakly skewed ðz ¼ 0:5Þ data, where 1.5 and 0.5 are commonly used parameters to investigate the algorithm performance in different levels of skewness [30]. Moreover, the synthetic data generation program takes other parameters, as shown in Table 2, and the values of parameters used to generate the data sets are summarized in Table 3. Data sets of high dimensions with the skewed distribution (named z1.5N30T50) and low dimensions with the approximately uniform distribution (named z0.5N5T50) are both considered.

In addition, Table 3 also shows two employed real data sets. The first one is a data set of network alarm logs, named AlarmLog, which is provided by a major telecommunica-tion company in Taiwan. The AlarmLog data set will be utilized to verify the feasibility of various sampling algorithms in the time-variant database. Note that this data records various alarms generated by a huge number of base station controllers and some types of alarms indeed more frequently occur during the weekday while others types of alarms may only occur during the weekend. Another real data set is a well-known public domain data, called the

Mushroom data set, which is downloaded from the UCI machine learning repository [31].

The simulation is coded in C++ and performed in an IBM compatible PC with 3.2 GHz CPU and 1.0 GB memory. All implementations employ the common set of functions for performing I/O. There are four sampling algorithms, i.e., PAS, EQAS, simple random sampling (abbreviated as SRS in the sequel), and EASE [8]. Note that EASE is the state-of-the-art sampling algorithm used to reduce the absolute proportion difference of each attribute value. The code of EASE is given from the authors of EASE. In addition, the default error threshold " is set as 0.1 for algorithms PAS and EQAS and the sample rate p ¼ 0:1.

4.2 Sensitivity Analysis of Algorithms PAS and

EQAS

4.2.1 On Sampling Quality as Time Advances

We first investigate ESRE, i.e., the expected square relative error in algorithms PAS, EQAS, and SRS as time advances (algorithm EASE cannot be compared in this experiment since it cannot sequentially generate the sample). Fig. 6a shows ESRE as time advances, where the time-variant data set AlarmLog is utilized and the time granularity of a window is specified as “day.” Note that two windows are shown, where 3/1/2002 is Friday and 3/2/2002 is Satur-day. In practice, data distributions in these two windows are quite different, and some types of alarms are more frequently in the weekday (3/1/2002). That is why the curves of ESRE are so different in these two windows, which can show the applicability of various sampling algorithms in a time-variant data. It is clear to see that ESRE in algorithm PAS is on orders of magnitude smaller than that in random sampling. Importantly, we see that the curve of ESRE in algorithm EQAS is close to that in algorithm

TABLE 2 Notations in Data Sets

TABLE 3 Parameters of Data Sets

(12)

PAS, where only 10 percent 20 percent data in algorithm EQAS are handled by algorithm PAS. As also shown in Fig. 6a, we find the execution time of algorithm EQAS is nearly equal to the time consumed by random sampling. It demonstrates that algorithm EQAS can gain the high execution efficiency without much compromising the sampling quality. It is worth mentioning that some types of alarms will emergently and repeatedly occur during the rush time, which incurs the challenge of the burst sampling error. In algorithm EQAS, while the burst sampling error is identified by the probing process, algorithm PAS can quickly take over and preserve the sampling quality. In this experiment, we demonstrate our claim that algorithms PAS and EQAS can quickly reduce the relative proportion difference and they will be robust to the burst sampling error as compared to random sampling.

Fig. 6b shows ESRE in the Mushroom data set with various sample rates (for ease of presentation, only the observations of the first 1,000 tuples are shown). As can be seen, ESRE in algorithm PAS is stably and quickly reduced toward a convergent value. In contrast, random sampling suffers from the burst sampling error and ESRE cannot be effectively reduced, even though the sample rate is large ðp ¼ 0:2Þ. It is interesting to point out that algorithms PAS and EQAS have the sampling quality with p ¼ 0:1 better than that of random sampling with p ¼ 0:2, showing the excellent proportion precision of the proposed algorithms. Note that the memory usages are also shown in Fig. 6. We can find that the memory usage is much smaller as compared to the memory required by the posterior mining applications such as the frequent-itemset mining [4]. 4.2.2 On Execution Time and Relative Error Distribution The sampling efficiency is further investigated in the two synthetic data sets, z1.5N30T50 and z0.5N5T50. We also investigate the sampling quality in a different perspective, called the relative error distribution. The relative error

distribu-tion refers to the distribudistribu-tion of the value supðaj;WkÞsupðaj;SkÞ

supðaj;WkÞ

h i2

of each value ajat the end of a window. In general, relative

proportion errors of most values in a high-quality sample are close to zero so that the relative error distribution will be highly left-skewed. In Fig. 7, we show the execution time and the relative error distribution obtained in four sampling algo-rithms, including algorithm EASE [8].

We first investigate the scalability of different sampling algorithms on synthetic data sets with various sizes. Note that, for fair comparison of different algorithms, the window size in PAS, EQAS, and SRS will be set equal to the size of the population because EASE cannot be directly extended to the window-based scenario. As shown in Fig. 7a, whether data is highly skewed or not, the execution time of each sampling method grows linearly as the data set size increases. Note that the execution time of SRS is independent to various parameters of the two synthetic data sets because random sampling did not maintain/ analyze the distribution of the population. Thus, we only show one execution time of SRS in Fig. 7a. Furthermore, the major reason of EASE having the longest execution time results from that EASE requires a corresponding time to obtain an initial large sample since EASE is a kind of two-phase sampling methods (the same as the size specified in [8], the size of the initial large sample is 0:3 jDj), showing that algorithm EASE gains the sampling quality at the cost of sampling efficiency. Formally, the time consumed by PAS is also large as compared to the one of SRS, particularly in the data set z1.5N30T50 since the number of attributes is large. However, EQAS has the execution time very close to that of SRS. It is because, in algorithm EQAS, the fraction of data passed by PAS is relatively small as compared to that passed by SRS, thus leading to the insignificant computa-tional overhead.

We then show the relative error distribution of generated samples in Fig. 7b and Fig. 7c. For ease of illustration, the relative proportion difference larger than 0.05 is truncated in these figures. In the high-dimensional and skewed data (Fig. 7b), each sampling algorithm inevitably leads to the larger relative proportion inconsistency. Importantly, the sampling quality of PAS is the best one since its relative error distribution is highly left-skewed. We also find that the result of EQAS is close to PAS. Similar results are also obtained in Fig. 7c, where the relative proportion difference in the low-dimensional and nonskewed data is relatively small as compared to that shown in Fig. 7b. Note that the result of EASE in the high-dimensional and skewed data is not good as compared to those of PAS and EQAS. Since EASE only minimizes the absolute proportion difference, the relative proportion difference of many attribute values, which rarely occur, will be apparently large in the skewed data set. In practice, the relative proportion error is very difficult to be bounded, especially for those attribute values whose population proportions are small. As a result, we Fig. 7. The relative error distribution and the execution time in various sampling algorithms. (a) The execution time. (b) The relative error distribution (z1.5N30T50). (c) The relative error distribution (z0.5N5T50).

(13)

show the effectiveness of algorithms PAS and EQAS to preserve the sampling quality. Clearly, both considering the execution time and the sampling quality, EQAS will be the winner.

4.2.3 On Parameter Sensitivity

We investigate the effect of two parameters of PAS, i.e., " and p. In the interests of space and ease of exposition, only PAS and SRS are compared in this analysis since EQAS and PAS have the similar sampling quality and only showing the results of PAS can demonstrate the pure behavior of our model without the effect from random sampling. Fig. 8a shows the relative error distribution of PAS in the Mush-room data set, where the error threshold " varies from 0.01 to 0.5 ðp ¼ 0:1Þ. Formally, the results of PAS with various " are not obviously different to each other in this experiment. It is because PAS tries to minimize the relative proportion error no matter what level of " is specified. However, on further investigation, we can see that the relative error distribution of PAS with " ¼ 0:01 is slightly left-sharper than that of PAS with " ¼ 0:5, but the relative error distribution of PAS with " ¼ 0:01 has a few small peaks in high relative proportion errors. The reason is that PAS simultaneously

considers all attribute values of ti, excluding

proportion-preserved values, when ti arrives. Note that it is difficult for

each attribute value of ti to become a proportion-preserved

value of ti when " is small. As such, PAS with " ¼ 0:01 tries

to simultaneously minimize the relative proportion differ-ences of more attribute values and it leads to a slow convergence of the relative proportion errors of a few attribute values. In contrast, although the relative error distribution of PAS with " ¼ 0:5 is not as sharp as that of PAS with " ¼ 0:01, the relative errors of all attribute values are equally reduced, leading to a relative error distribution with less peaks. Clearly, the results of PAS all outperform SRS, demonstrating the robustness of PAS.

The investigation of another parameter of PAS, i.e., the sample rate p, is shown in Fig. 8b. As can be seen, the result of PAS with p ¼ 0:1 is similar to that with p ¼ 0:2, showing that PAS can guarantee the sampling quality without the need for large sample rates/sizes.

It is worth mentioning that random sampling with a fixed sample rate can also achieve the goal of having the bounded relative proportion difference of each attribute value as long as the database size is large enough. Therefore, an interesting question arises: What is the minimal population size to have the relative proportion

difference of each attribute value being bounded below the specified threshold "? We show the result in Fig. 8c. As can be seen, the relative proportion differences of all attribute values in SRS can be bounded while the database size is prohibitively large both in p ¼ 0:1 and in p ¼ 0:01. In practice, having such a large database within a time window is not prevalent. In contrast, algorithms PAS and EQAS both require a small database size, which is the reasonable size within a time window, to have the relative proportion difference of each attribute value being bounded, thus showing the applicability of algorithms PAS and EQAS.

4.2.4 On Relative Proportion Consistency of Multivariate Statistics

The relative proportion consistency of multivariate statistics is further investigated in the Mushroom data set. We show the relative error distributions of two-dimensinoal variables and three-dimensional variables in Figs. 9a and 9b, respectively. For ease of illustration, the relative proportion error larger than 0.05 is truncated in these figures. Clearly, we can see that PAS still excellently preserves the relative proportion consistency of multivariate statistics in orders of magnitude better than random sampling since the relative error distribution of PAS is highly left-skewed, thus confirming the statement shown in Theorem 4.

We also observe the relative proportion consistency of a randomly selected three-dimensional variable “stalk-shape: enlarging, stalk-root:equal, stalk-surface-above-ring: smooth,” whose population proportion is equal to 4.3 percent. Its square relative error over time is shown in Fig. 9c. As compared to SRS, PAS can quickly and stably ensure a close-to-zero square relative error. Meanwhile, Fig. 9d shows the sampling distribution of the square relative error of this three-dimensional variable, generated from 10,000 runs with a sample rate equal to 0.1. The sampling distribution generated by PAS has a sharper curve than that generated by SRS, indicating that the variance of the multivariate statistic’s sample proportion in PAS is much smaller than that in SRS. In this experiment, we demonstrate that PAS would guarantee the relative propor-tion consistency of multivariate statistics and also show that PAS is an excellently unbiased and robust sampling mechanism.

Fig. 8. Studies of parameter sensitivity. (a) The relative error distribution with various error thresholds (Mushroom data set). (b) The relative error distribution with various sample rates (Mushroom data set). (c) The minimum population size to achieve the required relative error bound (synthetic data sets).

(14)

4.3 Application Studies

To investigate the advantage gained by preserving the relative proportion consistency, we first execute algorithm FP-growth, which is downloaded from Christian Borgelt’s Web

site,3_{on samples generated by PAS, EQAS, EASE, and SRS.}

First, in Figs. 10a and 10b, we show the accuracy of retrieved frequent itemsets in the Mushroom data set, where the minimum supports are specified as 0.3 (2,735 fre-quent itemsets are discovered in the original Mushroom data set) and 0.15 (98,575 frequent itemsets are identified in the original Mushroom data set). Formally, we use the F-Score measurement [8], F ðSÞ, to evaluate the accuracy of frequent itemsets which are obtained in the sample S,

where F ðSÞ ¼_{jLðDÞLðSÞjþjLðSÞLðDÞj}2jLðDÞ\LðSÞj . LðDÞ and LðSÞ denote

the sets of frequent itemsets obtained in the original data set D and in the sample S, respectively. We show the accuracy of discovered frequent itemsets of each sample size as the average of 50 runs. As can be seen, algorithms PAS, EQAS, and EASE outperform SRS in orders of

magnitude, especially when the sample size is small. In addition, note that PAS will reduce the relative proportion difference as opposed to the absolute proportion difference reduced by EASE. Reducing the relative proportion differ-ence indeed avoids the information loss of some attribute values whose population proportions are close to the specified minimum support. Thus, we can see accuracy of frequent itemsets obtained by PAS and EQAS both exceed that of EASE about 5 percent in average, demonstrating the effectiveness of PAS and EQAS for mining frequent item-sets. In addition, Fig. 10b shows accuracy of frequent itemsets with the minimum support equal to 0.15. As can be seen, PAS outperforms EASE in orders of magnitude when the sample rate is small since preserving the relative proportion consistency is more important than preserving the absolute proportion consistency in the presence of a small minimum support, thus demonstrating the feasibility of PAS and EQAS.

We also executed EM clustering, which is implemented in WEKA [32], on the generated samples. Similarly to the training-and-testing process for evaluating clustering re-sults in [32], the effectiveness of sampling for clustering can Fig. 9. The relative error distribution of multivariate statistics (Mushroom data set). (a) The relative error distribution of two-dimensional variables. (b) The relative error distribution of three-dimensional variables. (c) The Square Relative Error as time advances (three-dimensional variable: “stalk-shape:enlarging, stalk-root:equal, stalk-surface-above-ring:smooth”). (d) The sampling distribution of the Relative Error (three-dimensional variable: “stalk-shape:enlarging, stalk-root:equal, stalk-surface-above-ring:smooth”).

3. The URL is http://fuzzy.cs.uni-magdeburg.de/~borgelt/ fpgrowth.html.

Fig. 10. Sampling effectiveness for frequent-itemset mining and clustering (Mushroom data set). (a) Mining frequent itemsets with min sup¼ 0:3. (b) Mining frequent itemsets with min sup¼ 0:15. (c) EM Clustering.