• 沒有找到結果。

Chapter 4 Online Mining of Changes of Items across Two Data Streams

4.4 Online Mining Changes of Items over Distributed ADSs

4.4.2 The MFC-append Algorithm

Algorithm MFC-append uses the notations and conventions illustrated in Figure 4-3. In the framework of mining changes of items over data streams, the streaming data is divided into fixed sized buckets B1, B2, …, Bi, …, BN, where BN is the “latest” bucket with bucket identifier N, and B1 is the “oldest” one. Note that each bucket contains k items. The bucket length from Bi to Bj is denoted as B(i, j), where i ≥ j. Let t1, t2, …, tn be the timepoints (the smallest unit of time) which group the buckets so far in the streams, where tn is the most recent timepoint, and t1 is the oldest one. The form of bucket Bi is (StreamID, ti, items), where ti is the timepoint when the items appeared in the stream with identifier StreamID.

The window-id of ti is denoted as wi, and the number of buckets arrived from ti-1 to ti is |wi|, and the number of items (i.e., size) in wi is denoted as |wi|. The size of buckets arrived in T equals |wk| + |wk+1| + … + |wn|, ∀k = 1, 2, …, n. As described above, the goal is to find the set of all FFCIs, VFCIs, and SFCIs in a time period T = tk ∪ tk+1 ∪ … ∪ tn, ∀k = 1, 2, …, n.

Hence, the pair of input data streams P and Q are divided into two sequences of basic windows, i.e., P = w1[BP1 + BP2 + … + BPi] + w2[BPi+1 + BPi+2 + … + BPj] + … + wm[BPk + BPk+1 + … + BPcurrentid-1] , and Q = w1[BQ1 + BQ2 + … + BQi] + w2[BQi+1 + BQi+2 + … + BQj] + … + wm[BQk + BQk+1 + … + BQcurrentid-1]. The notation wi[BStreamIDj + BStreamIDj+1 + … + BStreamIDk] denotes that the buckets of data stream with id StreamID arrived at timepoint ti, and the current bucket id is denoted as BStreamIDcurrent. Note that BStreamIDcurrent = n/m + 1. For

63

example, there are five buckets in the first window w1 of Figure 4-1, in which two buckets (BP1and BP2) in stream P, and three buckets (BQ1, BQ2, and BQ3) in stream Q.

t0 t1 t2 tn-1 tn

Timepoints

Current timepoint

BQ1 BQ2BQ3 BQ4 BQ5 BQi

BP1 BP2 BP3 BPi-2 BPi-1 BPi BPcurrentid Batch buckets

Stream P

Stream Q

Data elements that will be seen

in the future

w1 w2 wn

Increasing time

Batch Buckets

Figure 4- 3. Notations and conventions used in the proposed algorithms

The algorithm description of MFC-append is shown in Figure 4-4. Four parameters are used in MFC-append algorithm: mcs, ase, maxcr, and mincr, where mcs is an acronym of the minimum changed support threshold, ase is an acronym of the approximate error support threshold, maxcr is an acronym of the maximum changed rate, and mincr is an acronym of the minimum changed rate. At any moment, a list of FFCIs with their estimated changed supports and changed rates is generated by the proposed algorithm. These approximate answers (i.e., a list of FFCIs) have the following guarantees. First, all items whose changed support exceed mcs⋅n are output, i.e., no false negative. Second, no items whose changed support is less than (ase−mcs)⋅n are output. Third, estimated changed supports are less than the true changed

64

supports by at most ase⋅n. Finally, all items whose changed rate exceed mcr⋅n or less than mcr⋅n are output, respectively.

Algorithm MFC-append

Input: (1) Two continuous append-only data streams, P = <p1, p2, …, pn, …> and Q = <q1, q2, …, qn, …> with time-varying data rate, (2) A user-defined approximate support error threshold, ase, i.e., the window size m is 1/ase, (3) A user-defined minimum changed support threshold, mcs, (4) A user-specified maximum changed rate maxcr, (5) A user-specified minimum changed rate minicr.

Output: A list of FFCIs, VFCIs, and SFCIs.

Begin

Change-Sketch( )←{ };

Repeat:

for each bucket from the data streams (P and Q) do

for each item q in wi(C, Bi) do /* i = 1, 2, …, n/m+1 */

Change-Sketch(q, q.count++, q.wid, q.rate);

for each item q in wi(D, Bi) do

Change-Sketch(q, q.count--, q.wid, q.rate);

while Change-Sketch(q, q.count, q.wid, q.rate) ≠ ∅ then if |q.count| ≥ mcs⋅m⋅(wcurrent − q.wid) then

item q is a frequent frequency change pattern in Change-Sketch;

else if |q.q.count |≥ ase⋅m⋅(wcurrent – q.wi) then preserve q in Change-Sketch;

else remove q from Change-Sketch;

if q.wi change its symbol (either from positive frequency to negative one or from negative one to positive one)

then q.rate++;

End

Figure 4- 4. Algorithm MFC-append

65

The maintenance process of Change-Sketch is described as follows. Let the window identifier of current window be k. Initially, Change-Sketch is empty. For each item q in the current window of item-stream P, MFC-append first checks Change-Sketch to see whether an entry with id q already exists or not. If the entry exists in the current Change-Sketch, the frequency of q (i.e., q.count) is increased by one. Otherwise, a new entry of the form (q, 1, k, 0) is created in the current Change-Sketch. After processing all items in wk of stream P, MFC-append computes all the items in wk of another stream Q to maintain the changed information in Change-Sketch. The computation first checks Change-Sketch to see whether an entry q already exists or not in the Change-Sketch. If the search succeeds, the proposed algorithm updates the entry with id q by decreasing its frequency q.count by one. Otherwise, a new entry of the form (q, -1, k, 0) is created in the current Change-Sketch. Now, if the updated entry q take place frequency vibration, q.rate is increased by one, i.e., from zero to one.

In order to bound the memory usage in mining changes of items over data streams, a pruning mechanism of Change-Sketch is proposed. The technique deletes some entries of Change-Sketch before MFC-append computes the next working window with window-id k+1.

It is a trade-off between the accuracy of the outputs and the memory requirement of Change-Sketch. The pruning is described as follows. An entry of the form (q, q.count, q.wi, q.rate) is deleted, if |q.count| < ase⋅m⋅(wcurrent-id − q.wid). After the pruning, MFC-append computes the next working windows with window-id wk+1 of data streams P and Q in the same way as described above.

When a user requests the results of the set of all FFCIs, VFCIs, and SFCIs embedded in the data streams, MFC-append algorithm outputs the entries whose |q.count| ≥ mcs⋅m⋅(wcurrent-id −q.wid), |q.rate| ≥ mincr⋅m⋅(wcurrent-id − q.wid), and |q.rate| ≥ maxcr⋅m⋅(wcurrent-id

− q.wid), respectively, by one scan of the current Change-Sketch.

66