Estimating Moments - Jeffrey D. Ullman

would main memory constrain the number of hash functions we could associate with any one stream. In practice, the time it takes to compute hash values for each stream element would be the more significant limitation on the number of hash functions we use.

4.4.5 Exercises for Section 4.4

Exercise 4.4.1 : Suppose our stream consists of the integers 3, 1, 4, 1, 5, 9, 2, 6, 5. Our hash functions will all be of the form h(x) = ax + b mod 32 for some a and b. You should treat the result as a 5-bit binary integer. Determine the tail length for each stream element and the resulting estimate of the number of distinct elements if the hash function is:

a h(x) = 2x + 1 mod 32.

(b) h(x) = 3x + 7 mod 32.

! Exercise 4.4.2 : Do you see any problems with the choice of hash functions in Exercise 4.4.1? What advice could you give someone who was going to use a hash function of the form h(x) = ax + b mod 2^k?

4.5 Estimating Moments

In this section we consider a generalization of the problem of counting distinct elements in a stream. The problem, called computing “moments,” involves the distribution of frequencies of different elements in the stream. We shall define moments of all orders and concentrate on computing second moments, from which the general algorithm for all moments is a simple extension.

4.5.1 Definition of Moments

Suppose a stream consists of elements chosen from a universal set. Assume the universal set is ordered so we can speak of the ith element for any i. Let mi

be the number of occurrences of the ith element for any i. Then the kth-order moment (or just kth moment) of the stream is the sum over all i of (mi)^k. Example 4.6 : The 0th moment is the sum of 1 for each mithat is greater than 0.³ That is, the 0th moment is a count of the number of distinct elements in the stream. We can use the method of Section 4.4 to estimate the 0th moment of a stream.

3Technically, since micould be 0 for some elements in the universal set, we need to make explicit in the definition of “moment” that 0⁰ is taken to be 0. For moments 1 and above, the contribution of mi’s that are 0 is surely 0.

The 1st moment is the sum of the mi’s, which must be the length of the stream. Thus, first moments are especially easy to compute; just count the length of the stream seen so far.

The second moment is the sum of the squares of the mi’s. It is some-times called the surprise number, since it measures how uneven the distribu-tion of elements in the the stream is. To see the distincdistribu-tion, suppose we have a stream of length 100, in which eleven different elements appear. The most even distribution of these eleven elements would have one appearing 10 times and the other ten appearing 9 times each. In this case, the surprise number is 10²+ 10× 9² = 910. At the other extreme, one of the eleven elements could appear 90 times and the other ten appear 1 time each. Then, the surprise number would be 90²+ 10× 1²= 8110. 2

As in Section 4.4, there is no problem computing moments of any order if we can afford to keep in main memory a count for each element that appears in the stream. However, also as in that section, if we cannot afford to use that much memory, then we need to estimate the kth moment by keeping a limited number of values in main memory and computing an estimate from these values. For the case of distinct elements, each of these values were counts of the longest tail produced by a single hash function. We shall see another form of value that is useful for second and higher moments.

4.5.2 The Alon-Matias-Szegedy Algorithm for Second Moments

For now, let us assume that a stream has a particular length n. We shall show how to deal with growing streams in the next section. Suppose we do not have enough space to count all the mi’s for all the elements of the stream. We can still estimate the second moment of the stream using a limited amount of space;

the more space we use, the more accurate the estimate will be. We compute some number of variables. For each variable X, we store:

1. A particular element of the universal set, which we refer to as X.element , and

2. An integer X.value, which is the value of the variable. To determine the value of a variable X, we choose a position in the stream between 1 and n, uniformly and at random. Set X.element to be the element found there, and initialize X.value to 1. As we read the stream, add 1 to X.value each time we encounter another occurrence of X.element .

Example 4.7 : Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c, a, a, b. The length of the stream is n = 15. Since a appears 5 times, b appears 4 times, and c and d appear three times each, the second moment for the stream is 5²+ 4²+ 3²+ 3²= 59. Suppose we keep three variables, X1, X2, and X3. Also,

4.5. ESTIMATING MOMENTS 129 assume that at “random” we pick the 3rd, 8th, and 13th positions to define these three variables.

When we reach position 3, we find element c, so we set X1.element = c and X1.value = 1. Position 4 holds b, so we do not change X1. Likewise, nothing happens at positions 5 or 6. At position 7, we see c again, so we set X1.value = 2.

At position 8 we find d, and so set X2.element = d and X2.value = 1.

Positions 9 and 10 hold a and b, so they do not affect X1 or X2. Position 11 holds d so we set X2.value = 2, and position 12 holds c so we set X1.value = 3.

At position 13, we find element a, and so set X3.element = a and X3.value = 1.

Then, at position 14 we see another a and so set X3.value = 2. Position 15, with element b does not affect any of the variables, so we are done, with final values X1.value = 3 and X2.value = X3.value = 2. 2

We can derive an estimate of the second moment from any variable X. This estimate is n× (2 × X.value − 1).

Example 4.8 : Consider the three variables from Example 4.7. From X1 we derive the estimate n× (2 × X¹.value− 1) = 15 × (2 × 3 − 1) = 75. The other two variables, X2 and X3, each have value 2 at the end, so their estimates are 15× (2 × 2 − 1) = 45. Recall that the true value of the second moment for this stream is 59. On the other hand, the average of the three estimates is 55, a fairly close approximation. 2

4.5.3 Why the Alon-Matias-Szegedy Algorithm Works

We can prove that the expected value of any variable constructed as in Sec-tion 4.5.2 is the second moment of the stream from which it is constructed.

Some notation will make the argument easier to follow. Let e(i) be the stream element that appears at position i in the stream, and let c(i) be the number of times element e(i) appears in the stream among positions i, i + 1, . . . , n.

Example 4.9 : Consider the stream of Example 4.7. e(6) = a, since the 6th position holds a. Also, c(6) = 4, since a appears at positions 9, 13, and 14, as well as at position 6. Note that a also appears at position 1, but that fact does not contribute to c(6). 2

The expected value of X.value is the average over all positions i between 1 and n of n× (2 × c(i) − 1), that is

E(X.value) = 1 n

i=1

n× (2 × c(i) − 1)

We can simplify the above by canceling factors 1/n and n, to get E(X.value) =

i=1

2c(i)− 1

However, to make sense of the formula, we need to change the order of summation by grouping all those positions that have the same element. For instance, concentrate on some element a that appears ma times in the stream.

The term for the last position in which a appears must be 2× 1 − 1 = 1. The term for the next-to-last position in which a appears is 2× 2 − 1 = 3. The positions with a before that yield terms 5, 7, and so on, up to 2ma− 1, which is the term for the first position in which a appears. That is, the formula for the expected value of X.value can be written:

E(X.value) =X

1 + 3 + 5 +· · · + (2m^a− 1)

Note that 1 + 3 + 5 +· · ·+(2m^a−1) = (m^a)². The proof is an easy induction on the number of terms in the sum. Thus, E(X.value) = P

a(ma)², which is the definition of the second moment.

4.5.4 Higher-Order Moments

We estimate kth moments, for k > 2, in essentially the same way as we estimate second moments. The only thing that changes is the way we derive an estimate from a variable. In Section 4.5.2 we used the formula n×(2v −1) to turn a value v, the count of the number of occurrences of some particular stream element a, into an estimate of the second moment. Then, in Section 4.5.3 we saw why this formula works: the terms 2v− 1, for v = 1, 2, . . . , m sum to m², where m is the number of times a appears in the stream.

Notice that 2v− 1 is the difference between v² and (v− 1)². Suppose we wanted the third moment rather than the second. Then all we have to do is replace 2v−1 by v³−(v−1)³= 3v²−3v+1. ThenPm

v=13v²−3v+1 = m³, so we can use as our estimate of the third moment the formula n×(3v²−3v+1), where v = X.value is the value associated with some variable X. More generally, we can estimate kth moments for any k ≥ 2 by turning value v = X.value into n× v^k− (v − 1)^k.

4.5.5 Dealing With Infinite Streams

Technically, the estimate we used for second and higher moments assumes that n, the stream length, is a constant. In practice, n grows with time. That fact, by itself, doesn’t cause problems, since we store only the values of variables and multiply some function of that value by n when it is time to estimate the moment. If we count the number of stream elements seen and store this value, which only requires log n bits, then we have n available whenever we need it.

A more serious problem is that we must be careful how we select the positions for the variables. If we do this selection once and for all, then as the stream gets longer, we are biased in favor of early positions, and the estimate of the moment will be too large. On the other hand, if we wait too long to pick positions, then

4.5. ESTIMATING MOMENTS 131 early in the stream we do not have many variables and so will get an unreliable estimate.

The proper technique is to maintain as many variables as we can store at all times, and to throw some out as the stream grows. The discarded variables are replaced by new ones, in such a way that at all times, the probability of picking any one position for a variable is the same as that of picking any other position. Suppose we have space to store s variables. Then the first s positions of the stream are each picked as the position of one of the s variables.

Inductively, suppose we have seen n stream elements, and the probability of any particular position being the position of a variable is uniform, that is s/n.

When the (n+1)st element arrives, pick that position with probability s/(n+1).

If not picked, then the s variables keep their same positions. However, if the (n + 1)st position is picked, then throw out one of the current s variables, with equal probability. Replace the one discarded by a new variable whose element is the one at position n + 1 and whose value is 1.

Surely, the probability that position n + 1 is selected for a variable is what it should be: s/(n + 1). However, the probability of every other position also is s/(n + 1), as we can prove by induction on n. By the inductive hypothesis, before the arrival of the (n + 1)st stream element, this probability was s/n.

With probability 1− s/(n + 1) the (n + 1)st position will not be selected, and the probability of each of the first n positions remains s/n. However, with probability s/(n + 1), the (n + 1)st position is picked, and the probability for each of the first n positions is reduced by factor (s− 1)/s. Considering the two cases, the probability of selecting each of the first n positions is

1− s

Thus, we have shown by induction on the stream length n that all positions have equal probability s/n of being chosen as the position of a variable.

4.5.6 Exercises for Section 4.5

Exercise 4.5.1 :Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, 4, 2, 1, 2. What is the third moment of this stream?

A General Stream-Sampling Problem

Notice that the technique described in Section 4.5.5 actually solves a more general problem. It gives us a way to maintain a sample of s stream elements so that at all times, all stream elements are equally likely to be selected for the sample.

As an example of where this technique can be useful, recall that in Section 4.2 we arranged to select all the tuples of a stream having key value in a randomly selected subset. Suppose that, as time goes on, there are too many tuples associated with any one key. We can arrange to limit the number of tuples for any key K to a fixed constant s by using the technique of Section 4.5.5 whenever a new tuple for key K arrives.

! Exercise 4.5.2 : If a stream has n elements, of which m are distinct, what are the minimum and maximum possible surprise number, as a function of m and n?

Exercise 4.5.3 : Suppose we are given the stream of Exercise 4.5.1, to which we apply the Alon-Matias-Szegedy Algorithm to estimate the surprise number.

For each possible value of i, if Xi is a variable starting position i, what is the value of Xi.value?

Exercise 4.5.4 : Repeat Exercise 4.7 if the intent of the variables is to compute third moments. What is the value of each variable at the end? What estimate of the third moment do you get from each variable? How does the average of these estimates compare with the true value of the third moment?

Exercise 4.5.5 : Prove by induction on n that 1 + 3 + 5 +· · ·+ (2m − 1) = m². Exercise 4.5.6 : If we wanted to compute fourth moments, how would we convert X.value to an estimate of the fourth moment?

在文檔中 Jeffrey D. Ullman (頁 139-144)