• 沒有找到結果。

隨機數位樹的機率分析

N/A
N/A
Protected

Academic year: 2021

Share "隨機數位樹的機率分析"

Copied!
9
0
0

加載中.... (立即查看全文)

全文

(1)

Project: Probabilistic Analysis of Random Digital Trees

by Michael Fuchs

1

General

This is the final report on the National Science Council project “Probabilistic Analysis of Random Digital Trees” with grant number NSC-99-2115-M-009-007-MY2.

We first summarize the main achievements of the project.

• We applied our approach from the project proposal to various cost measures in random symmetric digital search trees (see [7]).

• We applied our approach to partial match retrievals in k-dimensional bucket digital trees (see [4]).

• We applied our approach to approximate counting (see [6])

• We applied our approach to the Wiener index in random digital trees (see [5]).

• We proposed a general framework to the variance of linear cost measures in tries and PA-TRICIA tries (this is the work [8] which is still in progress).

• We presented some of our results at the Workshop on Analytic Algorithmics and Combina-torics (ANALCO11), San Francisco, USA and at the 23rd International Meeting on Proba-bilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms, Montreal, Canada.

• Four master students and one Ph.-D. student were partially supported by the project (three master students already graduated).

2

Results

This project was concerned with probabilistic analysis of random digital trees. Before explaining our results, we briefly recall what digital trees are.

Consider n keys which are infinite 0-1 strings. The main three types of digital trees are con-structed as follows.

• Digital Search Trees: The first key is placed into the root; all other keys are directed either to the left or right subtree according to whether their first bit is 0 or 1; subtrees are build recursively with the first bit from the keys removed.

• Tries: same construction principle as for digital search tree, but only leaves hold keys. • PATRICIA Tries: same as for tries, but with one-way branching suppressed.

(2)

The above trees are still deterministic. In order to construct random trees, we have to fix a probability model. The model we considered in this project was the Bernoulli model, where we assumed that bits in the keys are generated independently with probability of generating 1 equal to p (consequently, the probability of generating 0 is q := 1 − p). If p = 1/2, we called the tree symmetric, otherwise it was called asymmetric.

Now, we will discuss the main findings of the project.

2.1

Symmetric Digital Search Trees

In [7], we proposed a new approach to the variance of cost measures in symmetric digital search trees. Previous approaches were cumbersome and yielded very involved expressions for constants. Our approach greatly simplifies both the analysis and the final expressions. Roughly speaking, our approach consists of two major parts, namely, de-poissonization and a combination of Mellin and Laplace transform. The former is a classical tool in the analysis of algorithms; see [9]. However, we simplified the verification of the de-poissonization assumptions by proposing an admissibility notation and proving closure properties. As for the second part, the combination of Mellin and Laplace transform is entirely new in the analysis of algorithms. Previous methods used either Rice method or Euler transform combined with Mellin transform and singularity analysis. Our approach is computational easier and also gives simpler results.

The parameters to which our approach can be applied are so-called additive cost measures. We give some examples of parameters of this type which are analyzed in our paper.

• Internal path length: sum of distances of all nodes to the root.

• Peripheral path length: sum of sizes of subtrees rooted at the father of leaves. • Number of leaves and more general patterns.

• Colless index: sum of absolute differences of sizes of left and right subtree over all nodes. Moreover, our approach can also be applied to bucket digital search trees, where every node can hold up to b ≥ 1 keys. Here, there are two different notions of path length, namely, the sum of distances of all keys to the root (key-wise path length) and the sum of distances of all nodes to the root (node-wise path length). Interestingly, these two path length exhibit a quite different behavior; see our paper for details.

We finish this section by exemplifying our results. Therefore, we choose the number of leaves as guiding example. We use the notation Xn for the number of leaves in a random symmetric

digital search tree of size n. The computation of the mean was a problem posed by Knuth which was solved by Flajolet and Sedgwick in [3]. They proved that, as n → ∞,

E(Xn) = n(Cmean+ P1(log2n)) + O(n 1/2)

with P1(z) a one-periodic function with zero mean and computable Fourier coefficients and

Cmean = 1 + X k≥1 k Qk2k X 1≤j≤k 1 2j − 1 − 1 Q∞   1 log 2 + X k≥1 1 2k− 1 !2 −X k≥1 1 2k− 1  , where Qk = Q

1≤j≤k(1 − 1/2j) and Q∞ = limk→∞Qk. Then, Kirschenhofer and Prodinger

investigated the variance in [10] and obtained, as n → ∞,

(3)

with P2(z) again a one-periodic function with zero mean. Moreover, their involved analysis

yielded a very long and complicated expression for Cvar. In our paper, we gave a much shorter

proof of the above result. Another advantage of our approach is that the Fourier coefficients of P2(z) are computable and our expression for Cvaris much simpler. More precisely, our approach

yields Cvar = 1 log 2 Z ∞ 0 s Q(−2s) Z ∞ 0 e−zs˜g2(z)dzds, where Q(s) =Q∞ j=1(1 − z/2 j) and ˜ g2(z) = z ˜f100(z) 2+ e−z1 − e−z(1 + z) + 2z ˜f0 1(z/2) − 4 ˜f1(z/2)  with ˜f1(z) = e−z P

n≥0E(Xn)zn/n!. Our paper also proposes a method of numerical evaluation

of Cvar and the first few digits are Cvar = 0.034203 · · · .

2.2

k-dimensional Bucket Digital Trees

We start by briefly explaining what k-dimensional bucket digital trees are. First, k-dimensional digital trees are build from keys which consist of k sequences of 0-1 sequences. The k sequences are suitable shuffled and then the digital tree is build as before. In the bucket version, nodes con-taining data can hold up to b ≥ 1 keys. Finally, the random model is also as before.

In [4], we considered partial match retrieval in k-dimensional bucket digital trees. Here, a partial match query q is a k-tuple of specified and unspecified entries. This query is used to walk through the tree (starting from the root), where a specified entry means that we only proceed with one subtree, whereas an unspecified entry means that we proceed in both subtrees. The cost Xnis

the number of nodes visited by such a partial match query. In [4], we derived the variance of Xn

for partial match retrieval in bucket symmetric digital search trees as well as bucket symmetric tries and bucket symmetric PATRICIA tries. We briefly explain our results for symmetric digital search trees and refer the reader to [4] for the bucket version and other variants of digital trees.

First, Kirschenhofer and Prodinger derived the mean of Xn in [11]. They proved that, as n →

∞,

E(Xn) ∼ nu/kP1(log2n 1/k

),

where u is the number of unspecified entries in the partial match query and P1(z) is a one-periodic

function with computable Fourier coefficients. In [4], we derived the variance which satisfies, as n → ∞,

Var(Xn) ∼ nu/kP2(log2n 1/k

),

where P2(z) is again a one-periodic function with computable Fourier coefficients. As example,

we state our expression for the 0-th Fourier coefficient of P2(z) which is

1 k(log 2)Q∞Γ(1 + u/k) k−1 X l=0 δ1· · · δl2−ul/k X j1,j2,j3≥0 (−1)j1δ¯ q(l),j 2δ¯q(l),j32 −(j12)+(1−u/k)j1 2j2+j3Q j1Qj2Qj3 ϕ(u/k; 2j1−j2 + 2j1−j3),

where q(l) is the l-th cyclic shift of q, δ

i = 1 if the (i mod k)-th entry of q is specified and 2

otherwise, ¯ δq,j = X l≥0 (−1)l2−(l+12 ) Ql l+j Y h=1 δh,

(4)

and ϕ(ω; x) = ( π(xω− 1)/(sin(πω)(x − 1)), if x 6= 1; πω/ sin(πω), if x = 1.

2.3

Approximate Counting

In [6], we gave a new analysis of approximate counting. Approximate counting, an algorithm proposed by Morris in [12], is used for counting within a certain error tolerance a huge amount of objects with very limited space. The algorithms is very easy to describe: let Cnbe the content of

the counter after “counting n objects”. Then, a random decision determines whether the counter is increased or not when “counting the n + 1-st object”. More precisely, we have

Cn+1=

(

Cn+ 1, with probability 2−Cn;

Cn, with probability 1 − 2−Cn.

Flajolet [2] was the first to give a detailed analysis of approximate counting. He proved that, as n → ∞,

E(Cn) ∼ log2n + Cmean+ P1(log2n),

where P1(z) is a one-periodic function with zero mean and computable Fourier coefficients and

Cmean= γ log 2 + 1 2 − X l≥1 1 2l− 1,

where γ is Euler’s constant. Moreover, for the variance, he proved that, as n → ∞, Var(Cn) ∼ Cvar+ P2(log2n),

where P2(z) is a one-periodic function with zero mean and computable Fourier coefficients and

Cvar = π2 6 log22 − X l≥1 1 2l− 1− X l≥1 1 (2l− 1)2 + 1 12− 1 log 2 X l≥1 1 l sinh(2lπ2/ log 2).

After Flajolet’s analysis, at least another half a dozen papers re-proved Flajolet’s results using all kind of different methods (Martingale theory, Euler transform, Rice method, probabilistic tools, etc.). In [6], we showed that Cn is related to the length of the left-path in a random symmetric

digital search trees. Using this correspondence and our approach from Section 2.1, we again re-proved the above results of Flajolet. However, a new feature of our method is that it yields a very different expression for the Fourier coefficients of P2(z) and Cvar. We exemplify our result by

stating the expression for Cvar(notation as in Section 2.1)

Cvar = Q∞ log 2 X h,l,j≥0 (−1)j2−h−l−(j+12 ) QhQlQj ψ(2−h−j + 2−l−j), where ψ(x) = ( log x/(x − 1), if x 6= 1; 1, if x = 1.

Equating with the above expression of Flajolet gives an identity for which we provided a direct proof in [6] with tools from q-analysis.

Finally, we also considered variants of approximate counting such as several versions of ap-proximate counting with m-counters and analysis of shape parameters in m-DSTs for which our approach can be used as well. For detailed results see our paper.

(5)

2.4

Wiener Index in Random Digital Trees

In [5], we applied our approach from Section 2.1 to the analysis of the Wiener index in random digital trees. The Wiener index is one of the most important topological indices in combinatorial chemistry and was investigated in numerous papers (see [1] for many references).

As for random trees, the Wiener index was investigated for simply generated random trees, non-plane unlabeled random trees and a huge subclass of random grid trees containing random binary search trees, random median-of-(2k + 1) search trees, random m-ary search trees, random quadtrees, etc. An important class of random trees which was left open were random digital trees. In [5], we closed this gap. In particular, we showed that random digital trees behave very different from other random grid trees (to which random digital trees belong as a subclass).

In order to state our result, we consider the random symmetric digital tree. Denote by Tn the

total path length (which was already investigated in [7]) and by Wn the Wiener index. Then, we

proved in [5] that, as n → ∞,

E(Tn) = n log2n + nP1(log2n) + O(log n),

E(Wn) = n2log2n + n 2P

1(log2n) − n

2+ O(n log n),

where P1(z) is a one-periodic function with computable Fourier coefficients (see [7]). Moreover,

we showed that for variances and covariances, as n → ∞, Var(Tn) = nP2(log2n) + O(1),

Cov(Tn, Wn) = n2P2(log2n) + O(n log n),

Var(Wn) = n3P2(log2n) + O(n

2log n),

where P2(z) is again a one-periodic function with computable Fourier coefficients (see [7]).

Using the above result for the variances and covariances, we also proved the following bivariate central limit theorem.

Theorem 1. We have, Tn− E(Tn) pVar(Tn) ,Wn− E(Wn) pVar(Wn) ! d −→ (X, X), whereX is a standard normal distributed random variable.

Moreover, we also obtained similar results for bucket digital search trees (where there are two notions of the Wiener index), tries (where there are again two notations of the Wiener index) and PATRICIA tries. See [5] for more details.

2.5

Tries and PATRICIA Tries

Our approach from Section 2.1. is also applicable to additive cost measures in tries and PATRICIA tries. Moreover, here the asymmetric version can be treated as well. In the work in progress [8], we have proposed a general framework to the variance of additive cost measures in random tries and PATRICIA tries. More precisely, for instance for tries, an additive cost measure can be interpreted as a sequence of random variables Xnsatisfying the distributional recurrence

Xn d

= XIn+ X

(6)

where In= Binomial(n, p), Xn d

= Xn∗and Xn, Xn∗, In, Tnare independent. In [8], we have proved

results which under some conditions on E(Tn) and E(Tn2) give asymptotic expansions of the

variance of Xn. Again, these asymptotic expansions involve periodic functions whose Fourier

co-efficients are computable. One new feature of this study compared to [7] is a new closure property of the Hadamard project which is needed for the depoissonization step.

We give an example to show the usefulness of our results. Let Xn be the number of internal

nodes in tries (size of tries). Then, applying our results yields, as n → ∞,

Var(Xn) ∼        n −p log p − q log q X k G(−1 + χk)n−χk, if log p/ log q ∈ Q; G(−1)

−p log p − q log qn, if log p/ log q 6∈ Q,

where χk = 2ρkπi/ log p with log p/ log q = ρ/σ, gcd(ρ, σ) = 1 and G(ω) = G1(ω) + H1(ω)

with G1(ω) = (ω + 1)Γ(ω)  1 −ω 2+ 4ω + 8 2ω+3  + 2X l≥1 (−1)llΓ(ω + l + 1) (l + 1)!(1 − pl+1− ql+1)(p l+1+ ql+1)(l(ω + l + 1) − 1) and H1(−1 + χk) = pq X l≥1 (−1)l−1Γ(l + χ k+ 1) (l − 1)!(1 − p1+l− q1+l)(1 − p1−l− q1−l) p l− ql p−l− q−l −X κl Γ(κl+ 1)Γ(−κl+ χk+ 1) (1 − p1−κl− q1−κl)(p1+κllog p + q1+κllog q) p −κl− q−κl (pκl − qκl) ! ,

where κlare all zeros of 1 − p1+τ − q1+τ = 0 with <(κl) < 0.

3

Papers and Preprints

The following papers and preprints were (partially) written during the lifetime of this project (all of them are available from my webpage http://www.math.nctu.edu.tw/∼mfuchs/).

1. H.-K. Hwang, M. Fuchs, V. Zacharovas (2010). Asymptotic variance of random symmetric digital search trees, Discrete Math. Theor. Comput. Sci., 12, 103–166.

2. M. Fuchs (2010). The variance for partial match retrievals in k-dimensional bucket digi-tal trees, Discrete Math. Theor. Comput. Sci. Proc., Proceedings of the 21st International Meeting on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Al-gorithms, 261–276.

3. M. Fuchs (2011). The subtree size profile of plane-oriented recursive trees, Proceedings of the Eighth Workshop on Analytic Algorithmics and Combinatorics (ANALCO), 85–92. 4. M. Fuchs (2011). A note on simultaneous Diophantine approximation in positive

(7)

5. C.-H. Chen and M. Fuchs (2011). On the moment-transfer approach for random variables satisfying a one-sided distributional recurrence, Electron. J. Probab., 16, 903–928.

6. M. Fuchs (2012). Limit theorems for subtree size profiles of increasing trees, Combin. Probab. Comput., 21, 412–441.

7. M. Fuchs, C.-K. Lee, H. Prodinger (2012). Approximate counting via the Poisson-Laplace-Mellin method, Discrete Math. Theor. Comput. Sci. Proc., Proceedings of the 23rd Interna-tional Meeting on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms, 13–28.

8. S.-Y. Chen and M. Fuchs. A higher-dimensional Kurzweil theorem for formal Laurent series over finite fields, 10 pages, Finite Fields Appl., in press.

9. M. Fuchs and C.-K. Lee. The Wiener index of random digital trees, 28 pages, submitted. The papers [2,3,7] are conference papers which were submitted and presented at the following meetings.

• 21st International Meeting on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms, Vienna, Austria ([2]).

• Workshop on Analytic Algorithmics and Combinatorics (ANALCO11), San Francisco, USA ([3]).

• 23rd International Meeting on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms, Montreal, Canada ([7]).

From these three papers, only [3] is an extended abstract (the journal version of this paper is [6]), whereas the other two are complete papers which will not appear elsewhere.

4

Students

Four master students and one Ph.-D. students were partially supported by the project.

• Shu-Yi Chen, master student, graduated on June 7, 2012. (Thesis: “Kurzweil’s Theorem in the Field of Formal Laurent Series”.)

• Yen-Ting Kuo, master student, graduated on June 27, 2012. (Thesis: “Limit Theorems for the Cost of Splitting Random Cayley Trees”.)

• Yi-Wen Lee, master student, graduated on June 27, 2012. (Thesis: “Group Patterns of Social Animals under the Neutral Model”.)

• Chi-Wei Chiang, master student, expected to graduate in November, 2012. (Thesis: “Ran-dom Polynomials over Finite Fields via Analytic Combinatorics”.)

(8)

5

Conclusion and Future Work

In this project, we proposed a new approach to the asymptotic of the variance of cost measures in digital trees. Our approach was already outlined in the project proposal and it was claimed that this approach has many applications. We confirmed this claim by applying it to various problems such as cost measures in random digital trees, analysis of partial match retrievals in k-dimensional dig-ital trees, approximate counting, analysis of the Wiener index in random digdig-ital trees, etc. More-over, we presented our findings at international meetings. Finally, four master students and one Ph.-D. student were partially supported by this project (three master students already graduated).

As for future work, there are a lot of possible directions.

Extension of the Wiener Index and Correlation in Random Trees. In [5], we studied the Wiener index in random symmetric digital trees. Our approach also applies to asymmetric digital trees. Moreover, extensions of the Wiener index, such as the k-Steiner distance as well as other topological indices from combinatorial chemistry can be studied as well.

One main feature of our study of the Wiener index was the correlation between the Wiener index and the total path length which lead to a bivariate central limit theorem. Correlations of other parameters in random trees could be studied as well. For instance, an interesting example is the correlation between the space requirement of an m-ary search tree and its total path length. Since the limit law of the space requirement is known to be normal for m ≤ 26 and does not exist for m > 26, whereas the limit law of the total path length always exist, it is interesting to ask what happens with the bivariate limit law? We will investigate such and similar questions in one of our forthcoming projects.

Other Cost Measures in Random Digital Trees. Apart from the cost measures, we considered above, our approach can also be applied to many other cost measures. Some examples include the number of k-protected nodes, node-sorts and size of PATRICIA trie, etc.

A General Framework for Logarithmic Cost Measures. Most of the cost measures, we con-sidered so far are either linear or superlinear. The only logarithmic cost measure, we concon-sidered is the left-path length in random digital search trees (which was needed for the analysis of approxi-mate counting). Since there are many other logarithmic cost measures, it would be nice to have a more general framework like the one we proposed for tries and PATRICIA tries.

From Asymptotic Expansions to Identities. One of our main contribution in [7] was the use of “poissonized variances”. By suitable refining this notation, one does not only get asymptotic expansions but even identities (of an asymptotic nature). We have already worked out this for some easy examples, but the full usefulness of such identities still remains to be investigated.

Limit Laws, Local Limit Theorems, Rates of Convergence. In our original project proposal, we claimed that once the variance is well-understood, we plan to propose a general framework for deriving more refined results such as limit laws, local limit theorems, rates of convergence, etc. However, due to too many interesting problems related to the variance, we had no time to work on this second stage of the project yet. We will postpone such a study (which certainly is of great importance) to a future project.

(9)

References

[1] A. A. Dobrynin, R. Entringer, I. Gutman (2001). Wiener index of trees: Theory and applica-tions, Acta Appl. Math., 66, 211–249.

[2] P. Flajolet (1985). Approximate counting: a detailed analysis, BIT, 25, 113–134.

[3] P. Flajolet and R. Sedgewick (1986). Digital search trees revisited, SIAM J. Comp., 15, 748– 767.

[4] M. Fuchs (2010). The variance for partial match retrievals in k-dimensional bucket digital trees, Discrete Math. Theor. Comput. Sci. Proc., Proceedings of the 21st International Meet-ing on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms, 261–276.

[5] M. Fuchs and C.-K. Lee. The Wiener index of random digital trees, submitted.

[6] M. Fuchs, C.-K. Lee, H. Prodinger (2012). Approximate counting via the Poisson-Laplace-Mellin method, Discrete Math. Theor. Comput. Sci. Proc., Proceedings of the 23rd Interna-tional Meeting on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms, 13–28.

[7] H.-K. Hwang, M. Fuchs, V. Zacharovas (2010). Asymptotic variance of random symmetric digital search trees, Discrete Math. Theor. Comput. Sci., 12, 103–166.

[8] H.-K. Hwang, M. Fuchs, V. Zacharovas. Asymptotic variance of random tries and related structures, in preparation.

[9] P. Jacquet and W. Szpankowski (1998). Analytic de-Poissonization and its applications, Theor. Comput. Sci., 201, 1–62.

[10] P. Kirschenhofer and H. Prodinger (1988). Eine Anwendung der Theorie der Modulfunctio-nen in der Informatik, ¨Osterreich. Akad. Wiss. Math.-Natur. Kl. Sitzungsber. II, 197, 339–366. [11] P. Kirschenhofer and H. Prodinger (1994). Multidimensional digital searching - alternative

data structures, Random Struc. Algorithms, 5, 123–134.

[12] R. Morris (1978). Counting large numbers of events in small registers, Comm. ACM, 21, 840–842.

參考文獻

相關文件

Incorporated Management Committees should comply with the terms in this Code of Aid and abide by such requirements as promulgated in circulars and instructions issued by the

Consequently, Technology Education is characterized by learning activities which provide students with authentic experiences in various technological areas such as

We further want to be able to embed our KK GUTs in string theory, as higher dimensional gauge theories are highly non-renormalisable.. This works beautifully in the heterotic

It is well-known that, to deal with symmetric cone optimization problems, such as second-order cone optimization problems and positive semi-definite optimization prob- lems, this

Optim. Humes, The symmetric eigenvalue complementarity problem, Math. Rohn, An algorithm for solving the absolute value equation, Eletron. Seeger and Torki, On eigenvalues induced by

Now we assume that the partial pivotings in Gaussian Elimination are already ar- ranged such that pivot element a (k) kk has the maximal absolute value... The growth factor measures

2 I understand that the Education Bureau (EDB) will take such measures as they consider necessary and appropriate to verify the information provided in and/or in relation to this

For a 4-connected plane triangulation G with at least four exterior vertices, the size of the grid can be reduced to (n/2 − 1) × (n/2) [13], [24], which is optimal in the sense