Past Researches about Random Digital Trees

.

S₁, S₂

.

S₅, S₆

.

S₃, S₄

S₈ S

. .

.₇ 0

. 0

. 1

. 0

Figure 2.4: A bucket digital search tree built from the keys S₁, . . . , S₈ with bucket size b = 2.

2.3 Past Researches about Random Digital Trees

Random digital trees have been studied from many diﬀerent aspects. In this section, we focus on shape parameters of random digital trees. For each parameter, we will give the deﬁnition ﬁrst and then explain the practical use of the parameter. We will also give a list of references of known results of the parameter.

Size

For random digital trees, the parameter size is the total number of internal nodes. So, this parameter is only relevant for tries and PATRICIA tries.

People study this parameter because the number of internal nodes is pro-portional to the number of pointer needed to store the data structure. The less the number of internal nodes needed, the less the space required. Note that binary PATRICIA tries have a constant size and hence this parameter matters only for m-ary PATRICIA tries with m ≥ 3. For the study of size of tries, see [23,

39, 93, 103, 107, 109, 112, 116, 139, 145, 155, 95, 194]. For

PATRICIA tries, see [14,

15, 38, 39, 146].

Depth

Depth is deﬁned as the distance from the root to a randomly selected node which normally contains data. Some researchers use the name depth of insertion or successful search time. It is one of the most well-studied pa-rameter since it provide a great deal of information for many applications.

For example, the depth of a node storing a key represents the search time

for the key in searching and sorting algorithms [127]. Depth also gives the length of a conﬂict resolution session for tree-based communication proto-cols. For compression algorithms, depth is the length of a substring that may be occupied or compressed [7]. For the study of depth of tries, see [31,

36, 37, 57, 92, 97, 114, 136, 169, 192, 196, 198]. For the study of

this parameter in PATRICIA tries, see [38,

39, 181]. Finally, for DSTs,

see [36,

135, 169, 176, 200, 139, 37, 30, 100, 102, 127, 113, 197].

Height

Height of a tree is the length of the longest path from the root to a leaf. It can also be understood as the maximum value of the depth. Height in digital trees reﬂects the the longest common preﬁx of words stored in a digital tree and is directly related to many important operations such as hashing in computer programming. See [23,

32, 36, 37, 38, 60, 71, 92, 168, 169, 201] for studies of

the height tries. See [36,

38, 39, 68, 111, 127, 168, 170, 171, 199, 201, 204]

for PATRICIA tries and [122,

35, 37, 47, 139, 169] for digital search trees.

Shortest path length

This parameter is the length of the shortest path from the root to the leaves.

It was studied for tries by B. Pittel in [168,

169]. He considered this parameter

in order to derive the depth and height of a random trees.

Saturation level

Sometimes also called ﬁll-up level. The levels of a tree are the set of nodes which are of the same distance from the root. The saturation levels are the full levels in a tree. Researcher have investigate the number of saturation levels, the maximum level which is full and so on. This parameter was ﬁrstly studied for the purpose of understanding the behavior of the parameter height [35]. However, it found also other applications in improving the eﬃciency of algorithms for IP address lookup problem [123]. For more details, see [35,

123, 169].

Stack size

In the deﬁnition of the height, every edge contributes one when counting the distance. Stack size is a kind of ”biased” height with the edges whose label is the last symbol in the alphabet (for binary case, the edges labeled by 1) make no contribution when counting the distance. For general m-ary tries and PATRICIA tries, if the order of the symbols in the alphabet is given,

a preorder traversals of the corresponding trie or PATRICIA trie gives the list of the words stored in the lexicographical order. When the traversal is implemented in a recursive way, the height measures the recursion-depth needed. However, in many cases the recursion is removed or a technique called end-recursion removal is used to save recursion calls. In those cases, the amount of memory space is no longer measured by the height and hence the parameter stack-size is introduced. For stack size of tries, see [16,

157, 158].

For the PATRICIA tries case, L. Devroye gave a method to compute this quantity in [39], however, without giving an explicit result.

Horton-Strahler number

The Horton-Strahler is deﬁned similar to stack size for similar purposes.

It speciﬁes the recursion-depth needed for a traversal when pre-recursion removal is applied and the subtree is visited in an order chosen to minimize the recursion depth (this order is ﬁxed for stack-size). Related study for tries can be found in [16,

41, 156, 157, 158]. In [39], the author gave a bound for

Horton-Strahler number of PATRICIA tries.

One-sided path length

One-sided path length is length of the path with all edges labeled by the same symbol. This parameter is directly related to a widely used algorithm called leader (or loser) selection. For the algorithm and its applications, see [59]. Because of its recursive nature, the algorithm will generate a tree which has a similar structure as tries during the selecting process. As as result, the one-sided path length of tries will directly reﬂect the eﬃciency of the algorithm. Related studies can be found in [59,

106, 174, 212, 214].

Occurrence of certain pattern

People may interested in nodes of the digital trees satisfying certain prop-erties or the number of certain speciﬁed subgraphs in a digital tree. For example, the so-called 2-protected nodes, the nodes which are neither the leaves nor parents of a leaf, is a type of nodes which has been studied. For more researches of this kind, see [60,

114, 116, 162, 192, 198, 169, 200, 89, 90, 98, 114, 121, 127]

External path length

This parameter is the sum of distances between leaves and the root. In the case of trie data structures of hashing schemes, this parameter represents the

processing time in central unit. However, this parameter has drawn a lot of interests not because of its practical use but for an interesting phenomenon.

It is easy to see that the mean of the external path length can be derived directly from the mean of the depth. Therefore, people expected that the variance of the external path length can also be directly derived from the variance of the depth. However, this is not the case [118]. This parameter has been studied for tries [94,

99, 119, 167] and PATRICIA tries [14, 39, 118, 199].

Internal path length

Internal path length is deﬁned as the sum of distances between internal nodes and the root. In contrast to tries and PATRICIA tries which store data in the leaves, digital search trees store data in the internal nodes. Thus, the internal path length can be seen as the counterpart of the external path length in digital search trees. L. Devroye studied this parameter for PATRICIA tries in [38]. Later, Fuchs, Hwang and Vytas gave a general framework for parameters with similar properties of the internal path length in [75]. Researches about internal path length of digital search trees can be found in [200,

89, 173, 90, 117, 121, 98, 192, 160, 74].

Distances

This parameter refers to the distance between two randomly selected nodes in a digital tree. It can be seen as a measure of how ”diverse” a tree is. R.

Neininger studied this parameter to get better understanding of the Wiener index (which will be discussed later) of recursive trees [159]. After Neininger’s work, this parameter was studied for many diﬀerent classes of trees. For the study of distances of tries, see [1,

3, 22]. The study of distance of digital

search trees can be found in [1,

2].

Number of unary nodes in tries

Unary nodes are the nodes with exactly one child. PATRICIA tries are the tries with unary nodes ”collapsed”. Therefore, studying the number of unary nodes may gives us an estimation how much space is ”saved” in PATRICIA tries comparing to tries. Besides comparing the two variations of random digital trees, S. Wagner studied this parameter in [210] in order to get a better understanding of the eﬃciency of contention tree algorithms. Related studies can be found in [18,

205, 210]

Node proﬁle

The node proﬁle of a random digital tree is a parameter which represents the number of nodes at the same distance from the root. This parameter has drawn a lot of attention because many fundamental parameters, including size, depth, height, shortest path length, internal path length and satura-tion level can be uniformly analyzed and expressed in terms of node proﬁles.

Although node proﬁle are of great importance, there are not too many re-searches about it until very recently. See [167] for the node proﬁle of tries, [38,

39] for node proﬁle of PATRICIA tries and [48, 50, 129, 135, 51] for node

proﬁle of digital search trees.

Partial match queries

Multidimensional data retrieval is an important issue for the design of data base system. Partial match retrieval is a widely used method of retrieval which found many applications, especially in geographical data and graphic algorithm. For more detailed introduction of partial retrieval operation, see [65] and the references within. Because of the importance of this method, the performance of partial match retrieval on digital trees received a lot of interests. For the analysis of partial match retrieval on tries, PATRICIA tries and bucket-digital search trees, see [39,

42, 65, 73, 120, 190, 191].

Peripheral path length

This parameter was proposed by W. Szpankowski and M. D. Ward in [133,

211, 213] with the name w-parameter. This parameter was originally applied

in the study of Lempel-Ziv’77 data compression algorithm on uncompressed suﬃx trees. In [74], the authors renamed this parameter as the peripheral path length. The fringe-size of a leaf node is deﬁned to be the size of the subtree rooted as its parent node. The peripheral path length is then deﬁned to be the sum of the fringe-size of all leaves of the tree. Peripheral path length has been studied for tries, PATRICIA tries and digital search trees.

For related researches, see [49,

74, 133, 211, 213].

Weighted path length

Let l_j be the distance of the j-th node to the root and w_j be the weight attached to the j-th node, the weighted path length is deﬁned to be

W_n =

∑n j=1

w_jl_j

for a tree with n nodes. For many real life applications, people need to assign a weight to each edge or node of a graph. Weighted path length arises for these applications. For more details, see [74].

Diﬀerential path length

Also called the colless index, it inspects the internal nodes of trees. We partition the leaves descend end from internal nodes into two groups of sizes L and R. The diﬀerential path length is the sum over all absolute values

|L−R| for all ancestors. This parameter is investigated in the system biology literature [11]. M. Fuchs et. al. studied this parameter for symmetric random digital search trees [74].

Type of nodes in bucket digital search trees

Because of the construction, a bucket digital search tree with bucket size b ≥ 2 may have nodes containing diﬀerent number of keys. The authors of [90] proposed a multivariate frame to study the number of each type of nodes in bucket digital search trees.

Key-wise path length

Since a node of a bucket digital search tree with b ≥ 2 may contain more than one key, the distance of two keys stored in a b-DST could be 0 in some cases. Therefore, researchers deﬁned two diﬀerent types of path length to study b-DSTs, wise path length and node-wise path length. The key-wise path length is deﬁned as the sum of the distances between keys and the root. Related researches can be found in [74,

89].

Node-wise path length

Node-wise path length of b-DSTS is deﬁned to be the sum of all distances between nodes and the root. See [74,

90] for more details.

Chapter 3 Related Mathematical Techniques

3.1 Rice Method

Rice method, sometimes also called the Nörlund-Rice integral or Rice inte-gral, is named in honor of Niels Erik Nörlund and Stephen O. Rice. It is a fruitful method for ﬁnding the asymptotic expansion of sums of the form

∑n k=n0

(n k

)

(−1)^kf (k). (3.1)

The formulae of Rice integral is given by

∑n k=n0

(n k

)

(−1)^kf (k) = (−1)ⁿ 2πi

∮

f (z) n!

z(z− 1) · · · (z − n)dz,

where C is a positive oriented closed curve encircling the points n₀, n₁, . . . , n and f (z) is understood to be analytic within C. The integral can also be written as

∑n k=n0

(n k

)

(−1)^kf (k) = −1 2πi

∮

B(n + 1,−z)f(z)dz,

where C is the contour mentioned before and B(a, b) is the beta function.

Now we give a formal statement of the two formulaes.

Lemma 3.1.1. Let f (z) be analytic in a domain that contains the half-line

[n₀, +∞). Then, we have the representation

∑n k=n0

(n k

)

(−1)^kf (k) =(−1)ⁿ 2πi

∮

f (z) n!

z(z− 1) · · · (z − n)dz

=−1 2πi

∮

B(n + 1,−z)f(z)dz,

where C is a positively oriented closed curve that lies in the domain of an-alyticity of f (z), encircles [n₀, n], and does not include any of the integers 0, 1, . . . , n₀− 1.

Proof. The proof of these two equalities is omitted here since it is a simple application of residue theorem. For the complete proof, see [69].

Suppose we have a ﬁnite diﬀerences sum of the form in (3.1), the Rice method allows us to compute an asymptotic expansion by the following steps:

Step 1. Extend f (k) which is originally deﬁned on integers to an appropriate

meromorphic function f (z).

Step 2. Choose a suitable contour C which encircles the points n

₀, . . . , n and consider the Rice integral

∆ = (−1)ⁿ 2πi

∮

f (z) n!

z(z− 1) · · · (z − n)dz.

Step 3. Residue theorem yields that

∆ =

∑n k=n0

(n k

)

(−1)^kf (k) +{contribution from other poles inside C}.

Step 4. Estimate ∆.

Remark 1. The most diﬃcult step of the Rice method is usually Step 1.

Finding the meromorphic extension of f (k) is usually quite tricky. Also, to carry out Step 4 one needs growth properties of f (z), e.g. f (z) is of polynomial growth.

Rice method has been used in many researches about shape parameters of random digital trees. For example, P. Kirschenhofer, H. Prodinger and W. Szpankowski used it to derive the mean and variance of the internal path length of symmetric DSTs [114,

117, 121]. Here, we use the mean of

the internal path length of DSTs as an instance to illustrate how the Rice method works.

We let S_n be the expectation of the internal path length of a symmetric DST built on n strings. Then, under the Bernoulli mode, from the deﬁnition of the internal path length, we have the recurrence

S_n+1 = n + 2¹⁻ⁿ

where the initial condition is given by S₀ = 0. After some manipulations of generating functions, we get that

S_n =

Example 3.1.2. Consider the sum

S_n =

Step 1. We introduce the function

Q(s) = ∏ Moreover, Q(1)/Q(2²^−z) is analytic on [2,∞). Thus, we get the needed extension.

Step 2. We choose the contour C to be

{ and consider the integral

∆ = (−1)ⁿ

Step 3. To compute the residues, we need to ﬁnd the poles of Q(1)/Q(2

²^−z).

The poles occur at z = 1+2kπi/ log 2 and hence the integral has a double pole at z = 1 and simple poles at z = 1 + 2kπi/ log 2 with k ∈ Z \ {0}.

To compute the contribution from the double pole, we derive the series expansion

2ⁿ−1. Combining the expansions, we get Q(1) From the above expansion, we get the residue at z = 1 as

− n

Step 4. On the right semi-circle, ∆ converges to 0 as n grows since

Q(2²^−z)

= O(1) as |z| → ∞.

On the left line, we have the bound O

Combining all the above steps, we have S_n= n log₂n + n

(γ − 1

log 2 − α + 1

2+ σ(n) )

+O(n^1/2). (3.3)

The internal path length of DSTs is not the only shape parameter which has been analyzed by the Rice method. For example, the external path length of tries and PATRICIA tries under the Bernoulli model can also be analyzed by the Rice method [118,

119].

However, there are several other mathematical tools which can be used to analyze these shape parameters and they are sometimes more useful than the Rice method. In the following sections, we will introduce these tools. We begin with a standard tool which can transfer a problem under the Bernoulli model into a problem under the so-called Poisson model in which we have many useful tools to conquer the problem.

在文檔中隨機數位樹上加法性參數之機率分析 (頁 27-37)