Data structures for storing and retrieving multidimensional data are of vital importance in several areas of computer science such as design of data base systems and graphics algorithms. One possible class of such data structures was introduced in [10] and is based on digital data, i.e., data which is composed of infinite 0-1 strings. We assume throughout this work that every bit in these strings is generated independently and with the same probability (symmetric Bernoulli model).
We will first describe the above data structure in more details. Therefore, assume that we have given a set of multidimensional data. Then, we apply a “ regular shuffling” procedure to transform the multidi-mensional data into one-dimultidi-mensional data. Finally, this data is stored in a digital tree. To be more precise, let R1, . . . , Rndenote k-dimensional records, i.e.,
Ri,1=
R[1]i,1, R[2]i,1, R[3]i,1, . . . , ...
Ri,k =
R[1]i,k, R[2]i,k, R[3]i,k, . . . . After shuffling, we obtain the one-dimensional string ˜Ri
R˜i=
R[1]i,1, . . . , R[1]i,k, R[2]i,1, . . . , R[2]i,k, R[3]i,1, . . . , R[3]i,k, . . . . Then, ˜R1, . . . , ˜Rnare used to construct a digital tree.
1365–8050 c 2005 Discrete Mathematics and Theoretical Computer Science (DMTCS), Nancy, France
2 Michael Fuchs As digital trees, we use the three standard types (see [8]). First, for the k-dimensional trie the underlying digital tree is the trie data structure. For the readers convenience, we recall how a trie is constructed: if we only have one record, then we place it into the root which is considered an external node; if we have more than one record, then the root becomes an (empty) internal node and the records are either directed to the left or to right subtree according to whether their first bit is 0 or 1; finally, the subtrees are build recursively by the same procedure, where the first bits are removed; see Figure 1 for an example.
D2
D4
D1 D3
0 1
1
0 1
0 1
D2
D4
D1 D3
0 1
0 1
0 1
D1
D3 D2
D4
0 1
1
Fig. 1: A 2-dimensional trie, PATRICIA trie and digital search tree built from the data
data D1 D2 D3 D4
Ri,1 0010 · · · 1001 · · · 0001 · · · 0111 · · · Ri,2 1000 · · · 1010 · · · 1101 · · · 1011 · · ·
A variant of k-dimensional tries are k-dimensional PATRICIA tries, where the shuffled data is stored in a PATRICIA trie. Recall that a PATRICIA trie is constructed as a trie only multiple one-way branching is suppressed; again see Figure 1 for an example. This yields a more balanced tree improving the overall performance of tries.
A final type is given by the k-dimensional digital search tree which is based on the digital search tree data structure. Recall that a digital search tree is constructed as follows: the first record is placed in the root; all other records are directed to the left or right subtree according to whether their first bit is 0 or 1;
finally, the subtrees are constructed by the same principle again with the first bits removed; see Figure 1 for an example. So, in contrast to tries and PATRICIA tries, no distinction between internal and external nodes is necessary for digital search trees.
Note that all trees introduced above have the common feature that nodes only hold at most one record.
If we allow nodes to hold up to b ≥ 1 records, then the resulting trees are called k-dimensional bucket digital trees.
In this paper, we will study the cost of partial match retrievals in k-dimensional bucket digital trees.
Here, a partial match query will ask for the retrieval of all records matching certain criteria. Formally, a partial match query is a k-dimensional vector R = (R1, . . . , Rk) with some of its coordinates a string of
Partial match retrievals in digital trees 3 0-1 bits and others unspecified. For instance, for k = 2, we might have
R1= (?, ?, ?, ?, . . .) , R2= (0, 1, 1, 0, . . .) ,
Then, a shuffled record ˜R is produced as before. For the above example this yields R = (?, 0, ?, 1, ?, 1, ?, 0, . . .) .˜
The partial match query asks now for the retrieval of all data in the tree with ˜R being used as search query.
Here, 0 or 1 in ˜R means either going to the left or right subtree of the current node, whereas ? means that we have to proceed with our search in both subtrees. The cost of such a partial match query will be measured by the number of nodes visited (where we only consider internal nodes for tries and PATRICIA tries); so for the query ˜R above, we have a cost of 2 for the trie and Patricia trie of Figure 1 and a cost of 3 for the digital search tree.
Before going on, we will fix some notation which we are going to use throughout the work. First, it should be clear that under our random model, the cost of a partial match query only depends on the partial match patternq which is a k-tuple of symbols from {S, ?}, where the i-th coordinate is S if Riis specified and ? otherwise, We will fix such a q throughout this work and denote the number of ? entries in q by u, where we assume that 0 < u < k. Furthermore, we will consider cyclic shifts of the entries of q by one position to the left which will be denoted by q0; more generally, q(l) will denote the cyclic shift of the entries of q by l position to the left. Also, we will associate to a partial match pattern q an infinite sequence (δ1, δ2, δ3, . . .) with δi = 1 if qi mod k = S and δi = 2 otherwise. Finally, we denote the random variable describing the cost of the partial match query by Xq,nfor all three types of bucket digital trees of size n and bucket size b ≥ 1 (for the sake of simplicity, we suppress the index b).
In this paper, we will be concerned with stochastic properties of Xq,n. Therefore, let us recall what is known about this random variable. First, the mean value of Xq,nfor k-dimensional tries was investigated in [2] where the authors proved that
E(Xq,n) ∼ nu/kP1(log2n1/k)
with P1(z) a one-periodic function whose Fourier expansion was given in [2] as well (for a comparison of this result with other data structure for multidimensional search see [8] and [10]). Similar results were subsequently proved in [6] for dimensional bucket digital tries, dimensional PATRICIA tries, and k-dimensional digital search trees, too. As for the variance, it was conjectured in [7] that for k-k-dimensional tries
Var(Xq,n) ∼ nu/kP2(log2n1/k) (1)
with P2(z) a one-periodic function. The authors proved this conjecture for k = 2 in [7]. The general case was then settled in [11]. Note that this result implies that Xq,n/E(Xq,n) converges to 1 in probability.
Hence, the distribution of Xq,nis concentrated around its mean.
As for the method of proof of (1), the authors of [7] applied the analytic approach from [2] to derive asymptotic expansions of mean and second moment. Then, they used these expansions to compute the variance, where they had to cope with highly non-trivial cancellations. Here, their proof crucially rested on an identity of Ramanujan which only works in the case k = 2 and does not seem to have an analogue for k > 2. In [11], a new and mainly elementary approach was devised to settle the general case.
4 Michael Fuchs In this paper, we will re-derive the above result with a more simpler approach. Our approach will be analytic and use some standard tools from the analysis of algorithms (the same approach was already used in other contexts; see [4],[5],[12]). The crucial difference to the approach from [7] is that we incorpo-rate the cancellations at a much earlier stage, making the resulting analysis more easier. Moreover, our approach will work for k-dimensional bucket tries as well.
Before explaining our approach in more details, we are going to state our result. Therefore, we need some notation. Set
˜hq(z) = 2δ1eb(z)e−zL˜q0(z/2) + eb(z)e−z− eb(z)2e−2z, where eb(z) = 1 + z + z2/2! + · · · + zb/b! and ˜Lq(z) = exp{−z}P
n≥0E(Xq,n)zn/n!.
Theorem 1 The cost of a partial match query with u non-specified coordinates in a k-dimensional trie of sizen satisfies
With slightly more work, the Fourier coefficients can be further simplified.
Corollary 1 The Fourier coefficient in the above theorem can be expressed as
cr= Γ(−ωr)
Partial match retrievals in digital trees 5 where the last approximation was computed with Maple. This value coincides with the one given in [7].
Note that the expression given in the latter paper is slightly different; we leave it as an exercise to the reader to show that they are the same.
Next, our approach can also be straightforwardly applied to k-dimensional bucket PATRICIA tries.
Here, we have the same result as above only ˜hq(z) replaced by
˜hq(z) =2δ1(eb(z)e−z− δ1eb(z/2)e−z+ (δ1− 1)e−z/2) ˜Lq0(z/2)
+ eb(z)e−z− δ1eb(z/2)e−z+ δ1e−z/2− (eb(z)e−z− δ1eb(z/2)e−z+ δ1e−z/2)2. Also, a similar explicit expression for the Fourier coefficients as in Corollary 1 can be given. Since, the resulting formula is more messy we do not give details.
Finally, k-dimensional bucket digital search trees are slightly more involved. Here, we will use a variant of the above approach which was introduced in [3]. In order to state our result, we again need some notation. Therefore, set
Q(s) =Y
Theorem 2 The cost of a partial match query with u non-specified coordinates in a k-dimensional digital search tree of sizen satisfies
Var(Xq,n) = nu/kP2(log2n1/k) + O(n2u/k−1)
Moreover, the Fourier coefficients can be further simplified here, too. We will state the result for b = 1.
Therefore, set
ϕ(ω; x) =
(π(xω− 1)/(sin(πω)(x − 1)) if x 6= 1;
πω/ sin(πω), if x = 1.
Then, we have the following corollary.
6 Michael Fuchs Corollary 2 If the bucket size equals one, then the Fourier coefficient in the above theorem can be ex-pressed as
cr= 1
kLQ(1)Γ(1 + ωr)
k−1
X
l=0
δ1· · · δl2−ωrl
X
j1,j2,j3≥0
(−1)j1δ¯q(l),j2δ¯q(l),j32−(j12)+(1−ωr)j1
2j2+j3Qj1Qj2Qj3 ϕ(ωr; 2j1−j2+ 2j1−j3), where
¯δq,j =X
l≥0
(−1)l2−(l+12 ) Ql
l+j
Y
h=1
δh.
We conclude the introduction by giving a short sketch of the paper. In the next section, we will treat k-dimensional bucket tries. Then, in Section 3, we will briefly discuss k-dimensional bucket PATRICIA tries. Finally, in Section 5, we will prove the results for k-dimensional bucket digital search trees.