Chapter 2 Inverted File Size Reduction
2.3 Quantitative Analysis
Give a posting list PL=<d1,id2,...,idf > of f document identifiers, where idk <idk+1, and all document identifiers are within the range 1...N. As stated in Section 2.2, the first step in unique-order interpolative coding is to determine the group size g. Once g is determined, the PL will be
divided into ⎥⎥
⎢ ⎤
⎢
=⎡ g
m f blocks, with the first (m-1) blocks containing g document identifiers and the
last block containing f −(m−1)g document identifiers. The boundary pointers and the residual pointers will be coded by efficient prefix-free coding methods such as Golomb coding and γ coding, in d-gap manner, and the inner document identifiers will be coded by the interpolative coding.
Let the function F(N, f) represent bits needed for compressing the f document identifiers ranging from 1 to N. Theoretically, the following approximate formulas can then be achieved (Golomb, 1966; Gallager & Van Voorhis, 1975; Mcllroy, 1982; Elias, 1975; Moffat & Stuiver, 2000).
If Golomb coding is used to encode the boundary pointers and residual pointers, then the maximum number of bits required to store these f-(m-1)(g-1) boundary and residual pointers is
))
If we use γ coding to encode these pointers, then the maximum number of bits required is
))
Based on Eq.(2.3), the number of bits required to code the inner pointers ((m-1) groups, (g-1) document identifiers in each group) is
∑
−and the sum of the logarithms of the (m-1) individual ranges is maximized when all
−1
Therefore, if Golomb coding is used to encode the boundary and residual pointers, then the maximum number of bits required by the unique-order interpolative coding is at most
))
Eqs. (2.9) and (2.10) can be simplified under the condition that no residual pointers exist. For example, when f=(m-1)g+1, Eq. (2.9) can be rewritten as:
]
and some examples of the maximum number of bits required for unique-order interpolative coding are derived in Table 2.2.
Table 2.2 Some examples of the maximum number of bits required for unique-order interpolative coding if Golomb coding is used to encode boundary pointers under the condition that no residual pointers exist.
g maximum number of bits required
2 [3.29 log2 ]
f f × + N
4 [3.25 log2 ]
f f × + N
8 [3.05 log2 ]
f f × + N
16 [2.88 log2 ]
f f × + N
32 [2.76 log2 ]
f f × + N
The results in Table 2.2 showed that when Golomb coding is used to encode boundary pointers, the maximum number of bits required in unique-order interpolative coding has inverse relationship with group size g: the maximum number of bits decreases with increase in group size g and increases with decrease in g. On the other hand, if the number of document identifiers is less than (g+1), unique-order interpolative coding cannot be used. We design an experiment in Section 2.4 to find a suitable group size g.
The results in Eqs. (2.9) and (2.10), and Table 2.2 can be improved if Eq.(2.3) can be improved.
For example, the maximum number of bits required for interpolative coding to encode a posting list with 3 document identifiers ranging from 1 to N is
⎡
log2(N −2)⎤ ⎡
+ log2a⎤ ⎡
+ log2b⎤
(2.12) since the middle item requires⎡
log2(N −2)⎤
bits, and the left and right items require⎡
log2a⎤ ⎡
+ log2b⎤
bits where a, b are two positive integers and a+b=(N-1). Since⎡
log2(N −2)⎤
<1+log2N (2.13) and⎡
log2⎤ ⎡
log2⎤
(1 log2 ) (1 log2 ) 2 log2 2 log2 2N b N
a b
a + < + + + < + + (2.14) hence
⎡
log2( 2)⎤ ⎡
log2⎤ ⎡
log2⎤
3 (1.92 log2 3)b N a
N− + + < × + (2.15) We replace Eq.(2.3) with Eq.(2.15) when group size g=4, and the maximum number of bits required for the unique-order interpolative coding under the condition that no residual pointers exist is therefore
] log 76 . 2
[ 2
f
f × + N (2.16)
Compared with the figure in Table 2.2, a much tighter upper bound is obtained.
To further understand the characteristics of unique-order interpolative coding, we conducted following experiments. We used encoding methods such as Golomb coding, skewed Golomb coding, batched LLRUN coding, interpolative coding, variable byte coding, Carryover-12 mechanism, unique-order interpolative coding 1 (group size g=4; boundary pointers and residual pointers by Golomb coding), unique-order interpolative coding 2 (group size g=4; boundary pointers and residual pointers by γ coding) in compression. In the first experiment (Table 2.3(a)), f
= 1,000,000 gaps were drawn from a geometric distribution and compressed using the eight methods. The Golomb coding performs the best, since it is a minimum-redundancy code for geometric gap distribution (Gallager and Van Voorhis 1975). Compared with other methods, unique-order interpolative coding is not suitable for a geometric distribution when 2< <256
f
N .
But when
N increases, the performance of unique-order interpolative coding 1 improves f proportionally. When ≤2
f
N , the results of unique-order interpolative coding 2 are satisfying. For
most cases in the first experiment, both variable byte coding and Carryover-12 mechanism are inefficient in compression.
Table 2.3 Compression results for geometric and skew geometric distributions of f = 1,000,000 gaps:
average bits per gap
Average gap(N/f), Geometric Distribution Coding Methods
1 2 4 8 16 32 64 128 256 512 1024 2048 Golomb coding 1.00 2.33 3.30 4.39 5.43 6.45 7.46 8.47 9.47 10.47 11.47 12.47 Skewed Golomb coding 1.00 2.53 3.51 4.60 5.64 6.66 7.67 8.68 9.68 10.68 11.68 12.68 Batched LLRUN coding 1.00 2.27 3.46 4.50 5.53 6.52 7.52 8.52 9.52 10.52 11.52 12.53 Interpolative coding 0.00 2.15 3.45 4.59 5.66 6.69 7.70 8.71 9.71 10.71 11.71 12.72 Variable byte coding 8.00 8.00 8.00 8.00 8.00 8.14 9.08 10.93 12.87 14.24 15.07 15.52 Carryover-12 mechanism 1.07 2.88 4.11 5.17 6.18 7.38 8.75 9.90 10.58 12.30 14.41 15.56 Unique-order interpolative coding 1 3.00 4.19 5.13 5.97 6.76 7.53 8.29 9.06 9.89 10.77 11.68 12.77 Unique-order interpolative coding 2 0.25 2.33 3.91 5.31 6.64 7.92 9.19 10.45 11.70 12.96 14.21 15.46
Self-entropy 0.00 2.00 3.24 4.35 5.40 6.42 7.43 8.44 9.44 10.44 11.43 12.43 (a) Geometric distribution
Average gap(N/f), Skewed Distribution Coding Methods
1 2 4 8 16 32 64 128 256 512 1024 2048 Golomb coding 1.40 2.60 3.30 4.29 5.33 6.37 7.39 8.40 9.40 10.40 11.40 12.41 Skewed Golomb coding 1.80 2.31 2.92 3.76 4.80 5.79 6.80 7.82 8.82 9.83 10.83 11.83 Batched LLRUN coding 1.40 2.31 2.86 3.60 4.61 5.66 6.70 7.71 8.71 9.71 10.70 11.71 Interpolative coding 0.84 1.53 2.07 2.90 3.97 5.07 6.15 7.19 8.21 9.23 10.23 11.24 Variable byte coding 8.00 8.00 8.00 8.00 8.10 8.58 9.38 10.11 10.63 11.28 12.43 13.80 Carryover-12 mechanism 1.07 2.36 2.90 3.72 4.84 6.02 6.98 7.9 9.35 10.90 12.08 12.57 Unique-order interpolative coding 1 3.60 3.96 4.30 4.80 5.51 6.30 7.11 7.94 8.76 9.60 10.51 11.62 Unique-order interpolative coding 2 1.25 1.90 2.47 3.33 4.53 5.88 7.21 8.53 9.81 11.07 12.33 13.60 Self-entropy 0.97 1.77 2.30 3.05 4.06 5.10 6.15 7.18 8.19 9.19 10.19 11.20 (b) Skewed geometric distribution
In the second experiment, for each value of
N the sequence of f = 1,000,000 geometrically f distributed gaps was broken into chunks of 200 contiguous values. The chunks were then placed in groups of five. In the first three chunks of each group, all gaps were multiplied by a factor of 0.1;
whereas in the other two chunks all gaps were multiplied by a factor of 2.35. This process created
artificial clusters of gaps much similar than the average, and about 60% of the values were coded into these clusters, while the overall average gap remained the same. This better resembles the distribution of real document collections. The results are shown in Table 2.3(b). Compared with skewed Golomb coding, batched LLRUN coding, and interpolative coding, the compression efficiency of Golomb coding is not as good as others, meaning it is unable to exploit clustering well.
The compression results of unique-order interpolative coding for a skewed geometric distribution are better than that for a geometric distribution. This means that unique-order interpolative coding does take a good advantage of the clustering property. For ≤32
f
N , we prefer to use the
unique-order interpolative coding 2; while for
N >32, we suggest unique-order interpolative coding 1. f Similar to that for a geometric distribution, the unique-order interpolative coding 1 performs better as N becomes larger. Again, both variable byte coding and Carryover-12 mechanism are inefficient f
in compression for most cases in the second experiment. From Table 2.3(b), interpolative coding can even outperform self-entropy. This is due to the fact that interpolative coding does not use the gap value in encoding directly, but instead uses a minimal binary code to encode every gap after it is converted to a triple.