Quantitative Analysis - Inverted File Size Reduction

Chapter 2 Inverted File Size Reduction

2.3 Quantitative Analysis

Give a posting list PL=<d₁,id₂,...,id_f > of f document identifiers, where id_k <id_k₊₁, and all document identifiers are within the range 1...N. As stated in Section 2.2, the first step in unique-order interpolative coding is to determine the group size g. Once g is determined, the PL will be

divided into ⎥⎥

⎢ ⎤

⎢

=⎡ g

m f blocks, with the first (m-1) blocks containing g document identifiers and the

last block containing f −(m−1)g document identifiers. The boundary pointers and the residual pointers will be coded by efficient prefix-free coding methods such as Golomb coding and γ coding, in d-gap manner, and the inner document identifiers will be coded by the interpolative coding.

Let the function F(N, f) represent bits needed for compressing the f document identifiers ranging from 1 to N. Theoretically, the following approximate formulas can then be achieved (Golomb, 1966; Gallager & Van Voorhis, 1975; Mcllroy, 1982; Elias, 1975; Moffat & Stuiver, 2000).

If Golomb coding is used to encode the boundary pointers and residual pointers, then the maximum number of bits required to store these f-(m-1)(g-1) boundary and residual pointers is

))

If we use γ coding to encode these pointers, then the maximum number of bits required is

))

Based on Eq.(2.3), the number of bits required to code the inner pointers ((m-1) groups, (g-1) document identifiers in each group) is

∑

⁻

and the sum of the logarithms of the (m-1) individual ranges is maximized when all

−1

Therefore, if Golomb coding is used to encode the boundary and residual pointers, then the maximum number of bits required by the unique-order interpolative coding is at most

))

Eqs. (2.9) and (2.10) can be simplified under the condition that no residual pointers exist. For example, when f=(m-1)g+1, Eq. (2.9) can be rewritten as:

]

and some examples of the maximum number of bits required for unique-order interpolative coding are derived in Table 2.2.

Table 2.2 Some examples of the maximum number of bits required for unique-order interpolative coding if Golomb coding is used to encode boundary pointers under the condition that no residual pointers exist.

g maximum number of bits required

2 [3.29 log₂ ]

f f × + N

4 [3.25 log₂ ]

f f × + N

8 [3.05 log₂ ]

f f × + N

16 [2.88 log₂ ]

f f × + N

32 [2.76 log₂ ]

f f × + N

⁽¹ ^log2 ⁾ ⁽¹ ^log2 ⁾ ² ^log2 ₂ ^log2 ₂

N b N

a b

a + < + + + < + + (2.14) hence

⎡

^log2⁽ ²⁾

⎤ ⎡

^log2

⎤ ⎡

^log2

⎤

³ ⁽¹^.⁹² ^log2 ₃⁾

b N a

N− + + < × + (2.15) We replace Eq.(2.3) with Eq.(2.15) when group size g=4, and the maximum number of bits required for the unique-order interpolative coding under the condition that no residual pointers exist is therefore

] log 76 . 2

[ ₂

f × + N (2.16)

Compared with the figure in Table 2.2, a much tighter upper bound is obtained.

To further understand the characteristics of unique-order interpolative coding, we conducted following experiments. We used encoding methods such as Golomb coding, skewed Golomb coding, batched LLRUN coding, interpolative coding, variable byte coding, Carryover-12 mechanism, unique-order interpolative coding 1 (group size g=4; boundary pointers and residual pointers by Golomb coding), unique-order interpolative coding 2 (group size g=4; boundary pointers and residual pointers by γ coding) in compression. In the first experiment (Table 2.3(a)), f

= 1,000,000 gaps were drawn from a geometric distribution and compressed using the eight methods. The Golomb coding performs the best, since it is a minimum-redundancy code for geometric gap distribution (Gallager and Van Voorhis 1975). Compared with other methods, unique-order interpolative coding is not suitable for a geometric distribution when 2< <256

N .

But when

N increases, the performance of unique-order interpolative coding 1 improves f proportionally. When ≤2

N , the results of unique-order interpolative coding 2 are satisfying. For

most cases in the first experiment, both variable byte coding and Carryover-12 mechanism are inefficient in compression.

Table 2.3 Compression results for geometric and skew geometric distributions of f = 1,000,000 gaps:

average bits per gap

Average gap（N/f）, Geometric Distribution Coding Methods

1 2 4 8 16 32 64 128 256 512 1024 2048 Golomb coding 1.00 2.33 3.30 4.39 5.43 6.45 7.46 8.47 9.47 10.47 11.47 12.47 Skewed Golomb coding 1.00 2.53 3.51 4.60 5.64 6.66 7.67 8.68 9.68 10.68 11.68 12.68 Batched LLRUN coding 1.00 2.27 3.46 4.50 5.53 6.52 7.52 8.52 9.52 10.52 11.52 12.53 Interpolative coding 0.00 2.15 3.45 4.59 5.66 6.69 7.70 8.71 9.71 10.71 11.71 12.72 Variable byte coding 8.00 8.00 8.00 8.00 8.00 8.14 9.08 10.93 12.87 14.24 15.07 15.52 Carryover-12 mechanism 1.07 2.88 4.11 5.17 6.18 7.38 8.75 9.90 10.58 12.30 14.41 15.56 Unique-order interpolative coding 1 3.00 4.19 5.13 5.97 6.76 7.53 8.29 9.06 9.89 10.77 11.68 12.77 Unique-order interpolative coding 2 0.25 2.33 3.91 5.31 6.64 7.92 9.19 10.45 11.70 12.96 14.21 15.46

Self-entropy 0.00 2.00 3.24 4.35 5.40 6.42 7.43 8.44 9.44 10.44 11.43 12.43 (a) Geometric distribution

Average gap（N/f）, Skewed Distribution Coding Methods

1 2 4 8 16 32 64 128 256 512 1024 2048 Golomb coding 1.40 2.60 3.30 4.29 5.33 6.37 7.39 8.40 9.40 10.40 11.40 12.41 Skewed Golomb coding 1.80 2.31 2.92 3.76 4.80 5.79 6.80 7.82 8.82 9.83 10.83 11.83 Batched LLRUN coding 1.40 2.31 2.86 3.60 4.61 5.66 6.70 7.71 8.71 9.71 10.70 11.71 Interpolative coding 0.84 1.53 2.07 2.90 3.97 5.07 6.15 7.19 8.21 9.23 10.23 11.24 Variable byte coding 8.00 8.00 8.00 8.00 8.10 8.58 9.38 10.11 10.63 11.28 12.43 13.80 Carryover-12 mechanism 1.07 2.36 2.90 3.72 4.84 6.02 6.98 7.9 9.35 10.90 12.08 12.57 Unique-order interpolative coding 1 3.60 3.96 4.30 4.80 5.51 6.30 7.11 7.94 8.76 9.60 10.51 11.62 Unique-order interpolative coding 2 1.25 1.90 2.47 3.33 4.53 5.88 7.21 8.53 9.81 11.07 12.33 13.60 Self-entropy 0.97 1.77 2.30 3.05 4.06 5.10 6.15 7.18 8.19 9.19 10.19 11.20 (b) Skewed geometric distribution

In the second experiment, for each value of

N the sequence of f = 1,000,000 geometrically f distributed gaps was broken into chunks of 200 contiguous values. The chunks were then placed in groups of five. In the first three chunks of each group, all gaps were multiplied by a factor of 0.1;

whereas in the other two chunks all gaps were multiplied by a factor of 2.35. This process created

artificial clusters of gaps much similar than the average, and about 60% of the values were coded into these clusters, while the overall average gap remained the same. This better resembles the distribution of real document collections. The results are shown in Table 2.3(b). Compared with skewed Golomb coding, batched LLRUN coding, and interpolative coding, the compression efficiency of Golomb coding is not as good as others, meaning it is unable to exploit clustering well.

The compression results of unique-order interpolative coding for a skewed geometric distribution are better than that for a geometric distribution. This means that unique-order interpolative coding does take a good advantage of the clustering property. For ≤32

N , we prefer to use the

unique-order interpolative coding 2; while for

N >32, we suggest unique-order interpolative coding 1. f Similar to that for a geometric distribution, the unique-order interpolative coding 1 performs better as N becomes larger. Again, both variable byte coding and Carryover-12 mechanism are inefficient f

in compression for most cases in the second experiment. From Table 2.3(b), interpolative coding can even outperform self-entropy. This is due to the fact that interpolative coding does not use the gap value in encoding directly, but instead uses a minimal binary code to encode every gap after it is converted to a triple.

在文檔中大型資訊檢索系統之轉置檔案設計 (頁 43-48)