Chapter 2 Inverted File Size Reduction
2.4 Performance Evaluation
2.4.2 Performance results
In this subsection, we first present the compression performance of unique-order interpolative coding versus different group size g. We then present the compression performance of different coding methods. Finally, we present the search performance of different coding methods.
Compression performance of unique-order interpolative coding
In this subsection, Golomb coding was used to code both boundary pointers and residual pointers. This is due to the fact that the average gap sizes in Table 2.4 are relatively big, Golomb coding was recommended according to Table 2.3(b). The compression result is shown in Table 2.5, and the metric used is the average number of bits per document identifier BPI, defined as follows:
BPI f
identfiers document
of number
file inverted compressed
the of size
=The .
For each term t, the cost of using r coding to encode the frequency ft is calculated and included in the presented results.
Note that for group size g=4 and g=8, unique-order interpolative coding achieved good compression. For a simple implementation, we suggest using g=4. In the following experiments, Golomb coding was used to code both boundary pointers and residual pointers for unique-order interpolative coding, and group size g was set to 4 unless otherwise stated.
Table 2.5 Compression performance of unique-order interpolative coding versus different group size g
Group Size Collection
g Bible DBbib FBIS LAT TREC
1 6.11 6.20 5.27 5.31 5.49
2 5.64 5.47 4.84 4.91 4.99
3 5.61 5.31 4.80 4.89 4.94
4 5.46 5.11 4.66 4.74 4.78
5 5.52 5.13 4.71 4.80 4.82
6 5.52 5.10 4.71 4.79 4.81
7 5.47 5.04 4.65 4.74 4.75
8 5.42 4.98 4.59 4.68 4.69
9 5.47 5.01 4.64 4.72 4.73
10 5.51 5.03 4.67 4.75 4.76
Compression performance of different coding methods
We now compare the effectiveness of the eight coding methods: γ coding, Golomb coding, batched LLRUN coding, skewed Golomb coding, interpolative coding, variable byte coding, Carryover-12 mechanism, and unique-order interpolative coding. For each term t, the cost of using r coding to encode the frequency ft is calculated and included in the presented results. Moreover, any necessary overheads, such as the complete set of models and model selectors for the batched LLRUN coding, are also calculated and included. However, the cost of storing the parameter b for each posting list in Golomb coding (Witten et al., 1999) is not calculated nor included. This is because the parameter b for each posting list in Golomb coding can be calculated via stored frequency ft using Witten’s approximation. The results are shown in Table 2.6. Notice that:
1. Both variable byte coding and Carryover-12 mechanism are inefficient in compression of inverted files.
2. For the other coding methods, the compression efficiencies of both γ coding and Golomb coding are relatively low because of the simple models they use.
3. The compression efficiencies of batched LLRUN, skewed Golomb, interpolative, and unique order interpolative coding methods are relatively good. This shows that clustering is a good compression aid.
4. The compression efficiency of unique-order interpolative coding is only inferior to that of interpolative coding, meaning that it does take a good advantage of the clustering property.
Table 2.6 Compression Performance of different coding methods.
Collection Coding Methods
Bible DBbib FBIS LAT TREC
γ coding 6.58 5.96 5.38 5.63 5.63
Golomb coding 6.11 6.20 5.27 5.31 5.49
Batched LLRUN coding 5.52 4.88 4.63 4.78 4.84
Skewed Golomb coding 5.92 5.75 5.04 5.07 5.10
Interpolative coding 5.37 4.89 4.58 4.65 4.62
Variable byte coding 9.10 9.54 8.88 8.89 8.84
Carryover-12 mechanism 7.14 7.99 6.23 6.13 5.95
Unique-order interpolative coding 5.46 5.11 4.66 4.74 4.78
Search performance of different coding methods
The query processing time includes (1) disk access time, (2) decompression time, and (3) document identifiers comparison time. Experiments showed that disk access time and decompression time occupy more than 90% of query processing time. And document identifier comparison time is not a function of the coding method used. Therefore the search performance metric is defined as
Search Time (ST) = Disk Access Time (AT) + Decompression Time (DT).
And the speedups of all coding methods relative to Golomb coding, for all test collections, were calculated.
All experiments described in this subsection were run on an Intel P4 2.4GHz PC with 256MB DDR memory running Linux operating system 2.4.12. The hard disk was 40GB, and the data transfer rate was 25MB/sec. Intervening processes and disk activities were minimized during
experimentation. All decoding mechanisms were written in C, compiled with gcc, and optimized as follows:
1. Replaced subroutines with macros.
2. Careful choice for compiler optimization flags.
3. Implementation used 32-bit integers, as that is the internal register size of the Intel P4 CPU.
4. Implemented the integer logarithm function
⎡
log2(i)⎤
with a lookup table.Let z be a 256-entry array, and z[k] be
⎡
log2(k+1)⎤
where 0≤ k≤255. The function⎡
log2(i)⎤
can be implemented in C as follows (v is the returned value of
⎡
log2(i)⎤
):do {
register int __i = (i) - 1;
(v) = _B_i>>16 ? (_B_i>>24 ? 24 + z[_B_i>>24] : 16 + z[_B_i>>16]) : (_B_i>> 8 ? 8 + z[_B_i>>8] : z[_B_i]) ;
} while (0);
5. Implemented the integer logarithm function
⎣
log2(i)⎦
also with a lookup table.The array z is the same as that used in the function
⎡
log2(i)⎤
. The function⎣
log2(i)⎦
can be implemented in C as follows (v is the returned value of⎣
log2(i)⎦
):do {
register int __i = (i) ;
(v) = _B_i>>16 ? (_B_i>>24 ? 23 + z[_B_i>>24] : 15 + z[_B_i>>16]) : (_B_i>> 8 ? 7 + z[_B_i>>8] : z[_B_i] - 1) ;
} while (0);
6. A 256-entry lookup table is used to locate the exact bit location of the first “1” bit in a byte.
For example, in the byte 00101000 the first “1” bit is in location 3. This can accelerate the decoding process of unary codes because no bit-by-bit decoding is required.
7. Access to binary codes with masking and shifting operations, and no bit-by-bit decoding is required.
With these optimizations, decoding of a document identifier only required tens of ns, and no bit-by-bit decoding is required.
Other optimizations included: The Huffman code of batched LLRUN coding was implemented with canonical prefix codes (Turpin, 1998). The canonical prefix codes can be decoded via fast table look-up. And for the interpolative coding method, recursive process was transformed to non-recursive process, at the cost of an explicit stack (Tenenbaum et al., 1990).
The search performance measurements are shown in Table 2.7. Key findings are:
1. Although variable byte coding and Carryover-12 mechanism gave fast decoding, r coding and unique-order interpolative coding achieved higher query throughput rates. This is because the disk access time (AT) of variable byte coding and Carryover-12 mechanism is much higher than that of r coding and unique-order interpolative coding.
2. For collection DBbib, the decoding times (DT) of r coding and unique-order interpolative coding are less than that of Carryover-12. This is because a large portion of the d-gaps of frequently used query terms for DBbib is of value 1. Both r coding and unique-order interpolative coding can encode these d-gaps very economically. This also makes the decoding times of r coding and unique-order interpolative coding for these d-gaps very low.
3. Batched LLRUN coding, skewed Golomb coding, and interpolative coding gave better compression rates than Golomb coding. However, their complex decoding mechanisms prohibited them from being used in real-world IRSs.
4. Experimental results showed that r coding, Carryover-12 mechanism, and unique-order interpolative coding were recommended for real-world IRSs. Their query throughput rates were all much higher than that of Golomb coding.
Table 2.7 Search performance of different coding methods (AT is the disk access time, DT is the decoding time, ST=AT+DT is the search time, and SP is the performance relative to the Golomb coding)
Coding Method Collection
Bible DBbib FBIS LAT TREC
γ coding AT(us) 125 202 1125 1168 2149
DT(us) 70 188 952 980 1696
ST(us) 195 390 2077 2148 3845
SP 1.14 1.50 1.20 1.23 1.20
Golomb coding AT(us) 131 306 1282 1321 2422
DT(us) 92 280 1200 1314 2179
ST(us) 223 586 2482 2635 4601
SP 1.00 1.00 1.00 1.00 1.00
Batched LLRUN coding AT(us) 116 381 1101 1134 2086
DT(us) 130 192 1688 1771 3013
ST(us) 246 573 2789 2905 5099
SP 0.91 1.02 0.89 0.91 0.90
Skewed Golomb coding AT(us) 117 331 1120 1150 2097
DT(us) 122 201 1492 1577 2696
ST(us) 239 532 2612 2727 4793
SP 0.93 1.10 0.95 0.97 0.96
Interpolative coding AT(us) 111 137 1024 995 1916
DT(us) 243 688 3094 3266 5598
ST(us) 354 825 4118 4261 7514
SP 0.63 0.71 0.60 0.62 0.61
Variable byte coding AT(us) 214 918 3134 3489 5506
DT(us) 22 90 336 388 633
ST(us) 236 1008 3470 3877 6139
SP 0.95 0.58 0.72 0.68 0.75
Carryover-12 mechanism AT(us) 145 311 1498 1491 2566
DT(us) 52 190 765 825 1368
ST(us) 197 501 2263 2316 3934
SP 1.13 1.17 1.10 1.14 1.17
Unique-order interpolative coding AT(us) 113 182 1066 1076 2011
DT(us) 82 169 1041 1041 1909
ST(us) 195 351 2107 2117 3920
SP 1.14 1.67 1.18 1.24 1.17
5. To obtain better compression rates, Golomb coding and unique-order interpolative coding use a minimal binary code in their codewords. To decode a minimal binary code, “toggle point”
calculations are required and slow down query evaluation. Rice coding is a variant of Golomb
coding where the value b is restricted to be a power of 2. The advantage of this restriction is that there is no “toggle point” calculation required. The disadvantage of this restriction is the slightly worse compression than that of Golomb coding. If we use Rice coding to encode the boundary and residual pointers in unique-order interpolative coding and use a simple binary code to encode the (x, lo, hi) triples for the inner pointers, there is no “toggle point” calculation required for unique-order interpolative coding. Table 2.8 showed that Rice coding allowed query throughput rates of approximately 8% higher than Golomb coding, and unique-order interpolative coding without “toggle point” calculation allowed query throughput rates of approximately 30% higher than Golomb coding. Experimental results further showed that the decoding time of unique-order interpolative coding without “toggle point” calculation is even less than that of Carryover-12 mechanism.
6. Experimental results showed that a good coding method must be characterized by both high compression ratio and high decompression rate. The unique-order interpolative coding is such a good method.
Table 2.8 Search performance of Rice coding and unique-order interpolative coding (AT is the disk access time, DT is the decoding time, ST=AT+DT is the search time, and SP is the performance relative to the Golomb coding).
Coding Method Collection
Bible DBbib FBIS LAT TREC
Rice coding AT(us) 133 286 1305 1345 2462
DT(us) 74 267 1004 1069 1808
ST(us) 207 553 2309 2414 4270
SP 1.08 1.06 1.07 1.09 1.08
Unique-order interpolative codinga AT(us) 119 193 1128 1137 2127
DT(us) 55 141 747 772 1363
ST(us) 174 334 1875 1909 3490
SP 1.28 1.75 1.32 1.38 1.32
a The boundary and residual pointers are encoded in Rice codes, the (x, lo, hi) triples for the inner pointers are encoded in simple binary codes, and group size g is 4.