Proposed skipping mechanism - Proposed Two-level Skipped Inverted Files

Chapter 3 Redundant Decoding Elimination

3.3 Proposed Two-level Skipped Inverted Files

3.3.2 Proposed skipping mechanism

In this section, we first describe the proposed skipping mechanism based on maximum required bits (MRB) calculation. Then we present the recommended coding method and its MRB function for the document identifiers and the within-document frequencies within a sub-block. Finally, we present the implementation optimization technique.

The design

In this sub-section, we propose a novel skipping mechanism based on maximum required bits (MRB) calculation (cf. Fig. 3.1) to efficiently create a second-level index on each block for the first level of skipping. Consider a given block containing n postings

(id1,fq1), (id2,fq2), (id3,fq3), …, (idn,fqn)

where idi<idi+1. We first replace the within-document frequency fqi with the Fi, where

∑

= ⁱ

j j

i fq

referred to as the cumulative within-document frequency. Next a sub-block size g is determined.

The block is then divided into ^m⁼

⎡

ⁿ ^g

⎤

sub-blocks, each having g postings except possibly the

last block. We define the first posting in each sub-block to be a critical pair consisting of a document identifier and a cumulative within-document frequency, the postings between critical pairs to be inner postings, and those in the last sub-block except the critical pair to be the residual postings. The critical pairs and their subsequent residual postings together can be regarded as a sub- posting list, on which the document identifiers can be encoded in Golomb coding with the d-gap technique and the cumulative within-document frequencies can be encoded in γ coding also with the d-gap technique. For the inner postings within a sub-block, the document identifiers and the cumulative within-document frequencies are stored separately (cf. Fig. 3.1). Assume that the document identifiers in the inner postings are to be compressed with compression method C1,

(id₁,fq₁), (id₂,fq₂), (id₃,fq₃), …, (idn,fqn)

Figure 3.1 Illustration of the proposed skipping mechanism. Assume that the document identifiers in the inner postings are to be compressed with compression method C1, and the cumulative within-document frequencies are with compression method C2. The function MRBC(xj+g−xj−1,g) can calculate the maximum required bits that need to be allocated to store the strictly ascending integer sequences xj+1,xj+2,…xj+g-1 compressed with method C, where x can be either id or F and C can be either C1 or C2.

and the cumulative within-document frequencies are with compression method C2. We want to find two functions MRBC1(DIi,g) and MRBC2(DFi,g) to precisely calculate the maximum required bits that need to be allocated to store the document identifiers compressed with method C1 and the cumulative within-document frequencies compressed with method C2, respectively, in the inner postings within the i^th sub-block, where DIi=ICi−ICi+1−1 and ICi is the document identifier for the i^th critical pair, and DFi=FCi−FCi+1−1 and FCi is the cumulative within-document frequency for the i^th critical pair. Since the maximum number of bits for the document identifiers and the cumulative within-document frequencies in the inner postings within a sub-block is known, those identifiers and frequencies that are useless in set operations during query processing can be skipped easily. In this mechanism, the critical pair for the (i+1)^th sub-block should be stored before the inner postings for the i^th sub-block. Compared with the skipping mechanism proposed by Moffat & Zobel (1996), this mechanism does not require extra bits to specify the location of critical document identifiers.

However, the space overhead of this mechanism is still possibly high if the estimation function is not accurate. The key to the success of this skipping mechanism is to find efficient coding methods with accurate functions for compressing the document identifiers and the cumulative within-document frequencies in the inner postings within a sub-block.

Recommended coding method and its MRB function for inner postings

For the proposed skipping mechanism, interpolative coding is recommended for compressing both the document identifiers and the cumulative within-document frequencies. The reasons are:

(1) Interpolative coding can yield superior compression performance for both document identifiers and cumulative within-document frequencies (Moffat & Stuiver, 2000).

(2) When the group size g is known, Chapter 2 showed that the decoding process for interpolative coding can be greatly facilitated using recursion elimination and loop unwinding, this provides high query throughput rate.

(3) Consider a sequence of (g−1) numbers xj+1 to xj+g-1 constrained by xj<xj+1<xj+2<…<xj+g-1< xj+g. When the group size g=4, we can show that the maximum required bits for the interpolative coding can be derived as

Eq.(2.12) and can calculate the maximum required bits for the document identifiers and the cumulative within-document frequencies in the inner postings within a sub-block with very little space overhead.

With interpolative coding, to allow different values of g, one can easily show that

⎡

^log ⁽ ⁶⁾

⎤

⁽ ^,⁴⁾ ⁽ ^,⁴⁾

and this can be converted to

⎪⎪

Applying the same approach, we have

⎡

^log ⁽ ¹⁴⁾

⎤

⁽ ^,⁸⁾ ⁽ ^,⁸⁾

and this can be converted to

(3.3)

The proposed skipping mechanism can be directly employed to create the first-level index by dividing the compressed posting list into blocks each containing g postings. Table 2 shows the size of the inverted files constructed using the proposed skipping mechanism with different g values.

The results show that this skipping mechanism can efficiently support smaller sub-blocks. The size of inverted files constructed using this mechanism can be even smaller than that of a compressed inverted file in which the document identifiers are compressed in Golomb codes with the d-gap technique and the within-document frequencies are in γ codes. Note that the file size increases as the value of g increases, so this skipping mechanism works the best for smaller blocks.

When this skipping mechanism is employed to create the second-level index, to optimize the query performance of ranked queries requires that the sub-block size be set at smaller values of g.

For a simple implementation and which requires space efficiency, we suggest g=4. Note that when applying this skipping mechanism to a blocked inverted file to create the second-level index on each block, a unary code should be added in each block to indicate the number of sub-blocks in the block. Other coding methods are not disregarded. We are still looking for a faster and more effective coding method to encode the document identifiers or the cumulative within-document frequencies.

Table 3.2 Sizes of inverted files constructed using the proposed skipping mechanism with different g values.

Size Inverted file organization

MB %

compressed inverted file 93.28 100.0

the inverted file by the proposed skipping mechanism

g=4 89.33 95.8 g=8 93.06 99.8 g=16 96.21 103.1

Implementation optimization

To skip over unnecessary inner postings, this skipping mechanism requires calculating the maximum required bits for both document identifiers and cumulative within-document frequencies.

We observed that in most cases the gap value D in Eq. (3.1) is less than 256. Therefore, a 256-entry array z is used to facilitate the calculation of the maximum required bits, and z[i]=MRBinterp(i, g=4), i= xj+g− xj−1, for 3≤ i≤255. Whenever the gap value in Eq. (3.1) is less than 256, we can obtain the corresponding maximum required bits with only one array access. This greatly reduces the CPU time and improves query performance.

3.4 Performance Evaluation

This section presents our experiments to evaluate the efficiency of various inverted file organizations. We used the standard (un-skipped) compressed inverted file as the baseline, in which d-gaps are encoded in Golomb codes with the parameter b chosen appropriately for each posting list (Witten et al., 1999), and within-document frequencies are encoded in γ codes (Bell et al., 1993;

Moffat & Zobel, 1992). This baseline is then used to evaluate other fine-tuned skipped inverted file organizations.

Four skipped inverted file organizations are evaluated in our experiments: the skipped inverted file (described in Section 3.1.1), the blocked inverted file (described in Section 3.1.2), the skipped inverted file with the 2^nd-level index, and the blocked inverted file with the 2^nd-level index. The 2^nd -level index is created using the skipping mechanism (g=4) described in Section 3.3.2.

All experiments were run on an Intel P4 2.4GHz PC with 512MB DDR memory running Linux operating system 2.4.12. The hard disk was 40GB, and the data transfer rate was 25MB/sec.

Intervening processes and disk activities were minimized with best effort during experimentation.

In Section 3.4.1, we present the sizes for various inverted file organizations. In Section 3.4.2, we present the time taken to process the generated queries described in Section 3.2 to measure the query performance of various inverted file organizations.

在文檔中大型資訊檢索系統之轉置檔案設計 (頁 69-75)