Chapter 5 Parallel IR
5.3 Framework of Proposed Approach
5.4.1 Test collection and query set
Three document collections were used in the experiments. Their statistics are listed in Table 2.
In this table, N denotes the number of documents; n is the number of distinct terms; F is the total number of terms in the collection; and f indicates the number of document identifiers that appear in an inverted file. The collections FBIS (Foreign Broadcast Information Service) and LAT (LA Times) are disk 5 of the TREC-6 collection that is used internationally as a test bed for research in IR techniques (Voorhees and Harman 1997). The collection TREC includes the FBIS and LAT.
Table 5.1 Statistics of document collections
Collection
FBIS LAT TREC
# of documents N 130,471 131,896 262,367
# of terms F 72,922,893 72,087,460 145,010,353
# of distinct terms n 214,310 168,251 317,393
# of document identifier count f 28,628,698 32,483,656 61,112,354
Total size (Mbytes) 470 475 945
We followed the method (Moffat & Zobel, 1996) to evaluate performance with random queries.
For each document collection, 300 documents were randomly selected to generate a query set. A query was generated by selecting words from the word list of a specific document. To form the word list of a document, words in the document were folded to lower case, and stop words such as
“the” and “this” were eliminated. The number of terms per query ranged from 1 to 65. For example, a query containing 5 terms may be “inverted file document collection built”. For each query, there existed at least one document in the document collection that is relevant to the query. We also made the generated query set for each document collection have the following characteristics: (1) Query repetition frequencies followed a Zipf distribution 10.6
~ ) (q ρ
Pr , where Pr(q) is the probability of
query q appearing in generated query set, and ρ is the popularity rank of query q; (2) The terms per query distribution followed the shifted negative binomial distribution
1
f , where f(x) is the probability of a query containing x words. This
made the distribution of generated queries closely resemble the distribution of real queries (Xie &
O’Hallaron, 2002; Wolfram, 1992).
5.4.2 Performance results
This subsection shows the experimental results. These results include: (1) speedup of parallel query processing, and (2) compression efficiency.
Speedup of parallel query processing
This subsection investigates the DIA problem in an IRS that runs on a cluster of workstations.
Assuming k workstations, the inverted file is generally partitioned into k disjoint sub-files, each for one workstation. Table 5.2 shows the performance of parallel query processing using interleaving partitioning scheme with either the Default algorithm or the PBDIA algorithm, where the Default
algorithm means that the documents in a collection are assigned document identifiers in chronological order. The Default algorithm is widely used in modern IRSs, and it already captures some clustering nature. Hence, the Default algorithm can serve a rigid baseline in comparison with the PBDIA algorithm. The metric is the speedup relative to sequential query processing with Default algorithm. Experiments were conducted on the TREC collection. The sub-file on each workstation was compressed using the unique-order interpolative coding method (g=4). The parallel query processing time was defined as max[T1,T2,…,Tk], where Ti (1≤i≤k) was the time needed to retrieve and decompress the (partial) posting lists for the query terms on the ith workstation. The experimental results show that the interleaving partitioning scheme can yield near-ideal speedups, as reported in Ma et al. (2002). In addition, using the PBDIA algorithm to enhance the clustering property of posting lists for frequently used query terms, the interleaving partitioning scheme yields super-linear speedups. Hence the DIA problem should deserve much attention in parallel IR.
Table 5.2 Speedup of parallel query processing
The number of workstations Collection Approach
1a 2 4 6 8 10
FBIS Default + Interleaving partitioning 1.00 1.89 3.73 5.58 7.41 9.30 PBDIA + Interleaving partitioning 1.14 2.16 4.26 6.37 8.45 10.60 LAT Default + Interleaving partitioning 1.00 1.90 3.76 5.63 7.46 9.37
PBDIA + Interleaving partitioning 1.18 2.25 4.44 6.65 8.80 11.04 TREC Default + Interleaving partitioning 1.00 1.90 3.75 5.61 7.44 9.35
PBDIA + Interleaving partitioning 1.17 2.23 4.41 6.57 8.70 10.93
a Without interleaving partitioning
Compression Efficiency
To reduce average query processing time of parallel query processing, the PBDIA algorithm improves the compression efficiency for the frequently used query terms. However, this is at the
cost of sacrificing the compression efficiency for the less frequently used query terms. We need to know how much space overhead is needed to trade for this speed advantage. Average bits per document identifier of the different partitioning approaches are shown in Table 5.3. The sub-file on each workstation was compressed using the unique-order interpolative coding method (g=4).
Results in Table 5.3 show that the PBDIA algorithms can speed up query processing with very little or no storage overhead.
Table 5.3 Compression performance of different partitioning approaches
The number of workstations Collection Approach
1a 2 4 6 8 10
FBIS Default + Interleaving partitioning 4.86 4.88 4.86 4.85 4.83 4.82 PBDIA + Interleaving partitioning 4.95 4.98 4.96 4.95 4.95 4.94 LAT Default + Interleaving partitioning 5.22 5.23 5.23 5.21 5.19 5.17
PBDIA + Interleaving partitioning 5.01 5.02 5.01 5.01 4.99 4.97 TREC Default + Interleaving partitioning 5.10 5.13 5.12 5.10 5.07 5.05
PBDIA + Interleaving partitioning 5.08 5.11 5.08 5.07 5.05 5.04
a Without interleaving partitioning
5.5 Summary
This chapter is to propose an inverted file partitioning algorithm for parallel information retrieval. The inverted file is generally partitioned into disjoint sub-files, each for one workstation, in an IRS that runs on a cluster of workstations. When processing a query, all workstations have to consult only their own sub-files in parallel. The objective of this chapter is to develop an inverted file partitioning algorithm that minimizes the average query processing time of parallel query processing. Our approach is as follows. The foundation is interleaving partitioning scheme, which generates a partitioned inverted file with interleaved mapping rule and produces near-ideal speedup.
The key idea of our proposed algorithm is to use the document identifier assignment algorithm to enhance the clustering property of posting lists for frequently used query terms. This can aid the interleaving partitioning scheme to produce superior query performance.