Fundamental: Interleaving Partitioning Scheme

Chapter 5 Parallel IR

5.2 Fundamental: Interleaving Partitioning Scheme

In Section 5.2.1, we describe the well-known interleaving partitioning scheme that apply interleaved mapping rule to generate a partitioned inverted file and produce a near-ideal speedup. In Section 5.2.2, we describe how to improve the average processing time through document identifier assignment on the partitioned inverted file generated with the interleaving partitioning scheme.

5.2.1 Algorithm description

Figure 5.1 shows the idea of the interleaving mapping rule. Each workstation is mapped with a set of interleaved document identifiers. Let M be the number of workstations and N be the number of documents. The rule for mapping document identifiers to workstations is as follows.

Rule 1 The interleaved mapping rule maps a document identifier i to a workstation WSk with a function Aintlv:

M M i i

i A

k _intlv ⎥⎦⎥×

⎢⎣⎢ −

−

= ( 1)

)

( (5.1)

With the interleaved mapped rule, postings in a posting list are supposed to be evenly distributed regardless of the document identifier clustering.

To keep compression efficiency, each workstation represents documents using local document identifiers. The mapping rule Aintlv increases the gap between document identifiers after partitioning.

The gap between document identifiers in a local posting list is at least M. And compression methods can not work well on the local inverted file if documents are presented with the original document identifiers. We notice that, for a workstation WSk, the local document identifier for a document identifier i mapped to WSk can be obtained as following rule.

Rule 2 In the partitioned inverted file generated by interleaved mapping rule, a document i is represented as local document identifier LIDintlv(i):

⎣

⁽ ¹⁾^/

⎦

)

(i = i− M +

LID_intlv (5.2)

document identifiers: 1 2 3 4 5 6 7 8 9

WS1 WS2 WS3

(a) Mapping document identifiers to workstation IDs

posting list: 2, 3, 5, 7, 8, 11, 12, 13, 15, 16 represented using

original document identifier: 7, 13, 16 2, 5, 8, 11 3, 12, 15 represented using

local document identifier: 3, 5, 6 1, 2, 3, 4 1, 4, 5 WS1 WS2 WS3

(b) Partitioning a posting list

Figure 5.1 Partitioning with interleaved mapping rule

Note that the original document identifier i mapped to WSk then can be obtained using the following equation

k i

LID M

i= ×( _intlv()−1)+ (5.3) Figure 5.2 presents the algorithm to generate a partitioned inverted file with interleaved mapping rule. The time complexity is O(f) where f is the number of postings in the input inverted file.

Algorithm Interleaving_partitioning_scheme Input:

IF: the inverted file for sequential query processing. IF consists of a set of posting lists PLt for each term t.

Output:

LIF={LIF1,LIF2,…,LIFM}: the set of local inverted files LIFk for each workstation WSk. Each LIFk consists of a set of local posting lists PLt(WSk) for each term t.

Method:

1. for each term t do

1.1 for each document identifier i ∈ PLt do 1.1.1 ^k^←ⁱ⁻

⎣

⁽ⁱ⁻¹⁾^/^M

⎦

^×^M

1.1.2 ⁱ^′^←

⎣

⁽ⁱ⁻¹⁾^/^M

⎦

⁺¹

1.1.3 append i′ to PLt(WSk)

Figure 5.2 Interleaving partitioning scheme

5.2.2 How to improve parallel query processing through document identifier assignment In this subsection, we use an example to show how to improve parallel query processing through document identifier assignment. Consider term t appears in documents d1, d3, d4, d6, d8, d10, d18, d₂₂, d₂₃, d₂₆, d₃₄, d₃₅, d₄₅, d₄₆, d₄₇. There are two workstations in the cluster. We have two document identifier assignments DIA I and DIA II (cf. Figure 5.3). The notation diÆj in DIAs I and II denotes that the document identifier j is assigned to the document di. For each DIA, we can obtain

Term t appears in documents d1, d3, d4, d6, d8, d10, d18, d22, d23, d26, d34, d35, d45, d46, d47. (a) DIA I: { d1→1, d3→3, d4→4, d6→6, d8→8, d10→10, d18→18, d22→22,

d23→23, d₂₆→26, d₃₄→34, d₃₅→35, d₄₅→45, d₄₆→46, d₄₇→47}.

(1) The posting list PLt for DIA I

PLt: <1, 3, 4, 6, 8, 10, 18, 22, 23, 26, 34, 35, 45, 46, 47>

(2) The posting list PLt for DIA I is partitioned into two sub-posting lists PLt(WS₁) and PLt(WS₂) using the interleaving partitioning scheme (α is a constant)

(i) original document identifier representation

sub-posting lists bits after compression QPT PQPT PLt(WS1): <1, 3, 23, 35, 45, 47> 30 bits 30α

PLt(WS2): <4, 6, 8, 10, 18, 22, 26, 34, 46> 45 bits 45α (ii) local document identifier representation

sub-posting lists bits after compression QPT PQPT PLt(WS1): <1, 2, 12, 18, 23, 24> 20 bits 20α

PLt(WS₂): <2, 3, 4, 5, 9, 11, 13, 17, 23> 27 bits 27α (b) DIA II: { d1→1, d3→2, d4→3, d6→4, d8→5, d10→6, d18→7, d22→8,

d23→9, d26→10, d34→11, d35→12, d45→13, d46→14, d47→15}

(1) The posting list PLt for DIA II

PLt: <1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15>

(2) The posting list PLt for DIA II is partitioned into two sub-posting lists PLt(WS₁) and PLt(WS₂) using the interleaving partitioning scheme (α is a constant)

(i) original document identifier representation

sub-posting lists bits after compression QPT PQPT PLt(WS₁): <1, 3, 5, 7, 9, 11, 13, 15> 22 bits 22α

PLt(WS2): <2, 4, 6, 8, 10, 12, 14> 21 bits 21α (ii) local document identifier representation

sub-posting lists bits after compression QPT PQPT PLt(WS1): <1, 2, 3, 4, 5, 6, 7, 8> 8 bits 8α

PLt(WS₂): <1, 2, 3, 4, 5, 6, 7> 7 bits 7α

Figure 5.3 An example to show how to improve parallel query processing through document identifier assignment. There are two workstations in the cluster. The interleaving partitioning scheme is employed to partition the posting list PLt. All sub-posting lists are encoded in γ codes with the d-gap technique. QPT is the query processing time and PQPT is the parallel query processing time.

a posting list PLt for term t and the PLt can be partitioned into two sub-posting lists PLt(WS₁) and PLt(WS2) using the interleaving partitioning scheme. Assume that all sub-posting lists are encoded

45α

27α

22α

8α

in γ codes with the d-gap technique, where the γ code represents an integer x in 1+2

⎣

log2^x

⎦

bits.

Based on Eq.(4.4), we can derive the query processing time (QPT) of WS₁ for term t and that of WS₂ for term t. Then the parallel query processing time can be calculated using the time the last workstation finishes its job. This example confirms that local document identifier representation can improve the compression efficiency. We then observe that the compression efficiency of DIA II is better than that of DIA I. This implies that the query processing time of DIA II is shorter than that of DIA I since the query processing time is proportional to the total size of encoded posting list. The parallel query processing time of DIA II is also shorter than that of DIA I. Hence, this example shows that the clustering property in the posting list plays an important role in interleaving partition scheme.

在文檔中大型資訊檢索系統之轉置檔案設計 (頁 114-118)