Load and storage balanced posting file partitioning for parallel information retrieval

(1)

Load and storage balanced posting ﬁle partitioning for parallel information

retrieval

Yung-Cheng Ma

a,∗

_{, Chung-Ping Chung}

b

_{, Tien-Fu Chen}

c

a_{Department of Computer Science and Information Engineering, Chang-Gung University, Kwei-Shan, Tao-Yuan, Taiwan}

b_{Department of Computer Science and Information Engineering, National Chiao-Tung University, Hsinchu, Taiwan}

c_{Department of Computer Science and Information Engineering, National Chung-Cheng University, Chiayi, Taiwan}

a r t i c l e i n f o

Article history: Received 24 March 2010 Received in revised form 12 November 2010 Accepted 12 January 2011 Available online 1 February 2011 Keywords:

Load balancing Storage balancing

Parallel information retrieval Inverted ﬁle

a b s t r a c t

Many recent major search engines on Internet use a large-scale cluster to store a large database and cope with high query arrival rate. To design a large scale parallel information retrieval system, both performance and storage cost has to be taken into integrated consideration. Moreover, a quantitative method to design the cluster in systematical way is required. This paper proposes posting ﬁle partition-ing algorithm for these requirements. The partitionpartition-ing follows the partition-by-document-ID principle to eliminate communication overhead. The kernel of the partitioning is a data allocation algorithm to allo-cate variable-sized data items for both load and storage balancing. The data allocation algorithm is proven to satisfy a load balancing constraint with asymptotical 1-optimal storage cost. A probability model is established such that query processing throughput can be calculated from keyword popularities and data allocation result. With these results, we show a quantitative method to design a cluster systematically. This research provides a systematical approach to large-scale information retrieval system design. This approach has the following features: (1) the differences to ideal load balancing and storage balancing are negligible in real-world application. (2) Both load balancing and storage balancing can be taken into integrated consideration without conﬂicting. (3) The data allocation algorithm is capable to deal with data items of variable-sizes and variable loads. An algorithm having all these features together is never achieved before and is the key factor for achieving load and storage balanced workstation cluster in a real-world environment.

1. Introduction

This paper studies parallel information retrieval on a cluster of workstations. The research objective is to minimize the hard-ware cost of the cluster to satisfy a given throughput requirement. The cluster consists of a set of identical workstations. The post-ing file, a data structure for information retrieval, is partitioned onto the workstations. A query is processed in parallel with the workstations. Hardware cost of the cluster depends on the cluster configuration: the number of workstations and storage capacity per workstation. Achieving the research objective lies in posting file partitioning to efficiently use the processing and storage capa-bilities of workstations.

Information retrieval on parallel and distributed systems has been widely studied but none of the studies has fully considered the requirements of contemporary major search engines. Previous studies (Jeong and Omiecinski, 1995; Tomasic and Molina, 1995;

∗ Corresponding author. Tel.: +886 3 2118800; fax: +886 3 2118700.

E-mail address:ycma@mail.cgu.edu.tw(Y.-C. Ma).

Riberio-Neto et al., 1998; MacFarlane et al., 2000; Moffat et al., 2006; Barroso et al., 2003; Cacheda et al., 2007; Badue et al., 2001; Lucchese et al., 2007; Moffat et al., 2007) investigated data allo-cation for high performance information retrieval. In these studies, storage efﬁciency is not considered, whereas complex simulation is required for performance evaluation. In recent years, many major search engines use a large-scale cluster to store huge amount of data and face high query arrival rate. Reducing storage cost is important and quantitative method to design a cluster is desired. This paper tackles these requirements.

This primary work is load and storage balanced posting ﬁle partitioning. The objective of the partitioning is to minimize stor-age requirement per workstation subject to a limited mean query processing time. Mean query processing time is estimated with popularities of keyword terms. Issues to be dealt with are (1) load and storage balanced data allocation, and (2) popularity-based posting ﬁle partitioning model.

The ﬁrst issue is to allocate a set of items, each item being asso-ciated with a load and a data size, onto a set of workstations. The 0164-1212/$ – see front matter © 2011 Elsevier Inc. All rights reserved.

(2)

objective of the data allocation is storage balancing subject to a load balancing constraint. The second issue is to reduce posting file partitioning to load and storage balanced data allocation. Stor-age balancing reduces the storStor-age requirement and load balancing reduces mean query processing time. In the partitioning model, item loads are defined in terms of popularities of keyword terms. With the posting file partitioning algorithm, we show a systematic approach to determine the cluster configuration for the research objective. Contributions of this work are

(1) An asymptotically 1-optimal algorithm for load and storage bal-anced data allocation. This algorithm allocates variable-sized data items onto a set of workstations with solution quality on load and storage balancing been formally proved.

(2) A probabilistic posting ﬁle partitioning model to avoid com-munication overhead in parallel information retrieval. With this model, a query is processed in parallel without having to transfer postings between workstations. Moreover, this model formulates the posting ﬁle partitioning problem as the load and storage balanced data allocation problem. With this model, load balancing refers to maximize average throughput of the cluster of workstations.

As a result of these contributions, a quantitative method is pro-posed to design a clustered search engine from statistics data. Usefulness of the posting ﬁle partitioning in real-world applications is evaluated with TREC (Hardman, 1992) document collection.

This paper is organized as follows. Section2describes basic knowledge of information retrieval. Section 3 defines the con-cerned data allocation problem and describes related work on data allocation. Section4describes the proposed data allocation algo-rithm. Section5 describes how a query is processed in parallel. Section6describes how the proposed data allocation algorithm is applied for posting file partitioning. Section7describes a quan-titative method to design a cluster with the proposed posting file partitioning algorithm. Finally, conclusions are given in Section9.

2. Background and related work

This section presents the background to devise our posting ﬁle partitioning algorithm. Section2.1 describes the fundamen-tal knowledge, such as the inverted ﬁle, on information retrieval. Section2.2presents our survey on parallel information retrieval systems.

2.1. Fundamentals on information retrieval

This section describes information retrieval concepts and ana-lyzes its complexity to address research issues. An information retrieval system receives users queries and responds with a set of matched documents for each query. A query is a Boolean expres-sion in which each operand is a keyword term. A document either matches or mismatches a query in a binary fashion. For each query, set operations (∩, ∪, etc.) are performed to compute the answer list, which is the set of all document IDs of matched documents. The notation ANSqdenotes the answer list for query q, and the notation Ltdenotes the set of all document IDs referring to documents con-taining term t. For the query q=(processor < AND> text), the answer list is

ANSq= Lprocessor∩ Ltext,

which indicates the set of all documents containing both the term “processor” and the term “text”.

An answer list of a query is computed using the inverted ﬁle (Frakes and Baeza-Yates, 1992; Witten et al., 1999). An example

inverted file is depicted in Fig.1. An inverted file consists of an index file and a posting file. The index file is a set of records, each containing a keyword term t and a pointer to the posting list of term t in the posting file. The posting list of term t is a sorted list of Lt. An entry in the posting list is called a posting. To process a query, the system first searches the index file and then performs set operations on the posting lists of queried terms. The set operation results can be obtained by simply merging posting lists according to Boolean operators if the posting lists are sorted (Salton, 1989).

Time complexity of query processing is as follows. Time to search the index ﬁle is no more than O(m× log n) (Frakes and Baeza-Yates, 1992), where m is the number of queried terms and n is the number of all indexed terms. Zipf’s law (Salton, 1989; Zipf, 1949) states that 95% of words in documents fall in a vocabulary with no more than 8000 distinct terms. And m is usually small. Com-plexity on index ﬁle side is not critical. Let ft_i be the length of the posting list of a queried term ti. The time to retrieve and merge the posting lists is O(ft1+ ft2+ · · · + ftm). The length of a posting

list increases with the size of document collection. Adding a docu-ment into the collection is to add one posting to each posting list of the terms appearing in the document. The challenge is to attack the complexity on the posting ﬁle side: We tackle this problem by proposing posting ﬁle partitioning algorithm for parallel query processing.

2.2. Related work on parallel information retrieval

This section presents our survey on parallel information retrieval. The general framework of a parallel information retrieval system with a cluster of workstations is described. The key issue to design such a parallel information retrieval system is inverted ﬁle partitioning. This section gives a brief description to previous works on inverted ﬁle partitioning.

Cacheda et al. (2007)describes a general framework of parallel information retrieval. In the general framework, a set of brokers are responsible for receiving user queries and delivering query results to users through the Internet. Upon receiving a query, a broker forwards the query to a set of query servers. The inverted file is partitioned across the query servers and the query servers work together to find out the query results. Our work is to study the inverted file partitioning scheme for parallel query processing.

The key issue in designing a parallel information retrieval sys-tem is inverted file partitioning. How the inverted file is partitioned determines how queries are processed in parallel. The system performance, both throughput and response time, depends on the inverted file partitioning. For high performance, inverted file partitioning has to balance workload and reduce communication overhead among workstations. To reduce storage cost, storage bal-ancing is required for a homogeneous cluster. In this paper, we propose an inverted file partitioning algorithm taking integrated consideration over all these considerations.

Previous researchers (Tomasic and Molina, 1995; Jeong and Omiecinski, 1995) states that there are two ways to partition an inverted file: partition-by-term and partition-by-document scheme. With partition-by-term scheme, the partitioner finds a mapping from indexed terms to workstations. A workstation stores a subset of inverted lists from the inverted file. With partition-by-document scheme, the partitioner finds a mapping from partition-by-documents to workstations. A workstation stores an inverted file covering a subset of the document collection. We briefly describe algorithms with partition-by-term scheme (Moffat et al., 2007; Lucchese et al., 2007) and algorithms with partition-by-document scheme (Cacheda et al., 2005) (Cacheda et al., 2007; Barroso et al., 2003).

Several articles (Tomasic and Molina, 1995; Jeong and Omiecinski, 1995; MacFarlane et al., 2000) (Badue et al., 2001) reported performance comparisons of the two schemes. We

(3)

Fig. 1. Inverted ﬁle.

summarize the comparisons as follows. The advantages of partition-by-document scheme are

(1) avoid the communication overhead of transferring posting lists between workstations,

(2) easy to achieve good load balancing (MacFarlane et al., 2000), and

(3) good scalability with the increase of document collection (MacFarlane et al., 2000).

The disadvantage of partition-by-document scheme is long disk seek time to retrieve an inverted list from disks(Moffat et al., 2007). Moffat et al. (2007)proposed load balanced partitioning algo-rithm with partition-by-term scheme. Workload of workstations are estimated from term popularities. The algorithm assigns inverted lists to workstations with ﬁll-smallest policy for load bal-ancing. Moreover, inverted lists of hot keyword terms may be replicated to multiple workstations to resolve overloading.

Lucchese et al. (2007)proposed inverted ﬁle partitioning algo-rithm to improve both query processing throughput and response time. The algorithm follows the partition-by-term scheme. Inverted lists are assigned to workstations according to term popularities for load balancing. To improve query response time, a clustering scheme is applied to group keyword terms frequently appearing in the same query. The algorithm then assigns inverted lists of grouped terms to the same workstation. The clustering scheme reduces the communication overhead between query servers.

Cacheda et al. (2005, 2007)proposed analytical models to ana-lyze the performance of a cluster with partition-by-document scheme. The service rate of brokers, query servers, and network transfer are considered in the model. The effect of replicating brokers and query servers are also analyzed. However, the effect of documents-to-workstations mapping is not considered in this model. Our work builds a quantitative model to analyze the effect of document mapping scheme.

The Google search engine (Barroso et al., 2003) follows the partition-by-document scheme with replication. The inverted ﬁle is partitioned into several pieces named “index shards”. An index shard covers a set of randomly chosen subset of all documents and is replicated to a pool of workstations. To process a query, the query has to be broadcast to a pool of query servers cover-ing the whole document collection. In recent years, the indexcover-ing system is re-written with MapReduce (Dean and Ghemawat, 2008) programming scheme for better scalability.

3. Fundamentals of data allocation

Development of posting file partitioning algorithm starts from this section. This section defines the concerned data allocation problem and surveys related work. Remaining sections propose a data allocation algorithm and describe how posting file partitioning is reduced to the data allocation problem.

3.1. Load and storage balanced data allocation model

The concerned data allocation problem is as follows. The input to the data allocation algorithm is a set of data items I ={I0, I1, . . ., I_N−1} and a set of workstations WS = {WS0, WS1, . . ., WSM−1}. Each item Iiis associated with a load Load(Ii) and a size si. We normalize the size such that

0 < si≤ 1.00 and maxIi{si} = 1.00 for 0 ≤ i < N.

The output is an allocation X that allocates items in I onto work-stations in WS. Replicating an item to multiple workwork-stations is not allowed. The objective is to minimize storage requirement per workstation subject to certain load balancing requirement. We for-mally specify an allocation to formulate the optimization problem. An allocation without replication is speciﬁed as follows. Fig.2 depicts an example of such an allocation. An allocation is a matrix X in which

• each row corresponds to an item to be allocated and each column corresponds to a workstation.

• each entry is either 0 or 1, and

• there exists a unique 1 in each row of X.

The entry at row i and column k, denoted Xik, is set to 1 if item Iiis allocated on workstation WSk. Note that each item is allocated on a unique workstation. Load of WSkis the total load of all items allocated on WSk. LoadX(WSk)= N−1

i=0 Xik× Load(Ii)=

I_i:X_ik=1 Load(Ii). (1)

(4)

Fig. 2. (a–d) Example of allocation without item replication.

Data size allocated on WSkis the total size of all items allocated on WSk. DSX(WSk)= N−1

i=0 Xik× si=

I_i:X_ik=1 si. (2)

The objective of the allocation is storage balancing subject to a load balancing constraint. Storage balancing is to minimize the maximum amount of data allocated on a single workstation. That is, to minimize

maxWS_k{DSX(WSk)}. (3)

The constraint is that the load imbalancing is within the load of some item. That is,

LoadX(WSk)≤ L

M+ maxIi{Load(Ii)} for any WSk, (4)

where M is the number of workstations and L is the total load of all items.

L =

I_i

Load(Ii).

The objective is to generate an allocation X to minimize Eq.(3) subject to Eq.(4).

3.2. Related work on data allocation

Data allocation has been widely studied but none fully consid-ered the requirements of our research objective on posting ﬁle partitioning. In early 1970s, researchers investigated data alloca-tion for minimizing the storage cost (Johnson et al., 1974). Since 1980s, needs in high performance database systems have turned the research focus to improving the data retrieval performance (Dowdy and Foster, 1982; Wah, 1984; Wolf and Pattipati, 1990; Rotem et al., 1993; Lee and Park, 1995; Little and Venkatesh, 1995; Narendran et al., 1997; Lee et al., 2000). Starting in late 1990s,

information explosion brought by the Internet raises new chal-lenges in designing storage systems–both performance and storage cost have to be taken into consideration.Serpanos et al. (1998) proposed the MMPacking algorithm, which distributes and repli-cates identical-size data items onto workstations for both load and storage balancing. MMPacking (Serpanos et al., 1998) is the most closely related work but still not suitable for posting ﬁle partition-ing. Requirements to achieve our research objective on posting ﬁle partitioning are

(1) optimization for both load balancing and reducing storage cost, (2) capability in dealing with variable-size items, and

(3) no item replication.

All related work except for MMPacking fail to satisfy the first requirement. MMPacking satisfies the first requirement but not the second and the third requirements.

We propose a data allocation algorithm satisfying all these requirements. The key idea of the proposed algorithm is problem reduction to MMPacking with bin packing. We introduce MMPack-ing (Serpanos et al., 1998) and bin packing (Horowitz et al., 1996) in detail.

3.2.1. MMPacking

MMPacking (Serpanos et al., 1998) deals with the following opti-mization problem: The input is a set of n objects B ={B0, B1, . . ., Bn−1} with identical data sizes, and a set of M workstations WS ={WS0, WS1, . . ., WSM−1}. Each object Bjis associated with a load to access Bj, denoted Load(Bj). The output is an allocation with replication that allocates objects in B onto workstations in WS. The objective is to minimize the maximum number of objects allocated on a single workstation subject to the ideal load balancing constraint.

An allocation with replication is formulated as follows. An

allocation with replication is an n× M matrix Y similar to the

allocation matrix X deﬁned in Section3.1, with these differences: • each entry in Y is a real number valued between 0 and 1, and • there may be multiple non-zero entries in a row of Y.

(5)

Fig. 3. (a and b) Example of MMPacking.

The entry at row j and column k, denoted Yjk, represents the ratio of Load(Bj) shared by workstation WSk. The following equation holds:

M−1

k=0

Yjk= 1 for any row j. (5)

The load allocated on a workstation WSkby allocation Y is as follows. LoadY(WSk)= n−1

j=0 Yjk× Load(Bj). (6)

A row j with multiple non-zero entries means that load of accessing Bjis shared by multiple workstations. The load sharing is realized by replicating the object to multiple workstations. A copy of object Bjhas to be stored in WSkif Yjk> 0. The number of objects allocated on WSkby the allocation Y is

n−1 j=0Yjk.

Fig. 3 illustrates how MMPacking works. Objects are sorted in increasing order of load and then assigned to workstations in round-robin. Once the accumulated load of a workstation exceeds the ideal balanced load, part of the load of the object is split to the next workstation in round-robin. Splitting the load of an object is to replicate the object to multiple workstations. In this example, the object B7(with load 0.2) is replicated to WS3and WS0. Replication of the object B8starts at WS0, which is the last workstation to share partial load of B7. For this example, the resultant matrix Y is shown in Fig.4.

The following properties of MMPacking are used to analyze our proposed algorithm. Let L be the total load of all objects.Serpanos et al. (1998)have proved the following properties for MMPacking.

Property 1. The MMPacking algorithm generates an allocation Y in

which

LoadY(WSk)=_ML

for any workstation WSk(Serpanos et al., 1998).

Property 2. The MMPacking algorithm allocates at leastn/M and

at mostn/M + 1 objects on a workstation (Serpanos et al., 1998).

(6)

Fig. 5. (a and b) Example of bin packing.

Property 3. In the result of MMPacking, each workstation contains

at most two replicated bins. If a workstation contains two replicated bins, one of the bins is in the last workstation to share the load of the bin (Serpanos et al., 1998).

3.2.2. Bin packing

The bin packing problem (Horowitz et al., 1996) is as follows. The input is a set of items I ={I0, I1, . . ., IN−1} and a bin capacity x. Each item Iiis associated with a size siof it. The objective is to pack the set of items I into minimum number of bins B =_{B0, B1, . . ., B_n−1} with capacity x. Fig.5depicts an example of packing items with size not exceeding 1.00 to a set of bins with capacity x = 1.00. Our proposed algorithm uses the best-fit algorithm (Horowitz et al., 1996) to perform bin packing. This algorithm iteratively places an item to a bin with the smallest room left. Property4 states the key property of the best-fit algorithm. In Section 4.2.2, we prove a guaranteed storage balancing property of our proposed algorithm based on Property4. (See Lemma3for the effect of the best-fit scheme.)

Property 4. During the best-ﬁt bin-packing, a new bin is initialized

only when the current item to be packed cannot ﬁt into any existing bin (Horowitz et al., 1996).

4. Load and storage balanced data allocation

This section proposes an approximate algorithm for the data allocation problem deﬁned in Section3.1. The proposed algorithm is outlined as follows.

Step 1: Perform bin packing to pack items in I into bins B ={B0, B1, . . ., Bn−1} with capacity x.

Step 2: Perform MMPacking to obtain allocation Y which allocates load of bins in B onto workstations WS. The load of a bin Bj is set as follows:

Load(Bj)=

I_i∈ B_j Load(Ii).

Step 3: for each Bjdo /* allocate Ii∈ Bjto workstations */

• if MMPacking allocates all load of Bjto WSk: allocate all Ii∈ Bjto WSkin the ﬁnal result X;

• if MMPacking replicates Bj: invoke bin splitting procedure to gen-erate a partition PBj= {Bj(k)|Yjk> 0} where B(k)j is the subset of Bj allocated on WSkin the ﬁnal result X.

The idea behind the algorithm is as follows. Step 1 packs variable-size items into bins with approximately equal size. Step 2 performs MMPacking to determine the ideal load allocation and (approximately) balance amount of bins allocated on workstations. Step 3 generates the ﬁnal result X in which each item is allocated to a unique workstation. Fig.6depicts the ﬁnal result X generated from the MMPacking result Y shown in Fig.3. For a bin Bjnot repli-cated by MMPacking, the whole bin is allorepli-cated to the workstation that MMPacking allocates Bj. For a bin Bjreplicated by MMPacking, a bin splitting procedure is invoked to spread items in Bjto multiple workstations. Workstation WSkis allocated a subset of Bj, denoted B(k)_j , if MMPacking allocates partial load of Bjon WSk(Yjk> 0). This algorithm approximates storage balancing (when number of bins is large) since

• most of the bins are of approximately equal size, and

• each workstation contains approximately equal number of bins. We use the notation Load(B(k)_j ) to denote total load of all items in B(k)_j .

Load(B(k)_j )=

I_i∈ B(k)

j

Load(Ii).

Load balancing is approximated if the bin splitting procedure in Step 3 approximates the load sharing determined by MMPacking: Load(B(k)_j )≈ Yjk× Load(Bj) for WSk: Yjk> 0. (7) The remaining part of this section formalizes this idea. Issues to realize the idea are

(1) design of bin splitting procedure to approximate the load shar-ing determined by MMPackshar-ing, and

(2) selection of bin capacity x to minimize the worst-case storage requirement of a workstation.

Section4.1deals with the ﬁrst issue and Section4.2deals with the second issue. Section4.3summarizes the discussion to form the complete algorithm.

4.1. Bin splitting for load balancing

This sub-section presents the bin splitting procedure and proves that the load balancing constraint (Eq.(4)) is satisfied. MMPack-ing (Serpanos et al., 1998) achieves load balancing through data replication. However, to apply for posting file partitioning, each data item has to be mapped to a unique workstation. Data replica-tion introduces addireplica-tional overhead on parallel query processing. To design a data allocation algorithm for posting file partitioning, we apply the proposed bin-splitting method to approximate load balancing without data replication.

4.1.1. Design of bin splitting procedure

The bin splitting procedure, named SplitBin, is shown in Fig.7. The objective of bin splitting is to approximate the load sharing determined by MMPacking (Eq.(7)). The procedure generates a partition PBjof bin Bjaccording to the MMPacking result Y. The procedure examines each item Ii∈ Bjand generates B(k)_j s in the order that MMPacking replicates Bj. A partition is made whenever

(7)

Fig. 6. (a and b) Final result generated from MMPacking result.

Load(B(k)_j )≥ Yjk× Load(Bj). An example is shown in Fig.8, which depicts how the bin B7(with load 0.2) is split to approximate the MMPacking result shown in Fig.3. The procedure SplitBin ﬁrst gen-erates B(3)7 and then generates B

(0) 7 .

4.1.2. Analysis on load balancing property

We prove that the load balancing constraint (Eq.(4)) is satisﬁed after executing the bin splitting procedure. The key idea is to com-pare the load allocations of the ﬁnal result X and the MMPacking result Y. The load of a workstation is rewritten in Corollary1for the comparison. Let WSkbe a workstation sharing partial load of the bin Bjin the result of MMPacking. The value Load(B(k)_j ) is compared to Yjk× Load(Bj), which is the load sharing determined by MMPacking (Corollary2and3). The load balancing property can then be derived (Theorem1).

Fig. 7. Bin splitting procedure.

By observing Fig.6, the load of a workstation is rewritten as follows.

Corollary 1. The proposed algorithm generates an allocation X in

which the load of a workstation WSkis as follows. LoadX(WSk)=

B_j:Y_jk=1 Yjk× Load(Bj)+

B_j:0<Y_jk<1 Load(B(k)_j ). (8)

Load partitioning property is as follows. Let WSkbe a worksta-tion sharing partial load of bin Bjin the result of MMPacking. The bin splitting procedure generates items into a partition the accu-mulated load is greater than or equal to the load share determined by MMPacking (cf. Step (3.2) in Fig. 7). The difference between Load(B(k)_j ) and Yjk× Load(Bj) is thus at most the load of the last item included in the partition (to be stated in Corollary2). If WSkis not the last workstation to share load of Bj, Load(B(k)_j ) exceeds (if not equal to) Yjk× Load(Bj). Total load of all partitions is ﬁxed:

B_j(k) Load(B(k)_j )= Load(Bj)=

WSk Yjk× Load(Bj).

Hence, as to be stated in Corollary3, for the last workstation sharing the load of Bj, the allocated load is less than (if not equal to) the load share determined by MMPacking.

Corollary 2. Let WSkbe a workstation sharing partial load of bin Bj

in the result of MMPacking. The bin splitting procedure generates a B(k)_j satisfying the following equation:

Load(B(k)_j )_{≤ Y}_jk_{× Load(B}_j)_{+ max}_I

i{Load(Ii)}. (9)

Corollary 3. Let WSkbe the last workstation in MMPacking to share

the load of bin Bj. The bin splitting procedure generates a B(k)_j satisfying the following equation:

Load(B(k)_j )≤ Yjk× Load(Bj). (10)

Load balancing property of the proposed algorithm is as fol-lows (where L is the total load of all items and M is the number of workstations).

(8)

Fig. 8. Example of bin splitting.

Theorem 1 (Load balancing property). The proposed algorithm

gen-erates an allocation X in which

LoadX(WSk)≤_ML + maxI_i{Load(Ii)} (11) for any workstation WSk.

Proof. The theorem is proved by rewriting Eq.(8)for various cases.

In the result of MMPacking, a workstation may contain 0, 1, or 2 replicated bins (Property3). We consider the case of a workstation contains two replicated bins.

We rewrite Eq.(8)for a workstation WSkin which MMPacking allocates two replicated bins Bj1and Bj2.

LoadX(WSk)=

B_j:Y_jk=1

Yjk× Load(Bj)+ Load(Bj(k)1 )+ Load(B

(k) j2 ).

Property3states that one of the replicated bins, say Bj2, is in the

last workstation to share its partial load. According to Corollaries2 and3, we have:

Load(B(k)_j₁ )≤ Yjk× Load(Bj1)+ maxIi{Load(Ii)},

Load(B(k)_j₂ )≤ Yjk× Load(Bj2).

Load allocated on WSksatisﬁes the following equation: LoadX(WSk)≤

B_j:Y_jk>0

Yjk× Load(Bj)+ maxI_i{Load(Ii)}. MMPacking achieves exact load balancing (Property1).

Bj:Yjk>0

Yjk× Load(Bj)= L M.

Eq.(11)is thus obtained. Proofs for other cases are similar and hence omitted.

4.2. Bin capacity selection for storage balancing

We derive equations for selecting the bin capacity and indicate an upper bound on the allocated data size for any workstation. In selecting the bin capacity, two cases are considered: (i) setting bin capacity to the size of the largest item, (ii) setting bin capacity to be larger than the largest item.

4.2.1. Case of bin capacity being equal to largest item size

We ﬁrst consider the case of setting bin capacity x = 1, the size of the largest item, for bin packing. Bin packing will generate a set of bins with used capacity exceeding 1/2 except for the small-est bin (see Lemma1below). The number of bins required will be bounded (Lemma2), and the upper bound on allocated data size will be obtained (Theorem2).

Lemma 1. With bin capacity x = 1, there is at most one bin which is

less than half full in the output of the bin packing.

Proof. This lemma is proved by induction on the number of items

packed. The induction hypothesis is the lemma itself. Initially, I0is in B0. Suppose the lemma has not failed after packing Ii. The lemma has not failed again after the packing of Ii+1if no new bin is initial-ized for Ii+1. Consider the case of a new bin being initialized to pack Ii+1. If used capacity of each bin is at least 1/2 after the packing of Ii (see Fig.9(a)), the new bin is the only one possible with used capac-ity not exceeding 1/2 after the packing of Ii+1. In case of there being a unique bin Bjwhich is less than half full after the packing of Ii(see Fig.9(b)), according to Property4, to use a new bin for Ii+1indicates that no existing bin has enough room and hence si+1≥ 1/2. Bjis still the only bin with used capacity not exceeding 1/2 after the packing of Ii+1. The lemma still has not failed after the packing of Ii+1for all possible cases. This argument holds until all items are packed into bins.

With Lemma1, upper bound on the number of bins generated can be derived. Let S be the total data size of all items,

S = N

i=1

si, (12)

where siis the size of item Ii. Number of bins generated is bounded as follows.

Lemma 2. With bin capacity x = 1, the number of bins n generated

by the bin packing is bounded,

n ≤ 2 × S + 1. (13)

Proof. According to Lemma1, the total size of all items is at least

the total data size packed in the (n_{− 1) bins with size exceeding} 1/2.

S ≥1₂× (n − 1).

(9)

Fig. 9. (a and b) Packing an item into bins with capacity one.

Let M be the number of workstations. We derive the storage requirement of a workstation as follows.

Theorem 2 (Workstation storage requirement for bin capacity = 1).

With bin capacity x = 1, the proposed algorithm generates an allocation X in which

DSX(WSk)≤ 2 ×_MS + 3 (14)

for any workstation WSk.

Proof. The theorem is obtained by calculating number of bins

allo-cated on a workstation WSk. MMPacking allocates at mostn/M + 1 bins in a workstation (Property2) and the used capacity of a bin is at most 1.00. Hence we have the upper bound on the data size allocated on WSk: DSX(WSk)≤

_n M

+ 1.

Total number of bins n is bounded by Eq.(13). Eq.(14)is thus obtained.

4.2.2. Case of bin capacity being larger than item sizes

We improve storage balancing by allowing larger bin capacity. Fig.10shows an example of no size = 1 bins contain more than one item. (Recall that item sizes are normalized to the largest item size.) In the worst case, the most storage demanded workstation will contain twice the amount of data as those in the least storage demanded workstation. However, the worst case can be improved by enlarging bin capacity. The key issue is to select the bin capacity to minimize the (worst-case) storage requirement of a workstation. There is a tradeoff in selecting bin capacity. Suppose the bin capacity is x > 1.00. Let n be the number of bins generated and M be the number of workstations. Fig.11shows the maximum difference in allocated data size between workstations. MMPacking (Serpanos et al., 1998) allocates fromn/M to n/M + 2 bins on each work-station. Except for the least demanded bin, the used capacity of a bin lies between x− 1 and x (Lemma3). Data size allocated on a workstation is bounded as follows from the above:

max_WS k{DSX(WSk)} ≤

_n M

+ 2

× x.

Fig. 10. Worst case of setting bin capacity to be equal to the largest item.

In a workstation, there are at most three bins with used capacity not exceeding x− 1: one bin resulted from bin packing and two from bin splitting. Data size allocated on a workstation is bounded as follows from the below:

min WS_k{DSX(WSk)} ≥

_n M

− 3

× (x − 1).

The maximum difference in allocated data size between work-stations is as follows: maxWS_k{DSX(WSk)} − min WS_k{DSX(WSk)} = O

_n M

+ O(x). (15)

Selecting a large x reduces number of bins n generated and hence O(n/M) in Eq.(15). However, selecting a large x increases O(x) in Eq. (15). The tradeoff is resolved analytically.

How to select x is outlined as follows. Lemma3below states packed bin sizes. Lemma 4 bounds number of generated bins according to packed bin sizes. With the bound on number of gen-erated bins, Lemma5relates storage requirement to selected bin capacity, and deﬁnes the storage requirement function (Eq. (18)). The bin capacity x is selected to minimize the storage requirement function. The storage requirement for a workstation can then be derived (Theorem3).

Lemma 3. With bin capacity x > 1, there exists at most one bin with

used capacity less than x− 1 in the result of bin packing.

Proof. This lemma is proved by induction on the number of items

packed. The induction hypothesis is the lemma itself. The initial condition, condition after packing of item I0, is trivial. Suppose the lemma holds after the packing of Ii. Two possibilities exist here: (i) used capacity of each bin exceeds x− 1 (Fig.12(a)), and (ii) there is only one bin Bjwith used capacity not exceeding x− 1 (Fig.12(b)). For case (i), after packing of Ii+1, the new bin (if initialized) is the only possible one with used capacity less than x− 1. For case (ii), no new bin will be initialized (Property4) since size si+1≤ 1 and Bjhas enough room for Ii+1, and afterwards Bjis the only possible bin with used capacity not exceeding x− 1. The lemma thus holds again after the packing of Ii+1.

(10)

Fig. 12. (a and b) Packing an item to bins with enlarged bin capacity.

Similar to Lemma2, following lemma shows the upper bound on the number of bins generated. (S is the total size of all items.)

Lemma 4. With bin capacity x > 1, the number of bins n generated

by bin packing is bounded, n ≤ S

x − 1+ 1. (16)

Proof. According to Lemma3, there are at least n− 1 bins with

sizes exceeding x− 1, and the total data size S exceeds the total size of these n− 1 bins,

S ≥ (x − 1) × (n − 1).

Eq.(16)is obtained immediately.

The storage requirement of a workstation can then be written as a function of bin capacity x, stated as follows. (M is the number of workstations.)

Lemma 5. With bin capacity x > 1, the proposed algorithm generates

an allocation X in which DSX(WSk)≤_MS ×

1+ 1 x − 1

+ 3x (17)

Proof. In the output of the proposed algorithm, each

worksta-tion WSkcontains at mostn/M + 1 bins with the use capacity not exceeding x for all bins.

DSX(WSk)≤

n M

+ 1

× x.

Lemma4gives an upper bound on n and Eq.(17)is obtained immediately.

The bin capacity is selected to minimize the storage require-ment function. The storage requirerequire-ment function f(x) indicates the required storage of a workstation if bin capacity is set to be x. f (x) ≡ S M×

1+ 1 x − 1

+ 3x. (18)

The storage requirement function on the x-y plane is shown in Fig.13, which reﬂects the tradeoff in selecting the bin capacity. Solving the differential equation f(x) = 0, we obtain the optimal

bin capacity x0to minimize f(x):

x0= 1 +

S

3× M. (19)

And the most efﬁcient required storage capacity for a worksta-tion is obtained: f (x0)= S M+ 2 √ 3×

S M+ 3. (20)

Theorem 3 (Workstation storage requirement for bin capacity ¿1).

By selecting bin capacity x = 1 +

S/(3 × M), the proposed algorithm generates an allocation X in which

DSX(WSk)≤ _MS + 2 √ 3×

S M+ 3 (21)

Proof. This is the conclusion of previous discussion.

The theorem indicates the optimization quality of the proposed algorithm in storage balancing.

4.3. Summary of proposed algorithm

We summarize previous discussion to derive a complete algo-rithm and analyze the complexity and asymptotic behavior of the algorithm.

The proposed data allocation algorithm, LSB Alloc (Load and Storage Balanced Allocation), is shown in Fig.14. The algorithm determines the bin capacity as discussed in Section4.2. By com-paring Eq.(14)and Eq.(21), whether to enlarge the bin capacity or not is best determined by S/M, where S is the total data size and M is the number of workstations. (Note that the results of the two equations Eqs.(14) and (21)equals at S/M = 12.) Properties of the output are proved in previous sections.

The complexity of the data allocation algorithm is as follows. The time complexity of best-ﬁt bin packing is O(N2_{) (}_Horowitz et al., 1996). The time complexity of the MMPacking to allocate n (≤N) bins is O(n + M) (Serpanos et al., 1998). Hence the time com-plexity of the proposed algorithm is O(N2_{+ M + n). To implement} the algorithm, an O(n× M) space is required to store the result of MMPacking Y. The ﬁnal result X can be implemented as a mapping

(11)

Fig. 14. Proposed load and storage balanced data allocation algorithm, LSB Alloc.

table with O(N) space. Hence the space complexity of the algorithm is O(n× M + N).

The algorithm is asymptotically 1-optimal on storage balancing. Let J be an instance (for given data items I and workstations WS) for the optimization problem. Eq.(3)deﬁnes the cost of a solution for J. The notation OPT(J) denotes the cost of the optimal solution and F(J) denotes the cost of the solution found by the proposed algorithm. It is clear that OPT(J)≥ S/M. According to Theorem3, the ratio F(J)/OPT(J) is bounded as follows.

F(J) OPT(J)≤ 1 + 2√3

S/M+ 3 S/M. (22)

Fig. 15 depicts the curve of Eq. (22). The ratio F(J)/OPT(J) approaches one when S/M exceeds certain threshold. Section7 shows that very near optimal storage balancing is achieved for real-world applications.

5. Parallel information retrieval

We come back to information retrieval problem. The work here is to apply the proposed data allocation algorithm LSB Alloc for posting file partitioning. We first specify what a data item is. We follow the partition-by-document-ID principle to partition the posting file. The principle states that an item for data allocation is

(12)

Fig. 16. The partition-by-document-ID principle.

the set of all postings referring to the same document ID. This sec-tion describes how a query is processed in parallel following the principle. With this principle, a query is processed in parallel with-out having to transferring postings between workstations. Time complexity of parallel query processing will be analyzed. Based on the analysis, Section6formulates posting ﬁle partitioning as the data allocation problem and proposes the posting ﬁle partitioning algorithm.

The partition-by-document-ID principle is illustrated in Fig.16. The principle dictates that all postings referring to the same docu-ment ID be allocated on the same workstation. Each workstation covers an exclusive set of document IDs. In parallel query pro-cessing, workstation WSkis responsible for providing answers only from document IDs covered by it. For example, for the query (term 1 < AND> term 2), WS1provides answers{4, 8}. Checking whether a document ID d matches a query or not requires only postings refer-ring to document ID d, which are all in the same workstation. For the above example, checking whether document 4 contains both term 1 and term 2 requires only the local data in WS1. Following the principle, a query is processed independently and in parallel by all workstations.

This section describes how a query is processed in parallel. Sec-tion5.1deals with the set theory and the time complexity of parallel query processing. Section5.2deals with implementation issues on cluster of workstations.

5.1. The theory

The partitioned posting ﬁle is formalized as follows. To partition a posting ﬁle is to map document IDs to workstations. Let Ltbe the posting list of term t, and Dkbe the set of document IDs mapped to WSk. The notation Lt(WSk) denotes the set of document IDs in Lt which are mapped to WSk.

Lt(WSk)= Lt∩ Dk. (23)

The local posting list of term t in WSkis the set of document IDs in Lt(WSk) stored in increasing order. The local posting ﬁle in WSk is the set of local posting lists for all terms t. For the example in Fig. 16, local posting ﬁles for all workstations are shown in Fig.17.

Parallel query processing works as follows. For a given query q, the parallel query processing is to compute the answer list ANSq in parallel. Each workstation WSkis responsible for computing its

own partial answer list ANSq(WSk):

ANSq(WSk)= ANSq∩ Dk. (24)

The set ANSq(WSk) is the set of all document IDs matching query q and mapped to WSk. The union of all partial answer lists from all workstations is hence the complete answer list,

Ansq=

WS_k

Ansq(WSk). (25)

The following theorem states the set operation to compute a partial answer list.

Theorem 4 (Computation of partial answer list). The partial answer

list, ANSq(WSk), can be represented in set operations on local posting lists of queried terms in WSk.

Proof. We prove this theorem by induction on the number of

Boolean operators in the given query q. The induction hypothesis is the theorem itself.

The basis, when q contains only one Boolean operator, is as fol-lows. Query q is either “term i < AND> term j” or “term i < OR> term j”. Consider the case that q is “term i < AND> term j”. The partial answer list at WSkis

ANSq(WSk)= (Li∩ Lj)∩ Dk= (Li∩ Dk)∩ (Lj∩ Dk)= Li(WSk)∩ Lj(WSk). This rewrites ANSq(WSk) with set operations on local posting lists in WSk. The case that q is “term i < OR> term j” is similar and is omitted.

Suppose the theorem holds when the number of Boolean oper-ators in q is less than n and we prove that the theorem also holds when q contains n Boolean operators. The query q is either “(q1) < AND> (q2)” or “(q1) < OR> (q2)” where q1and q2are queries contain-ing no more than n− 1 Boolean operators. Consider the case that q

(13)

hence proved by induction.

Theorem4states an efﬁcient way to compute a partial answer list. To compute ANSq(WSk), WSkonly has to perform basic set oper-ations on its local posting lists of queried terms, without examining all document IDs mapped to it. An example is as follows. Consider the partitioned posting ﬁle in Fig.16, and let query q be “(term0 < AND> term3) < OR> term4”. Local posting lists of term0, term3, and term4 in WS2are{5, 9}, {9}, and {5}, respectively. The series of set operations to compute the partial answer list at WS2is as follows: ANSq(WS2)= ({5, 9} ∩ {9}) ∪ {5} = {5, 9}.

Note that no postings referring to document ID 2 is examined. The time complexity of computing a partial answer list is as follows. Any set operation algorithm operating on sorted lists can be used. We use the list merging algorithm (Salton, 1989) to perform set operations. Let f_t(k)_i be the length of the local posting list of the i th queried term tiin WSk, and m be the number of queried terms. The following corollary states the time complexity.

Corollary 4. With list merging (Salton, 1989), the time complexity

to compute ANSq(WSk) is O(ft(k)1 + f

(k) t2 + · · · + f

(k) tm).

5.2. Implementation on cluster of workstations

This section describes the ﬂow of processing a query on a cluster of workstations, starting from receiving a user query to the com-pletion of the answer list. While Boolean query processing is the major concern here, the proposed scheme can also be extended to deal with ranking with the addition of parallel sorting.

The flow of computing the answer list for a query is as fol-lows. Fig.18shows the system overview of a clustered information retrieval system. The parallel flow to compute the answer list for such a cluster is shown in Fig.19. A specific workstation, called the gateway, is dedicated for receiving user queries and performing the index file search. The gateway searches the index file shown in Fig. 20(a), and substitutes a term ID for each term in the query. All the other workstations are called the backend workstations. Records of frequently used terms are often stored in random access mem-ory so that the average index search time will not scale with the size of the keyword collection. The query is then broadcasted to all back-end workstations to compute the answer list in parallel. Each workstation stores an index array of pointers to local posting lists, as shown in Fig.20(b). Upon receiving a broadcasted query, the workstation retrieves local posting lists and computes its own partial answer list. The partial answer list is buffered locally, and the number of document IDs found is sent back to the gateway.

The remaining work is for the gateway to reply answers to the user page by page. A page contains the number of answers to the query, and a page full of titles of matched documents. The number of answers is useful for a user to determine whether a different query should be requested: for example, when the number of answers is large, a user may decide to discard the query results and give a more speciﬁc query. The gateway accumulates number of answers found by each back-end workstation to obtain total number of answers.

selects the top r answers within its partial answer list indepen-dently. The top r answers in the complete answer list is obtained by parallel sorting (Kumar, 1994) of all workstations’ top r answers. With architectural support, Patterson’s group shows that more than 1 G integers can be sorted in 2.41 s using 64 workstations ( Arpaci-Dusseau et al., 1997, 1998). The time to rank answers for a page is small since r is small and does not scale with the collection size.

6. Posting ﬁle partitioning algorithm

We are now ready to partition the posting ﬁle using the data allocation algorithm. The input includes

• posting ﬁle PFseqgood only for sequential processing, • popularities of keyword terms ptfor each term t, and • a set of workstations WS = {WS0, WS1, . . ., WSM−1}.

The output is a partitioned form of PFseqto be distributed on WS, following the partition-by-document-ID principle. Mean query processing time in parallel processing of the partitioned posting file is estimated according to popularities of keyword terms. The objec-tive is to minimize the storage requirement per workstation subject to the constraint: the mean query processing time of a workstation is at most one document processing time more than the ideal value. We first formulate posting file partitioning as the data allocation problem defined in Section3.1. The proposed algorithm LSB Alloc is then applied to generate a partitioned posting file.

6.1. Formulating as data allocation problem

Posting file partitioning can be formulated as the data allocation problem defined in Section3.1. Three rules are given to specify (1) an item, (2) size of an item, and (3) load of an item. The key issue is to define item loads such that the mean query processing time of a workstation can be calculated by accumulating loads of allocated items. We establish a probability model to define item loads.

The following rule speciﬁes what an item is. This rule is the partition-by-document-ID principle described in Section5.

Rule 1 An itemIito be allocated by Algorithm LSB Alloc is the set

of all postings referring to doc. ID i.

With the rule, a partitioned posting ﬁle is generated as follows. The algorithm LSB Alloc generates an allocation matrix X, indicat-ing the mappindicat-ing of document IDs (items) to workstations. For the allocation X, local posting list of term t at workstation WSk, denoted Lt(WSk), is as follows.

Lt(WSk)= {doc. IDi|i ∈ LtandXik= 1}. (26) The local posting ﬁle at WSkis the set of local posting lists for all term t.

(14)

Fig. 18. System overview of a clustered information retrieval system.

Rule 2 Data size siof item Iiis normalized and deﬁned as follows: si=_max number of postings referring to doc. ID i

doc. ID j{number of postings referring to doc. ID j}. (27) Storage requirement of a workstation is calculated as follows. The data size allocated on workstation WSk, denoted DSX(WSk) deﬁned in Eq.(2), indicates (normalized) amount of postings allo-cated on WSk. The space, in bytes, occupied by an item Iiis [(bytes per posting)× (number of postings in Ii)]. For workstation WSk, the

required storage space in bytes is [DSX(WSk)× (space occupied by the largest item)].

Mean query processing time is estimated by the following prob-ability model. Let TQtbe a random Boolean variable representing whether term t appears in a query: TQt= 1 if term t appears in a query and TQt= 0 otherwise. The term popularity ptof a term t is the probability that a query contains the term t. That is, pt= Pr{TQt= 1}. The expected value of TQtis thus

E[TQt]= 1 × Pr{TQt= 1} + 0 × Pr{TQt= 0} = pt. (28)

(15)

Fig. 20. (a and b) Partitioned inverted ﬁle on cluster of workstations.

Let ft(k)be the length of the local posting list of term t in work-station WSk. Corollary4states that the query processing time is proportional to amount of postings to be processed. With alloca-tion X, the query processing time at WSk, denoted QPTX(WSk), is as follows:

QPTX(WSk)=

termt

TQt× ft(k), (29)

where time quantity is normalized such that

• one unit of time is the average time to process a posting. The mean query processing time of WSkis the expected value of QPTX(WSk).

MQPTX(WSk)= E[QPTX(WSk)]. (30)

Similarly, the sequential query processing time, denoted QPTseq, and the mean sequential query processing time, denoted MQPTseq, are as follows:

QPTseq=

termt

TQt× ft, (31)

MQPTseq= E[QPTseq]. (32)

(Notation ftstands for the length of the posting list of term t.) Item loads are deﬁned to indicate mean query processing time, as stated in the following rule and theorems. Let Ltbe the posting list of term t. The rule to deﬁne item loads is as follows.

Rule 3 The load of the item Iiis as follows: Load(Ii)=

termt:i ∈ Lt E[TQt]=

termt:i ∈ Lt pt. (33)

The load of an item Iican be calculated by accumulating cor-responding term popularities for all postings in Ii. Consider the example in Fig.16. There are three postings in the item correspond-ing to document ID 7: postcorrespond-ings corresponds to terms 0, 1, and 4. The load is thus Load(I7) = p0+ p1+ p4. The load of Iiis the aggre-gated mean query processing time imposed by Ii. This is stated in the following theorems.

Theorem 5 (Mean query processing time in parallel processing).

Mean query processing time of a workstation WSkis the summed load of all items allocated on WSk.

MQPTX(WSk)=

Ii:Xik=1

Load(Ii)= LoadX(WSk). (34)

Proof. Eq.(34)is derived by rewriting QPTX(WSk) as accumulating

TQts corresponding to all postings. Refer againto Fig.16. QPTX(WSk)

can be calculated by scanning the local posting ﬁle row by row. Each time a posting is found, the corresponding TQtis added to the current sum. This rewrites Eq.(29)to be

QPTX(WSk)=

termt

⎛

⎝

I_i:X_ik=1andi ∈ Lt TQt

⎞

⎠

(where Ltis the posting list of term t). Scanning the local posting ﬁle column by column also yields the same result and the above equation is equivalent to:

QPTX(WSk)=

I_i:X_ik=1

termt:i ∈ Lt TQt

. The mean query processing time is thus:

MQPTX(WSk)=

I_i:X_ik=1

termt:i ∈ Lt E[TQt]

.

By observing Eq. (33) and the above equation, Eq. (34) is obtained.

Theorem 6 (Mean query processing time in sequential processing).

Mean query processing time in sequential processing is the total load of all items.

MQPTseq=

I_i

Load(Ii)= L. (35)

Proof. This is similar to the proof of Theorem5.

With these three rules, Algorithm LSB Alloc can be applied to generate a partitioned posting file with the following properties. The three rules specify inputs to Algorithm LSB Alloc. Algorithm LSB Alloc then generates an allocation X, and the partitioned post-ing file is generated from X accordpost-ing to Eq.(26). The objective of posting file partitioning is to balance amount of postings allocated on workstations subject to a limited difference to ideal mean query processing time. Storage requirement is indicated by Theorems2 and 3. Mean query processing time in parallel processing is stated in the following Corollary, which is a direct consequence of Theorems 1, 5, and6.

Corollary 5. Applying AlgorithmLSB Alloc generates a partitioned

posting ﬁle such that

MQPTX(WSk)≤MQPT_Mseq + maxIi{Load(Ii)} (36)

(16)

Fig. 21. Algorithm for generating partitioned posting ﬁle, LSB PostingFilePartition.

Corollary 5states that the mean query processing time of a workstation is at most the ideal value, MQPTseq/M, plus the effect of a single document.

6.2. Generation of partitioned posting ﬁle

Fig.21shows the algorithm to generate the partitioned posting file. The first step scans the input posting file to assign parameters to items. Algorithm LSB Alloc is then invoked to obtain allocation matrix X. Finally, the input posting file is scanned again to generate the partitioned posting file from X.

The complexity of generating a partitioned posting file is as follows. Let N be the number of documents, M be the num-ber of workstations, n be the numnum-ber of bins generated by bin packing, and f be the number of postings in the input posting file. The time complexity is O(N2_{+ M + n + f), in which} O(N2_{+ M + n) is spent on the algorithm LSB Alloc and O(f) is spent} on scanning the input posting file twice. The space complexity is O(n× M + N + f), in which O(f) space is used to store the input

and generated posting ﬁle and O(n× M + N) space for algorithm LSB Alloc.

7. Application: quantitative method for workstation cluster design

With the proposed posting ﬁle partitioning algorithm, we are ready to show a quantitative method to design a parallel infor-mation retrieval system systematically. The quantitative method determines number of workstations and storage capacity per work-station of a cluster. The objective of the quantitative method is to minimize the hardware cost to satisfy a given throughput require-ment. Load balancing reduces the number of workstations to satisfy a given throughput requirement. Storage balancing reduces storage requirement of all workstations. We show the usefulness of our work on real-world applications with TREC document collection (Hardman, 1992).

(17)

The output, workstation cluster conﬁguration, should specify • number of workstations M in the cluster,

• storage capacity CAP per workstation, and

• data allocation X to generate the partitioned posting ﬁle for the M workstations.

Hardware cost of the cluster is [M× (cost per workstation)]. The cost per workstation is a function of the storage capacity CAP. The objective is to minimize the hardware cost subject to the following constraints:

• throughput of the cluster ≥, and

• amount of data allocated on a workstation ≤ storage capacity CAP, DSX(WSk)≤ CAP for any WSk.

The following throughput requirement is assumed.

Assumption 1. The throughput requirement satisﬁes

< 1

maxI_i{Load(Ii)}.

(37) Recall that the load of an item Iiis the mean query processing time of document i. And the time unit is normalized such that one unit of time is average processing time per posting. Assumption 1ensures the existence of a solution to the cluster conﬁguration problem (see the next subsection).

7.2. Cluster conﬁguration procedure

Workstation cluster conﬁguration is calculated according to load and storage balancing properties of proposed data allocation algorithm. The key issue is to relate throughput requirement to the load balancing property.

Load balancing property determines the throughput capability of a cluster. Theorem5states that load allocated on a workstation is the mean query processing time of the workstation. Load balancing is to minimize the maximum amount of load allocated on a single workstation. That is, to minimize

max_WS

k{LoadX(WSk)} = maxWSk{MQPTX(WSk)}.

With a query arrival rate of , the mean query processing time of each workstation should not exceed the average time interval between two arrived queries. That is,

MQPTX(WSk) < 1 for any WSk or equivalently, max_WS k{MQPTX(WSk)} < 1 .

With ﬁxed number of workstations M, load balancing is to maximize the throughput capability . Whereas if throughput

max_I_i{Load(Ii)} Load(Ij)

Let WS_kbe the workstation that Ijis assigned to. The mean query processing time of WSkis at least the load of Ij, which exceeds 1/. MQPTX(WSk)= LoadX(WSk)≥ Load(Ij) >

1 .

The workstation WSkis overwhelmed by the query arrival rate . According to the throughput limitation, Assumption1ensures the existence of a solution to the cluster conﬁguration problem.

Cluster conﬁguration is calculated as follows. According to Corollary 5, the required number of workstations M to achieve throughput requirement is:

M =

MQPTseq (1/) − maxI_i{Load(Ii)}

. (38)

Note that Assumption1ensures that (1/) − maxI_i{Load(Ii)} > 0. Let S be the total data size. Storage requirement CAP is determined according to Theorem3: CAP ≥ S M+ 2 √ 3×

S M+ 3. (39)

Recall that quantities in Eq.(39)is normalized with the factor that one unit of space is the space occupied by the largest item. The next subsection justiﬁes that S/M exceeds 12 in real world applications.

8. Evaluation of proposed posting ﬁle partitioning algorithm

This section presents our evaluation on the proposed posting ﬁle partitioning algorithm. The evaluation is to show the effective-ness of the proposed algorithm on real-world large-scale document collection. We use TREC/Blogs08 (Macdonald et al., 2010) as the document collection and AOL query log as sample queries for this evaluation. The evaluation result shows that, with the proposed algorithm, one may design a clustered information retrieval system with simple quantitative method and still approximates optimal throughput and storage cost.

8.1. Evaluation method on load and storage balancing effect The evaluation method is as follows. The objective of the eval-uation is to show how throughput and storage cost scales with cluster size for real-world document collection. An advantage of our proposed algorithm is that we have analytical properties on load and storage balancing properties been proved. We thus proﬁle the inverted ﬁle built upon the document collection to get statis-tics data. Throughput and storage cost of a clustered information retrieval system is thus calculated from the analytical properties with the statistics data.

Fig.22shows the evaluation flow to collect required data. Utility programs built for collecting data are (1) inverted file builder, (2) query-log profiler, (3) item profiler, and (4) query evaluator. The evaluation flow is as follows.

(18)

inverted file

builder

query evaluator (OR)

item statistics

item profiler

inverted file

query−log

profiler

document collection

query log

time−per−posting

term popularities

(Load(I

i

) and Size(I

i

))

Fig. 22. Overview on the experiment system.

• Stage 1: We build an inverted file over the test document col-lection. The MG package (Witten et al., 1999) is applied to build the inverted file. The inverted file built by MG is then dumped and converted to our internal format for the convenience of the evaluation.

• Stage 2: The query-log proﬁler reads all queries in the AOL query log and annotates appearance counts for all indexed keywords in the inverted ﬁle. This generates term popularities, probability that a term appears in a query, for all keyword terms.

• Stage 3: The item proﬁler takes the term popularities and reads the whole posting ﬁle to get statistics data for all data items. An item is the set of postings referring to the same document ID. This stage generates loads and data sizes for all data items.

• Stage 4: The query evaluator performs query processing for all queries in the test query log. An OR operation is performed on all keyword terms in a query. Besides the query results, the process-ing time on the test computer is also recorded. With this utility, we got average time-per-posting for query processing.

The throughput of a clustered information retrieval system con-structed by the proposed algorithm is calculated as follows. We re-write formulas in Sections6and7to get the throughput equation in unit of queries-per-second. With M workstations, the throughput (M) is guaranteed to meet the following lower-bound:

(M) ≥ 1

TPP × ((L/M) + maxI_i{Load(Ii)}),

(40) where TPP is the average time-per-posting and L is the total load of all data items. The ratio-to-ideal on throughput aspect is thus as follows: (M) ideal(M)≥ 1 1+ maxIi{Load(Ii)}/(L/M) , (41)

where the ideal throughput ideal(M) occurs when each workstation is allocated balanced load L/M.

With the proposed algorithm, storage requirement per worksta-tion is also calculated from re-writing formulas in Secworksta-tions6and 7.

For a cluster of M workstations, the required storage space per workstation in bytes is obtained from the following equation: SC(M) ≤ MIS ×

S M+ 2 √ 3

S M+ 3

, (42)

where MIS is the maximum item size in bytes and S is the total data size normalized such that the maximum item size is 1. The ratio-to-ideal on the aspect of storage cost is calculated from the following equation: max_WS k{DSX(WSk)} S/M ≤ 1 + 2√3

S/M+ 3 S/M. (43)

Note that all parameters required to calculate throughput and storage requirement are obtained from the evaluation system in Fig.22.

8.2. Evaluation results and analysis

We present the evaluation results as a case study. Consider the case of setting up a clustered information retrieval system over TREC/Blogs08 document collection (Macdonald et al., 2010) with proﬁling information provided from AOL query log. Table1shows basic statistics data over TREC/Blogs08. The system manager may setup the cluster with the following steps:

• Step 1: Find a range on amount of workstations to ﬁt storage capacity per workstation. Thestorage requirement for a

worksta-Table 1

Basic data of TREC/Blogs08 document collection.

Total documents 624,470 Total postings 4,545,314,247 Maximum amount of postings for a document (item) 15,979 Total load 513,258.210938 Maximum load of an item 1.033949

(19)

• Step 2: Observe the query processing throughput for the inter-ested cluster size addressed in Step 1. The query processing throughput is indicated by Eq.(40). The ratio-to-ideal on through-put aspect is given by Eq.(41).

• Step 3: Make the ﬁnal decision on cluster size from observations in Step 1 and 2.

Table 2 shows collected parameters over TREC/Blogs08 for calculating throughput and storage cost with the mentioned equa-tions. We thus observe the cluster conﬁguration results from the quantitative method.

Fig.23 shows the scaling curve on storage requirement indi-cated by Eq.(42). Assume that a posting occupies 4 bytes in space. For high-performance, we may decide to store the whole posting file in the random access memory (RAM). In recent years, a con-temporary desktop computer may contain 1–4 GB RAM. We find that, for TREC/Blogs08 document collection, the whole partitioned posting file can be stored in RAM with 5–30 workstations. The ratio-to-ideal on the aspect of storage cost, indicated by Eq.(43)is given in Fig.24. With the proposed posting file partitioning algorithm, the required storage cost is no more than 4% more than the ideal storage cost.

Fig.25shows the throughput scaling curve indicated by Eq.(40) for TREC/Blogs08. We draw the curve for the range of 5 to 30 work-stations to ﬁt storage capacity per workstation. The ratio-to-ideal on throughput aspect, indicated by Eq.(41), is shown in Fig.26. Almost optimal throughput is achieved within the selected range on cluster size.

The evaluation result shows that, for TREC/Blogs08 document collection, the proposed partitioning algorithm results in a cluster with almost optimal throughput and storage cost. The reason is as follows. A system manager tends to select amount of workstations M such that 500 1000 1500 2000 2500 3000 3500 4000 5 10 15 20 25 30

storage requirement per workstation (MByte)

amount of workstations

storage cost per workstation

Fig. 23. Storage requirement scaling.

1.01 1.015 1.02 1.025 1.03 1.035 5 10 15 20 25 30

ratio-to-ideal on storage requirement

Fig. 24. Ratio-to-ideal scaling on storage requirement.

1000 2000 3000 4000 5000 6000 5 10 15 20 25 30 throughput (queries/sec) amount of workstations throughput(M)

Fig. 25. Throughput scaling.

0.99993 0.99994 0.99995 0.99996 0.99997 0.99998 0.99999 5 10 15 20 25 30 ratio-to-ideal on throuput amount of workstations ratio_to_ideal (M)

(20)

2000 4000 6000 8000 10000 12000 10 20 30 40 50 60 throughput (queries/sec) amount of workstations throughput(M)

Fig. 27. Throughput scaling to meet ten thousands of requests per second.

(1) Storage capacity per workstation (CAP) approximates the bal-anced data size.

CAP ≈ S M.

(2) The required throughput is achieved when workload is uni-formly distributed across all workstations.

≈ 1

TPP × (L/M).

Analysis in Section4shows that the proposed algorithm approx-imates optimal load and storage balancing when

(1) the balanced data size is far larger than the size of a data item S

M max. item size;

(2) the balanced load is far larger than the workload contributed by a data item L M maxIi{Load(Ii)}. 0 500 1000 1500 2000 2500 3000 3500 4000 10 20 30 40 50 60

storage requirement per workstation (MByte)

storage cost per workstation

Fig. 28. Storage scaling for enlarged cluster size.

0.99986 0.99988 0.9999 0.99992 0.99994 0.99996 0.99998 1 10 20 30 40 50 60 ratio-to-ideal on throuput amount of workstations ratio_to_ideal (M)

Fig. 29. Ratio-to-ideal for large throughput requirements.

Statistics data on TREC/Blogs08 shows that the above two con-ditions are met and we believe that this will be usual case for a large-scale document collection.

We also observe the performance for a cluster with higher throughput demand. Fig.27shows the throughput scaling curve to provide more than ten thousand queries per second. The clus-ter size is scaled up to contain 60 workstations. The scaling curve on storage requirement per workstation for the enlarged cluster is shown in Fig.28. We ﬁnd that, for such an enlarged cluster size, approximation factor on storage balancing is no longer a key design concern. The required storage space per workstation is within the RAM size that a contemporary computer will have. Moreover, our algorithm still approximates optimal throughput for the enlarged cluster size. Fig.29shows the ratio-to-ideal on throughput aspect. This result shows that our algorithm is still suitable for a clustered information retrieval system with high throughput demand.

9. Conclusions

This paper establishes the asymptotic 1-optimal result for static posting ﬁle partitioning. The partitioning considers all aspects on load balancing, storage balancing, and communication overhead. Communication overhead is avoided by partition-by-document-ID principle. The key result is that the algorithm is proved to be asymptotic 1-optimal on both load and storage balancing. The result indicates that, for large document collection, the algorithm achieves almost optimal result. Usefulness of the partitioning algo-rithm on real-world application is evaluated with TREC document collection. The partitioning algorithm is static in the sense that it partitions the whole inverted ﬁle at once and has to work off-line. For a Web search engine, the document collection growth rapidly and term popularities may change day by day. Based on the results in this paper, the future work is to adopt the theory to cope with dynamic changes on document collection and query log behavior.

The major impact of the results is that the effort to design a large-scale information retrieval system is simpliﬁed. In recent years, major Web search engines use large-scale clusters to store huge amount of data and handle high query arrival rate. Reducing stor-age cost as well as improving query processing throughput are required. Moreover, instead of running complex simulations, an analytical method to design a large-scale cluster is desired. In the previous work, storage efﬁciency was not considered and complex simulation was required for performance evaluation. Our algorithm has guaranteed load and storage balancing factor been proved and experiment shows that the guaranteed factors approximate opti-mal in real-world application. These results enable the system