• 沒有找到結果。

Distributed query processing in the Internet: exploring relation replication and network characteristics

N/A
N/A
Protected

Academic year: 2021

Share "Distributed query processing in the Internet: exploring relation replication and network characteristics"

Copied!
8
0
0

加載中.... (立即查看全文)

全文

(1)

Distributed Query Processing in the Internet: Exploring Relation Replication

and Network Characteristics

Chang-Hung Lee and Ming-Syan Chen

Department

of Electrical Engineering

National Taiwan University

Taipei, Taiwan, ROC

E-mail: { chlee@arbor. ee.ntu.edu.tw, mschen@cc. ee.ntu. edu. tw }

Abstract

We introduce the concept of network graph for dis- tributed query processing. Semijoins and joins are ternred contributive replicated semijoins and contributive repli- cated joins, respective!v, when they are interleaved into a join sequence to reduce the anioiint of data tvansmission cost required in a nehoork with replicatedrelations. Our so-

hi tion procedure consists of three come cu tive steps, n ani ely relation selection, join sequence scheduling and merge pro- cessing. .4 sirrrulator is developed to evaluate the per@-- niance of algorithnis devised. O w results show that the approach of interleaving a join sequence with contributive replicated seniijoins/joins is not onl-v ejicient in its execu- tion but also eflective in reducing the total ainount of data transmission cost required to process distributed queries.

1.

Introduction

Recently, as the nuniber of Internet applications in- creases rapidly, there has been a growing demand for dis- tributed database systems, where several databases for vari- ous applications are physically located in different sites. To save data communication cost, using mirror sites, among others. has become a favorable solution 12, 81. However, it is noted that, instead of replicating databases at sevetxl net- work sites, it would be more economic to allocate only some relation replication in distributed network sites 19, 10, 151. Wlule having its own advantages, how to perform a join operation with relation replication efficiently is a difficult problem since the data transmission over the network is ex- pensive and how to minimize the transmission cost for per- forming a join with relation replication is intrinsically hard to solve. Such a problem is even more important and diffi- cult to resolve when a multi-join query is being carried out. Moreover, conventional methodologies of query pro- cessing do not cope with some practical nehvork issues

in the distributed relational database system. Specifically, even though many prior studies have developed several ef- ficient algorithms forjoin or semijoin processing 14, 5, 131, little work has taken the network topology and the relation replication together into consideration [9]. For example, since most prior work does not consider relation replica- tion, each transmission path from one relation to another is assumed to be a fixed path in the network graph, which is indeed a limitation in practice. Clearly, without consid- ering the network characteristics, the solution quality for distributed query processing may degrade.

Consequently, to remedy the above deficiencies. we shall explore in this paper distributed query processing while taking relation replication and network characteristics into consideration. In general, optimizing large queries in dis- tributed relational database is a complicated problem [5] and many fomis of it are Np-hard [ 111. The discrete do- main of different schedules to answer a query is huge [l].

In view of tlus, our solution procedure for distributed query processing in thls paper is decomposed into three consecu- tive steps, namely relation selection, join sequence schedul- ing and merge processing. However, as the effect of semi- joins depends upon subsequent join and senujoin opera- tions, the use of semijoins cannot be determined in isola- tion. The problem becomes even more complicated when relation replication and network characteristics are consid- ered. In view of tlus fact. we develop in tlus paper the con- cepts of contributive replicated semijoins and contributive replicated joins, and utilize these operations to devise effi-

cient algorithms for distributed query processing. Semijoins and joins are termed contributive replicated semijoins and contributive replicated joins, respectively, when they are in- terleaved into a join sequence to reduce the amount of data transnussion cost required in a network with replicated rela- tions. A simulator is developed to evaluate the performance of algorithms devised. It is shown by our results that the approach of interleaving a join sequence with contributive replicated semijoins or contributive replicated joins is not only efficient in its execution but also effective in reducing

(2)

the total amount of data tmnsmission cost required when processing distributed queries.

The contributions of thus paper are twofold. We not only dcvise innovative algorithms for distributed query process- ing by esploring relation replication and network chracter- istics, but also conduct performmce studies on several im- portant parameters to provide many insights into thc prob- lem. Note that the effect of relation replication and that of network characteristics are in fact entangled, thus justify- ing the necessity of our approach to considering these two factors together.

The rest of ttus article is organized as follows. The no- tation, definitions and assumptions are given in Section 2. In Section 3, we describe network-based distributed que? processing. Simulation model and results are presented in Section 4. Tlus paper concludes with Section 5 .

2.

Preliminaries

As in most previous works in distributed databases [ 1, 5 ,

9, 141: we assume that the cost for executing a query can niainiy be expressed in terms of the total amount of inter- site data transmission cost required. Also, it is assumed that a query is in the fomi of conjunctions of eqtii-join predicates

and,all attributes are renamed in such a way that two join attributes have the same attribute name if and only if they have a join predicate between them.

For ease of discussion, GQ denotes the quev graph and IGQ

I

indicates the number of relations in a query, while G,v is the network graph of query and IGN

1

shows the nuniber of network sites in a network graph. We use transmission coefficient TC,,, to indicate the communication cost of transmitting one data unit from site S, to site S,. Ih'l is used to denote the cardinality of a set K. Let w . 4 be the

width of an attribute A and ,WR, be the width of a tuple in R;. The size of the total amount of data in R; can then be denoted by W R , lRtl. For notational simplicity, IAl is used to denote the cardinality of the domain of an attribute A.

Define the selectivity pi,, of attribute A in Ri as

w,

where &(A) is the set of distinct values for the attribute A

in

R;.

Ri - A -+ R j means a semijoin from Ri to Rj on

attribute A. To simplifi the notation, R; + Rj is used to

mean a semijoin from Ri to Rj in the case that the semijoin attribute does not have to be specified. Also, the notation R,+- Rj is used to mean that R; is sent to the site of Rj and a join operation is performed with Rj there. We use RiI to

denote the resulting relation after some reducers (joins or semijoins) are applied to an original relation R;. The trans- mission coefficient TC,,, is used to serve as the stati- cal avenge value of the relative bandwidth in each network edge and we define an effectual semijoin as follows.

Definition 1 (Effectual semijoin): A semijoin, R,(S,)

-

B -+ R2(S2), is called effectual, if its cost of

sending R ~ ( B ) , i.e., TC1+2(wBIR1(R)I = w B I B / p l , b ) >

is smaller than its benefit, i.e., TC2,1(,wR, lRzl --

located at sites SI and S;? respectively, and W H ,

/&I

and W R * I f ? 2 I P l , b represent, respectively, the sizes of R2 before

and after the semijoin. Thus, W B [ R I (B)ITCl,i is used to

denote the cost of a senujoin R I -- B ---f R P .

In this paper, we use 'WR, IR1

I

x TC1-2 to denote the

cost of a join RI Rz. Furthermore, we assume that. the

values of attributes arc uniformly hstributed over all tuples in a relation and that the values of one attribute arc inde- pendent from each other. The cardinalities of the resulting relations from join operations can thus be estimated accord- ing to the formula in [3]. Also, all tuples are assumed to have the same size. Results on the effect of data skew can be found in [7, 121.

We shall introduce the concepts of contributive repli- cated semijoin and contributive replicated join which are in essence new reducers to perform a join sequence. By interleaving these reducers into a join sequence, the data transmission cost required can be significantly reduced.

Definition 2 (Contributive replicated semijoin): A semijoin, Ri(Sreplicated) - A -+ Rj(Staryet), is said to

be contributive replicated, if its communication cost of

sending

Ri

(A), i.e., T C r e p l i c a t e d i t a r g a t x (.~,qlRi(A)l == w,qlA41pi~,). is smaller than the cost of original semi- join, i.e., Ri(Soriyinal) - A

-

Rj(Starget), where

Ri(srep/icated): & ( S o r i y i n a l ) ! and Rj ( s t a r g e t ) mean that

R;, R, and Rj are located at sites S r e p l i c a t e d , Soriyinal and

Staryet respectively.

Definition 3 (Contributive replicated join): A join, Ri(Srcplicated)

==+

R j ( S t a r g e t ) , is said to be contrihu- tive replicated, if its communication cost of sending Ri. i.e., T C T r a p ~ i c u t a d - - , t a r g e t x ( W R , IRzI), is smaller than the

cost of the original join operation, i.e.,

Ri(Soriginal)

Rj ( S t a r g e t ) , while the infomiation in Ri(Sreplzcated) should be the same as in the Ri(Sorigi.nal).

As will be shown later, contributive replicated semijoin and contributive replicated join are very useful in reduc- ing the data transmission cost required for distributed query processing. Note that the processing time in each comput- ing host may v a y , and its system dependent optimization is a challenging issue itself [ 11 and is beyond the scope of thus paper.

'WR21&IPl,b = , W R 2 1 R 2 1 ( l - P l , b ) ) , where RI a d R;t N e

3. Network-Based Dist. Query Processing

In [ I I], it is proved that five cIasses of distributed query optinuzation problems, i.e., local optimization of semi- joins, join sequence optimization, relation semijoins 011

broadcasting networks, single-reducer tree queries, and full- reducer tree queries, whch cover the majority of distributed query optimization problems in the query processing, arc NP-hard. The query processing we study in tlus paper with the consideration of network characteristics and rela- tion replication is therefore also NP-hard by reduction [ 6 ] ,

(3)

(a) An illustrative query

R , A

I

0.7

I

lo00

R*

R,

(b) Profile ofthe query in (a)

A 0.6 1200

B 0.6 lo00

Figure 1. An illustrative query and its profile

thereby justifying the need of applying heuristic approaches to solving this problem.

In this paper, the procedure for distributed query process- ing is decomposed into three consecutive steps, i.e., namely relation selection, join sequence scheduling and merge pro- cessing. For ease of exposition, an illustrative example is described below to explore each proposed scheme. Without loss of generality, the query we use is as follows.

select A, B from R1, R2, R3 where Rl.A = R2.A and

R1.B = R3.B.

Consider a company which has three fragments of databases, R1,

RP,

and R3, located in four sites, S1, SZ, S3, and S4. For example, R1 resides in sites S.2 and S4, and on the other hand S2 contains both RI and R2. The corre- sponding query graph and its profile are given in Figure l a and Figure Ib, respectively. Specifically, as shown in Fig- ure 2a, a network graph illustrates the distribution of rela- tions in network sites with relation replication considered. The transmission cost among network sites, i.e., TC,,,, is given in each network edge. For instance, the value of

"3" in the arrow from S1 to S2 means that the transmission cost of sending one data unit from S1 to S p is 3 units, i.e.,

TCI+p = 3. Note that, the asymmetry feature between up-

load path and download path is also considered in this net- work topology since TC,,, and TC,,, may differ from each other. With a shortest path algorithm [6], each min- imum cost among network sites is generated in Figure 2b.

3 1 1 3

(a) An illustrative network graph

I

s,

I

3 1 5 1 0 1 4 1 I S 4 1 6 1 1 1 3 1 0

(b) Shortest path among network sites

Figure 2. An illustrative network graph and its shortest path among network sites

The transmission cost from S, to S,,, TC,,,, is given as "0". In this example, S1 is assumed to be the final site

S f z n a l , where the final results are needed.

3.1. Relation Selection

In the first step, i.e., relation selection, our main purpose is to search appropriate relations from the existing network graph based on the relationship among relations and net- work sites.

3.1.1 Independent Relation Selection (IS)

The scheme of Independent relation Selection (abbreviat- edly as IS) explores how to select each relation required

from the most appropriate network site, from whch the cost of sending data to the final site is the minimal. For example, as shown in Figure 2a, R1 is allocated at sites S2 and S1. It can be seen that TC4,, = 6 is smaller than TCp,, = 8 from Figure 2b. RI (S4) is thus selected. Similarly, R 2 ( S 3 )

is then selected since TC3-1 = 3

<

TCp,] = 8. Finally R3(Sl) is selected since SI is the final site. Scheme IS can be formally described as follows. For better readability, a few abbreviations are utilized in IS. cost(R,(S,), S f z n a l )

(4)

the final site S f i n a l , wluch needs the final result of query.

SRSP

indicates the set of relation-site pairs selected. Scheme IS : Independent relation Selection 1. beein " 2. S R S P :=

4;

3. foreach R; in Go do Y 4. begin 5. foreach Ri(Sm) in GN do 6. begin

7 . select Ri(S,) such that cost(Ri(S,), S f i n a l ) m i n { c ~ ~ t ( R i ( S , ) , S f i n a l ) } ;

8. S R S P := SRSP U {Ri(Sm)};

9. end 10. end 11. end

Clearly, since the transmission cost of sending data to the final site is the only decisive parameter, the relation which has lower transmission coefficient TCm+,final will be se- lected under scheme IS.

3.1.2 Greedy Relation Selection (CS)

Note that IS described above does not take the intrinsic re-

lationship among relations into considerdtion. However, as shown below, this relationship, if properly exploited, can lead to pronunent performance improvement. In view of this, an altemative scheme, i.e., Greedy relation Selection (abbreviatedly as GS), is devised as follows. Scheme GS is greedy in nature. The first selected site R3(S1) is the one

which incurs the nuninial transmission cost of sending its data to the final site. Nest, we select RI (S.i) as the second relation because the transmission cost of sending data from

R1(S4) to R3(S1) is the smallest one among all legitimate considerations. The sequentially selected relation-site pairs are the ones which incur the minimal communication cost to any selected relation-site pair. With the minimum cost of a combination of semijoidjoin between R2(S2) and R1 (S4) (i.e.. Rl(S4) -+ Rz(S2) and Rh(S.2) 3 Rl(S4)), Rz(S2)

is thus the tlurd selected relation. In general, scheme GS can be described as follows. Let cost(R,(S,). R3 ( S n ) , SA4.J)

be the transmission cost of merging R,(S,) to R3 ( S n ) with the join operation SMJ. S ~ s p indicates the set of relation-

suchthat cost(R,(S,), R j ( S n ) , S h / l J ) = minRt (S,) $i S R ~ ~ , {cost( Rt ( s m ) 7 R j ( s n ) 9

SMJ)};

R 3 ( S n ) E S ~ s ~ 10. SRSP :=

SRSP

U

{&(sm)};

11. end 12. end

In GS, not only is the transmission path taken as a param- eter for relation selection, but also the intrinsic relationship among relations is utilized to make the decision for relation selection. In addition, the notion of effectual semijoin is used to evaluate the usefulness of semijoins. Consequently, with relations selected, the join sequence can next be deter- mined.

3.2. Join Sequence Scheduling (JSS)

After all the relation-site pairs, such as R1 (S4), R 2 ( S 3 )

and R3(S1) from IS scheme, are selected in the first step,

we shrill determine the join sequence for later merge pro- cessing. With the method of interleaving joins with semi- joins, the step of Join Sequence Scheduling, abbreviatedly as JSS, is devised. First, the relation whch needs the mini-

mal communication cost of sending its data to the final site is selected as the one to send merged results to the final site. R3(S1) is thus selected since RB is located at the final site $ 1 . Next, it can be seen that a combination o f semijoin

Rs(S1)

-

Rl(S4) and join Rlt(S4)*RB(SI) is the op-

eration with the minimal cost to merge relation R I ( S4) to

R3(S1). Then, the second relation selected is R1(S4). Fi- nally, with Rz(S3)

* R 1

(&),

Rz(S3)

is the final relation to be selected into the join sequence scheduling. As a result, the join sequence of the selected relations obtained by IS scheme is J o i n ( J o i n , ( R 2 ( S ~ ) , Rl (S4))R3(S1)), , where

J o i n ( Ri ( S m ) , Rj ( Sn )) indicates the join result of joining

Ri(S,) with Rj(S,). According to the operations above,

the algorithmic form of scheme JSS can be described as fol- lows, where S J S ( ~ ) means the i-th step of the join sequence, and SS J S is the set of relation-site pairs selected by scheme

JSS. SR S P is the set of relation-site pairs selected from the step of relation selection (i.e., by scheme IS and scheme GS).

site pairs selected. Note that the relationship among rela-

tions is taken as an important parameter in GS. Scheme 1. begin

J ~ s

: Join Sequence Scheduling Scheme

GS

: Greedy relation Selection

1. begin

3. select R;(S,) E G N such that cost(R;(S,), S f i n a l )

2 . S R s P := f+b;

= mi71R,(Sm) € G N { cost(Ri(Sm ) > Sfin a d

1;

4. S R S P := S R S P U {Ri(sm)}i

5 . foreach Ri $!

SRSP

do 6. begin

7 . if (Ri

+

Rj

5

R -+

Ri+

Rir

+

R j )

t h e n S M J := Ri R j ; 8. else (SD1.J := Rj -+ Ri

+

R;I

I* eflecttral seniijoin *I

9. select Ri(S,,,) $2 SRSP in GN

(5)

such that cost(R;(S,), R j ( S , ) , S M J ) - - minnR*(Sm)ESRSP, {cost(Rt(Sm

11

Rj ( S R I ; S M . J ) } ; R, ( S n ) ESs J S 11. S R S P := S R S P - { R i ( s m ) } ; 12. S S J S := S S J S U { R i ( S m ) ) ; 13. S J s ( i ) := ( R i ( S m ) , Rj(S,)) a n d i = i

+

1; 14. end 15. end

Thus, according to the S R S ~ (i.e., {Rl(S4), R2(S3)rR:c(S1))) generated in IS, we have S ~ s ( 0 ) = set of S J s ( o ) . These operations will be utilized in merge processing.

( R l ( S 4 ) ; & ( S I ) ) and S J S ( ~ ) = (R2(5'3), R 1 ( S 4 ) ) in the

3.3. Merge Processing

After the join sequence is decided, the merge processing is performed. The merge processing can be taken as the re- vised and polished step for the step of JSS. The contributive replicated semijoins and contributive replicated joins will be interleaved into a join sequence in the step of merge pro- cessing. Tlvo alternative schemes, normal merge processing (abbreviatedly as N h f ) and merge relation replication (ab- breviatedly as hnZ), are devised.

3.3.1 Normal Merge Processing (NM)

Normal Merge processing (Ain/c) interleaves the join se- quence obtained with semijoins in the merge processing. As

long as an effectual senujoin is found in each step of merge processing, that semijoin is employed as a reducer. There- fore. the sequence of the merge processing for this example in J S S would be: R2(S3)

*

R1 (S4) : 4, 800, RS(S1) -+ RII(S1) : 2,400, and

Rl(S2)

+

R3u(S5) : 7,200. The to- tal amount of transmission cost is 14,400. The algorithnuc form of iVh1 is described below, where JGQJ indicates the number of relations in the query graph GQ. Joins and effec- tual semijoins are two merging operations. The minimum- cost operation will be selected from these two operations in each step of merging processing until the query result is developed. After each merging processing, the profile of query graph is updated accordingly.

Scheme N M : Normal Merge processing

1. 2. 3 . 4. 5. 6. 7 . 8. begin fork = ~ G Q ]

-

1 to 0; begin ( R i ( s m ) > q s 7 J ) := S J S ( k ) , k = k - 1; if ( R ,

=+

Rj

<:

Rj + R; i- R i t a R j ) then Sk1.J := R;

*

R j ; /*effectual semijoins*/

and update the profile of query graph accordingly; /*joins*/

else(SMJ := Rj --+ R;

+

Ril

+

Rj);

merge R,(S,) and Rj(S71) to Rj(S,,) with SMJ

9. end 10. end

3.3.2 Merge relation Replication (MR)

Note that replicated relations are not utilized inNh1. To rem- edy this, instead of directly merging all relations selected, scheme

A4R

(standing for Merge relation Replication) takes the replicated relations on the unselected sites into consid- eration for merge processing. Therefore, before each merge processing is executed, all replicated relations are exam- ined. When contributive replicated joins or contributive replicated semijoins are identified, new R; (Sreplicated) re- places prior R, (Soriginal). Thus, the merge sequence of the previous example in Section 3.3.1 becomes the following:

R2(S2) 3 R 1 ( S 4 ) : 3,600, R3(S4) 3 R11(S4) : 0, and

R111(S4)

+

R3(S1) : 7,200. Thus, the total transmission

cost is 10,800. Instead of using R2(Ss)

+

R1 (S4) and

R3(S1) + R1/(S4) in scheme NM, a contributive repli-

cated join R2(S2) 3 R1 (S4) and a contributive replicated semijoin R3(S4) -+ R11(S4) are utilized in scheme h1R

to save 25%, i.e., 1 4 4 ~ ~ & ) 8 " " , of the transnussion cost for

merging processing, showing the vely advantage of scheme

MR. A formal description of h1R is described below, where S M R is the set of merged relations. In Scheme MR, costl indicates the transmission cost of merging two relations with the join method. The cost.;! indicates the one of using the combination of joins/effectual semijoins The costs de- notes the one of utilizing the contributive replicated join to merge. The cosh contains the contributive replicated semi- join in the merging processing. In each step of merging two relations, scheme MR selects the minimum-cost operation

from costl, costn, costs, and costl. After each merging pro- cessing, the profile of query graph is updated accordingly until the final result is developed.

Scheme M R : Merge relation Replication 1. begin 2. f o r k = lGQ - 1 too: 3. begin 4. SMR :=

4;

5 . (Ri(Sm), Rj(S74 := S J S ( k ) l 6 . k = k - 1; 7. costl = cost(R;(S,)

+

R j ( S , ) ) ; 8. cost2 = cost(Rj(S,) --+ R i ( S v t ) ) /*joins*/ +cost(Ri/(S,)

+

R j ( S n ) ) ;

/*combination of joins/effectual semijoins*/ 9. = m i n S r r . p i I a n t ~ . d - - # f S m , R 1 ~ S M R

{cost(Ri(Sre~/icated-m)

*

R j ( S n ) ) >I /*contributive replicated joins *I 10. cost4 = ~ l i ~ l S P e p [ ~ c n t e ~ - " # S n

{cost(Rj (I-(jreplicated-n) + &(Sm

1)

+cost(RiI(S,)

*

Rj(Sn))}j

/*contributive replicated semijoins */ 1 I . cost = min{ cost 1 , cost2

,

costs, c o s t 4 ) ; 12. switch (cost)

13. {

14. case costl do R i ( S m )

+

Rj(S,);

15. casecost2 do Rj(S,) -+ R;(S,)

16. case cost3 do Ri(Sreplicated-m)

+

Rj(S,);

(6)

17. casecost do RJ(STeplzcated-,) + &(ST,)

18. }

19. S M R := S M R U { R 3 } and update the profile 20. end

21. end

and R,/(S,)

*

Rj(S,);

of query graph accordingly;

3.4. Solution Algorithms Constructed

Since there are two alternative schemes in the first step and two alternative methods in the third step, four solution algorithms are possibly developed to deal with distributed query processing. For the space limitation, three selected algorithms, namely I S N M , ISMR, and GSMR, will be ex-

plored, where ZSNM can be viewed as an extension from the conventional query processing.

Algorithm ISNM is to independently select each relation from the network topology and query, i.e., using scheme IS in the step of relation selection. Also, replicated relations are not taken into consideration, i.e., using scheme N h f in the step of merge processing. JSS is used as the method to decide the join sequence. The resulting execution sequence is as follows: R2(Ss)

*

R l ( S 4 ) : 4,800.

&(SI)

-

R,/(S,) : 2,400, and RI//(S4)

+

R 3 ( S l ) : 7:200. Thus. the total transmission cost of employing ZSNM is 14,400. Algorithm ISNM: Independent Selection and Normal Merge

Step 1: Using IS scheme to select the relations required for the query processing.

Step 2: Deciding the join sequence with ./SS scheme. Step 3 : Merging the relations required with iVhf scheme. Algorithm I S N M is formally described below.

Algorithm ZSMR is similar to algorithm ZSNM except that scheme MR is used in the step of merge processing. The resulting execution sequence will become the follow- ing: R2(S2)

*

R l ( S 4 ) : 3,600, &(&)

-

RII(S4) : 0, and Rln(S,) R3(S1) : 7,200. Thus, the total trans- mission cost of employing Z S M R is 10,800. Comparing ISMR with ZSNM, we find that two merging steps in I S M R ,

Rz(S2)

*

R I ( & ) and R3(S4)

-

R1/(S4), are d a e r - ent from Rz(S3)

+

RI(SI) and &(SI)

-

Rlr(S4) in BNM. By utilizing the replicated relations on the uns- elected sites as contributive replicated semijoidjoin, about

, of the transmission cost is saved. Algorithm ZSMR is outlined as follows.

Algorithm G S M R is similar to algorithm I S M R except that scheme GS is used in the step of relation selection. Un- der algorithm GSMR, the resulting execution sequence be-

comes: RI(&)

-

&(&) : 0, Rzi(S2)

*

R I ( S ~ ) : 2,520, R3(S;1)

-

R I / ( & ) : 0, and R I / / ( & )

+

&(SI) : 7,200. Therefore, the total communication cost is 9,720. Com aring the execution results inZSNM, about 32.5% i.e.,

14.4:{;t;720,

of the cost is saved by these contributive repli- cated semijoins and the greedy selection.

25y0, 11.400- 10.800

14,400'

4.

Simulation Model and Results

In this section, the simulation niodel is introduced in Section 4.1. Evaluating average value of transmission co- efficients, the density of relation replication, the density of network graph, and the nehvork size are shown in Section 4.2.

4.1. Simulation

Model

To show the execution of those three algorithms, I S N M ,

ZSMR, and GSMR, unless mentioned otherwise, the network

graph with 25 network sites /C,Y

I

= 25 is utilized to evalu- ate several parameters of our simulation model. With a uni- form distribution, the transmission coefficient TC,,, of each edge of the network graph is randomly determined by a given mige ( 0 , l O ) and its average value ITCJ = 5. The simulation program was coded in C++, and input queries were generated as follows. The average number of rela- tions in a query was pre-determined as ) G Q ) = 5. The occurrence of an edge behveen two relations in the query graph was determined according to a given probability. de- noted by ~ Q C . Without loss of generality, only queries with

connected query graphs were deemed valid and used for our study. To make the simulation be feasibly conducted, the average number of tuples in a relation is selected froni 2.500 lo 3: 500 and the cardinalities of attributes are also se- lected from 2,500 to 3,500 accordingly. As such the join selectivities can reflect the reality.

In addition, relation-site distributions are randomly gen- erated by the simulation program with a unifomi distribu- tion according to a given density of relation replication in a nehvork graph, denoted by ~ R S D . For example, when P R S D is selected to be 0.5, it means that one relation, say R,, is located at one network site, say S,,, with the prob-

ability of 0.5. In our experiments, each execution cost is the result of the average from 5,000 query executions. For each processing, three scheduling algorithms. i.e., I S N M .

ZSMR and GSMR, are performed to execute the query. When

hvo reliltions not having join predicates are to be joined to- gether, a Cartesian product is performed. From our simula- tion, since relative performance of these schemes is not sen- sitive to the density of the query graph, we use PQG = 0.5 in our simulation. The occurrence of an edge between two sites in the network graph was detemuned according to a given probability, denoted by p ~ c . As before, only con- nected network graphs were deemed valid and used for our study.

To exhibit the benefit of relation replication, the re-

duction ratio T I S =

I

C u s ' ( z ~ s R l ' , r ~ ~ ~ ~ ~ ~ z s N ~ ~

I

is used to compare ISMR and ISNM. Similarly, TGS =

(7)

I

25 5 10 25 50

Averagedue oftranshssion coefficient

Figure 3. Performance evaluation of the aver- age value of transmission coefficient lTCl

4.2. Performance Evaluation

Figure 3 shows that the relative performances of our pro- posed algorithms are not sensitive to the cost range of the network edge. Therefore, in the following experimental study, the use of ITC) = 5 will not affect the relative merits of the schemes evaluated. Furthermore, Figure 4a depicts that total communication cost of each proposed algorithm decreases with the growth of the density of relation replica-

tionpRsD in network graphs. Clearly, more alternatives of

relation selection will lead to a lower communication cost required. In addition, a 1argerpRsD tends to favor the use of contributive replicated semijoins andor contributive repli- cated joins, which will in tum increase the values of 7-1s

and rGs. This fact is validated by the results in Figure 4b. Performance study of relation replication for various net- work graphs is conducted in Figure 5. It can been seen that 7-1s and TGS in Figure 5b grow with the value of PNG while total communication cost of each proposed algorithm de- creases in Figure Sa. It shows that more alternative trans- mitting paths lead to a lower communication cost required. T h ~ s means that the benefit of relation replication increases with the growth of the density of network graph. In ad- dition, with the growth of IGNI, the execution costs de-

crease as shown in Figure 6a, leading to more prominent performance improvement. The performance improvement, however, will be saturated as the number of network sites (GN

1

reaches g s a t , where the value of gsat can be approxi- mated to

p,dz$!

= 20 empirically in Figure 6b. Intu- itively, even thou& the relations required are more widely distributed in the network, those replicated relations located far away from the final site will be of little impact to reduce the transmission cost. That is to say, if we want to further decrease the transmission cost of &stributed query process- ing with replicated relations, P N G , P R S D , and ~ C N

I

should be increased to overcome such a saturation point.

1 0 0 0 0 0

,

I S,, 8 0 0 0 0

B

& 6 0 0 0 0 .- .I

i

4 0 0 0 0 i= 2 0 0 0 0 0 0 3 0 . 5 0.7 Density of relation replication (a) The transmission cost o f IS,,, IS,, and GS,,

40% I

0%

'

I

0 . 3 0 4 0 5 0.6 0 7

Density of relation replication ( b ) The reduction ratio o f ris and rGS

Figure 4. Performance evaluation with various PRSD 4 0 0 0 0 I 0 I S N M 0 IS,,

-

3 2 0 0 0 B 2 2 4 0 0 0 5 1 6 0 0 0 .- 5 Ez 8 0 0 0 0 0.3 0 4 0 5 0.6

The density of network graph ( a ) The transmission cost o f IS,,, IS,, and G S , ,

...

s

20% B e! 10% I 0.3 0 4 0 5 0 6

The density of network graph

(b) The reduction ratio of rIs and rOs

Figure 5. Performance evaluation with various density of network graph

(8)

3 2 0 0 0

1

0 c 5 2 4 0 0 0 2 1 6 0 0 0 5 8000 0 1 5 2 0 25 30 T h e n u m b e r o f n e t w o r k sites ( a ) T h e transmission c o s t o f IS,,, IS,, a n d GS,, 4 0 0% I o 3 0 0 %

2

2 0 0 %

j

I O 0% 0.0% I I 1 0 I5 20 2 5 30 T h e n u m b e r of n e t w o r k s i t e s (b) T h e reduction ratio o f rIs a n d ras

Figure 6. Performance evaluation with various network sites IGN

I

5. Conclusion

We studied in this paper the subject of exploring the effects of relation replication and network characteristics to optimize the execution of distributed query processing. Our solution procedure was composed of three consecutive steps, namely relation selection, join sequence scheduling and merge processing. Our results showed that the approach of interleaving a join sequence with contributive replicated semijoins/joins is not only efficient but also effective in re- ducing the total amount of data transmission cost required to process distributed queries.

6.

Acknowledgment

The authors are supported in part by the Ministry of Ed- ucation Project No. 89-E-FA06-2-47 and the National Sci- ence Council, Project No. NSC 89-2219-E-002-028 and NSC 89-2218-E-002-028, Taiwan, Republic of China.

References

[ l ] G. Bojan and M. M. Qutaibah. Combinatorial Op- timization of Distributed Queries. Transactions on

Knowledge and Data Engineering, pages 91 5-927,

December 1995.

[2] J. W. Byers, M. Luby, and M. Mitzenmacher. Access- ing Multiple Mirror Sites in Parallel: Using Tomado

Codes to Speed up Downloads. INFOCOM '99. Eigh- teenth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE,

pages 275 -283, Jan. 1999.

[3] M.-S. Chen and P. S.Yu. Combining Join and Semijoin Operations for Distributed QueIy Processing. IEEE Transactions on Knowledge and Data Engineering,

5(3):534-512, June 1993.

[4] M . 4 . Chen and P. S. Yu. A Graph Theoretical Ap- proach to Determine a Join Reducer Sequence in Distributed Query Processing. IEEE Transactions on Knowledge and Data Engineering, 6(1): 152-165,

February 1994.

[5] M.-S. Chen, P. S. Yu, and K.-L. Wu. Optimization

of Parallel Execution for Multi-Join Queries. IEEE

Transactions on Knowledge and Data Engineering,

8(3):416-428, June 1996.

[6] 7: H. Cormen, C. E. Leisenon, and R. L. Rivest. In- trodution to Algorithms. The MIT Press, Cambridge, Massachusetts London, England, 1989.

[7] E;. A. Hua, Y L. Lo, and H. Young. Considering Data Skew Factor in Multi-way Join QueIy Optimization

for Parallel Execution. VLDB Journal, 2(3):303-330,

July 1993.

[8] G. S. Jung, Q. M. Malluhi, and W. G . Brown. A Scheme for High-performance Data Delivery in the Web Environment. Parallel and Distributed Systems,

1998. Proceedings. 1998 International Conference,

pages 210 -217, 1998.

[9] D. Kossmann. The State of the Art in Distributed QueIy Processing. In ACM Computing Survey,

September 2000.

[IO] A. B. Stephens, Y. Yesha, and K. E. Humenik. Optimal Allocation for Partially Replicated Database Systems

on Ring Networks. IEEE Transactions on Knowledge and Data Engineering, pages 975 -982, 1994.

[11] C . Wang and M.-S. Chen. On the Complexity of Dis- tributed Query Optimization. IEEE Transactions on

knowledge and Data Engineering, pages 650 -662,

1996.

A Parallel Soa Merge Join Algorithm for Managing Data Skew.

IEEE Transactions on Parallel and Distributed Sys-

[ 131 Z. Xie and J. Han. Join index hierarchies for support- ing efficient navigations in object-oriented databases.

In Proc. 1994 Int. Con$ Very Large Data Bases, pages

522-533, Santiago, Chile, September 1994.

[ 141 C . T. Yu and C. C. Chang. Distributed Query Process- ing. .4CM Computing Surveys, 16(4):399-433, De-

cember 1984.

[15] C. T. Yu and W. Meng. Principle of Database Que? Processing for Advanced Applications. Morgan KauJ

mann Pulisher, Inc. San Francisco, Calfornia, 1997.

[12] J. L. Wolf, D. M. Dias, and P. S. Yu. tems, 4(1):70-86, January 1993.

數據

Figure 1. An illustrative query and its profile
Figure  3.  Performance evaluation of the aver-  age value of transmission coefficient  lTCl
Figure  6.  Performance evaluation with various  network sites  IGN  I

參考文獻

相關文件

Now, nearly all of the current flows through wire S since it has a much lower resistance than the light bulb. The light bulb does not glow because the current flowing through it

In view of the large quantity of information that can be obtained on the Internet and from the social media, while teachers need to develop skills in selecting suitable

Using this formalism we derive an exact differential equation for the partition function of two-dimensional gravity as a function of the string coupling constant that governs the

„ A host connecting to the outside network is allocated an external IP address from the address pool managed by NAT... Flavors of

„ An adaptation layer is used to support specific primitives as required by a particular signaling application. „ The standard SS7 applications (e.g., ISUP) do not realize that

„ A socket is a file descriptor that lets an application read/write data from/to the network. „ Once configured the

The simulation environment we considered is a wireless network such as Fig.4. There are 37 BSSs in our simulation system, and there are 10 STAs in each BSS. In each connection,

In response to the twenty-first century’s global economy, “broadband network construction” is an important basis for the government in developing the national knowledge and