Enhanced Smallest Communication Ratio Algorithm (ESCR)

3 Communication Sensitive Techniques for Grid Scheduling

3.6 Enhanced Smallest Communication Ratio Algorithm (ESCR)

In order to optimize the system utilization, the enhanced SCR is presented in the following. Figure 3.4 shows another example of master slave tasking in heterogeneous network. Processors (P1, P2, P3, P4) within different clusters (C1, C2, C3, C4) presenting different computational speed (T1, T2, T3, T4) = (3, 6, 11, 13). As in heterogeneous network, communications from master to different clusters (C1, C2, C3, C4) are with different network bandwidth.

For example, β β β β₁: ₂ : ₃: ₄ =6α:15α:30α:10α in Figure 3.4 (a) resulting

, ,

_ 5

1 _comm =

T T₂_{_}_comm =2 T₃_{_}_comm =1 and T₄_{_}_comm =3

4) BSC

. According to definition 3.4, we have BSC = 48. To demonstrate SCR and ESCR maximize system throughput upon a given execution deadline, we assume the system deadline is 200 in this example. In SCR implementation, according to definition 3.5, we have task(P1)=6, task(P2)=6, task(P3)=4 and task(P4)=3. There are totally 19 tasks will be dispatched to the four slave processors in each BSC. The communication costs of slave processors are comm(P1)=30, comm(P2)=12, comm(P3)=4 and comm(P4)=9, respectively. In Figure 3.4(b), the SCR method distributes tasks by the order P3, P4, P2 and P1 according to processors’

communication ratio that defined in definition 3.8. Processor P3 first receives 4 tasks and it finishes at time t = 48. Meanwhile, processor P1 is receiving tasks during t = 48~55.

The second BSC starts to dispatch tasks at t = 55. Namely, t = 55 is the earliest time for P3 to receive tasks in the second scheduling cycle. Therefore, P3 has 7 unit of time idle.

Lemmas 3.4 and 3.5 state the above phenomenon. In this example, the completion time of the j^th BSC depends on the finish time of processor P1. We have = 183.

As the system deadline is 200 and = 238 which exceeds the deadline, the SCR presents 57 tasks been completed before deadline if the remainder time slots were not fully utilized.

) (BSC₃ T_finish^SCR

SCR (

finish

The FPF task allocation is depicted in Figure 3.4(d). According to Lemma 3.1, task(Pmax+1) = task(P4) = 0 which means that P4 will be excluded from task allocation due to its slow computational speed. According to the speed-oriented scheduling policy, FPF has the dispatching order P1, P2, P3 with task(P1)=6, task(P2)=6 and task(P3)=4.

Observing that there are 4 scheduling cycles within the system deadline in P1, while the remaining time (26) in P2 and remaining time (14) in P3 are not enough to dispatch the

fourth cycle, the FPF algorithm completes 54 tasks before system deadline while the SCR and ESCR complete 57 and 66 tasks, respectively. The SCR based task allocation schemes (SCR and ESCR) present better system throughput and average turnaround time.

According to the above illustration, we have the following lemmas demonstrating important characteristics of SCR and ESCR.

Master Server

C₁ C₂

C₃

C₄

P₁ P₂

P₃ P₄

(a) (b)

Figure 3.4: Master slave tasking in heterogeneous network with deadline. T1_comm=5, T2_comm=2, T3_comm=1, T4_comm=3, T1=3, T2=6, T3=11 and T4=13. (a)Heterogeneous network (b)SCR task allocation (c)ESCR task allocation (d)FPF task allocation.

Lemma 3.4: In SCR task allocation scheme, the amount of tasks been assigned to each slave processor Pi in one BSC can be calculated by the following,

comm i i

i T T

P BSC task

)

( = + (3.4) Proof: This lemma can be easily established by replacing i with max+1, in definition 3.5.█

Lemma 3.4 clarifies that each slave processor receives the same amount of tasks in SCR scheme. On the contrary, in FPF, according to Lemma 3.1 and definition 3.10, processor Pi with i < max+1 will receive

comm i

i T

BSC

+ _ tasks, processor Pmax+1 receives (BSC − ) / Tmax+1_comm and processor Pi with i > max+1 receives none.

∑

= max 1

) (

i comm Pi

Lemma 3.5: Given master-slave tasking paradigm with n slave processors, in SCR task allocation scheme, the idle time of a slave processor denoted by , is equal to the following equation.

idleSCR

∑

− BSC. (3.5)

SCR

Tidle

= n

i comm Pi 1

) (

Proof: According to definition 3.4, BSC is identical to all processors. We have BSC

= ) , for all i = 1~ max+1. By replacing i with max+1, we obtain BSC

= .

( )

(P_i comp P_i

comm +

( )

(P_max+₁ + comp

comm P_max+₁)

Since T_idle^SCR=

∑

(

= n

i commPi 1

)

( − comm(P_n)+comp(P_n)), we can further have =

, and + = . Therefore,

= − BSC.

SCR

Tidle

∑

⁻

= 1 1 n (

P comm

∑

= n i

comm

(

i) −

P )i

) (P_n

comp comp(P_n) T_idle^SCR

∑

⁻

= 1 1

)

n (

i comm Pi T_idle^SCR

█

Lemma 3.6: Given master-slave tasking paradigm with n slave processors, in SCR task allocation scheme, the task completion time of the j^th BSC denoted by , can be calculated by the following equation,

)

( _j

SCR finish BSC T

) ( _j

finishSCR BSC

T =

∑

ⁿ comm )(Pi +comp(P )+ (3.6)

= i 1

where P is the slave processor with maximum communication ratio.

k k

1 k

) )

( )

( ( ) 1

(j− × comm P_k +comp P_k +T_idle^SCR

Proof: We prove this lemma by induction manner.

Forj=1, T_finish^SCR (BSC ) =

∑

comp(P ),

= n

i comm Pi 1

)

( +

For ^j⁼², Tfinish^SCR (BSC2)=

∑

^+comp(P^k⁾⁺ ^,

= n

i comm Pi 1

)

( (comm(P_k)+comp(P_k)+T_idle^SCR)

For j= ,m (BSCm)= +comp(Pk)+ ,

(3.6.1)

finishSCR

∑

= n

i comm Pi 1

)

( (m−1)×(comm(P_k)+comp(P_k)+T_idle^SCR)

For , (BSCm+1)= +comp(Pk)+ ,

(3.6.2)

) 1 ( +

j= m T_finish^SCR

∑

= n i

comm

)

( m×(comm(P_k)+comp(P_k)+T_idle^SCR)

By subtracting the two equations, (3.6.2) – (3.6.1), we have T(BSCm+1) − T(BSCm) =(comm(P_k)+comp(P_k)+T_idle^SCR). Therefore,

∑

^+comp(P^k⁾⁺⁽ ^{. █}

)

( _j

SCR finish BSC T

= n

i comm Pi 1

)

( j−1)×(comm(P_k)+comp(P_k)+T_idle^SCR)

Lemma 3.7: Given master-slave tasking paradigm with n slave processors, if tdue is the system deadline between the j^th BSC and (j+1)^th BSC, the amount of tasks additional dispatched by ESCR than SCR, denoted by Task_extra^ESCR, can be estimated as

∑

= +

= ⁿ −

i i i comm

SCR j finish ESCR due

extra

T T

BSC T

Task t

1 _

)

(

(3.7) Proof:

According to Lemma 3.6, the completion time of the j^th BSC is . Because tdue is the deadline between j^th BSC ~ (j+1)^th BSC, the remaining time slots available for dispatching additional tasks is . Therefore,

) ( _j

finishSCR BSC T

)

( _j

finishSCR

due T BSC

t −

∑

= +

= ⁿ −

i i i comm

SCR j finish ESCR due

extra

T T

BSC T

Task t

1 _

)

( . █

Lemma 3.8: Given master-slave tasking paradigm with n slave processors, if tdue is the system deadline between the j^th BSC and (j+1)^th BSC, the total amount of tasks

dispatched by ESCR, denoted by Task^ESCR_finish , can be estimated as

ESCR finish

Task (tdue)= ⁿ _extra^ESCR (3.8)

i task Pi Task

j×

∑

)) ( (

Proof: Because the system deadline is between the j^th BSC and (j+1)^th BSC. There are tasks can be dispatched in j BSCs in ordinary SCR algorithm.

∑

× ⁿ

i task Pi

)) ( (

According to Lemma 3.7, the amount of tasks additional dispatched by ESCR than SCR is . Therefore, the total amount of tasks dispatched by ESCR is . █

extraESCR

Task

i Task

P))+

( _extra^ESCR

n i

task j×

∑

(

The ESCR scheduling algorithm is given as follow.

Algorithm_ESCR_task_Scheduling (Ti, Ti_comm, tdue) // tdue is the system deadline

01. for (all slave processor Pi ) {

02. task(Pi)=BSC / (Ti+Ti_comm);

03. m = (tdue–Wi) / (BSC+Tidle^ESCR);

// m is the number of BSC before system deadline 04. Taskextra^ESCR(i) = ( tdue–T_finish^SCR(BSC_m)/(Ti+Ti_comm); }

// calculate the additional number of tasks in each Pi

05. for i = 1 to n {

06. total = (task(Pi)*m+Taskextra^ESCR(i));

07 Send total tasks to Pi

08. Task^ESCR_finish (tdue) += y; }

// calculate the total number of tasks before deadline End_of_ ESCR_task_Scheduling

Figure 3.5: The ESCR algorithm.

The other optimization to be investigated in this section is minimizing overall execution time (i.e., makespan) for a given fixed amount of tasks. Let’s use the same environment setting in Figure 3.4 to explain task allocation schemes using different algorithms. Figure 3.6 shows the three different task allocation schemes, SCR, ESCR and FPF, with a total number of 66 tasks to be processed.

Figure 3.6(a) shows scheduling of the SCR task allocation scheme. Because there are 19 tasks can be distributed in each BSC, the SCR allocates 57 tasks in three BSCs and remaining 9 tasks un-dispatched before the fourth BSC is started. According to the communication ratio, P3 first receives 4 tasks at time t= 165, then P4 receives 3 tasks at time t = 169, finally P2 receives 2 tasks at time t=178. The SCR method presents that 66 tasks can be completed with makespan = 217 which is dominated by P4.

In FPF implementation, there are 16 tasks can be distributed in each BSC, the FPF allocates 64 tasks in four BSCs and remaining 2 tasks un-dispatched before the fifth BSC is started. According to the computation speed of processors, P1 receives the 2 remaining tasks for computation at time t = 200, as shown in Figure 3.6(b). The FPF method presents that 66 tasks can be completed with makespan = 234 which is dominated by P3. The ESCR task allocation scheme is depicted in Figure 3.6(c). Similar to the SCR scheme, there are 57 tasks can be completed in the first three BSCs and remaining 9 tasks un-dispatched before the fourth BSC is started. The ESCR uses a binary approximation method (shown in Figure 3.7) to optimize the overall completion time for a given fixed amount of tasks.

According to Lemma 3.8 and the approximation algorithm in Figure 3.7, the ESCR method presents that 66 tasks can be completed with makespan = 199 which is dominated by P1.

(a) (b)

Figure 3.6: Different task allocation schem SCR, ESCR and FPF, with a total number

Figure 3.7: The binary approximation method of ESCR algorithm.

of 66 tasks to be processed (a) SCR (b) FPF (c) ESCR.

Algorithm_ESCR_Binary_Approcimation (Ti, Ti_comm, Qtask)

ft_t+ Right_t

Qtask) x) } // Qtask is the amount of tasks to be processed 01. hile ( !(TaskESCR_finish(x) = Qtask) ) { 02. Left_t=T_finishSCR(BSC_j₋₁); 03. Right_t=T_finish^SCR(BSC_j₊₁); 04. x=1/2(Le );

05. if (Task_finish^ESCR(x) > Qtask) 06. x=1/2 (x+ Right_t) 07. else if (TaskESCR_finish(x)<

08 x=1/2 (Left_t+

09. makespan = x;

End_of_ESCR_Binary_Approcimation

The fo d from the above dem

llowing theorems summarize important features of ESCR obtaine onstration.

Theorem 3.1: Given master-slave tasking paradigm with n slave processors, the startup

unication ration

1 i

=1 j

0 and Wi’ = , for i'>1; for the FPF scheme. Observing the startup

han the FPF sc

waiting time of ESCR task allocation scheme is less than the FPF scheme.

Proof: Because the ESCR distributes tasks according to processors’ comm

(in increasing order), we assume that comm(P1) < comm(P2) < … < comm(Pn) as shown in Figure 3.8(a). On the contrary, the FPF scheme distributes tasks according to processors’

computational speed; namely faster processors receive more tasks. This will result comm(P1) > comm(P2) > … > comm(Pn) as shown in Figure 3.8(b). According to definition 3.11, the startup waiting time of the ESCR scheme can be formulated as

∑

= n i

Wi 1

where W =0 and W =

∑

ⁱ⁻¹^comm⁽^P ⁾, for i>1; the same formulation

∑

ⁿ ^{W , wher}

−1 'i

waiting time fo s, we have W1=W1’=0, W2 = comm(P1) < W2’ = comm(P1’), W3

= comm(P1)+ comm(P2) < W3’ = comm(P1’)+ comm(P2’) and so on. Therefore, we

have

∑ ∑

< ⁿ

i i

n i

i W

1 ' 1

, showing the startup waiting time of ESCR task allocation scheme is less t heme. █ Theorem 3.2:

i i

1 ' e W1’ =

∑

=1 ( ')

comm

r both scheme

Given master-slave tasking paradigm with n slave processors, the ESCR

and 3.5 upon scheme has less total processor idle time than the FPF scheme if Pmax+1 < n.

Proof: This lemma can be easily established according to Lemmas 3.1, 3.2

the assumption Pi will be regarded as idle processor if i > max+1 in FPF scheme. █ Theorem 3.3: Given master-slave tasking paradigm with n slave processors, the ESCR algorithm has less execution time than FPF algorithm for a given amount of tasks.

Proof: According to Lemma 3.6:

We have_T_finish^ESCR₍_BSC_j₎₌

∑

ⁿ _comm₍_P_i₎₊_comp₍_P_k₎₊₍_j₋₁₎_×₍_comm₍_P_k₎₊_comp₍_P_k₎₊_T_idle^ESCR₎^{, where}

i=1

maximum communication ratio; and T_finish^FPF )

(P_i + j×(comm(P_n_')+comp(P_n_')+T_idle^FPF) −T_idle^FPF according to Lemma 3.3.

) (BSC _j Pk is the slave processor with

of FPF is 3.1 and 3.2, we know that

execution time than FPF algori

(a)

igure 3.8: Ta time (a) ESCR (b)

PF.

∑

i=1

comm

max

The makespan of SCR is W_n + j×(BSC)+(j−1)×T_idle^SCR and makespan . According to Theorems

e conclude that the

FPF idle

n j BSC j T

W _' + ×( )+( −1)×

' n

Wn <W and T_idle^SCR<T_idle^FPF. Therefore, w ESCR algorithm has less thm for a given amount of tasks. █

(b)

sk allocation paradigm showing different startup waiting F

To evaluate the performance of the proposed techniques, we have implemented the Comparative metrics, such as turnaround time, system throughput and processor idle time, will be discussed in the following evaluation.

在文檔中中華大學 (頁 53-63)