∑ = comm

Lemma 8: Given a master slave system with heterogeneous communication and δ > 1, in MJF scheduling, we have

task(Pmax+1) =

⎥⎥

⎥

⎦

⎥

⎢⎢

⎢

⎣

⎢ −

∑

= comm i

P comm BSC

_ 1 max

max

)

( (8)

Lemma 9: Given an SCR scheduling with heterogeneous communication and

δ

> 1, T_idle^SCR is the idle time of one slave processor, we have the following equation,

SCR

Tidle =

∑

⁺

= 1 max

) (

comm − BSC. (9)

Lemma 10: Given an SCR scheduling with heterogeneous communication and δ> 1, T_start^SCR(BSC_j) is the start time to dispatch tasks in the j^thBSC, we have the following equation,

) (

) 1 ( )

( _j _idle^SCR

SCR

start BSC j BSC T

T = − × + (10) Lemma 11: Given an SCR scheduling with heterogeneous communication and δ> 1, the task completion time of the j^thBSC denoted by T_finish^SCR (BSC_j), we have

)

SCR( BSC

T =^max

∑

⁺¹ ⁺^comp⁽^P^k⁾⁺⁽^j⁻¹⁾^×⁽^comm⁽^P ⁾⁺^comp⁽^P ⁾⁺^T^SCR⁾ (11)

where Pk is the slave processor with maximum communication cost.

Another example of heterogeneous of communication with δ > 1 master slave tasking is shown in Fig. 4(a).

The communication overheads vary from 1 to 5. The computational speeds vary from 3 to 13. In this example, we have BSC = 48.

In SCR implementation, according to corollary 3, task distribution is task(P1) = 6, task(P2) = 6, task(P3) = 4 and task(Pmax+1) = task(P4) = 3. The communication costs of slave processors are comm(P1) = 30, comm(P2) = 12, comm(P3) = 4 and comm(P4) = 9, respectively. Therefore, the SCR method distributes tasks by the order P3, P4, P2, P1. There are 19 tasks in the first BSC dispatched to P1 to P4 during time period 1~55. Processor P3 is the first processor to receive tasks and it finishes at time t = 48 and becomes available. In the meanwhile, processor P1

receives tasks during t = 48~55. The second BSC starts to dispatch tasks at t = 55. Namely, P3 starts to receive tasks at t = 55 in the second scheduling cycle. Therefore, P3 has 7 unit of time idle. Lemmas 4 and 5 state the above phenomenon. The completion time of tasks in the first BSC depends on the finish time of processor P1. We have

T

_finish^SCR(

BSC

₁) = 73.

(a)

(b)

(c)

Fig. 4. Task scheduling on heterogeneous communication environment with δ>1. (a) Smallest Communication Ratio (b)

The MJF scheduling is depicted in Fig. 4(b). According to corollary 5, task(Pmax+1) = task(P4) = 0, therefore, P4 will not be included in the scheduling. MJF has the task distribution order P1, P2, P3. Another scheduling policy is called Longest Communication Ratio (LCR) which is an opposite approach to the SCR method. Fig. 4(c) shows the LCR scheduling result which has the dispatch order P1, P2, P4, P3.

To investigate the performance of SCR scheduling technique, we observe that MJF algorithm completes 16 tasks in 90 units of time in the first BSC. On the other hand, in SCR scheduling, there are 19 tasks completed in 73 units of time in the first BSC. In LCR, there are 19 tasks completed in 99 units of time. We can see that the system throughput of SCR (19/73≈0.260) > LCR (19/99≈0.192) > MJF (16/90≈0.178). Moreover, the average turnaround time of the SCR algorithm in the first three BSCs is 183/57 (≈3.2105) which is less than the LCR‘s average turnaround time 209/57 (≈3.6666) and the MJF‘s average turnaround time 186/48 (≈3.875).

6 Performance Evaluation

To evaluate the performance of the proposed method, we have implemented the SCR and the MJF algorithms. We compare different criteria, such as average turnaround time, system throughput and processor idle time, in Heterogeneous Processors with Heterogeneous Communications (HPHC).

Simulation experiments for evaluating average turnaround time are made upon different number of processors and show in Fig. 7. The computational speed of slave processors is set as T1=3, T2=3, T3=5, T4=7, T5=11, and T6=13. For the cases when processor number is 2, 3… 6, we haveδ≤1. When processor number increases to 7, we haveδ>1. In either case, the SCR algorithm conduces better average turnaround time. From the above results, we conclude that the SCR algorithm outperforms MJF for most test samples.

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

1 2 3 4 5 6

# of nodes Average turn-around time

MJF SCR

Fig. 5. Average task turn-around time on different numbers of processors.

Simulation results present the performance comparison of three task scheduling algorithms, SCR, MJF, LCR, on heterogeneous processors and heterogeneous communication paradigms. Fig. 6 shows the simulation results for the experiment setting that with ±10 processor speed variation and ±4 communication speed variation. The computation speed of slave processors are

T

₁=3, T₂=6, T₃=11, and T₄=13. The time of a slave processor to receive one task from master processor are T₁_{_}_comm = 5, T₂_{_}_comm = 2, T₃_{_}_comm = 1 and T₄_{_}_comm=3. The average task turnaround time, system throughput and processor idle time are measured.

0 1 2 3 4 5 6

1 2 3 4 5 BSC

Average trun-around time (time unit)

MJF LCR SCR

(a)

0 0.1 0.2 0.3 0.4

1 2 3 4 5 BSC

Throughput

MJF LCR SCR

(b)

0 50 100 150 200 250

1 2 3 4 5 BSC

Total processor idle time (time unit)

MJF LCR SCR

(c)

Fig. 6. Simulation results for 5 processors with ±10 computation speed variation and ±4 communication variation when

δ

(a) average turnaround time (b) system throughput (c) processor idle time.

Fig. 6(a) is the average turnaround time within different number of BSC. The SCR algorithm performs better than the LCR and MJF method. Similarly, the SCR method has higher throughput than the other two algorithms as shown in Fig. 6(b). The processor idle time are estimated in Fig. 6(c). The SCR and LCR algorithms have the same period of processor idle time which is less than the MJF scheduling method. These phenomena match the theoretical analysis in section 5.

The miscellaneous comparison in Fig. 7 presents the performance comparison of SCR, MJF with more cases.

The simulation results for the experiment setting that with ±5~±30 processor speed variation and ±5~±30 communication speed variation. The computation speed variation of T₁~T_n=±5~±30. The communication speed variation of T1__comm ~T_n_{_}_comm=±5~±30. The system throughput is measured.

0.1 0.2 0.3 0.4

Throughput MJF

LCR SCR

(a)

0 0.05 0.1 0.15 0.2 0.25

5 10 15 20 # of Nodes25

Throughput

MJF LCR SCR

(b)

Fig. 7. Simulation results of throughput for the range of 5~25 processors with ±30 computation speed variation and

±30 communication variation in 100 cases and 100 BSC (a) system throughput of the cases when 0<T_i ≤30 and 0<

comm

Ti_{_} ≤5 (b) system throughput of the cases when 0<T ≤_i 5 and 0<

comm

Ti_{_} ≤30.

Fig. 7(a) is the case of 0<

T

_i ≤30, 0<T_i_{_}_comm ≤5 and the parameter of computation speed and communication speed are to be random and uniformly distributed within different number of nodes and 100 BSC for 100 cases.

Fig. 7(b) is the case of 0<

T

_i≤5 and 0<T_i_{_}_comm ≤30. The SCR algorithm performs better than MJF method, and SCR method has higher throughput than the MJF algorithm as shown in Fig. 7(a) and Fig. 7(b). From the above experimental tests, we have the following remarks. The proposed SCR scheduling technique has better task turnaround time and higher system throughput than the MJF algorithm.

From the above experimental tests, we have the following remarks.

z The proposed SCR scheduling technique has higher system throughput than the MJF algorithm.

z The proposed SCR scheduling technique has better task turnaround time than the MJF algorithm.

The SCR scheduling technique has less processor idle time than the MJF algorithm.

7 Conclusions

The problem of resource management and scheduling has been one of main challenges in grid computing. In this paper, we have presented an efficient algorithm, SCR for heterogeneous processors tasking problem. One significant improvement of our approach is that average turnaround time could be minimized by selecting processor has the smallest communication ratio first. The other advantage of the proposed method is that system throughput can be increased via dispersing processor idle time. Our preliminary analysis and simulation results indicate that the SCR algorithm outperforms Beaumont’s method in terms of lower average turnaround time, higher average throughput, less processor idle time and higher processors’ utilization.

There are numbers of research issues that remains in this paper. Our proposed model can be applied to map tasks onto heterogeneous cluster systems in grid environments, in which the communication costs are various from clusters.

In future, we intend to devote generalized tasking mechanisms for computational grid. We will study realistic applications and analyze their performance on grid system. Besides, rescheduling of processors / tasks for minimizing processor idle time on heterogeneous systems is also interesting and will be investigated.

References

1. O. Beaumont, A. Legrand and Y. Robert, “The Master-Slave Paradigm with Heterogeneous Processors,” IEEE Trans. on parallel and distributed systems, Vol. 14, No.9, pp. 897-908, September 2003.

2. C. Banino, O. Beaumont, L. Carter, J. Ferrante, A. Legrand and Y. Robert, ”Scheduling Strategies for Master-Slave Tasking on Heterogeneous Processor Platforms,” IEEE Trans. on parallel and distributed systems, Vol. 15, No.4, pp.319-330, April 2004.

3. O. Beaumont, A. Legrand and Y. Robert, “Pipelining Broadcasts on Heterogeneous Platforms,” IEEE Trans. on parallel and distributed systems, Vol. 16, No.4, pp. 300-313 April 2005.

5. O. Beaumont, V. Boudet, F. Rastello and Y. Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms,” Proc. Int'l Conf.

Parallel Processing,Vol. 12, No. 10, pp. 1033-1051, Oct. 2001.

6. F. Berman, R. Wolski, H. Casanova, W. Cirne, H. Dail, M. Faerman, S. Figueira, J. Hayes, G. Obertelli, J. Schopf, G. Shao, S.

Smallen, N. Spring, A. Su, and D. Zagorodnov, ”Adaptive Computing on the Grid Using AppLeS,” IEEE Trans. on parallel and distributed systems, Vol. 14, No. 4, pp.369-379, April 2003.

7. S. Bataineh, T.Y. Hsiung and T.G. Robertazzi, “Closed Form Solutions for Bus and Tree Networks of Processors Load Sharing a Divisible Job,” IEEE Trans. Computers, Vol. 43, No. 10, pp. 1184-1196, Oct. 1994.

8. T. D. Braun, H. J. Siegel, N. Beck, L. Boloni, M. Maheswaran, A. I. Reuther, J. P. Robertson, M. D. Theys and B. Yao, “A taxonomy for describing matching and scheduling heuristics for mixed-machine heterogeneous computing systems,” Proceedings of the IEEE Workshop on Advances in Parallel and Distributed Systems, pp. 330-335, Oct. 1998.

9. A.T. Chronopoulos and S. Jagannathan, “A Distributed Discrete-Time Neural Network Architecture for Pattern Allocation and Control,” Proc. IPDPS Workshop Bioinspired Solutions to Parallel Processing Problems, 2002.

10. S. Charcranoon, T.G. Robertazzi and S. Luryi, “Optimizing Computing Costs Using Divisible Load Analysis,” IEEE Trans.

Computers, Vol. 49, No. 9, pp. 987-991, Sept. 2000.

11. K. Cooper, A. Dasgupta, K. Kennedy, C. Koelbel, A. Mandal, G. Marin, M. Mazina, J. Mellor-Crummey, F. Berman, H. Casanova, A. Chien, H. Dail, X. Liu, A. Olugbile, O. Sievert, H. Xia, L. Johnsson, B. Liu, M. Patel, D. Reed, W. Deng, C. Mendes, Z. Shi, A.

YarKhan, J. Dongarra, ”New Grid Scheduling and Rescheduling Methods in the GrADS Project,” Proceedings of the 18^th International Parallel and Distributed Processing Symposium (IPDPS’04), pp.209-229, April 2004.

12. H. Casanova, A. Legrand, D. Zagorodnov and F. Berman, “Heuristics for Scheduling Parameter Sweep applications in Grid environments,” Proceedings of the 9th Heterogeneous Computing workshop (HCW'2000), pp. 349-363, 2000.

13. T. Thanalapati and S. Dandamudi, ”An Efficient Adaptive Scheduling Scheme for Distributed Memory Multicomputers,” IEEE Trans. on parallel and distributed systems, Vol. 12, No. 7, pp.758-767, July 2001.

行政院所屬各機關人員出國報告書提要

撰寫時間： 96 年 6 月 20 日姓名許慶賢服務機關名稱中華大學

資工系

連絡電話、

電子信箱

03-5186410 chh@chu.edu.tw

出生日期 62 年 2 月 23 日職稱副教授

出席國際會議

名稱

2007 International Conference on Algorithms and Architecture for Parallel Processing, June 11 -14 2007.

到達國家

及地點

Hangzhou, China 出國

期間

自 96 年 06 月 11 日迄 96 年 06 月 19 日

內容提要

這一次在杭州所舉行的國際學術研討會議共計四天。第一天下午本人抵達會場辦理報到。第二天各主持一場 invited session 的論文發表。同時，自己也在上午的場次發表了這依次被大會接受的論文。第一天也聽取了 Dr.

Byeongho Kang 有關於 Web Information Management 精闢的演說。第二天許多重要的研究成果分為六個平行的場次進行論文發表。本人選擇了 Architecture and Infrastructure、Grid computing、以及 P2P computing 相關場次聽取報告。晚上本人亦參加酒會，並且與幾位國外學者及中國、香港教授交換意見，合影留念。第三天本人在上午聽取了 Data and Information Management 相關研究，同時獲悉許多新興起的研究主題，並了解目前國外大多數學者主要的研究方向，並且把握最後一天的機會與國外的教授認識，

希望能夠讓他們加深對台灣研究的印象。三天下來，本人聽了許多優秀的論文發表。這些研究所涵蓋的主題包含有：網格系統技術、工作排程、網格計算、網格資料庫以及無線網路等等熱門的研究課題。此次的國際學術研討會議有許多知名學者的參與，讓每一位參加這個會議的人士都能夠得到國際上最新的技術與資訊。是一次非常成功的學術研討會。參加本次的國際學術研討會議，感受良多。讓本人見識到許多國際知名的研究學者以及專業人才，

得以與之交流。讓本人與其他教授面對面暢談所學領域的種種問題。看了眾多研究成果以及聽了數篇專題演講，最後，本人認為，會議所安排的會場以及邀請的講席等，都相當的不錯，覺得會議舉辦得很成功，值得我們學習。

出席人所屬機關審核意見

層轉機關

審核意見

研考會

處理意見

(出席 ICA3PP-07 研討會所發表之論文)

A Generalized Critical Task Anticipation Technique for DAG Scheduling

Ching-Hsien Hsu

, Chih-Wei Hsieh

and Chao-Tung Yang

1 Department of Computer Science and Information Engineering Chung Hua University, Hsinchu, Taiwan 300, R.O.C.

chh@chu.edu.tw

2 High-Performance Computing Laboratory

Department of Computer Science and Information Engineering Tunghai University, Taichung City, 40704, Taiwan R.O.C.

ctyang@thu.edu.tw

Abstract. The problem of scheduling a weighted directed acyclic graph (DAG) representing an application to a set of heterogeneous processors to minimize the completion time has been recently studied. The NP-completeness of the problem has instigated researchers to propose different heuristic algorithms. In this paper, we present a Generalized Critical-task Anticipation (GCA) algorithm for DAG scheduling in heterogeneous computing environment. The GCA scheduling algorithm employs task prioritizing technique based on CA algorithm and introduces a new processor selection scheme by considering heterogeneous communication costs among processors for adapting grid and scalable computing. To evaluate the performance of the proposed technique, we have developed a simulator that contains a parametric graph generator for generating weighted directed acyclic graphs with various characteristics. We have implemented the GCA algorithm along with the CA and HEFT scheduling algorithms on the simulator.

The GCA algorithm is shown to be effective in terms of speedup and low scheduling costs.

1. Introduction

The purpose of heterogeneous computing system is to drive processors cooperation to get the application done quickly. Because of diverse quality among processors or some special requirements, like exclusive function, memory access speed, or the customize I/O devices, etc.; tasks might have distinct execution time on different resources. Therefore, efficient task scheduling is important for achieving good performance in heterogeneous systems.

The primary scheduling methods can be classified into three categories, dynamic scheduling, static scheduling and hybrid scheduling according to the time at which the scheduling decision is made. In dynamic approach, the system performs redistribution of tasks between processors during run-time, expect to balance computational load, and reduce processor’s idle time. On the contrary, in static

approach, information of applications, such as tasks execution time, message size of communications among tasks, and tasks dependences are known a priori at compile-time; tasks are assigned to processors accordingly in order to minimize the entire application completion time and satisfy the precedence of tasks. Hybrid scheduling techniques are mix of dynamic and static methods, where some preprocessing is done statically to guide the dynamic scheduler [8].

A Direct Acyclic Graph (DAG) [2] is usually used for modeling parallel applications that consists a number of tasks. The nodes of DAG correspond to tasks and the edges of which indicate the precedence constraints between tasks. In addition, the weight of an edge represents communication cost between tasks. Each node is given a computation cost to be performed on a processor and is represented by a computation costs matrix. Figure 1 shows an example of the model of DAG scheduling. In Figure 1(a), it is assumed that task n_j is a successor (predecessor) of task ni if there exists an edge from ni to nj (from nj to ni) in the graph. Upon task precedence constraint, only if the predecessor ni completes its execution and then its successor nj receives the messages from ni, the successor nj can start its execution.

Figure 1(b) demonstrates different computation costs of task that performed on heterogeneous processors. It is also assumed that tasks can be executed only on single processor with non-preemptable style. A simple fully connected processor network with asymmetrical data transfer rate is shown in Figures 1(c) and 1(d).

P1 P2 P3 wi

n1 14 19 9 14

n2 13 19 18 16.7 n3 11 17 15 14.3

n4 13 8 18 13

n5 12 13 10 11.7 n6 12 19 13 14.7 n7 7 16 11 11 n8 5 11 14 10 n9 18 12 20 16.7 n10 17 20 11 16

(a) (b)

Figure 1: An example of DAG scheduling problem (a) Directed Acyclic Graph (DAG-1) (b) computation cost matrix (W) (c) processor topology (d) communication weight.

The scheduling problem has been widely studied in heterogeneous systems where

the computational ability of processors is different and the processors communicate over an underlying network. Many researches have been proposed in the literature.

The scheduling problem has been shown to be NP-complete [3] in general cases as well as in several restricted cases; so the desire of optimal scheduling shall lead to higher scheduling overhead. The negative result motivates the requirement for heuristic approaches to solve the scheduling problem. A comprehensive survey about static scheduling algorithms is given in [9]. The authors of have shown that the heuristic-based algorithms can be classified into a variety of categories, such as clustering algorithms, duplication-based algorithms, and list-scheduling algorithms.

Due to page limitation, we omit the description for related works.

In this paper, we present a Generalized Critical task Anticipation (GCA) algorithm, which is an approach of list scheduling for DAG task scheduling problem. The main contribution of this paper is proposing a novel heuristic for DAG scheduling on heterogeneous machines and networks. A significant improvement is that inter-processor communication costs are considered into processor selection phase such that tasks can be mapped to more suitable processors. The GCA heuristic is compared favorable with previous CA [5] and HEFT heuristics in terms of schedule length and speedup under different parameters.

The rest of this paper is organized as follows: Section 2 provides some background, describes preliminaries regarding heterogeneous scheduling system in DAG model and formalizes the research problem. Section 3 defines notations and terminologies used in this paper. Section 4 forms the main body of the paper, presents the Generalized Critical task Anticipation (GCA) scheduling algorithm and illustrating it with an example. Section 5 discusses performance of the proposed heuristic and its simulation results. Finally, Section 6 briefly concludes this paper.

2. DAG Scheduling on Heterogeneous Systems

The DAG scheduling problem studied in this paper is formalized as follows. Given a parallel application represented by a DAG, in which nodes represent tasks and edges represent dependence between these tasks. The target computing architecture of DAG scheduling problem is a set of heterogeneous processors, M = {Pk: k = 1: P} and P = |M|, communicate over an underlying network which is assumed fully connected. We have the following assumptions:

z Inter-processor communications are performed without network contention between arbitrary processors.

z Computation of tasks is in non-preemptive style. Namely, once a task is assigned to a processor and starts its execution, it will not be interrupted until its completion.

z Computation and communication can be worked simultaneously because of the separated I/0.

z If two tasks are assigned to the same processor, the communication cost between the two tasks can be discarded.

z A processor is assumed to send the computational results of tasks to their immediate successor as soon as it completes the computation.

Given a DAG scheduling system, W is an n × P matrix in which w_i,j indicates

estimated computation time of processor P_j to execute task n_i. The mean execution time of task ni can be calculated by the following equation:

∑

= ^P

j j i

i P

w w

, (1)

Example of the mean execution time can be referred to Figure 1(b).

For communication part, a P × P matrix T is structured to represent different data transfer rate among processors (Figure 1(d) demonstrates the example). The communication cost of transferring data from task n_i (execute on processor p_x) to task n_j (execute on processor p_y) is denoted by c_i,j and can be calculated by the following equation,

y x j i m j

i V Msg t

c_, = + _, × _, , (2) Where:

V_m is the communication latency of processor P_m, Msg_i,j is the size of message from task n_i to task n_j,

t_x,y is data transfer rate from processor p_x to processor p_y, 1≤ x, y ≤P.

In static DAG scheduling problem, it was usually to consider processors’

latency together with its data transfer rate. Therefore, equation (2) can be simplified as follows,

y x j i j

i Msg t

c_, = _, × _, , (3) Given an application represented by Directed Acyclic Graph (DAG), G = (V, E), where V = {n_j: j = 1: v} is the set of nodes and v = |V|; E = {e_i,j = <n_i, n_j>} is the set of communication edges and e =|E|. In this model, each node indicates least indivisible task. Namely, each node must be executed on a processor from the start to its completion. Edge <n_i, n_j> denotes precedence of tasks n_i and n_j. In other words, task n_i is the immediate predecessor of task n_j and task n_j is the immediate successor of task n_i. Such precedence represents that task n_j can be start for execution only upon the completion of task n_i. Meanwhile, task n_j should receive essential message from ni for its execution. Weight of edge <ni, nj > indicates the average communication cost between ni and nj.

Node without any inward edge is called entry node, denoted by nentry; while node without any outward edge is called exit node, denoted by n_exit. In general, it is supposed that the application has only one entry node and one exit node. If the actual application claims more than one entry (exit) node, we can insert a dummy entry (exit) node with zero-cost edge.

3. Preliminaries

This study concentrates on list scheduling approaches in DAG model. List scheduling was usually distinguished into list phase and processor selection phase.

Therefore, priori to discuss the main content, we first define some notations and terminologies used in both phases in this section.

3.1 Parameters for List Phase

在文檔中行政院國家科學委員會專題研究計畫成果報告 (頁 55-86)

∑

δ

∑

∑

T

BSC

T

δ

T

T

行政院所屬各機關人員出國報告書提要

出 生 日 期 62 年 2 月 23 日 職 稱 副教授

出 席 國 際 會 議

名 稱

到 達 國 家

及 地 點

Hangzhou, China 出 國

期 間

自 96 年 06 月 11 日 迄 96 年 06 月 19 日

內 容 提 要

(出席 ICA3PP-07 研討會所發表之論文)

A Generalized Critical Task Anticipation Technique for DAG Scheduling

Ching-Hsien Hsu

, Chih-Wei Hsieh

and Chao-Tung Yang

1. Introduction

2. DAG Scheduling on Heterogeneous Systems

∑

3. Preliminaries

出生日期 62 年 2 月 23 日職稱副教授

出席國際會議

名稱

到達國家

及地點

Hangzhou, China 出國

期間

自 96 年 06 月 11 日迄 96 年 06 月 19 日

內容提要