行政院國家科學委員會專題研究計畫成果報告

(1)

行政院國家科學委員會專題研究計畫成果報告

異質性平行計算網路下支援 SPMD 程式之資源配置工作管理與資料重組技術之研發(II)

計畫類別：個別型計畫

計畫編號： NSC93-2213-E-216-028-

執行期間： 93 年 08 月 01 日至 94 年 07 月 31 日執行單位：中華大學資訊工程學系

計畫主持人：許慶賢

計畫參與人員：陳世璋、翁銘遠、藍朝陽

報告類型：精簡報告

報告附件：出席國際會議研究心得報告及發表論文處理方式：本計畫可公開查詢

中華民國 94 年 10 月 28 日

(2)

行政院國家科學委員會補助專題研究計畫 █ ^{成果報告}

□ 期中進度報告

異質性平行計算網路下支援 SPMD 程式之資源配置工作管理與資料重組技術之研發(II)

計畫類別：█ 個別型計畫 □ 整合型計畫計畫編號：NSC 93－2213－E－216－028－

執行期間： 93 年 08 月 01 日至 94年 07 月 31 日執行單位：中華大學資訊工程學系

計畫主持人：許慶賢助理教授共同主持人：

計畫參與人員：陳世璋、翁銘遠、藍朝陽中華大學資訊工程學系研究生

成果報告類型(依經費核定清單規定繳交)：□精簡報告 █完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

█出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年□二年後可公開查詢

中華民國 94 年 10 月 14 日

(3)

行政院國家科學委員會專題研究計畫結案報告

異質性平行計算網路下支援 SPMD 程式之資源配置工作管理與資料重組技術之研發

中文摘要

在平行計算系統中所發展出來的 SPMD 計算模式已在許多大量資料運算及高性能科學應用中被廣泛的接受。隨著網路技術的進步與頻寬快速成長，加上經濟成本與效益的考量，異質性網格計算已成為大量資料與科學計算在平行與分散式計算平台以外的另一種選擇。因此，如何可以有效率的移植 SPMD 程式透通於異質性的計算平台上以保有其程式演算的最佳效能成了最值得討論的問題。

本報告是有關於在研發異質性平行計算網路下支援 SPMD 程式之資源配置工作管理與資料重組技術之描述。在這一個計畫中，我們針對 SPMD 平行資料程式在異質性多叢集系統提出有效率的資源配置、工作排程方法、以及通訊局部化技術。在資源配置與工作管理方面，我們分別針對處理器計算能力的異質與網路頻寬的異質，提出

SCTF (Shortest Communication Time First)

工作排程演算法，並且開發以網頁為基礎的管理工具，在兩套 PC_Cluster 系統上實作出資源配置、監督與工作排程系統。在通訊局部化技術方面，透過邏輯處理器與資料對應的技術，可以降低不同叢集系統之間處理器的通訊成本。另外，針對異質性平行計算網路下的負載平衡問題，我們也提出了一套以基因演算法為基礎的模糊理論，提升異質性分散式計算平台的排程效率。本計畫的成果可以有助於增加異質性叢集系統的產能以及 SPMD 平行程式在該系統的執行效率。對於工作排程、協調配置與資源管理的問題，我們測試了幾套國外著名的工作排程系統，進行不同平台的測試，實驗結果顯示，SCTF 提升了系統平均產能(throughput)、

縮短了工作執行的平均回覆時間(turnaround time)。

關鍵詞：異質性計算、SPMD、分散式記算、平行演算法、資源配置、工作管理、資料重組、網格計算。

(4)

II

Design and Implementation of Resource Allocation and Job Scheduling for Supporting SPMD Programs on Heterogeneous

Parallel Computing Networks

Abstract

The SPMD programming model evolved from parallel computing system has become a widely accepted paradigm for massive computing and high performance scientific applications. With the progressing of network technology, the rapid growth of communication bandwidth and the consideration of cost-effective ratio, grid-computing environment has become the other choice for many scientific applications in parallel and distributed computation. Thus, how to execute an SPMD program on heterogeneous computing platforms efficiently is a common challenge.

This report presents the development and implementation of resource allocation and job scheduling for supporting SPMD programs on heterogeneous parallel computing networks.

In this project, we have proposed an efficient communication technique for SPMD parallel programs in heterogeneous multi-cluster systems. Utilizing the logical processor to data mapping technique, inter-cluster communications between physical processors can be reduced. Besides, we have also proposed a genetic-fuzzy logic based approach for dynamic load balance on heterogeneous parallel computing systems. The results of our work facilitate increasing the throughput of heterogeneous distributed memory environments and the performance of SPMD parallel programs executing on such systems. For job scheduling, co-allocation and resource management, we have proposed an SCTF (Shortest Communication Time First) task scheduling algorithm. We also developed a web-based resource monitoring tool upon two PC cluster systems. We have tested some major tools that developed by other research teams on different platforms. The experimental results show that SCTF outperforms Beaumont’s method in terms of lower average turnaround time, higher average throughput, less processor idle time and higher processors’ utilization.

Keywords: Heterogeneous Computing, SPMD, Distributed Computing, Parallel Algorithm,

Resource Allocation, Task Management, Data Reconfiguration, Grid Computing.

(5)

一、緣由與目的

隨著網路技術的進步與頻寬快速成長，加上經濟成本與效益的考量，將網路上的計算資源結合成為一個具有工作協調能力的計算系統蘊育而生，網格計算(Grid

Computing)

也因此成為大量資料與科學計算在平行與分散式計算平台以外的另一種

選擇。從技術面來說，網格計算環境可以結合平行電腦、工作站叢集、以及網路上任意可用的計算資源，進而加大其運算能力。有鑑於此，在異質性(Heterogeneous)的計算環境上開發輔助運算的軟體工具也成為近幾年來廣為討論的課題。爲了結合既有的平行程式技術與異質性的網格計算平台，在分散式網路計算環境上從事大量平行計算所延伸的相關問題就成了相當值得研究的課題。由於 SPMD 程式模式是在平行計算系統中所發展出來的程式方法，如何可以有效率的移植 SPMD 程式透通於異質性的計算平台上以保有其程式演算的最佳效能自然成了最直接的挑戰。爲了結合既有的平行程式技術與異質性的網格計算平台，在分散式網路計算環境上從事大量平行計算所延伸的相關問題就成了相當值得研究的課題。由於 SPMD 程式模式是在平行計算系統中所發展出來的程式方法，如何可以有效率的移植 SPMD 程式透通於異質性的計算平台上以保有其程式演算的最佳效能自然成了最直接的挑戰。這些問題討論的重點可以從系統與應用程式的管理以及計算平台的架構兩點來研究。

在系統與應用程式的管理方面，工作分配的好壞直接影響了程式的完成時間與系統資源的使用。工作負載平衡(workload balance)則可避免某一系統因工作負擔太重，

而拉長整個工作結束的時間，以達到高效能計算的目標。另一方面，在分散式記憶體計算環境下，要有效率的執行一個平行資料的程式，適當的資料配置(Data Distribution) 是很重要的。由於資料的區域性(locality)可減少處理器間資料的傳輸，所以在支援

SPMD

程式資料執行於網格計算系統方面，我們將研究在異質性的分散式記憶體群體計算系統中，工作如何有效的分配計算工作至各處理器上，可使得各處理器的工作量是均衡的；我們也將研究有效率的方法來處理的資料分配與資源重組的問題，進而提高資料的區域性；在程式執行期間，減少處理器之間的資料交換，降低通訊成本。

在這個計畫中，我們主要是要研發適應於異質性分散式網路計算環境之工作排程與管理、資源分配以及動態資料重組技術的整合方法，用來提升 SPMD 平行運算程式的效能。主要的工作項目包括研究工作排程、工作負載平衡與工作重新配置對整體程式執行的必要性及其在效能上影響；研究在不同網路領域或網路拓僕環境之間通訊對

SPMD

程式執行動態資料交換所造成的影響及其最佳化；以及發展以網頁為基礎的異

(6)

2

質性網路計算環境工作排程與監督系統。

二、研究方法與成果

對於異質性分散式網路計算環境下工作配置與資源管理的問題，我們首先針對軟體的異質性(包括作業系統、訊息傳遞介面、區域排程策略)，解決系統之間身分確認

(authentication)

與資源授權(authorization) 的問題。爲了在分散的異質性環境下執行相同的工作，我們在不同的節點上同步配置需要的資源，並且建立一個虛擬的共同執行環境 (MPI 的實作上，也可以利用 MPI_COMM_WORLD 作為程式執行過程中的通訊領域) 達到工作協調配置(co-allocation)的目的。在開發工作協調配置的方法上，利用包裝在通訊模組與工作配置模組的工具來達成此目標。主要的好處是可以降低開發資源管理模組的複雜度。我們亦採用 GRAM (Globus Resource Allocation Management) 為工具，透過單一的介面來管理區域資源(包括電腦與網路的服務)，解決資源管理的問題並提高未來系統的擴充性。GRAM 除了可以提供上述安全認證的管理機制，另外還支援複雜的資源協調配置(co-allocation)與錯誤偵測(failure detection)。

由於在異質性的網格計算環境中，工作配置、排程以及計算過程中處理必要的資源重新分配問題都屬於資源協調配置的問題，而這些動態管理的機制其目的在於維持系統的可靠性以及提高程式執行的效能。我們將 SPMD 程式在異質性的計算環境上執行平行計算結合通訊介面技術一併考慮。

資料在不同計算系統平台之間的通訊，我們仍然採用以 MPI 為基礎的通訊介面標準：MPICH-G 的實作(利用 MPI 的通訊介面，將有助於本項計劃所開發的程式，在不同分散式記憶體平台之間的可攜性)，亦有助於我們在不同的分散式平台之間執行程式。相較於其它類似網格計算環境所提供之工具，MPICH-G 提供簡單的單一介面來起始程式的執行。此外，不論在 SMP 或 MPP 之間配置工作的執行，它都使用相同的語法。在不同的網路領域，我們也嘗試使用 Nexus 通訊函式庫所提供之多種不同的通訊機制。針對如何有效的複製 SPMD 程式在異質的分散式系統中選定的電腦，MPICH-G 的實作也克服了硬體與記憶體儲存的困難。我們利用 GASS(Global Access Secondary

Storage)

工具，將所要執行的 SPMD 程式複製到每一個遠端的機器上。這裡值得注意

的一點是，所有 SPMD 程式必須先由程式人員編譯完成，才能將執行檔散佈到參與執行的節點機器。另外，根據 MPICH-G 的實作，我們可以採用動態需求更新的線上配置方法，送出需求、確認正確的起始時間，進而針對不同的程序提供函式建立虛擬的共同執行環境 (利用 MPI_COMM_WORLD 作為程式執行過程中的通訊領域)。

(7)

在多個 MPP 系統上配置計算工作是比較複雜的環節。工作管理上面臨的問題，我們考量的方法討論如下。首先將資源配置給予參與執行的電腦，接著將起始處理程序的執行，最後將所有的處理程序連結為一個大型的計算。由於不同電腦的資源配置與處理程序的建立策略有所差異，因此我們在每一個計算節點之間協調出一個合適的方式。另外，要起始一個處理程序，可能會花上很大的時間和出現不可預期的錯誤。所以我們引用 GRAM 的介面與其函式庫進行錯誤偵測(failure detection)的機制(可以利用

timeout

的方式來決定)，當完成處理程序的起始後隨即實施同步 (synchronizing)。用

GRAM

單一介面來執行區域的排程並且支援工作的協調配置(co-allocation)，可以有效的收集各個系統資源的資訊。同時，我們也利用 LDAP (Lightweight Directory Access

Protocol)

的方式，提供資源配置模組最新的系統資訊與狀態。

在異質性平行計算網路下工作分享、負載平衡與重新配置的問題上，我們提出了一套以基因演算法為基礎的模糊理論，提升異質性分散式計算平台的排程效率。另外，

在提升 SPMD 程式執行於不同網路領域或網路拓僕環境的效能方面，我們也針對 SPMD 平行資料程式在異質性多叢集系統上提出有效率的通訊技術。透過邏輯處理器與資料對應的技術，可以降低不同叢集系統之間處理器的通訊成本。在資源配置與工作管理方面，我們分別針對處理器計算能力的異質與網路頻寬的異質，提出 SCTF (Shortest

Communication Time First)

工作排程演算法，並且開發以網頁為基礎的管理工具，在兩

套 PC_Cluster 系統上實作出資源配置、監督與工作排程系統。對於工作排程、協調配置與資源管理的問題，我們測試了幾套國外著名的工作排程系統，進行不同平台的測試，實驗結果顯示，SCTF 提升了系統平均產能(throughput)、縮短了工作執行的平均回覆時間(turnaround time)。本計畫的成果有助於增加異質性叢集系統的產能以及 SPMD 平行程式在該系統的執行效率。

三、結果與討論

下面我們歸納本計畫主要的成果:

我們在現有的兩套 PC Cluster 架構之下，建置一套異質性的分散式計算平台。根據處理器計算能力與網路傳輸速度的不同(異質)，我們提出ㄧ套效能評估模組，用來預測程式的執行效能。

本計畫另一個成果是研究計算網格上平行程式通訊最佳化的資料分割技術。利用邏輯處理器與資料對應的方法，可以有效的降低處理器之間資料通訊的時間。我們所提出來的方法可以適應於同質、異質的計算系統。此ㄧ技術已於 2005 年歐洲網格

(8)

4

會議中發表，在會場中引起多位學者的興趣與討論。

此外，針對異質性網路計算環境下的工作負載平衡，我們提出一套以基因演算法為基礎之模糊理論應用在異質性分散式系統。我們也針對工作排程、協調配置與資源管理的問題，提出 SCTF (Shortest Communication Time First)工作排程演算法、此演算法不論在處理器異質或網路異質的系統都可以很容易的實作出來，實驗的結果也顯示 SCTF 可以有比較好的系統產能與平均工作回覆時間。相關的研究工作亦包括我們將資料重組的技術移植到 SPMD 程式模式下的工作重新(配置)排程。

執行本計畫所發表之相關論文列舉如下

1. Ching-Hsien Hsu and Min-Hao Chen, “Communication Free Dynamic Data Redistribution of Symmetrical Matrices on Distributed Memory Machines,” Accepted, IEEE Transactions on Parallel and Distributed Systems (SCI, EI, NSC93-2213-E-216-028) //對稱矩陣上動態資料重組技術

2. Ching-Hsien Hsu, Shih-Chang Chen and Chao-Yang Lan, "Scheduling Contention-Free Irregular Redistribution in Parallelizing Compilers," Accepted, The Journal of Supercomputing, Kluwer Academic Publisher. (SCI, EI, NSC93-2213-E-216-028, NCHC-KING-010200) // 異質性系統之通訊排程技術 3. Ching-Hsien Hsu, Shih-Chang Chen and Tzu-Tai Lo, “Locality Preserving Data Partitioning for SPMD

Programs on Computational Grid," Chung Hua Journal of Science and Engineering, Vol. 3, No. 1, pp.

121-128, January 2005. (NSC92-2213-E-216-028) // SPMD程式資料區域化技術

4. Ching-Hsien Hsu and Tai-Long Chen, “Grid Enabled Master Slave Task Scheduling for Heterogeneous Processor Paradigm,” Grid and Cooperative Computing - Lecture Notes in Computer Science, Vol. 3795, pp. 449-454, Springer-Verlag, Dec. 2005. (GCC’05) (SCI Expanded, NSC92-2213-E-216-028) //異質性系統之工作排程技術

5. Ching-Hsien Hsu, Shih-Chang Chen, Chao-Yang Lan, Chao-Tung Yang and Kuan-Ching Li, “Scheduling Convex Bipartite Communications Towards Efficient GEN_BLOCK Transformations,” Parallel and Distributed Processing and Applications - Lecture Notes in Computer Science, Vol. 3758, pp. 419-424, Springer-Verlag, Nov. 2005. (ISPA’05) (SCI Expanded, NSC92-2213-E-216-028) // 異質性系統之通訊排程技術

6. Ching-Hsien Hsu, Guan-Hao Lin, Kuan-Ching Li and Chao-Tung Yang, “Localization Techniques for Cluster-Based Data Grid,” Algorithm and Architecture for Parallel Processing - Lecture Notes in Computer Science, Vol. 3719, pp. 83-92, Springer-Verlag, Oct. 2005. (ICA3PP’05) (SCI Expanded, NSC

93-2213-E-216-028) // 叢集式資料網格系統之資料局部化技術

7. Kun-Ming Yu, Ching-Hsien Hsu and Chwani-Lii Sune, "A Genetic-Fuzzy Logic Based Load Balancing Algorithm in Heterogeneous Distributed Systems," Proceedings of the IASTED International Conference on Neural Network and Computational Intelligence (NCI 2004), Feb. 2004, Grindelwald, Switzerland. //

異質性系統之工作負載平衡技術

(9)

四、計劃成果自評

本計劃之研究成果，達到預期之目標，其中之成果一，以基因演算法為基礎之模糊理論應用在異質性分散式系統工作負載平衡與排程已經發表於 2004 年 International Conference on Neural Network and Computational Intelligence會議。而另一個成果，平行資料程式於多叢集式格網系統中通訊最佳化也在 2005 年歐洲網格會議中發表。最後，我們也在不規則的通訊排程問題的技術上有所突破，日前已被 The Journal of Supercomputing 接受，預計在 2006-2007 年發表。本計畫相關成果的整理與擴充將陸續投稿至國外的會議以及國際期刊。本計畫有堪稱不錯的研究成果，感謝國科會給予機會。下一個計畫年度，我們將更加努力，爭取經費建立更完備的研究環境。另外，對於參與研究計畫執行同學的認真，本人亦表達肯定與感謝。

五、參考文獻

1. D. Angulo, I. Foster, C. Liu and L. Yang, “Design and Evaluation of a Resource Selection Framework for Grid Applications,” Proceedings of IEEE International Symposium on High Performance Distributed Computing (HPDC-11), Edinburgh, Scotland, July 2002.

2. K. Czajkowski, I. Foster and C. Kesselman, “Resource Co-Allocation in Computational Grids,” Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing (HPDC-8), pp. 219-228, 1999.

3. C. Lee, R. Wolski, I. Foster, C. Kesselman and J. Stepanek, “A Network Performance Tool for Grid Computations,” Supercomputing '99, 1999.

4. I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt and A. Roy, “A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation,” Intl Workshop on Quality of Service, 1999.

5. K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith and S.

Tuecke, “A Resource Management Architecture for Metacomputing Systems,” Proc.

IPPS/SPDP '98 Workshop on Job Scheduling Strategies for Parallel Processing, pg.

62-82, 1998.

6. B. Allcock, J. Bester, J. Bresnahan, A. L. Chervenak, I. Foster, C. Kesselman, S.

Meder, V. Nefedova, D. Quesnal and S. Tuecke, “Data Management and Transfer in

High Performance Computational Grid Environments,” Parallel Computing Journal,

Vol. 28 (5), May 2002, pp. 749-771.

(10)

6

7. H. Stockinger, A. Samar, B. Allcock, I. Foster, K. Holtman and B. Tierney, “File and Object Replication in Data Grids,” Journal of Cluster Computing, 5(3)305-314, 2002.

8. S. Vazhkudai and J. Schopf, “Using Disk Throughput Data in Predictions of End-to-End Grid Transfers,” Proceedings of the 3rd International Workshop on Grid Computing (GRID 2002), Baltimore, MD, November 2002.

9. J.M. Schopf and S. Vazhkudai, “Predicting Sporadic Grid Data Transfers,” 11th IEEE International Symposium on High-Performance Distributed Computing (HPDC-11), IEEE Press, Edinburg, Scotland, July 2002.

10. M. Colajanni and P.S. Yu, “A performance study of robust load sharing strategies for distributed heterogeneous Web servers,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 2, pp. 398-414, 2002.

11. Kun-Ming Yu, Ching-Hsien Hsu and Chwani-Lii Sune, "A Genetic-Fuzzy Logic Based Load Balancing Algorithm in Heterogeneous Distributed Systems,"

Proceedings of the IASTED International Conference on Neural Network and Computational Intelligence (NCI 2004), Feb. 2004, Grindelwald, Switzerland.

12. Ching-Hsien Hsu, Tzu-Tai Lo and Shih-Chang Chen, "Optimizing Communications

of Data Parallel Programs on Cluster Grid," Proceedings of the 1

^st

Workshop on

Grid Technology and Applications, Dec. 2004, NCHC, Hsinchu, Taiwan.

(11)

行政院所屬各機關人員出國報告書提要

撰寫時間： 94 年 3 月 1 日

姓名許慶賢服務機關名稱

中華大學資工系

連絡電話、

電子信箱 03-5186410 chh@chu.edu.tw

出生日期 62 年 2 月 23 日職稱助理教授

出席國際會議

名稱

European Grid Conference, February 14 -16 2005,

到達國家

及地點

Science Park Amsterdam, The Netherlands

出國

期間

自 94 年 02 月 12 日迄 94 年 02 月 18 日

內容提要

這一次在荷蘭所舉行的國際學術研討會議共計三天。第一天上午由

Domenico Laforenza

_博士針對

Towards a Next Generation Grid: Learning from the past, Looking into the future

_主題發表精闢的演說作為研討會的開始。同時當天也有許多重要的研究成果分為兩個平行的場次進行論文發表。本人選擇了 Architecture and Infrastructure 場次聽取報告。第一晚上本人亦參加酒會，

並且與幾位國外學者及中國、香港教授交換意見。第二天本人在上午聽取了

Data and Information Management

相關研究，同時獲悉許多新興起的研究主題，並了解目前國外大多數學者主要的研究方向。當天下午發表我們的論文，本人亦參與大會所舉辦的晚宴。並且與幾位外國學者認識，交流，合影留念。會議最後一天，本人選擇與這一次論文較為相近的 Scheduling, Fault

Tolerance and Mapping

以及分散式計算研究聽取論文發表，並且把握最後一

天的機會與國外的教授認識，希望能夠讓他們加深對台灣研究的印象。三天下來，本人聽了許多優秀的論文發表。這些研究所涵蓋的主題包含有：網格系統技術、工作排程、網格計算、網格資料庫以及無線網路等等熱門的研究課題。此次的國際學術研討會議有許多知名學者的參與，讓每一位參加這個會議的人士都能夠得到國際上最新的技術與資訊。是一次非常成功的學術研討會。參加本次的國際學術研討會議，感受良多。讓本人見識到許多國際知名的研究學者以及專業人才，得以與之交流。讓本人與其他教授面對面暢談所學領域的種種問題。看了眾多研究成果以及聽了數篇專題演講，最後，本人認為，會議所安排的會場以及邀請的講席等，都相當的不錯，覺得會議舉辦得很成功，值得我們學習。

出席人所屬機關審核意見層轉機關審核意見

研考會

處理意見

(12)

Localized Communications of Data Parallel Programs on Multi-Cluster Grid Systems

¹

Ching-Hsien Hsu, Tzu-Tai Lo and Kun-Ming Yu Department of Computer Science and Information Engineering

Chung Hua University, Hsinchu, Taiwan 300, R.O.C.

chh@chu.edu.tw

Abstract. The advent of widely interconnected computing resources introduces the technologies of grid computing. A typical grid system, the cluster grid, consists of several clusters located in multiple campuses that distributed globally over the Internet. Because of the Internet infrastructure of cluster grid, the communication overhead becomes as key factor to the performance of applications on cluster grid. In this paper, we present a processor reordering technique for the communication optimizations of data parallel programs on cluster grid. The alignment of data in parallel programs is considered as example to examine the proposed techniques. Effectiveness of the processor reordering technique is to reduce the inter-cluster communication overheads and to speedup the execution of parallel applications in the underlying distributed clusters. Our preliminary analysis and experimental results of the proposed method on mapping data to logical grid nodes show improvement of communication costs and conduce to better performance of parallel programs on different hierarchical grid of cluster systems.

1. Introduction

One of the virtues of high performance computing is to integrate massive computing resources for accomplishing large-scaled computation problems. The common point of these problems has enormous data to be processed. Due to cost-effective, clusters have been employed as a platform for high-performance and high-availability computing platform. In recent years, as the growth of Internet technologies, the grid computing emerging as a widely accepted paradigm for next-generation applications, such as data parallel problems in supercomputing, web-serving, commercial applications and grand challenge problems.

Differing from the traditional parallel computers, a grid system [7] integrates distributed computing resources to establish a virtual and high expandable parallel platform. Figure 1 shows the typical architecture of cluster grid. Each cluster is geographically located in different campuses and connected by software of computational grids through the Internet. In cluster grid, communications occurred when grid nodes exchange data with others via network to run job completion. These communications are usually classified into two types, local and remote. If the two grid nodes belong to different clusters, the messaging should be accomplished through the Internet. We refer this kind of data transmission as external communication. If the two grid nodes in the same space domain, the communications take place within a cluster; we refer this kind of data transmission as interior communication. Intuitionally, the external communication is usually with higher communication latency than that of the interior communication since the data should be routed through numbers of layer-3 routers or higher-level network devices over the Internet. Therefore, to efficiently execute parallel applications on cluster grid, it is extremely critical to avoid large amount of external communications.

PC Cluster A

PC Cluster D PC Cluster B

PC Cluster C

Cluster Grid Internet

Figure 1: The paradigm of cluster grid.

In this paper, we consider the issue of minimizing external communications of data parallel program on cluster grid. We first employ the example of data alignments and realignments that provided in many data parallel-programming languages to examine the effective of the proposed data to logical processor mapping

____________________________________

1The work of this paper was supported NSC, National Science Council of Taiwan, under grant number NSC-93-2213-E-216-028.

(13)

scheme. As researches discovered that many parallel applications require different access patterns to meet parallelism and data locality during program execution. This will involve a series of data transfers such as array redistribution. For example, a 2D-FFT pipeline involves communicating images with the same distribution repeatedly from one task to another. Consequently, the computing nodes might decompose local data set into sub-blocks uniformly and remapped these data blocks to designate processor group. From this phenomenon, we propose a processor-reordering scheme to reduce the volume of external communications of data parallel programs in cluster grid. The key idea is that of distributing data to grid/cluster nodes according to a mapping function at data distribution phase initially instead of in numerical-ascending order.

We also evaluate the impact of the proposed techniques. The theoretical analysis and experimental results show improvement of volume of interior communications and conduce to better performance of data alignment in different hierarchical cluster grids.

The rest of this paper is organized as follows. Section 2 briefly surveys the related works. In section 3, we formulate the communication model of parallel data partitioning and re-alignment on cluster grid. Section 4 describes the processor-reordering scheme for communication localization. Section 5 reports the performance analysis and experimental results. Finally, we conclude our paper in section 6.

2. Related Work

Clusters have been widely used for solving grand challenge applications due to their good price-performance nature. With the growth of Internet technologies, the computational grids [4] become newly accepted paradigm for solving these applications. As the number of clusters increases within an enterprise and globally, there is the need for a software architecture that can integrate these resources into larger grid of clusters. Therefore, the goal of effectively utilizing the power of geographically distributed computing resources has been the subject of many research projects like Globus [6, 8] and Condor [9]. Frey et al. [9]

also presented an agent-based resource management system that allowed users to control global resources.

The system is combined with Condor and Globus, gave powerful job management capabilities is called Condor-G.

Recent work on computational grid has been broadly discussed on different aspects, such as security, fault tolerance, resource management [9, 2], job scheduling [17, 18, 19], and communication optimizations [20, 5, 16, 3]. For communication optimizations, Dawson et al. [5] and Zhu et al. [20] addressed the problems of optimizations of user-level communication patterns in local space domain for cluster-based parallel computing. Plaat et al. analyzed the behavior of different applications on wide-area multi-clusters [16, 3]. Similar researches were studied in the past years over traditional supercomputing architectures [12, 13]. For example, Guo et al. [11] eliminated node contention in communication phases and reduced communication steps with schedule table. Y. W. Lim et al. [15] presented an efficient algorithm for block-cyclic data realignments. Kalns and Ni [14] proposed the processor mapping technique to minimize the volume of communication data for runtime data re-alignments. Namely, the mapping technique minimizes the size of data that need to be transmitted between two algorithm phases. Lee et al. [10]

proposed similar algorithms, the processor reordering, to reduce data communication cost. They also compared their effects upon various conditions of communication patterns.

The above researches give significant improvement of parallel applications on distributed memory multi-computers. However, most techniques only applicable for parallel programs running on local space domain, like single cluster or parallel machine. For a global grid of clusters, these techniques become inapplicable due to various factors of Internet hierarchical and its communication latency. In this paper, our emphasis is on dealing with the optimizations of communications for data parallel programs on cluster grid.

3. Preliminaries

3.1 Problem Formulation

The data parallel programming model has become a widely accepted paradigm for parallel programming on distributed memory multi-computers. To efficiently execute a parallel program, appropriate data distribution is critical for balancing the computational load. A typical function to decompose the data equally can be accomplished via the BLOCK distribution directive.

It has been shown that the data reference patterns of some parallel applications might be changed dynamically. As they evolve, a good mapping of data to logical processors must change adaptively in order to ensure good data locality and reduce inter-processor communication. For example, a global array could be equally allocated to a set of processors initially in BLOCK distribution manner. As the algorithm goes into another phase that requires to access fine-grain data patterns, each processor might divide its local data into

(14)

27

sub-blocks locally and then distribute these sub-blocks to corresponding destination processors. Figure 2 shows an example of this scenario. In the initial distribution, the global array is evenly decomposed into nine data sets and distributed over processors that are selected from three clusters. In the target distribution, each node divides its local data into three sub-blocks evenly and distributes them to the same processor set in grid as in the initial distribution. Since these data blocks might be needed and located in different processors, consequently, efficient inter-processor communications become major subject to the performance of these applications.

I n i t i a l D i s t r i b u t i o n

C l u s t e r - 1 C l u s t e r - 2 C l u s t e r - 3

P0 P1 P2 P3 P4 P5 P6 P7 P8

A B C D E F G H I T a r g e t D i s t r i b u t i o n

C l u s t e r 1 C l u s t e r 2 C l u s t e r 3 C l u s t e r 1 C l u s t e r 2 C l u s t e r 3 C l u s t e r 1 C l u s t e r 2 C l u s t e r 3 P0 P1 P2 P3 P4 P5 P6 P7 P8 P0 P1 P2 P3 P4 P5 P6 P7 P8 P0 P1 P2 P3 P4 P5 P6 P7 P8 a1 a2 a3 b1 b2 b3 c1 c2 c3 d1 d2 d3 e1 e2 e3 f1 f2 f3 g1 g2 g3 h1 h2 h3 i1 i2 i3

Figure 2: Data distributions over cluster grid.

To facilitate the presentation of the proposed approach, we assume that a global array is distributed over processors in BLOCK manner at the initiation. Each node is requested to partition its local block into K equally sub-blocks and distribute them over processors in the same way. The second assumption is that each cluster provides the same number of computers involved in the computation.

Definition 1: The above term K is defined as partition factor.

For instance, the partition factor of the example in Figure 2 is K=3. (Block A is divided into a₁, a₂, a₃, B is divided into b₁, b₂, b₃, etc.)

Definition 2: Given a cluster grid, C denotes the number of clusters in the grid; n_i is the number of processors selected from cluster i, where 1 ≤ i ≤ C; P is the total number of processors in the cluster grid.

According to definition 2, we have P = _∑

= C i ni

1

. Figure 2 has three clusters, thus C = 3, where {P₀, P₁, P₂}

∈ Cluster 1, {P3, P₄, P₅} ∈ Cluster 2 and {P6, P₇, P₈} ∈ Cluster 3, we also have n1 = n₂= n₃= 3 and P = 9.

3.2 Communication Cost Model

Because the interface of interconnect switching networks in each cluster system might be different; to obtain accurate evaluation, the interior communication costs in clusters should be identified individually. We let Ti represents the time of two processors both reside in Cluster-i to transmit per unit data; mi is the sum of volume of all interior messages in Cluster-i; for an external communication between cluster i and cluster j, T_ij is used to represent the time of processor p in cluster i and processor q in cluster j to transmit per unit data;

similarly, mij is the sum of volume of all external messages between cluster i and cluster j. According to these declarations, we can have the following cost function,

) (

, 1 , 1

ij C

j i j i

ij C

i

i i

comm

T m m T

T = ∑

₌

× + ∑

₌ _≠

×

(1) Due to various factors over Internet might cause communication delay; it is difficult to get accurate costs from the above function. As the need of a criterion for performance modeling, integrating the interior and external communications among all clusters into points is an alternative mechanism to get legitimate evaluation. Thus, we totted up the number of these two terms to represent the communication costs through the whole running phase for the following discussions. The volume of interior communications, denoted as |I|

and external communications, denoted as |E| are defined as follows,

| I | =

∑

^C₌

i

Ii 1

(2)

| E | = _∑

≠

= C

j i j

i Eij

, 1 ,

(3) Where I_i is the total number of interior communications within cluster i; E_ij is the total number of external communications between cluster i and cluster j.

4. Communication Localization

(15)

4.1 Motivating Example

Let us consider the example in Figure 2. In the target distribution, processor P₀ divides data block A into a₁, a₂, and a₃. Then, it distributes these three sub-blocks to processors P₀, P₁ and P₂, respectively. Since processors P₀, P₁ and P₂ belong to the same cluster with P₀; therefore, these are three interior communications.

Similar situation on processor P1 will generate three external communications; P1 divides its local data block B into b1, b2, and b3. It distributes these three sub-blocks to P3, P4 and P5, respectively. However, as processor P₁ belongs to Cluster 1 while processors P₃, P₄ and P₅, belong to Cluster 2. Thus, this results three external communications. Figure 3 summarizes all messaging patterns of the example into a communication table. The messages {a₁, a₂, a₃}, {e₁, e₂, e₃} and {i₁, i₂, i₃} are interior communications (the shadow blocks). All the others are external communications. Therefore, we have | I | = 9 and | E | = 18.

D P

S P P0 P1 P2 P3 P4 P5 P6 P7 P8

P0 a1 a2 a3

P1 b1 b2 b3

P2 c1 c2 c3

P3 d1 d2 d3

P4 e1 e2 e3

P5 f1 f2 f3

P6 g1 g2 g3

P7 h1 h2 h3

P8 i1 i2 i3

C l u s t e r - 1 C l u s t e r - 2 C l u s t e r - 3

Figure 3: Communication table of data distribution over cluster grid.

Figure 4 illustrates a bipartite representation to show the communications that given in the above table.

In this graph, the dashed arrows and solid arrows indicate interior and external communications, respectively.

Each arrow contains three communication links.

S o u rc e

P0 P1 P2 P3 P4 P5 P6 P7 P8

Ta r g et

In terio r co m m u n ica tio n E xte rn al co m m u n ica tio n

Figure 4: Interior and external communications using bipartite representation.

4.2 Processor Reordering Data Partitioning

The processor mapping techniques were used in several previous researches to minimize data transmission time of runtime array redistribution. In a cluster grid system, the similar concept can be applied. According to assumptions in section 3.1, we proposed the processor reordering technique and its mapping function that is applicable to data realignment on cluster grid. In order to localize the communication, the mapping function produces a reordered sequence of processors for grouping communications into local cluster. A reordering agent is used to accomplish this process. Figure 5 shows the concept of processor reordering technique for parallel data to logical processor mapping. The source data is partitioned and distributed to processors into initial distributions (ID(PX)) according to the processor sequence derived from reordering agent, where X is the processor id and 0 ≤ X ≤ P-1. To accomplish the target distribution (TD(PX’)), the initial data is divided into K sub-blocks and realign with processors according to the new processors id X’ that is also derived from the reordering agent. Given distribution factor K and processor grid (with variables C and n_i), for the case of K=ni, the mapping function used in reordering agent is formulated as follows,

F(X) = X’ =

⎣

^{X /}^C

⎦

+(X mod C) * K (4) We use the same example to demonstrate the above reordering scheme. Figure 6 shows the communication table of messages using new logical processor sequence. The initial distribution of source data is allocated by the sequence of processors’ id, <P₀, P₃, P₆, P₁, P₄, P₇, P₂, P₅, P₈> which is derived from equation 4. To accomplish the target distribution, P₀ divides data block A into a₁, a₂, a₃ and distributes them

(16)

29

to P₀, P₁ and P₂, respectively. These communications are interior. For P₃, the division of initial data also generates three interior communications; because P₃ divides its local data B into b₁, b₂, b₃ and distributes these three sub-blocks to P₃, P₄ and P₅, respectively; which are in the same cluster with P₃. Similarly, P₆ sends e₁, e₂ and e₃ to processors P₆, P₇ and P₈ and causes three interior communications. Eventually, there is no external communication incurred in this example in Figure 6.

Reordering Agent SCA(x)

Generate new Pid

Realignment ID(Px) DCA(x)

Determine Target Cluster

Designate Target Node SCA(x)

SCA(x) ID(Px)

Partitioning Data

Master Node Alignment/

Dispatch

DCA(x) DCA(x)

TD(PX’) Source Data

Figure 5: The flow of data to logical processor mapping.

D P

S P P0 P1 P2 P3 P4 P5 P6 P7 P8

P0 a₁ a₂ a₃

P3 b1 b2 b3

P6 c₁ c₂ c₃

P1 d1 d2 d3

P4 e₁ e₂ e₃

P7 f1 f2 f3

P2 g1 g2 g3

P5 h1 h2 h3

P8 i1 i2 i3

C l u s t e r - 1 C l u s t e r - 2 C l u s t e r - 3

Figure 6: Communication table with processor reordering.

The bipartite representation of Figure 6’s communication table is shown in Figure 7. All the communication arrows are in dashed lines. We totted up the communications, then have | I | = 27 and | E | = 0. The external communications are mostly eliminated.

5. Performance Analysis and Experimental Results

5.1 Performance Analysis

The effectiveness of processor reordering technique in different hierarchy of cluster grid can be evaluated in theoretical. This section presents the improvements of volume of interior communications for different number of clusters (C) and partition factors (K).

For the case consists of three clusters (C=3), Figure 8(a) shows that the processor reordering technique provides more interior communications than the method without processor reordering. For the case consists of four clusters (C=4), the values of K vary from 4 to 10. The processor reordering technique also provides more interior communications as shown in Figure 8(b). Note that Figures 8 and 9 report the theoretical results which will not be affected by the Internet traffic. In other words, Figure 8 is our theoretical predictions.

S o u r c e

P0 P3 P6 P1 P4 P7 P2 P5 P8

P0 P1 P2 P3 P4 P5 P6 P7 P8

Ta r g e t

Figure 7: Bipartite representation with processor reordering.

(17)

N u m be r of in te r ior c o m m unic a tio n, C = 3

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0

3 4 5 6 7 8 9 1 0

K =

|I |

w it h o u t re o rd e rin g w it h re o rd e rin g

N u m b e r o f in t e r io r c o m m u n ic a t io n , C = 4

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0

4 5 6 7 8 9 1 0

K =

|I |

w it h o u t r e o r d e r in g w it h r e o r d e r in g

Figure 8: The number of interior communications (a) C=3 (b) C=4.

5.2 Simulation settings and Experimental Results

To evaluate the performance of the proposed technique, we have implemented the processor reordering method and tested on Taiwan UniGrid in which 8 campus clusters ware interconnected via Internet. Each cluster owns different number of computing nodes. The programs were written in the single program multiple data (SPMD) programming paradigm with C+MPI codes.

Figure 9 shows the execution time of the methods with and without processor reordering to perform data realignment when C=3 and K=3. Figure 9(a) gives the result of 1MB test data that without file system access (I/O). The result for 10MB test data that accessed via file system (I/O) is given in Figure 9(b). Different combinations of clusters denoted as NTI, NTC, NTD, etc. were tested. The composition of these labels is summarized in Table 1.

Table 1: Labels of different cluster grid

Label Cluster-1 Cluster-2 Cluster-3 Label Cluster-1 Cluster-2 Cluster-3 NTI NCHC NTHU IIS NCI NCHC CHU IIS NTC NCHC NTHU CHU NCD NCHC CHU NDHU NTH NCHC NTHU THU NHD NCHC THU NDHU

In Figure 9(a), we observe that processor reordering technique outperforms the traditional method. In this experiment, our attention is on the presented efficiency of the processor reordering technique instead of on the execution time in different clusters. Compare to the results given in Figure 8, this experiment matches the theoretical predictions. It also satisfying reflects the efficiency of the processor reordering technique.

Figure 9(b) presents the results with larger test data (10 MB) under the same cluster grid. Each node is requested to perform the data realignments through access file system (I/O). The improvement rates are lower than that in Figure 9(a). This is because both methods spend part of time to perform I/O; the ratio of communication cost becomes lower. Nonetheless, the reordering technique still presents considerable improvement.

C= 3, K= 3, without I/O

0 2 4 6 8 10 12 14 16 18

NTI NTC NTH NCI NCD NHD

Seco nd

without reordering with reordering

C =3 , K =3 , w ith I/O (1 0 M B )

0 5 10 15 20 25 30 35 40 45 50

NTI NTC NTH NCI NCD NHD

Seco nd

with o u t reo rd errin g with re o rd e rin g

(a) (b)

Figure 9: Execution time of data realignments on cluster grid when C = K = 3.

6. Conclusions and Future Works

In this paper, we have presented a processor reordering technique for localizing the communications of data parallel programs on cluster grid. Our preliminary analysis and experimental results of re-mapping data to

(18)

31

logical grid nodes show improvement of volume of interior communications. The proposed techniques conduce to better performance of data parallel programs on different hierarchical grid of clusters systems.

There are numbers of research issues remained in this paper. The current work of our study restricts conditions in solving the realignment problem. In the future, we intend to devote generalized mapping mechanisms for parallel data partitioning. We will also study realistic applications and analyze their performance on the UniGrid. Besides, the issues of larger grid system and analysis of network communication latency are also interesting and will be investigated.

References

[1] Taiwan UniGrid, http://unigrid.nchc.org.tw

[2] O. Beaumont, A. Legrand and Y. Robert, ”Optimal algorithms for scheduling divisible workloads on heterogeneous systems,” Proceedings of the 12^th IEEE Heterogeneous Computing Workshop, 2003.

[3] Henri E. Bal, Aske Plaat, Mirjam G. Bakker, Peter Dozy, and Rutger F.H. Hofman, “Optimizing Parallel Applications for Wide-Area Clusters,” Proceedings of the 12th International Parallel Processing Symposium IPPS'98, pp 784-790, 1998.

[4] J. Blythe, E. Deelman, Y. Gil, C. Kesselman, A. Agarwal, G. Mehta and K. Vahi, “The role of planning in grid computing,” Proceedings of ICAPS’03, 2003.

[5] J. Dawson and P. Strazdins, “Optimizing User-Level Communication Patterns on the Fujitsu AP3000,” Proceedings of the 1st IEEE International Workshop on Cluster Computing, pp. 105-111, 1999.

[6] I. Foster, “Building an open Grid,” Proceedings of the second IEEE international symposium on Network Computing and Applications, 2003.

[7] I. Foster and C. K., “The Grid: Blueprint for a New Computing Infrastructure,” Morgan Kaufmann, ISBN 1-55860-475-8, 1999.

[8] I. Foster and C. Kessclman, “Globus: A metacomputing infrastructure toolkit,” Intl. J. Supercomputer Applications, vol. 11, no. 2, pp. 115-128, 1997.

[9] James Frey, Todd Tannenbaum, M. Livny, I. Foster and S. Tuccke, “Condor-G: A Computation Management Agent for Multi-Institutional Grids,” Journal of Cluster Computing, vol. 5, pp. 237 – 246, 2002.

[10] Saeri Lee, Hyun-Gyoo Yook, Mi-Soon Koo and Myong-Soon Park, “Processor reordering algorithms toward efficient GEN_BLOCK redistribution,” Proceedings of the 2001 ACM symposium on Applied computing, 2001.

[11] M. Guo and I. Nakata, “A Framework for Efficient Data Redistribution on Distributed Memory Multicomputers,”

The Journal of Supercomputing, vol.20, no.3, pp. 243-265, 2001.

[12] Florin Isaila and Walter F. Tichy, “Mapping Functions and Data Redistribution for Parallel Files,” Proceedings of IPDPS 2002 Workshop on Parallel and Distributed Scientific and Engineering Computing with Applications, Fort Lauderdale, April 2002.

[13] Jens Koonp and Eduard Mehofer, “Distribution assignment placement: Effective optimization of redistribution costs,” IEEE TPDS, vol. 13, no. 6, June 2002.

[14] E. T. Kalns and L. M. Ni, “Processor mapping techniques toward efficient data redistribution,” IEEE TPDS, vol. 6, no. 12, pp. 1234-1247, 1995.

[15] Y. W. Lim, P. B. Bhat and V. K. Parsanna, “Efficient algorithm for block-cyclic redistribution of arrays,”

Algorithmica, vol. 24, no. 3-4, pp. 298-330, 1999.

[16] Aske Plaat, Henri E. Bal, and Rutger F.H. Hofman, “Sensitivity of Parallel Applications to Large Differences in Bandwidth and Latency in Two-Layer Interconnects,” Proceedings of the 5th IEEE High Performance Computer Architecture HPCA'99, pp. 244-253, 1999.

[17] Xiao Qin and Hong Jiang, “Dynamic, Reliability-driven Scheduling of Parallel Real-time Jobs in Heterogeneous Systems,” Proceedings of the 30th ICPP, Valencia, Spain, 2001.

[18] S. Ranaweera and Dharma P. Agrawal, “Scheduling of Periodic Time Critical Applications for Pipelined Execution on Heterogeneous Systems,” Proceedings of the 30th ICPP, Valencia, Spain, 2001.

[19] D.P. Spooner, S.A. Jarvis, J. Caoy, S. Saini and G.R. Nudd, “Local Grid Scheduling Techniques using Performance Prediction,” IEE Proc. Computers and Digital Techniques, 150(2): 87-96, 2003.

[20] Ming Zhu, Wentong Cai and Bu-Sung Lee, “Key Message Algorithm: A Communication Optimization Algorithm in Cluster-Based Parallel Computing,” Proceedings of the 1^st IEEE International Workshop on Cluster Computing, 1999.

(19)

Grid Enabled Master Slave Task Scheduling for Heterogeneous Processor Paradigm

Ching-Hsien Hsu, Tai-Lung Chen

and

Guan-Hao Lin

Department of Computer Science and Information Engineering Chung Hua University, Hsinchu, Taiwan 300, R.O.C.

Email: chh@chu.edu.tw

Abstract:

Efficient task scheduling is an important issue on system performance of computational grid. To investigate this problem, the master slave paradigm is a good vehicle for developing tasking technologies of centralized grid system. In this paper, we present an efficient method for dispatching tasks to heterogeneous processors in master slave environment. The main idea of the proposed technique is first to allocate tasks to processors that with lower communication overheads. A significant improvement of this approach is that average turnaround time can be minimized. The second advantage of the proposed approach is that system throughput can be increased by dispersing processor idle time. Our proposed model can also be applied to map tasks to heterogeneous cluster systems in grid environments in which the communication costs are various from clusters. To evaluate performance of the proposed techniques, we have implemented the proposed algorithms along with Beaumont’s method. The experimental results show that our techniques outperform Beaumont’s method in terms of lower average turnaround time, higher average throughput, less processor idle time and higher processors’ utilization.

Keywords: master-slave paradigm, heterogeneous processors, task scheduling, computational grid, Least Job First

1. Introduction

One of the virtues of high performance computing is to integrate massive computing resources for accomplishing large computation problems. Cluster computing is one of the well known high performance paradigms. The use of master slave cluster of computers as a platform for high-performance and high-availability computing is mainly due to their cost-effective nature. As the growth of Internet technologies, computational grids become widely accepted paradigm for solving numerous applications and grand challenge problems.

Computing grid system integrates geographically distributed computing resources to establish a virtual and high expandable parallel machine. In recent years, more and more research work done in scheduling problem in heterogeneous grid systems. A centralized computational grid system can be viewed as the collection of one resource broker (the master processor) and several heterogeneous clusters (slave processors). Therefore, to investigate task scheduling problem, the master slave paradigm is a good vehicle for developing tasking technologies of centralized grid system.

The master slave tasking is a simple and widely used technique. Figure 1 shows an example of the master slave paradigm. One master node connects to n slave nodes.

A pool of independent tasks are dispatched by master processor and be processed by the n slave processors. In a heterogeneous implementation, slave processors may have

different computation speeds. Each slave processor executes the tasks after it receives its own part.

Communication between master and slave nodes is handled through a shared medium (e.g. bus) that can be accessed only in exclusive mode. Namely, the communications between master and different slave processors can not be overlapped.

In general, the optimization of master slave tasking problem is twofold. One is to minimize total execution time for a given fix amount of tasks, namely minimize average turnaround time. The other one is to maximize total amount of finished tasks in a given time period, namely maximize throughput.

Figure 1: The Master-Slave paradigm.

In this paper, an efficient method for scheduling homogeneous tasks to heterogeneous processors in master slave environment is presented. The main idea of the proposed technique is first to allocate tasks to processors that with lower communication overheads. A significant improvement of this approach is that average turnaround

行政院國家科學委員會專題研究計畫 成果報告