行政院國家科學委員會專題研究計畫成果報告

(1)

行政院國家科學委員會專題研究計畫成果報告

平行資料程式於計算網格上通訊與 I/O 局部化研究與應用工具開發(3/3)

研究成果報告(完整版)

計畫類別：整合型

計畫編號： NSC 96-2221-E-216-001-

執行期間： 96 年 08 月 01 日至 97 年 07 月 31 日執行單位：中華大學資訊工程學系

計畫主持人：許慶賢

計畫參與人員：碩士班研究生-兼任助理人員：張智鈞碩士班研究生-兼任助理人員：郁家豪碩士班研究生-兼任助理人員：蔡秉儒博士班研究生-兼任助理人員：陳泰龍

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫涉及專利或其他智慧財產權，2 年後可公開查詢

中華民國 97 年 10 月 30 日

(2)

行政院國家科學委員會補助專題研究計畫 █ 成果報告

□期中進度報告

平行資料程式於計算網格上通訊與 I/O 局部化研究與應用工具開發(3/3)

計畫類別： 5 個別型計畫 □ 整合型計畫計畫編號： NSC95-2221-E-216-006

執行期間： 96 年 8 月 1 日至 97 年 7 月 31 日

計畫主持人：許慶賢中華大學資訊工程學系副教授共同主持人：

計畫參與人員：陳泰龍 (中華大學工程科學研究所博士生)

張智鈞、郁家豪、蔡秉儒(中華大學資訊工程學系研究生)

成果報告類型(依經費核定清單規定繳交)：□精簡報告 5 完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

5 出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年 5 二年後可公開查詢

(3)

行政院國家科學委員會專題研究計畫成果報告

平行資料程式於計算網格上通訊與 I/O 局部化研究與應用工具開發(3/3)

Design and Implementation of Communication and I/O Localization Tools for Parallel Applications on Computational Grids (3/3)

計畫編號：NSC95-2221-E-216-006 執行期限：96 年 8 月 1 日至 97 年 7 月 31 日主持人：許慶賢中華大學資訊工程學系副教授

計畫參與人員：中華大學資訊工程學系研究生

陳泰龍(博二)、張智鈞(研二)、郁家豪(研二)ヽ蔡秉儒(研二)

一、中文摘要

本報告是有關於在異質性計算網格系

統和網路拓樸下開發適應型的評估模組與通訊局部化的技術之描述。本計畫執行三年，我們完成自動資料分割工具、平行資料程式效能預測工具、資料局部化選擇器、以及針對特殊平行應用程式的資料局部化學習系統。本項研究所發展的通訊局部化技術與分析工具有助於提升平行資料程式在計算網格上的執行效能。執行本計畫所得到的研究理論、工具開發、與實務經驗亦可作為相關領域學術研究與教學的素材。

關鍵詞：通訊區域化、平行資料程式、計算網格、平行 I/O、資料配置、通訊排程、效能預測、平行編譯器、平行應用、SPMD。

Abstract

This report presents adaptive performance models for optimizing communications of real world parallel applications on heterogeneous grid systems and topologies. This project developed tools for automatic data partitioning,

for scientific applications. The integrated locality preserving techniques and analysis tools developed in this project will facilitate development of efficient data parallel applications on computational grids. The achievements of theorems, tools and experience in this project can be applied in both academic teaching and research. It is the main objective of this project.

Keywords: Localized Communication, Data Parallel Program, Computational Grid, Parallel I/O, Data Distribution, Communication Scheduling, Performance Prediction, Parallelizing Compiler, Parallel Applications, SPMD.

二、緣由與目的

整合計算資源的觀念使得網格計算成為廣泛被接受的虛擬高效能計算平台。網格 (Grid Computing)計算系統不同於傳統平行電腦，它連接分散於不同網域的電腦組成一個具有高度擴充性的計算平台。叢集式的網格 (Cluster Grid)即是一個典型的系統。對於平行資料程式(Data Parallel Program)而言，程式執

(4)

訊必然發生。計算節點之間的通訊有可能發生於相同叢集之內(Interior Communication)的電腦，也有可能發生於不同叢集系統之間 (External Communication)的電腦。為了減少通訊產生的代價，通訊局部化 (Localized Communication) 將資料分佈在適當的電腦，

使得程式執行過程中節點之間所必須的通訊可以大部分集中在相同的叢集或相同的網域之內。通訊局部化的問題不僅在通訊的層次，

其可能應用的範圍包含資料的局部化(Data Localization) 、 I/O 的局部化 (Grid I/O Localization) 、處理節點的局部化 (Processor Group Localization)。在傳統的平行電腦與分散式記憶體環境之下，有許多類似的研究。這些研究包括通訊的區域化或局部化、通訊排程 (Communication Scheduling)、資料分割(Data Partitioning) 、資料重新分佈 (Data Redistribution)、處理器映對(Logical Processor Mapping)技術等。我們在這一個計畫中，就是要整合過去的這些技術，並且發展適用於網格環境下的方法，同時開發相關輔助的分析與調整工具，建立出一套有效而且簡單的方法與介面，使得平行資料應用程式 (Data Parallel Applications)在未來的網格計算系統中可以有更多的應用。

三、研究方法與成果

針對異質性網格計算環境 (Non-identical Cluster Grid)內部與外部通訊的問題，我們提出在實際應用程式上進行最佳化的研究。圖一是網格環境中，

內、外部通訊的示意圖。在這個網格環境中有三個電腦叢集，總共有十二個處理 器。P0~2屬於第一組電腦叢集; P3~5屬於第 二組電腦叢集; P_6~11 屬於第三組電腦叢 集。這三組電腦叢集共有六筆資料(a1~3,

f

1~3, g1~3,

h

1~3,

k

1~3,

l

1~3)是傳送給內部處理器，但也有六筆資料傳送給外部處理器 (b1~3, c1~3, d1~3, e1~3, i1~3, j1~3)。外部通訊量等同於內部通訊量，在未經過通訊最佳化的處理前，大量的外部通訊花費更多通訊成本。

P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 I E P0 a1 a2 a3 3 0 P1 b1 b2 b3 0 3 P2 c1 c2 c3 0 3

P3 d1 d2 d3 0 3

P4 e1 e2 e3 0 3 P5 f1 f2 f3 3 0 P6 g1 g2 g3 3 0

P7 h1 h2 h3 3 0

P8 i1 i2 i3 0 3 P9 j1 j2 j3 0 3 P10 k1 k2 k3 3 0

P11 l1 l2 l3 3 0

Cluster-1 Cluster-2 Cluster-3 18 18

圖一、異質性網格環境中的資料通訊示意圖。

圖二是處理器重新排序的示意圖，利用重新排序的技術，將外部通訊轉換成內部通訊，

可有效減少通訊成本。切割 Source Data 以後，由 Master Node 分配給每個 Source Node，

而 Reordering Agent 利用重新排序的技術，提供 Source Node 新的 Destination Node。由於屬於內部處理器的 Destination Node 個數提高了，讓內部通訊量增加，使得通訊成本降低，

讓內部通訊量增加，使得通訊成本降低且更有效率。

圖二、重新排序處理器的邏輯 ID 之演算法流程。

在經過 Reordering Agent 將處理器邏輯 ID 重新排序後，原本屬於外部通訊的

b

_1~3,

c

1~3, d1~3, e1~3, i1~3, j1~3等六筆資料被轉換成內部通訊，如圖三。使得所有通訊均為內部通訊，有效降低通訊所花費的成本。

(5)

P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 I E P0 a1 a2 a3 3 0 P₃ b1 b2 b3 3 ⁰ P6 c1 c2 c3 3 ⁰

P9 d1 d2 d3 3 ⁰

P1 e1 e2 e3 3 ⁰

P4 f1 f2 f3 3 0

P₇ g1 g2 g3 3 ⁰

P10 h1 h2 h3 3 0

P₂ i1 i2 i3 3 ⁰

P5 j1 j2 j3 3 0

P8 k1 k2 k3 3 ⁰

P11 l1 l2 l3 3 ⁰

Cluster-1 Cluster-2 Cluster-3 36 0

圖三、重新排序處理器 ID 後的資料通訊示意圖。

針對每組電腦叢集提供不同數量的處理器之問題，可利用此做法，提高資料傳輸效能。

為了適用於多維度的處理器編排系統，我們也提出多維陣列 (Multi-Dimensional Array)資料對應模組，

希望可以動態調整通訊的瓶頸，提升程式的執行效益。圖四是處理器跟資料通訊的關係，(a)是 2-D 處理器編排系統，可視為多維系統的表示圖，每個 P 皆視為一個處理器，其所佔面積等同於二維陣列中所分配的資料範圍；(b)為資料重新分配時的需產生資料(m00~m22)示意圖，虛線表示二維陣列新的分配方式。

(a)

(b)

圖四、處理器與資料通訊的關係。(a)多維處理器示意圖；(b)資料通訊示意圖

為了達到資料配置的要求，處理器經常移動資料，而花費的通訊成本過高時會影響執行效能。為此，我們提出了 Local Message Reduction Optimization，改善資料

對排程了所有的通訊，除了可以有效降低通訊成本，更能避免資料傳輸所產生的衝突。

四、結論與討論

下面我們歸納本計畫主要的成果:

z 完成自動資料分割模組的開發 z 完成平行資料分割效能預測系統的實

作。

z 提出重新排程與資料重新分配的技術 z 實作程式階層的效能預測、與其效能監

督工具所提供的資訊，進行程式判別。

z 發表三篇國際期刊與五篇國際研討會論文

Journal Papers:

Ching-Hsien Hsu, Min-Hao Chen, Chao-Tung Yang and Kuan-Ching Li,

“Optimizing Communications of Dynamic Data Redistribution on Symmetrical Matrices in Parallelizing Compilers,”

IEEE Transactions on Parallel and Distributed Systems, Vol. 17, No. 11, pp.

1226-1241, Nov. 2006. (SCI, EI)

Ching-Hsien Hsu, Tai-Lung Chen and Kuan-Ching Li, "Performance Effective Pre-scheduling Strategy for Heterogeneous Communication Grid Systems," Future

Generation Computer Science, Vol. 23,

Issue 4, pp. 569-579, May 2007. Elsevier (SCI, EI)

Ching-Hsien Hsu, Shih-Chang Chen and Chao-Yang Lan, "Scheduling Contention-Free Irregular Redistribution in Parallelizing Compilers," The Journal of

Supercomputing, Kluwer Academic

Publisher, Vol. 40, No. 3, pp. 229-247, June 2007. (SCI, EI)

Ching-Hsien Hsu, Tai-Lung Chen and Jong-Hyuk Park, “On improving resource utilization and system throughput of master slave jobs scheduling in heterogeneous systems,” Journal of

Supercomputing, Springer, Vol. 45, No. 1,

pp. 129-150, July 2008. (SCI, EI).

Conference Papers:

(6)

Grids,” IEEE Proceedings of the third ChinaGrid Annual Conference (ChinaGrid 2008), Dunhunag, Gansu, China.

Ching-Hsien Hsu, Tai-Lung Chen, Bing-Ru Tsai and Kuan-Ching Li,

“Scheduling for Atomic Broadcast Operation in Heterogeneous Networks with One Port Model,” Proceedings on the 3

^rd

International Conference on Grid and Pervasive Computing (GPC-08), LNCS 5036, pp. 166-177, May 2008.

Ching-Hsien Hsu, Yi-Min Chen and Chao-Tung Yang, “A Layered Optimization Approach for Redundant Reader Elimination in Wireless RFID Networks,” Proceedings of 2007 IEEE Asia-Pacific Services Computing Conference (IEEE APSCC 2007), pp.

138-145, Tsukuba, Japan, December 11-14, 2007.

Ching-Hsien Hsu, Chih-Wei Hsieh and Chao-Tung Yang, “A Generalized Critical Task Anticipation Technique for DAG Scheduling,” Algorithm and Architecture for Parallel Processing - Lecture Notes in

Computer Science, vol. 4494, pp. 493-505,

Springer-Verlag, June 2007. (ICA3PP’07)

Ching-Hsien Hsu, Ming-Yuan Own and Kuan-Ching Li, ”Critical-Task Anticipation Scheduling Algorithm for Heterogeneous and Grid Computing,”

Computer Systems Architecture - Lecture Notes in Computer Science, Vol. 4186, pp.

95-108, Springer-Verlag, Sept. 2006.

(ACSAC’06) (SCI Expanded, NSC92-2213-E-216-029)

五、計畫成果自評

本計畫之研究成果已達到計畫預期之目標。第三年年的研究中、針對這一個研究主題上共計發表三篇國際期刊與五篇研討會論文。本計畫有目前研究成果，感謝國科會給予機會。未來，我們將更加努力，爭取經費建立更完備的研究環境。另外，對於參與研究計畫執行同學的認真，本人亦表達肯定與感謝。

六、參考文獻

[1] Taiwan UniGrid, http://unigrid.nchc.org.tw [2] B. Allcock, J. Bester, J. Bresnahan, A. L.

Chervenak, I. Foster, C. Kesselman, S.

Environments,” Parallel Computing Journal, Vol. 28 (5), May 2002, pp. 749-771.

[3] D. Angulo, I. Foster, C. Liu and L. Yang,

“Design and Evaluation of a Resource Selection Framework for Grid Applications,” Proceedings of IEEE

International Symposium on High Performance Distributed Computing (HPDC-11), Edinburgh, Scotland, July 2002.

[4] Shih-Chang Chen and Ching-Hsien Hsu,

“ISO: Comprehensive Techniques Towards Efficient GEN_BLOCK Redistribution with Multidimensional Arrays”, Parallel

Computing Technologies

(PaCT’07) -

Lecture Notes in Computer Science, Vol.

4671, pp. 507-515, Springer-Verlag, Sep.

2007.

[5] M. Colajanni and P.S. Yu, “A performance study of robust load sharing strategies for distributed heterogeneous Web servers,”

IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 2, pp. 398-414,

2002.

[6] K. Czajkowski, I. Foster and C. Kesselman,

“Resource Co-Allocation in Computational Grids,” Proceedings of the Eighth IEEE

International Symposium on High Performance Distributed Computing (HPDC-8), pp. 219-228, 1999.

[7] K. Czajkowski, I. Foster, N. Karonis, C.

Kesselman, S. Martin, W. Smith and S.

Tuecke, “A Resource Management Architecture for Metacomputing Systems,”

Proc. IPPS/SPDP '98 Workshop on Job Scheduling Strategies for Parallel Processing, pg. 62-82, 1998.

[8] I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt and A. Roy, “A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation,” Intl Workshop on Quality of

Service, 1999.

[9] Ching-Hsien Hsu, Min-Hao Chen, Chao-Tung Yang and Kuan-Ching Li, “ Optimizing Communications of Dynamic Data Redistribution on Symmetrical Matrices in Parallelizing Compilers,＂ IEEE Transactions

on Parallel and Distributed Systems, Vol. 17,

No. 11, pp. 1226-1241, Nov. 2006. (SCI, EI, NSC93-2213-E-216-029,

NCHC-KING-010200)

[10] Ching-Hsien Hsu, Shih-Chang Chen and

Chao-Yang Lan, “Scheduling Contention-Free

Irregular Redistribution in Parallelizing

Compilers,” Accepted, The Journal of

Supercomputing, Kluwer Academic Publisher,

2007. (SCI, EI, NSC93-2213-E-216-028,

(7)

Pre-scheduling Strategy for Heterogeneous Communication Grid Systems,” Accepted, Future Generation Computer Science, Elsevier, 2007. (SCI, EI, NSC93-2213-E-216-029)

[12] Ching-Hsien Hsu, Chih-Wei Hsieh and Chao-Tung Yang, “ A Generalized Critical Task Anticipation Technique for DAG Scheduling,＂ Algorithm and Architecture for Parallel Processing - Lecture Notes in Computer

Science, Springer-Verlag, June 2007.

(ICA3PP＇07)

[13] Ching-Hsien Hsu, Chao-Yang Lan and Shih-Chang Chen, “ Optimizing Scheduling Stability for Runtime Data Alignment, ＂

Embedded System Optimization - Lecture Notes in Computer Science, Vol. 4097, pp. 825-835,

Springer-Verlag, Aug. 2006. (ESO＇06) (SCI Expanded, NSC92-2213-E-216-029)

[14] Ching-Hsien Hsu, Guan-Hao Lin, Kuan-Ching Li and Chao-Tung Yang, “ Localization Techniques for Cluster-Based Data Grid, ＂ Algorithm and Architecture for Parallel Processing - Lecture Notes in Computer Science, Vol. 3719, pp. 83-92, Springer-Verlag, Oct.

2005. (ICA3PP’05) (SCI Expanded, NSC 93-2213-E-216-029)

[15] Ching-Hsien Hsu, Ming-Yuan Own and Kuan-Ching Li, “Critical-Task Anticipation Scheduling Algorithm for Heterogeneous and Grid Computing,” Computer Systems

Architecture - Lecture Notes in Computer Science, Vol. 4186, pp. 97-110, Springer-Verlag,

Sept. 2006. (ACSAC’06) (SCI Expanded, NSC92-2213-E-216-029)

[16]

D.H. Kim, K.W. Kang, “Design and Implementation of Integrated Information System for Monitoring Resources in Grid Computing,”

Computer Supported Cooperative Work in Design, 10th Conf., pp. 1-6, 2006.

[17] K.C. Li, Ching-Hsien Hsu, H.H. Wang and C.T.

Yang, “Towards the Development of Visuel: a Novel Application and System Performance Monitoring Toolkit for Cluster and Grid Environments”, Accepted, International Journal

of High Performance Computing and Networking (IJHPCN), Inderscience Publishers,

2008.

[18] Emmanuel Jeannot and Frédéric Wagner, “Two Fast and Efficient Message Scheduling Algorithms for Data Redistribution through a Backbone,” Proceedings of the 18th International Parallel and Distributed Processing Symposium, April 2004.

[19] C. Lee, R. Wolski, I. Foster, C. Kesselman and J. Stepanek, “A Network Performance Tool for Grid Computations,”

Supercomputing '99, 1999.

[20] K.C. Li, H.H. Wang, C.T. Yang and Ching-Hsien Hsu, “Towards the Development of Visuel: a Novel Application and System Performance Monitoring Toolkit for Cluster and Grid Environments,” Accepted, International Journal of High Performance Computing and Networking (IJHPCN), Inderscience Publishers, 2007.

[21] J.M. Schopf and S. Vazhkudai, “Predicting Sporadic Grid Data Transfers,” 11th IEEE

International Symposium on High-Performance Distributed Computing (HPDC-11), IEEE Press, Edinburg, Scotland,

July 2002.

[22] A. Smyk, M. Tudruj, L. Masko, “Open MP Extension for Multithreaded Computing with Dynamic SMP Processor Clusters with Communication on the Fly,” PAR ELEC, pp.

83-88, 2006.

[23] H. Stockinger, A. Samar, B. Allcock, I.

Foster, K. Holtman and B. Tierney, “File and Object Replication in Data Grids,” Journal of

Cluster Computing, 5(3)305-314, 2002.

[24]

M. Tudruj and L. Masko, “Fast Matrix Multiplication in Dynamic SMP Clusters with Communication on the Fly in Systems on Chip Technology,” PAR ELEC, pp. 77-82, 2006

[25] S. Vazhkudai and J. Schopf, “Using Disk Throughput Data in Predictions of End-to-End Grid Transfers,” Proceedings of

the 3rd International Workshop on Grid Computing (GRID 2002), Baltimore, MD,

November 2002.

[26] Chun-Ching Wang, Shih-Chang Chen, Ching-Hsien Hsu and Chao-Tung Yang,

“Optimizing Communications of Data Parallel Programs in Scalable Cluster Systems,”

Proceedings of the 3rd International Conference on Grid and Pervasive Computing (GPC-08), LNCS 5036,

pp. 29-37, May 2008

[27] C.T. Yang, I-Hsien Yang, Shih-Yu Wang, Ching-Hsien Hsu and Kuan-Ching Li, “A Recursive-Adjustment Co-Allocation Scheme with Cyber-Transformer in Data Grids,”

Accepted, Future Generation Computer

Science, Elsevier, 2008.

[28] Kun-Ming Yu, Ching-Hsien Hsu and Chwani-Lii Sune, “A Genetic-Fuzzy Logic Based Load Balancing Algorithm in Heterogeneous Distributed Systems,”

Proceedings of the IASTED International

Conference on Neural Network and Computational Intelligence (NCI 2004), Feb.

2004, Grindelwald, Switzerland.

(8)

行政院所屬各機關人員出國報告書提要

撰寫時間： 95 年 9 月 11 日姓名許慶賢服務機關名稱中華大學

資工系

連絡電話、

電子信箱

03-5186410 chh@chu.edu.tw

出生日期 62 年 2 月 23 日職稱副教授

出席國際會議

名稱

Eleventh Asia-Pacific Computer Systems Architecture Conference (ACSAC-06), Shanghai, China

到達國家

及地點

ShangHai, China

出國

期間

自 95 年 09 月 06 日迄 95 年 09 月 08 日

報告內容應包括下列各項：

一、參加會議經過

這一次在上海所舉行的國際學術研討會議共計三天。第一天上午由 Guang R. Gao博士針對The Era of Multi-Core Chips- A Fresh Look on Software Challenges主題發表精闢的演說作為研討會的開始。同時當天也有許多重要的研究成果分為兩個平行的場次進行論文發表。本人選擇了 Languages and Compilers 場次聽取報告。本人也在同一天下午發表這一次被大會接受的論文。

第一晚上本人亦參加酒會，並且與幾位國外學者及中國教授交換意見。第二天本人除了在上午參加 Multi-core，Architecture，Networks 場次，也在下午主持了 Power Management 場次，同時獲悉許多新興起的研究主題，並了解目前國外大多數學者主要的研究方向。第二天晚上本人亦參與大會所舉辦的晚宴。並且與幾位外國學者認識，交流，合影留念。會議最後一天，本人選擇與這一次論文較為相近的 Scheduling, fault tolerance and mapping 以及分散式計算研究聽取論文發表，並且把握最後一天的機會與國外的教授認識，希望能夠讓他們加深對台灣研究的印象。三天下來，本人聽了許多優秀的論文發表。這些研究所涵蓋的主題包含有：ILP, TLP, Processor Architecture, Memory System, Operation System, High Performance I/O Architecture 等等熱門的研究課題。

二、與會心得

此次的國際學術研討會議有許多知名學者的參與，讓每一位參加這個會議的人士都能夠得到國際上最新的技術與資訊。是一次非常成功的學術研討會。參加本次的國際學術研討會議，感受良多。讓

(9)

三、考察參觀活動(無是項活動者省略)

四、建議

看了眾多研究成果以及聽了數篇專題演講，最後，本人認為，會議所安排的會場以及邀請的講席等，都相當的不錯，覺得會議舉辦得很成功，值得我們學習。

五、攜回資料名稱及內容

1. Conference Program 2. Proceedings

(10)

An Efficient Processor Selection Scheme for Master Slave Paradigm on Heterogeneous Networks

Tai-Lung Chen Ching-Hsien Hsu

Department of Computer Science and Information Engineering Chung Hua University, Hsinchu, Taiwan

chh@chu.edu.tw

Abstract. It is well known that grid technology has the ability to achieve resources shared and tasks scheduled coordinately. In this paper, we present a performance effective pre-scheduling strategy for dispatching tasks onto heterogeneous processors. The main contribution of this study is the consideration of heterogeneous communication overheads in grid systems. One significant improvement of our approach is that average turnaround time could be minimized by selecting processor has the smallest communication ratio first. The other advantage of the proposed method is that system throughput can be increased via dispersing processor idle time. Our proposed technique can be applied to heterogeneous cluster systems as well as computational grid environments, in which the communication costs vary in different clusters. Experimental results show that our techniques outperform other previous algorithms in terms of lower average turnaround time, higher average throughput, less processor idle time and higher processors’

utilization.

1 Introduction

Computational grid system integrates geographically distributed computing resources to establish a virtual and high expandable parallel computing infrastructure. In recent years, there are several research investigations done in scheduling problem for heterogeneous grid systems. A centralized computational grid system can be viewed as the collection of one resource broker (the master processor) and several heterogeneous clusters (slave processors). Therefore, to investigate task scheduling problem, the master slave paradigm is a good vehicle for developing tasking technologies in centralized grid system.

The master slave tasking is a simple and widely used technique [1, 2]. In a master slave tasking paradigm, the master node connects to n slave nodes. A set of independent tasks are dispatched by master processor and be processed on the n heterogeneous slave processors. Slave processors execute the tasks accordingly after they receive their tasks.

This will restrict that the computation and communication can’t overlap. Moreover, communication between master and slave nodes is handled through a shared medium (e.g., bus) that can be accessed only in exclusive mode. Namely, the communications between master and different slave processors can not be overlapped.

In general, the optimization of master slave tasking problem is twofold. One is to minimize total execution time for a given fix amount of tasks, namely minimize average turnaround time. The other one is to maximize total amount of finished tasks in a given time period, namely maximize throughput.

In this paper, an efficient strategy for scheduling independent tasks to heterogeneous processors in master slave environment is presented. The main idea of the proposed technique is first to allocate tasks to processors that present lower communication ratio, which will be defined in section 3.2. Improvements of our approach towards both average

(11)

where we also present a motivating example to demonstrate the characteristics of the master-slave pre-scheduling model. Section 4 assesses the new scheduling algorithm, the Smallest Communication Ratio (SCR), while the illustration of SCR on heterogeneous communication is examined in section 5. The performance comparisons and simulations results are discussed in section 6, and finally in section 7, some conclusions of this paper.

2 Related Work

The task scheduling research on heterogeneous processors can be classified into DAGs model, master-slave paradigm and computational grids. The main purpose of task scheduling is to achieve high performance computing and high throughput computing. The former aims at increasing execution efficiency and minimizing the execution time of tasks, whereas the latter aims at decreasing processor idle time and scheduling a set of independent tasks to increase the processing capacity of the systems over a long period of time.

Thanalapati et al. [13] brought up the idea about adaptive scheduling scheme based on homogeneous processor platform, which applies space-sharing and time-sharing to schedule tasks. With the emergence of Grid and ubiquitous computing, new algorithms are in demand to address new concerns arising to grid environments, such as security, quality of service and high system throughput. Berman et al. [6] and Cooper et al. [11] addressed the problem of scheduling incoming applications to available computation resources. Dynamically rescheduling mechanism was introduced to adaptive computing on the Grid. In [8], some simple heuristics for dynamic matching and scheduling of a class of independent tasks onto a heterogeneous computing system have been presented. Moreover, an extended suffrage heuristic was presented in [12] for scheduling the parameter sweep applications that have been implemented in AppLeS. They also presented a method to predict the computation time for a task/host pair by using previous host performance.

Chronopoulos et al. [9], Charcranoon et al. [10] and Beaumont et al. [4, 5] introduced the research of master-slave paradigm with heterogeneous processors background. Based on this architecture, Beaumont et al. [1, 2] presented a method on master-slave paradigm to forecast the amount of tasks each processor needs to receive in a given period of time. Beaumont et al. [3] presented the pipelining broadcast method on master-slave platforms, focusing on message passing disregarding computation time. Intuitionally in their implementation, fast processor receives more tasks in the proportional distribution policy. Tasks are also prior allocated to faster slave processors and expected higher system throughput could be obtained.

3 Preliminaries

In this section, we first introduce basic concepts and models of this investigation, where we also define notations and terminologies that will be used in subsequent subsections.

3.1 Research Architecture

We have revised several characteristics that were introduced by Beaumont et al. [1, 2]. Based on the master slave paradigm introduced in section 1, this paper follows next assumptions as listed.

z Heterogeneous processors: all processors have different computation speed.

z Identical tasks: all tasks are of equal size.

z Non-preemption: tasks are considered to be atomic.

z Exclusive communication: communications from master node to different slave processors can not be overlapped.

z Heterogeneous communication: communication costs between master and slave processors are of different overheads.

(12)

First, we list definitions, notations and terminologies used in this research paper.

Definition 1: In a master slave system, master processor is denoted by M and the n slave processors are represented by P₁,P₂,....,P_n, where n is the number of slave processors.

Definition 2: Upon the assumption of identical tasks and heterogeneous processors, the execution time of each one of slave processors to compute one task are different. We use Ti to represent the execution time of slave processor Pi to complete one task. In this paper, we assume the computation speed of n slave processors is sorted and T1 ≤ T2 ≤ … ≤ Tn.

Definition 3: Given a master slave system, the time of slave processor Pⁱ to receive one task from master processor is denoted as T_i_{_}_comm.

Definition 4: A Basic Scheduling Cycle (BSC) is defined as BSC =lcm(T₁+T₁_{_}_comm,T₂+T₂_{_}_comm,...,T_m+T_m_{_}_comm), where m is the number of processors that will join the computation.

Definition 5:Given a master slave system, the number of tasks processor Pi needs to receive in a basic scheduling cycle is defined as

comm i i

i T T

P BSC task

_

)

( = + .

Definition 6: Given a master slave system, the communication cost of processor Pi in BSC is defined as )

( )

(P_i T_i_{_}_comm task P_i

comm = × .

Definition 7: Given a master slave system, the computation cost of processor Pi in BSC is defined as )

( )

(P_i T_i task P_i

comp = × .

Definition 8: Given a master slave system, the Communication Ratio of processor Pi is defined as CRi =

comm i i

comm i

T T

T

_ _

+ .

Definition 9: The computational capacity (δ) of a master slave system is defined as the sum of communication ratio of all processors that joined the computation, i.e., δ =

∑

=

m

i CRi

1 , where m is the number of processors that involved in the computation.

Definition 10: Given a master slave system with n heterogeneous slave processors, Pmax is the processor Pk such

that max{ | 1}

1 _

_ ≤

∑

+

= k

i i i comm

comm i

T T

k T , where 1≤ k ≤ n. i.e. 1

1

1 _

_ >

∑

⁺ +

= k

i i i comm

comm i

T T

T . We use Pmax+1 to represent processor Pk+1.

3.3 Master Slave Task Scheduling

Discussions on the problem of task scheduling in master slave paradigm will be addressed in two cases, depending on the value of system computational capacity (δ).

As mentioned in section 2, faster processors receive more tasks is an intuitional approach in which tasks are previously allocated to these faster processors, and this method is called Most Jobs First (MJF) scheduling algorithm [1, 2]. Fig. 1 shows the pre-scheduling of the MJF algorithm. As defined in definition 8, the communication ratio of P1 to P4 are

3 1 ,

4 1 ,

4 1 , and

6

1 , respectively. Because BSC = 12, we have task(P1)=4, task(P2)=3, task(P3)=3 and task(P4)=2. When the number of tasks is numerous, such scheduling achieves higher system utilization and less processor idle time than the greedy method.

(13)

Fig. 1. Most Jobs First (MJF) task scheduling when

δ

≤1.

Lemma 1: Given a master slave system with δ > 1, in MJF scheduling, the amount of tasks being assigned to Pmax+1 can be calculated by the following equation,

task(Pmax+1) = (BSC −

∑

= max

1

) (

i

Pi

comm ) / Tmax+1_com (1) Lemma 2: Given a master slave system with δ > 1, in MJF scheduling, the period of processor Pmax+1 stays idle denoted by T_idle^MJF and can be calculated by the following equation,

MJF

Tidle = BSC − comm(P_max₊₁)−comp(P_max₊₁) (2) Another example of master slave task scheduling with identical communication (i.e., Ti_comm=1) and

δ

> 1 is given in Fig. 2. Because

δ

> 1, according to equation (1), we have task(Pmax+1=P4) = 10. We note that P4

completes its tasks and becomes available at time 100. However, the master processor dispatches tasks to P3

during time 100 ~ 110 and starts to send tasks to P4 at time 110. Such kind of idle situation also happens at time 100~110, 160~170, 220~230, and so on.

Fig. 2. Most Jobs First (MJF) Tasking when

δ

>1.

Lemma 3: In MJF scheduling algorithm with identical communication Ti_comm, when

δ

> 1, the completion time of tasks in the j^thBSC can be calculated by the following equation.

T(BSCj) =

∑

= max

1

) (

i

Pi

comm + j×(comm(P_max₊₁)+comp(P_max₊₁)+T_idle^MJF) −T_idle^MJF (3)

(14)

4 Smallest Communication Ratio (SCR) Scheduling with Identical Communication

The MJF scheduling algorithm distributes tasks to slave processors according to processors’ speed, namely, faster processor receives tasks first. In this section, we demonstrate an efficient task scheduling algorithm, Smallest Communication Ratio (SCR), focuses on master slave task scheduling with identical communication.

Lemma 4: In SCR scheduling algorithm, if δ ≤ 1 and Ti_comm are identical, the task completion time of the j^thBSC denoted by T_finish^SCR(BSC_j), can be calculated by the following equation.

)

( _j

SCR finish BSC

T = BSC + j×(comm(P₁)+comp(P₁))−comm(P₁) (4) Lemma 5: Given a master slave system with

δ

> 1, in scheduling, the amount of tasks being assigned to Pmax+1

can be calculated by the following,

T comm

T P BSC task

_ 1 max 1 max 1 max ) (

+

+ = + + (5)

Lemma 6: In SCR scheduling algorithm, when

δ

> 1, the idle time of a slave processor is denoted as T_idle^SCR and can be calculated by the following equation,

SCR

Tidle =

∑

⁺

= 1 max

1

) (

i

Pi

comm − BSC (6)

The other case in Fig. 3 is to demonstrate the SCR scheduling method with dispersive idle when δ> 1. We use the same example in Fig. 2 for the following illustration. Because

δ

> 1, according to definition 10 and Lemma 5, we have task(Pmax+1=P4) = 12. Comparing to the example in Fig. 2, P4 stays 10 time units idle in MJF algorithm while the idle time is reduced and dispersed in SCR algorithm. In SCR, every processor has 2 units of time idle and totally 8 units of time idle. Moreover, we observe that the MJF algorithm finishes 60 tasks in 100 units of time, showing a throughput of 0.6. While in SCR, there are 62 tasks completed during 102 time units.

The throughput of SCR is 62/102 (≈0.61) > 0.6. Consequently, the SCR algorithm delivers higher system throughput.

Lemma 7: In SCR scheduling algorithm, if Ti_comm are identical for all slave processors and δ > 1, the task completion time of the j^thBSC denoted by T_finish^SCR(BSC_j), can be calculated by the following equation,

)

( _j

SCR finish BSC

T =

∑

⁺

= 1 max

1

) (

i

Pi

comm +comp(P1)+

) ) ( ) ( ( ) 1

(j− × comm P₁ +comp P₁ +T_idle^SCR (7)

(15)

Fig. 3. Smallest Communication Ratio (SCR) Tasking when δ>1.

5 Generalized Smallest Communication Ratio (SCR)

As computational grid integrates geographically distributed computing resources, the communication overheads from resource broker / master computer to different computing site are different. Therefore, towards an efficient scheduling algorithm, the heterogeneous communication overheads should be considered. In this section, we present the SCR task scheduling techniques work on master slave computing paradigm with heterogeneous communication.

Lemma 8: Given a master slave system with heterogeneous communication and δ > 1, in MJF scheduling, we have

task(Pmax+1) =

⎥⎥

⎥

⎦

⎥

⎢⎢

⎢

⎣

⎢ −

+

∑

= comm i

i

T

P comm BSC

_ 1 max

max

1

)

( (8)

Lemma 9: Given an SCR scheduling with heterogeneous communication and

δ

> 1, T_idle^SCR is the idle time of one slave processor, we have the following equation,

SCR

Tidle =

∑

⁺

= 1 max

1

) (

i

Pi

comm − BSC. (9)

Lemma 10: Given an SCR scheduling with heterogeneous communication and δ> 1, T_start^SCR(BSC_j) is the start time to dispatch tasks in the j^thBSC, we have the following equation,

) (

) 1 ( )

( _j _idle^SCR

SCR

start BSC j BSC T

T = − × + (10) Lemma 11: Given an SCR scheduling with heterogeneous communication and δ> 1, the task completion time of the j^thBSC denoted by T_finish^SCR (BSC_j), we have

)

SCR( BSC

T =^max

∑

⁺¹ ⁺^comp⁽^P^k⁾⁺⁽^j⁻¹⁾^×⁽^comm⁽^P ⁾⁺^comp⁽^P ⁾⁺^T^SCR⁾ (11)

(16)

where Pk is the slave processor with maximum communication cost.

Another example of heterogeneous of communication with δ > 1 master slave tasking is shown in Fig. 4(a).

The communication overheads vary from 1 to 5. The computational speeds vary from 3 to 13. In this example, we have BSC = 48.

In SCR implementation, according to corollary 3, task distribution is task(P1) = 6, task(P2) = 6, task(P3) = 4 and task(Pmax+1) = task(P4) = 3. The communication costs of slave processors are comm(P1) = 30, comm(P2) = 12, comm(P3) = 4 and comm(P4) = 9, respectively. Therefore, the SCR method distributes tasks by the order P3, P4, P2, P1. There are 19 tasks in the first BSC dispatched to P1 to P4 during time period 1~55. Processor P3 is the first processor to receive tasks and it finishes at time t = 48 and becomes available. In the meanwhile, processor P1

receives tasks during t = 48~55. The second BSC starts to dispatch tasks at t = 55. Namely, P3 starts to receive tasks at t = 55 in the second scheduling cycle. Therefore, P3 has 7 unit of time idle. Lemmas 4 and 5 state the above phenomenon. The completion time of tasks in the first BSC depends on the finish time of processor P1. We have

T

_finish^SCR(

BSC

₁) = 73.

(a)

(b)

(c)

Fig. 4. Task scheduling on heterogeneous communication environment with δ>1. (a) Smallest Communication Ratio (b)

(17)

The MJF scheduling is depicted in Fig. 4(b). According to corollary 5, task(Pmax+1) = task(P4) = 0, therefore, P4 will not be included in the scheduling. MJF has the task distribution order P1, P2, P3. Another scheduling policy is called Longest Communication Ratio (LCR) which is an opposite approach to the SCR method. Fig. 4(c) shows the LCR scheduling result which has the dispatch order P1, P2, P4, P3.

To investigate the performance of SCR scheduling technique, we observe that MJF algorithm completes 16 tasks in 90 units of time in the first BSC. On the other hand, in SCR scheduling, there are 19 tasks completed in 73 units of time in the first BSC. In LCR, there are 19 tasks completed in 99 units of time. We can see that the system throughput of SCR (19/73≈0.260) > LCR (19/99≈0.192) > MJF (16/90≈0.178). Moreover, the average turnaround time of the SCR algorithm in the first three BSCs is 183/57 (≈3.2105) which is less than the LCR‘s average turnaround time 209/57 (≈3.6666) and the MJF‘s average turnaround time 186/48 (≈3.875).

6 Performance Evaluation

To evaluate the performance of the proposed method, we have implemented the SCR and the MJF algorithms. We compare different criteria, such as average turnaround time, system throughput and processor idle time, in Heterogeneous Processors with Heterogeneous Communications (HPHC).

Simulation experiments for evaluating average turnaround time are made upon different number of processors and show in Fig. 7. The computational speed of slave processors is set as T1=3, T2=3, T3=5, T4=7, T5=11, and T6=13. For the cases when processor number is 2, 3… 6, we haveδ≤1. When processor number increases to 7, we haveδ>1. In either case, the SCR algorithm conduces better average turnaround time. From the above results, we conclude that the SCR algorithm outperforms MJF for most test samples.

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

1 2 3 4 5 6

# of nodes Average turn-around time

MJF SCR

Fig. 5. Average task turn-around time on different numbers of processors.

Simulation results present the performance comparison of three task scheduling algorithms, SCR, MJF, LCR, on heterogeneous processors and heterogeneous communication paradigms. Fig. 6 shows the simulation results for the experiment setting that with ±10 processor speed variation and ±4 communication speed variation. The computation speed of slave processors are

T

₁=3, T₂=6, T₃=11, and T₄=13. The time of a slave processor to receive one task from master processor are T₁_{_}_comm = 5, T₂_{_}_comm = 2, T₃_{_}_comm = 1 and T₄_{_}_comm=3. The average task turnaround time, system throughput and processor idle time are measured.

(18)

0 1 2 3 4 5 6

1 2 3 4 5 BSC

Average trun-around time (time unit)

MJF LCR SCR

(a)

0 0.1 0.2 0.3 0.4

1 2 3 4 5 BSC

Throughput

MJF LCR SCR

(b)

0 50 100 150 200 250

1 2 3 4 5 BSC

Total processor idle time (time unit)

MJF LCR SCR

(c)

Fig. 6. Simulation results for 5 processors with ±10 computation speed variation and ±4 communication variation when

>1

δ

(a) average turnaround time (b) system throughput (c) processor idle time.

Fig. 6(a) is the average turnaround time within different number of BSC. The SCR algorithm performs better than the LCR and MJF method. Similarly, the SCR method has higher throughput than the other two algorithms as shown in Fig. 6(b). The processor idle time are estimated in Fig. 6(c). The SCR and LCR algorithms have the same period of processor idle time which is less than the MJF scheduling method. These phenomena match the theoretical analysis in section 5.

The miscellaneous comparison in Fig. 7 presents the performance comparison of SCR, MJF with more cases.

The simulation results for the experiment setting that with ±5~±30 processor speed variation and ±5~±30 communication speed variation. The computation speed variation of T₁~T_n=±5~±30. The communication speed variation of T1__comm ~T_n_{_}_comm=±5~±30. The system throughput is measured.

0.1 0.2 0.3 0.4

Throughput MJF

LCR SCR

(19)

(a)

0 0.05 0.1 0.15 0.2 0.25

5 10 15 20 # of Nodes25

Throughput

MJF LCR SCR

(b)

Fig. 7. Simulation results of throughput for the range of 5~25 processors with ±30 computation speed variation and

±30 communication variation in 100 cases and 100 BSC (a) system throughput of the cases when 0<T_i ≤30 and 0<

comm

Ti_{_} ≤5 (b) system throughput of the cases when 0<T ≤_i 5 and 0<

comm

Ti_{_} ≤30.

Fig. 7(a) is the case of 0<

T

_i ≤30, 0<T_i_{_}_comm ≤5 and the parameter of computation speed and communication speed are to be random and uniformly distributed within different number of nodes and 100 BSC for 100 cases.

Fig. 7(b) is the case of 0<

T

_i≤5 and 0<T_i_{_}_comm ≤30. The SCR algorithm performs better than MJF method, and SCR method has higher throughput than the MJF algorithm as shown in Fig. 7(a) and Fig. 7(b). From the above experimental tests, we have the following remarks. The proposed SCR scheduling technique has better task turnaround time and higher system throughput than the MJF algorithm.

From the above experimental tests, we have the following remarks.

z The proposed SCR scheduling technique has higher system throughput than the MJF algorithm.

z The proposed SCR scheduling technique has better task turnaround time than the MJF algorithm.

The SCR scheduling technique has less processor idle time than the MJF algorithm.

7 Conclusions

The problem of resource management and scheduling has been one of main challenges in grid computing. In this paper, we have presented an efficient algorithm, SCR for heterogeneous processors tasking problem. One significant improvement of our approach is that average turnaround time could be minimized by selecting processor has the smallest communication ratio first. The other advantage of the proposed method is that system throughput can be increased via dispersing processor idle time. Our preliminary analysis and simulation results indicate that the SCR algorithm outperforms Beaumont’s method in terms of lower average turnaround time, higher average throughput, less processor idle time and higher processors’ utilization.

There are numbers of research issues that remains in this paper. Our proposed model can be applied to map tasks onto heterogeneous cluster systems in grid environments, in which the communication costs are various from clusters.

In future, we intend to devote generalized tasking mechanisms for computational grid. We will study realistic applications and analyze their performance on grid system. Besides, rescheduling of processors / tasks for minimizing processor idle time on heterogeneous systems is also interesting and will be investigated.

References

1. O. Beaumont, A. Legrand and Y. Robert, “The Master-Slave Paradigm with Heterogeneous Processors,” IEEE Trans. on parallel and distributed systems, Vol. 14, No.9, pp. 897-908, September 2003.

2. C. Banino, O. Beaumont, L. Carter, J. Ferrante, A. Legrand and Y. Robert, ”Scheduling Strategies for Master-Slave Tasking on Heterogeneous Processor Platforms,” IEEE Trans. on parallel and distributed systems, Vol. 15, No.4, pp.319-330, April 2004.

3. O. Beaumont, A. Legrand and Y. Robert, “Pipelining Broadcasts on Heterogeneous Platforms,” IEEE Trans. on parallel and distributed systems, Vol. 16, No.4, pp. 300-313 April 2005.

(20)

5. O. Beaumont, V. Boudet, F. Rastello and Y. Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms,” Proc. Int'l Conf.

Parallel Processing,Vol. 12, No. 10, pp. 1033-1051, Oct. 2001.

6. F. Berman, R. Wolski, H. Casanova, W. Cirne, H. Dail, M. Faerman, S. Figueira, J. Hayes, G. Obertelli, J. Schopf, G. Shao, S.

Smallen, N. Spring, A. Su, and D. Zagorodnov, ”Adaptive Computing on the Grid Using AppLeS,” IEEE Trans. on parallel and distributed systems, Vol. 14, No. 4, pp.369-379, April 2003.

7. S. Bataineh, T.Y. Hsiung and T.G. Robertazzi, “Closed Form Solutions for Bus and Tree Networks of Processors Load Sharing a Divisible Job,” IEEE Trans. Computers, Vol. 43, No. 10, pp. 1184-1196, Oct. 1994.

8. T. D. Braun, H. J. Siegel, N. Beck, L. Boloni, M. Maheswaran, A. I. Reuther, J. P. Robertson, M. D. Theys and B. Yao, “A taxonomy for describing matching and scheduling heuristics for mixed-machine heterogeneous computing systems,” Proceedings of the IEEE Workshop on Advances in Parallel and Distributed Systems, pp. 330-335, Oct. 1998.

9. A.T. Chronopoulos and S. Jagannathan, “A Distributed Discrete-Time Neural Network Architecture for Pattern Allocation and Control,” Proc. IPDPS Workshop Bioinspired Solutions to Parallel Processing Problems, 2002.

10. S. Charcranoon, T.G. Robertazzi and S. Luryi, “Optimizing Computing Costs Using Divisible Load Analysis,” IEEE Trans.

Computers, Vol. 49, No. 9, pp. 987-991, Sept. 2000.

11. K. Cooper, A. Dasgupta, K. Kennedy, C. Koelbel, A. Mandal, G. Marin, M. Mazina, J. Mellor-Crummey, F. Berman, H. Casanova, A. Chien, H. Dail, X. Liu, A. Olugbile, O. Sievert, H. Xia, L. Johnsson, B. Liu, M. Patel, D. Reed, W. Deng, C. Mendes, Z. Shi, A.

YarKhan, J. Dongarra, ”New Grid Scheduling and Rescheduling Methods in the GrADS Project,” Proceedings of the 18^th International Parallel and Distributed Processing Symposium (IPDPS’04), pp.209-229, April 2004.

12. H. Casanova, A. Legrand, D. Zagorodnov and F. Berman, “Heuristics for Scheduling Parameter Sweep applications in Grid environments,” Proceedings of the 9th Heterogeneous Computing workshop (HCW'2000), pp. 349-363, 2000.

13. T. Thanalapati and S. Dandamudi, ”An Efficient Adaptive Scheduling Scheme for Distributed Memory Multicomputers,” IEEE Trans. on parallel and distributed systems, Vol. 12, No. 7, pp.758-767, July 2001.

(21)

行政院所屬各機關人員出國報告書提要

撰寫時間： 96 年 6 月 20 日姓名許慶賢服務機關名稱中華大學

資工系

連絡電話、

電子信箱

03-5186410 chh@chu.edu.tw

出生日期 62 年 2 月 23 日職稱副教授

出席國際會議

名稱

2007 International Conference on Algorithms and Architecture for Parallel Processing, June 11 -14 2007.

到達國家

及地點

Hangzhou, China 出國

期間

自 96 年 06 月 11 日迄 96 年 06 月 19 日

內容提要

這一次在杭州所舉行的國際學術研討會議共計四天。第一天下午本人抵達會場辦理報到。第二天各主持一場 invited session 的論文發表。同時，自己也在上午的場次發表了這依次被大會接受的論文。第一天也聽取了 Dr.

Byeongho Kang 有關於 Web Information Management 精闢的演說。第二天許多重要的研究成果分為六個平行的場次進行論文發表。本人選擇了 Architecture and Infrastructure、Grid computing、以及 P2P computing 相關場次聽取報告。晚上本人亦參加酒會，並且與幾位國外學者及中國、香港教授交換意見，合影留念。第三天本人在上午聽取了 Data and Information Management 相關研究，同時獲悉許多新興起的研究主題，並了解目前國外大多數學者主要的研究方向，並且把握最後一天的機會與國外的教授認識，

希望能夠讓他們加深對台灣研究的印象。三天下來，本人聽了許多優秀的論文發表。這些研究所涵蓋的主題包含有：網格系統技術、工作排程、網格計算、網格資料庫以及無線網路等等熱門的研究課題。此次的國際學術研討會議有許多知名學者的參與，讓每一位參加這個會議的人士都能夠得到國際上最新的技術與資訊。是一次非常成功的學術研討會。參加本次的國際學術研討會議，感受良多。讓本人見識到許多國際知名的研究學者以及專業人才，

得以與之交流。讓本人與其他教授面對面暢談所學領域的種種問題。看了眾多研究成果以及聽了數篇專題演講，最後，本人認為，會議所安排的會場以及邀請的講席等，都相當的不錯，覺得會議舉辦得很成功，值得我們學習。

出席人所屬機關審核意見

層轉機關

審核意見

研考會

處理意見

(22)

(出席 ICA3PP-07 研討會所發表之論文)

A Generalized Critical Task Anticipation Technique for DAG Scheduling

Ching-Hsien Hsu

¹

, Chih-Wei Hsieh

¹

and Chao-Tung Yang

²

1 Department of Computer Science and Information Engineering Chung Hua University, Hsinchu, Taiwan 300, R.O.C.

chh@chu.edu.tw

2 High-Performance Computing Laboratory

Department of Computer Science and Information Engineering Tunghai University, Taichung City, 40704, Taiwan R.O.C.

ctyang@thu.edu.tw

Abstract. The problem of scheduling a weighted directed acyclic graph (DAG) representing an application to a set of heterogeneous processors to minimize the completion time has been recently studied. The NP-completeness of the problem has instigated researchers to propose different heuristic algorithms. In this paper, we present a Generalized Critical-task Anticipation (GCA) algorithm for DAG scheduling in heterogeneous computing environment. The GCA scheduling algorithm employs task prioritizing technique based on CA algorithm and introduces a new processor selection scheme by considering heterogeneous communication costs among processors for adapting grid and scalable computing. To evaluate the performance of the proposed technique, we have developed a simulator that contains a parametric graph generator for generating weighted directed acyclic graphs with various characteristics. We have implemented the GCA algorithm along with the CA and HEFT scheduling algorithms on the simulator.

The GCA algorithm is shown to be effective in terms of speedup and low scheduling costs.

1. Introduction

The purpose of heterogeneous computing system is to drive processors cooperation to get the application done quickly. Because of diverse quality among processors or some special requirements, like exclusive function, memory access speed, or the customize I/O devices, etc.; tasks might have distinct execution time on different resources. Therefore, efficient task scheduling is important for achieving good performance in heterogeneous systems.

The primary scheduling methods can be classified into three categories, dynamic scheduling, static scheduling and hybrid scheduling according to the time at which the scheduling decision is made. In dynamic approach, the system performs redistribution of tasks between processors during run-time, expect to balance computational load, and reduce processor’s idle time. On the contrary, in static

行政院國家科學委員會專題研究計畫 成果報告

行政院國家科學委員會專題研究計畫 成果報告

平行資料程式於計算網格上通訊與 I/O 局部化研究與應用 工具開發(3/3)

研究成果報告(完整版)

中 華 民 國 97 年 10 月 30 日

行政院國家科學委員會補助專題研究計畫 █ 成 果 報 告

□期中進度報告

平行資料程式於計算網格上通訊與 I/O 局部化 研究與應用工具開發(3/3)

計畫類別： 5 個別型計畫 □ 整合型計畫 計畫編號： NSC95-2221-E-216-006

執行期間： 96 年 8 月 1 日至 97 年 7 月 31 日

計畫主持人： 許慶賢 中華大學資訊工程學系副教授 共同主持人：

計畫參與人員： 陳泰龍 (中華大學工程科學研究所博士生)

張智鈞、郁家豪、蔡秉儒(中華大學資訊工程學系研究生)

成果報告類型(依經費核定清單規定繳交)：□精簡報告 5 完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

5 出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、列 管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年 5 二年後可公開查詢

行政院國家科學委員會專題研究計畫成果報告

平行資料程式於計算網格上通訊與 I/O 局部化研究與 應用工具開發(3/3)

Design and Implementation of Communication and I/O Localization Tools for Parallel Applications on Computational Grids (3/3)

計畫編號：NSC95-2221-E-216-006 執行期限：96 年 8 月 1 日至 97 年 7 月 31 日 主持人：許慶賢 中華大學資訊工程學系副教授

計畫參與人員：中華大學資訊工程學系研究生

陳泰龍(博二)、張智鈞(研二)、郁家豪(研二)ヽ蔡秉儒(研二)

f

h

k

l

b

c

(a)

(b)

圖四、處理器與資料通訊的關係。(a)多維處理器 示意圖；(b)資料通訊示意圖

 Ching-Hsien Hsu, Min-Hao Chen, Chao-Tung Yang and Kuan-Ching Li,

“Optimizing Communications of Dynamic Data Redistribution on Symmetrical Matrices in Parallelizing Compilers,”

1226-1241, Nov. 2006. (SCI, EI)

 Ching-Hsien Hsu, Tai-Lung Chen and Kuan-Ching Li, "Performance Effective Pre-scheduling Strategy for Heterogeneous Communication Grid Systems," Future

Issue 4, pp. 569-579, May 2007. Elsevier (SCI, EI)

 Ching-Hsien Hsu, Shih-Chang Chen and Chao-Yang Lan, "Scheduling Contention-Free Irregular Redistribution in Parallelizing Compilers," The Journal of

Publisher, Vol. 40, No. 3, pp. 229-247, June 2007. (SCI, EI)

 Ching-Hsien Hsu, Tai-Lung Chen and Jong-Hyuk Park, “On improving resource utilization and system throughput of master slave jobs scheduling in heterogeneous systems,” Journal of

pp. 129-150, July 2008. (SCI, EI).

Grids,” IEEE Proceedings of the third ChinaGrid Annual Conference (ChinaGrid 2008), Dunhunag, Gansu, China.

 Ching-Hsien Hsu, Tai-Lung Chen, Bing-Ru Tsai and Kuan-Ching Li,

“Scheduling for Atomic Broadcast Operation in Heterogeneous Networks with One Port Model,” Proceedings on the 3

International Conference on Grid and Pervasive Computing (GPC-08), LNCS 5036, pp. 166-177, May 2008.



 Ching-Hsien Hsu, Chih-Wei Hsieh and Chao-Tung Yang, “A Generalized Critical Task Anticipation Technique for DAG Scheduling,” Algorithm and Architecture for Parallel Processing - Lecture Notes in

Springer-Verlag, June 2007. (ICA3PP’07)

 Ching-Hsien Hsu, Ming-Yuan Own and Kuan-Ching Li, ”Critical-Task Anticipation Scheduling Algorithm for Heterogeneous and Grid Computing,”

95-108, Springer-Verlag, Sept. 2006.

(ACSAC’06) (SCI Expanded, NSC92-2213-E-216-029)

[1] Taiwan UniGrid, http://unigrid.nchc.org.tw [2] B. Allcock, J. Bester, J. Bresnahan, A. L.

Chervenak, I. Foster, C. Kesselman, S.

Environments,” Parallel Computing Journal, Vol. 28 (5), May 2002, pp. 749-771.

[3] D. Angulo, I. Foster, C. Liu and L. Yang,

“Design and Evaluation of a Resource Selection Framework for Grid Applications,” Proceedings of IEEE

[4] Shih-Chang Chen and Ching-Hsien Hsu,

“ISO: Comprehensive Techniques Towards Efficient GEN_BLOCK Redistribution with Multidimensional Arrays”, Parallel

(PaCT’07) -

4671, pp. 507-515, Springer-Verlag, Sep.

2007.

[5] M. Colajanni and P.S. Yu, “A performance study of robust load sharing strategies for distributed heterogeneous Web servers,”

2002.

[6] K. Czajkowski, I. Foster and C. Kesselman,

“Resource Co-Allocation in Computational Grids,” Proceedings of the Eighth IEEE

[7] K. Czajkowski, I. Foster, N. Karonis, C.

Kesselman, S. Martin, W. Smith and S.

Tuecke, “A Resource Management Architecture for Metacomputing Systems,”

[8] I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt and A. Roy, “A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation,” Intl Workshop on Quality of

[9] Ching-Hsien Hsu, Min-Hao Chen, Chao-Tung Yang and Kuan-Ching Li, “ Optimizing Communications of Dynamic Data Redistribution on Symmetrical Matrices in Parallelizing Compilers,＂ IEEE Transactions

No. 11, pp. 1226-1241, Nov. 2006. (SCI, EI, NSC93-2213-E-216-029,

NCHC-KING-010200)

[10] Ching-Hsien Hsu, Shih-Chang Chen and

Chao-Yang Lan, “Scheduling Contention-Free

Irregular Redistribution in Parallelizing

Compilers,” Accepted, The Journal of

行政院國家科學委員會專題研究計畫成果報告

行政院國家科學委員會專題研究計畫成果報告

平行資料程式於計算網格上通訊與 I/O 局部化研究與應用工具開發(3/3)

中華民國 97 年 10 月 30 日

行政院國家科學委員會補助專題研究計畫 █ 成果報告

平行資料程式於計算網格上通訊與 I/O 局部化研究與應用工具開發(3/3)

計畫類別： 5 個別型計畫 □ 整合型計畫計畫編號： NSC95-2221-E-216-006

計畫主持人：許慶賢中華大學資訊工程學系副教授共同主持人：

計畫參與人員：陳泰龍 (中華大學工程科學研究所博士生)

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、列管計畫及下列情形者外，得立即公開查詢

平行資料程式於計算網格上通訊與 I/O 局部化研究與應用工具開發(3/3)

計畫編號：NSC95-2221-E-216-006 執行期限：96 年 8 月 1 日至 97 年 7 月 31 日主持人：許慶賢中華大學資訊工程學系副教授

圖四、處理器與資料通訊的關係。(a)多維處理器示意圖；(b)資料通訊示意圖

Ching-Hsien Hsu, Min-Hao Chen, Chao-Tung Yang and Kuan-Ching Li,

Ching-Hsien Hsu, Tai-Lung Chen and Kuan-Ching Li, "Performance Effective Pre-scheduling Strategy for Heterogeneous Communication Grid Systems," Future

Ching-Hsien Hsu, Shih-Chang Chen and Chao-Yang Lan, "Scheduling Contention-Free Irregular Redistribution in Parallelizing Compilers," The Journal of

Ching-Hsien Hsu, Tai-Lung Chen and Jong-Hyuk Park, “On improving resource utilization and system throughput of master slave jobs scheduling in heterogeneous systems,” Journal of

Ching-Hsien Hsu, Tai-Lung Chen, Bing-Ru Tsai and Kuan-Ching Li,

Ching-Hsien Hsu, Chih-Wei Hsieh and Chao-Tung Yang, “A Generalized Critical Task Anticipation Technique for DAG Scheduling,” Algorithm and Architecture for Parallel Processing - Lecture Notes in

Ching-Hsien Hsu, Ming-Yuan Own and Kuan-Ching Li, ”Critical-Task Anticipation Scheduling Algorithm for Heterogeneous and Grid Computing,”

出生日期 62 年 2 月 23 日職稱副教授

出席國際會議

名稱

到達國家

及地點

出國

期間

自 95 年 09 月 06 日迄 95 年 09 月 08 日