Performance Evaluation

Definition 5: If SPi sends messages to DPj-1 and DPj+1, the transmission between SPi and DPj

4. Performance Evaluation

To evaluate the performance of the proposed methods, we have implemented the DRC along with the Divide-and-Conquer algorithm [23]. The performance simulation is discussed in two categories, even GEN_BLOCK and uneven GEN_BLOCK distributions. In even GEN_BLOCK distribution, each processor owns similar size of data. In contrast to even distribution, few processors might be allocated by grand volumes of data with uneven distribution.

Since data elements could be centralized to some specific processors, it is also possible for those processors to have the maximum degree of communications.

The simulation program generates a set of random integer number and the size of message as A[SPi] and A[DPi]. Moreover, the total message size sending from SPi equals to the total size receiving to DPi keeping the balance between source processors and destination processors.

We assume that the data computation (communication) time in the simulation is represented by the transmission size |E_ij|. In the following figures, the percentage of events is plotted as a function of the message size and the number of processors. Also, in the figures, “DRC Better” represents the percentage of the number of events that the DRC algorithm has lower total computation (communication) time than the Divide-and-Conquer algorithm, while “DC Better”

gives the reverse situation. If both algorithms have the same total computation (communication) time, “The Same Results” represents the number of that event.

In the uneven distribution, the size of message’s up-bound is set to be B*1.7 and that of low-bound is set to be B*0.3, where B is equal to the sum of total transmission message size / total number of processors. In the even distribution, the size of message’s up-bound is set to be B*1.3 and that of low-bound is set to be B*0.7. The total message-size is 10M.

Fig 6(a) and 6(b) show the simulation results of both the DRC and the Divide-and-Conquer algorithm

with different number of processors and total message size. The number of processors is from 8 to 24. We can observe that the DRC algorithm has better performance in the uneven data redistribution compared with Divide-and-Conquer algorithm. Since

the data is concentrated in the even case, from Fig 7(a) and 7(b), we can observe that DRC has better performance compared with the uneven case. In both even and uneven cases, DRC performs better than the Divide-and-Conquer algorithm.

Figure 6. The events percentage of computing time is plotted (a) with different number of processors and (b) with different number of total message sizes in 24 processors, on the uneven data set.

Figure 7. The events percentage of computing time is plotted (a) with different number of processors and (b) with different number of total message sizes in 24 processors, on the even data set.

5.Conclusion

In this paper, we have presented a Degree-Reduction-Coloring (DRC) scheduling algorithm to efficiently perform HPF2 irregular array redistribution on a distributed memory multi-computer.

The DRC algorithm is a simple method with low algorithmic complexity to perform GEN_BLOCK array redistribution. The DRC algorithm is an optimal algorithm in terms of minimal number of steps. In the same time, DRC algorithm is also a near optimal algorithm satisfying the condition of minimal message size of total steps. Effectiveness of the proposed methods not only avoids node contention, but also shortens the overall communication length.

For verifying the performance of our proposed algorithm, we have implemented DRC as well as the Divide-and-Conquer redistribution algorithm. The experimental results show improvement in communication costs and high practicability on different processor hierarchies. Also, the experimental results indicate that both of them have good

performance on GEN_BLOCK redistribution. In many situations, DRC is better than the Divide-and-Conquer redistribution algorithm.

Reference

[1] G. Bandera and E.L. Zapata, “Sparse Matrix Block-Cyclic Redistribution,” Proceeding of IEEE Int'l. Parallel Processing Symposium (IPPS'99), San Juan, Puerto Rico, 355 - 359 ,April 1999

[2] J.A. Bondy and U.S.R. Murty, Graph Theory with Applications, Macmillan, London, 1976.

[3] Frederic Desprez, Jack Dongarra and Antoine Petitet,

“Scheduling Block-Cyclic Data redistribution,” IEEE Trans. on PDS, vol. 9, no. 2, pp. 192-205, Feb. 1998.

[4] Minyi Guo, “Communication Generation for Irregular Codes,” The Journal of Supercomputing, vol. 25, no. 3, pp. 199-214, 2003.

[5] Minyi Guo and I. Nakata, “A Framework for Efficient Array Redistribution on Distributed

Memory Multicomputers,” The Journal of Supercomputing, vol. 20, no. 3, pp. 243-265, 2001.

[6] Minyi Guo, I. Nakata and Y. Yamashita,

“Contention-Free Communication Scheduling for Array Redistribution,” Parallel Computing, vol. 26, no.8, pp. 1325-1343, 2000.

[7] Minyi Guo, I. Nakata and Y. Yamashita, “An Efficient Data Distribution Technique for Distributed Memory Parallel Computers,” Joint Symp. on Parallel Processing (JSPP'97), pp.189-196, 1997.

[8] Minyi Guo, Yi Pan and Zhen Liu, “Symbolic Communication Set Generation for Irregular Parallel Applications,” The Journal of Supercomputing, vol.

25, pp. 199-214, 2003.

[9] Edgar T. Kalns, and Lionel M. Ni, “Processor Mapping Technique Toward Efficient Data Redistribution,” IEEE Trans. on PDS, vol. 6, no. 12, pp. 1234-1247, December 1995.

[10] S. D. Kaushik, C. H. Huang, J. Ramanujam and P.

Sadayappan, “Multiphase data redistribution:

Modeling and evaluation,” International Parallel Processing Symposium (IPPS’95), pp. 441-445, 1995.

[11] Peizong Lee, Academia Sinica, and Zvi Meir Kedem,

“Automatic Data and Computation Decomposition on Distributed Memory Parallel Computers,” ACM Transactions on Programming Languages and systems, Vol 24, No. 1, pp. 1-50, January 2002.

[12] S. Lee, H. Yook, M. Koo and M. Park, “Processor reordering algorithms toward efficient GEN_BLOCK redistribution,” Proceedings of the ACM symposium on Applied computing, pp . 539-543, 2001.

[13] Y. W. Lim, Prashanth B. Bhat and Viktor and K.

Prasanna, “Efficient Algorithms for Block-Cyclic Redistribution of Arrays,” Algorithmica, vol. 24, no.

3-4, pp. 298-330, 1999.

[14] C.-H Hsu, S.-W Bai, Y.-C Chung and C.-S Yang, “A Generalized Basic-Cycle Calculation Method for Efficient Array Redistribution,” IEEE Transactions on Parallel and Distributed Systems, vol. 11, no. 12, pp. 1201-1216, Dec. 2000.

[15] Ching-Hsien Hsu, Kun-Ming Yu, “An Optimal Processor Replacement Scheme for Efficient Communication of Runtime Data Realignment,”

Parallel and Distributed and Processing and Applications, - Lecture Notes in Computer Science,

Vol. 3358, pp. 268-273, 2004.

[16] C.-H Hsu, Dong-Lin Yang, Yeh-Ching Chung and Chyi-Ren Dow, “A Generalized Processor Mapping Technique for Array Redistribution,” IEEE Transactions on Parallel and Distributed Systems, vol.

12, vol. 7, pp. 743-757, July 2001.

[17] Antoine P. Petitet and Jack J. Dongarra, “Algorithmic Redistribution Methods for Block-Cyclic Decompositions,” IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 12, pp. 1201-1216, Dec. 1999

[18] Neungsoo Park, Viktor K. Prasanna and Cauligi S.

Raghavendra, “Efficient Algorithms for Block-Cyclic Data redistribution Between Processor Sets,” IEEE Transactions on Parallel and Distributed Systems, vol.

10, No. 12, pp.1217-1240, Dec. 1999.

[19] .L. Prylli and B. Touranchean, “Fast runtime block cyclic data redistribution on multiprocessors,”

Journal of Parallel and Distributed Computing, vol.

45, pp. 63-72, Aug. 1997.

[20] S. Ramaswamy, B. Simons, and P. Banerjee,

“Optimization for Efficient Data redistribution on Distributed Memory Multicomputers,” Journal of Parallel and Distributed Computing, vol. 38, pp.

217-228, 1996.

[21] Akiyoshi Wakatani and Michael Wolfe,

“Optimization of Data redistribution for Distributed Memory Multicomputers,” short communication, Parallel Computing, vol. 21, no. 9, pp. 1485-1490, September 1995.

[22] Hui Wang, Minyi Guo and Wenxi Chen, “An Efficient Algorithm for Irregular Redistribution in Parallelizing Compilers,” Proceedings of 2003 International Symposium on Parallel and Distributed Processing with Applications, LNCS 2745, 2003.

[23] Hui Wang, Minyi Guo and Daming Wei,

"Divide-and-conquer Algorithm for Irregular Redistributions in Parallelizing Compilers”, The Journal of Supercomputing, vol. 29, no. 2, pp.

157-170, 2004.

[24] H.-G. Yook and Myung-Soon Park, “Scheduling GEN_BLOCK Array Redistribution,” Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems, November, 1999.

應用網格建立一個高效能演化樹平行建構環境 ^*

游坤明¹, 徐蓓芳¹, 賴威廷¹, 謝一功¹, 周嘉奕¹, 林俊淵², 唐傳義³

1 中華大學資訊工程學系

2 國立清華大學分子與細胞生物研究所

3 國立清華大學資訊工程學系

1 yu@chu.edu.tw, {b9102042, b9004060, b9102004}@cc.chu.edu.tw, jyzhou@pdlab.csie.chu.edu.tw

2 cyulin@mx.nthu.edu.tw

3 cytang@cs.nthu.edu.tw

摘要

以平行處理方式來計算龐大的資料運算是近年來一個非常重要的應用觀念。有許多不同的環境架構伴隨著不同的應用。網格 (Grid) 是一種建立在網際網路上的架構，網格可透過網際網路與其他網格互相分享資源，因此可以視為在使用龐大的且容易增減的資源來運算；與傳統的叢集式系統相比，傳統的叢集式系統 (Cluster) 若要增加運算能力，則必需花費比網格多的費用，因此運算能力有限。在一般所見的網格中，必須要有相同的協定、

彼此認同的認證、安全性的考量以及合理的資源存取，才能讓網格在網路上互相溝通。使用網格運算我們所要處理的資料及程式，並且在合理的時間內得到正確的結果。本論文使用平行化演算法並以人類粒腺體為例，在單機、網格與叢集電腦環境中建構演化樹，並比較其效能差異。

關鍵詞：等距演化樹 , 叢集電腦計算, 網格計算, Globus Toolkit

1. 簡介

生物資訊研究領域中，科學家常常需要從演化樹的結果以了解物種間的親疏關係。從距離矩陣中建造演化樹在生物學和分類法方面是一個重要的議題，因此也產生許多不同的模型及相對應的演算法。而大部份的最佳解問題都已被証明為 NP-hard。

This word was supported in part by the NSC of ROC, under grant NSC-93-2213-E-216-037 and NSC-94-2213-E-216-028

其中在許多不同的模型中有一個重要的模型便是假定演化的速度是一致的 [5, 17]。在這種前提下，利用距離矩陣算出的演化樹將會是一個等距演化樹(ultrametric tree)。

本論文使用一種高效能的平行化分枝界限演算法(branch-and-bound) 建立最小距離演化樹。這個平行演算法是建立在 master-slave centralize 的架構上，並且加入了有效的負載平衡、節點與節點間通訊的策略，以解決最小權值等距演化樹建構的問題，使得時間在可容忍的範圍內完成。

近年來，對於許多以電腦輔助來求解的問題越來越多，且個人電腦的計算能力已無法滿足在合理的時間內得到結果。於是分散式的計算技術便是下一個發展的層次。本論文以人類粒腺體為例建構出演化樹，建構演化樹是一種非常複雜且耗時的計算過程，使用一般的個人電腦，將耗費大量的時間以求得結果，有時還會因資源不足造成等待許久的運作中斷，因此，要在合理的時間內得到滿意的結果，必須具有高效能的電腦，如超級電腦，但在經濟的考量下，我們可使用叢集電腦或網格來達到近似的效能。

叢集電腦可有大小不同規模，此做法的最大優點是「可擴充性」 (scalability) ：只要增加新的個人電腦，就可以提高叢集電腦的效能。在某些情況下資料是分布在不同的地區中需要互相存取，而網格是透過網路連線將好幾個在不同地區的叢集電腦串聯成的，更可以有效的利用這樣的優點來保持最新的訊息，所以在使用資源效率方面更遠勝於叢集電腦 [19]。

在網格上發展的技術為中介軟體，是用來整

合網格分散的計算資源，主要角色是擔任機器間協

調功能的任務。在網格的使用者和資源提供者之

1 間，擔任資源分配的協調工作，幫助使用者找到適合其使用的機器，並完成資料存取的交易 [19]。

其中一個重要的組成要素，就是後設資料。

網格的優點之一，是有效率的使用閒置中的電腦，若是再長時間運算比較下，網格可以更有效率的使用資源。使用平行處理的環境，像是叢集計算或網格計算，必須用平行化的演算法以及使用平行化的溝通工具，例如 MPI，以幫助程式在該平臺上順利運作。

目前我們已成功的在網格的環境上執行平行化演算法，並且建構出演化樹，從網格與叢集電腦的實驗數據可看出，網格擁有與叢集電腦相似的效能。在本論文中，比較使用單機、叢集電腦及網格三種環境下的效能，在實驗結果中可顯示出，單機運算能力遠不如叢集電腦及網格；叢集電腦與網格之間的比較，若在相同節點數計算下，兩種環境效能是差不多的。

2. 背景 2.1 等距演化樹

在建立演化樹上有許多模型，其中一種為等距演化樹。等距演化樹為假設各物種的演化速率一致 [5, 13]，而等距演化樹的特性為有共同的父節點，物種存在葉節點而且在邊上有權重值的一個二元樹，在每個內節點的子樹中有同樣的路徑長到每一個葉節點上 [4]。對於一個 n * n 的距離矩陣 M 來說，定義最小的等距演化樹指的是兩兩葉節點的邊上權重總合為最小的。因為等距演化樹可以很容易的轉換為二元樹且不需要改變葉節點的距離 [13]，所以，等距演化樹是一個非常適合給電腦計算的模型。

圖1. 建立分支界限樹 (BBT) [3]

如圖 1，我們可知，等距演化樹的數目 A(n)，

隨著 n 的增加，演化樹的數量也快速的增加。有一些有關等距演化樹的研究先前已被提出 [6, 7, 15]。由於這些問題往往是不易解的，所以這些研究大都是基於 heuristic 演算法。舉例來說，像

UPGMA(Unweighted Pair Group Method with Arithmetic mean) [17]就是一個很常被用來建立等距演化樹的演算法。

在本論文中，我們使用 Exact Algorithms for Constructing Minimum branch-and-bound’s from Distance Matrices [4]的演算法為基礎，並將之平行化。在上述方法中，使用分支界限法的策略作為找尋最小距離演化樹的方法。為了求得最小距離演化樹我們會將所有可能的樹型都找出並一一求值，但隨著物種數的增加，等距演化樹 A(n)的增加是非常快的，例如：A(20) > 1021 ，A(25) > 1029 ，A(30)

> 1037 ，於是上述方法中使用了分支界限法的策略來避免完全的搜尋。在本論文中，使用有效率平行化的分支界限演算法建立最小距離演化樹，在我們提出的方法中，是一個主從且集中式的平行化架構，並在此架構中加上了 loading-balancing, bounded 和 communication strategies 等機制，以增加程式的效率。

2.2 叢集計算

叢集計算(cluster computing)在隨著目前的科技下，處理器和周邊設備的普及，我們可以用低成本連接出高效能的叢集計算機。叢集計算機是以高速網路連接個人電腦或工作站而成的，可提供高效能的計算能力而且降低原來達到此效能的成本。在運作上，既然是由許多台電腦連接的，所以普通的應用程式也無法在上面發揮作用，必須設計適合在平行及分散式環境中的演算法，而且同時配合像是 MPI 這種專門用來做平行溝通的軟體，來設計應用程式。

現今在電腦和網路普及下，幾乎是可以看成所有電腦都與網際網路相連，如果把叢集電腦更廣義的角度來看，每台電腦就好像被網際網路連接的大型區網，全球就是一個大型的叢集電腦，但是事實並非如此，因為無法做到資源互相分享、計算互相分擔，所以為了達到更廣義的資源活化運算，於是網格計算的理念被提出。

2.3 網格計算

網格計算(Grid Computing)可讓分散於各地的虛擬組織，協調彼此的資源分享，同時滿足大量運算的需求。而集合分散的運算資源之外，網格計算能夠經由網路管理組織內任何一個可使用的運算資源，進而降低伺服器的閒置時間。

網格計算可以解決在同一時間內使用網路上

很多資源去解決一個問題或者當一個問題需要大

量處理器計算或是需要存取大量分佈不同地方的

資料。耳熟能詳的例子像是 SETI (Search For

Extraterrestrial Intelligence )@home 它讓上千人的

電腦在閒置時的處理器中去幫助計算資料。而且這

些電腦都是獨立性工作，指的是說無論有些工作需

2

在文檔中行政院國家科學委員會專題研究計畫成果報告 (頁 45-56)