• 沒有找到結果。

National Sun Yat-sen University Institutional Repository:Item 987654321/30048

N/A
N/A
Protected

Academic year: 2021

Share "National Sun Yat-sen University Institutional Repository:Item 987654321/30048"

Copied!
6
0
0

加載中.... (立即查看全文)

全文

(1)

行政院國家科學委員會補助專題研究計畫成果報告

※※※※※※※※※※※※※※※※※※※※※※※※※

※ 執行時期 Doacross 迴圈排程技術之研究 ※

※ Run-time scheduling for doacross loops ※

※※※※※※※※※※※※※※※※※※※※※※※※※

計畫類別:個別型計畫

計畫編號:NSC 89-2213-E-110-009

執行期間:88 年 08 月 01 日至 89 年 07 月 31 日

主持人:黃宗傳博士 中山大學電機工程學系

共同主持人:

本成果報告包括以下應繳交之附件:

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

□出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

(2)

行政院國家科學委員會專題研究計畫成果報告

執行時期 Doacross 迴圈排程技術之研究

Run-time scheduling for doacr oss loops

計畫編號: NSC 89-2213-E-110-009

執行期間:88 年 08 月 01 日至 89 年 07 月 31 日

主持人:黃宗傳博士 中山大學電機工程學系

計畫參與人員:研究生 陳宜昌,巫濟帆,劉咸銘

一、 中英文摘要 目前平行編譯器沒有辦法在編譯時 期決定某些迴圈的平行度,例如對非常 態 (間接索引)或 者 非線 性 之陣 列 存取 樣式。在這個計畫我們提出一個有效率 的 執 行 時 期 方 案 針 對 這 類 迴 圈 來 計 算 出高平行度排程。 這個新的方案首先在偵測器階段建 立 前 行 輪 次 表 格 (Predecessor Iteration Table);而後將全體迴圈輪 次依此規劃成多個波前再平行執行。對 不均勻的存取樣式,偵測器/執行器方法 之效能通常急遽降低,但我們的方案沒 有此不良效應。此外, 因為高尺度與低 負荷的特性,此方案特別適合運用在多 處理器系統。 關鍵詞:平 行 編 譯 器 , 執 行 時 期 平 行 化 , 偵測器/執行器。 Abstr act

For loops with arrays of irregular (i.e., indirectly indexed), or nonlinear access patterns, no state-of-the-art compilers can determine their parallelism at compile-time. In this project, we propose an efficient run-time scheme to compute a high parallelism execution schedule for those loops.

This new scheme first constructs a predecessor iteration table in inspector phase, and then schedules the whole loop iterations into wavefronts for parallel execution. For non-uniform access patterns, the performance of the inspector/executor methods usually degrades dramatically, but it is not found for our scheme. Furthermore, this scheme is especially suitable for multiprocessor systems because of the features of high scalability and low overhead.

Keywor ds:parallelizing compiler, run-time

(3)

二、 緣由與目的

Current parallelizing compilers demonstrate their effectiveness for regular loops. But there are some limitations in the parallelization of loops with complex or statically insufficiently defined access patterns. A key issue is to solve the problem on the run-time.

In retrospect of the development of run-time doacross parallelization, we find some inefficient factors as follows:

• sequential inspector,

• synchronization overhead on updating shared variables,

• overhead in constructing dependence chains for all array elements,

• large memory space in operation,

• inefficient scheduler,

• possible load migration from inspector to executor.

In this project, we propose an efficient run-time scheme to overcome these problems. Our new scheme constructs an predecessor iteration table first, and then schedules the whole loop iterations efficiently into wavefronts based on the useful information of the above table. Owing to the characteristics of high scalability and low overhead, our scheme is especially suitable for multiprocessor systems.

三、 結果與討論

The goal of our inspector is to construct a predecessor iteration table in parallel. We use auxiliary array la(1:numproc,1:arysize) to keep the last iteration. Namely, la(p,j)

array element j for processor p. The algorithm of our parallel inspector is shown in Fig. 1. And the algorithm of our parallel scheduler is presented in Fig. 2. Scheduling the loop iterations into wavefronts becomes very easy once the predecessor iteration table is available.

We intended to evaluate the impact of non-uniform array access pattern on the performance of two run-time schemes: PRS (Predecessor Reference Scheme) by Rauchwerger et al.[8], and PIS (Predecessor

Iteration Scheme) by us. Fig. 3 is a speedup comparison for two run-time schemes. We can see that PIS always obtains a higher speedup and still has a satisfactory speedup of 2.8 in the worst case. The overhead comparison in Fig. 4 shows that the processing overhead ((inspector time + scheduler time) / sequential loop time) of PRS is dramatically larger than that of PIS.

四、 計畫成果自評

In the parallelization of partially parallel loops, the biggest challenge is to find a parallel execution schedule that can fully extract the potential parallelism but incurs run-time overhead as little as possible. In this project, we present a new run-time parallelization scheme PIS, which use the information of predecessor iterations instead of predecessor references to eliminate the processing overhead of repeatedly examining all the references in PRS. The predecessor iteration table can be constructed in parallel with no synchronization and our parallel scheduler

(4)

with its help. From either theoretical time analysis or experimental results, our run-time scheme reveals better speedup and less processing overhead than PRS.

A paper based on the study of this project will be published in the Proceedings of International Conference on High Performance Computing (HiPC2000), December 2000.

五、 參考文獻

[1] Chen, D. K., Yew, P. C., Torrellas, J.: An Efficient Algorithm for the Run-Time Parallelization of Doacross Loops. Proc.

1994 Supercomputing (1994) 518-527 [2] Huang, T. C., Hsu, P. H.: The SPNT

Test: A New Technology for Run-Time Speculative Parallelization of Loops.

Lecture Notes in Computer Science Vol. 1366. Springer-Verlag, Berlin Heidelberg New York (1998) 177-191 [3]. Huang, T. C., Hsu, P. H., Sheng, T. N.:

Efficient Run-Time Scheduling for Parallelizing Partially Parallel Loops, J.

Information Science and Engineering 14(1), (1998) 255-264

[4]. Leung, S. T., Zahorjan, J.: Improving the

Performance of Run-Time

Parallelization. Proc. 4th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, (1993) 83-91 [5]. Leung, S. T., Zahorjan, J.: Extending the

Applicability and Improving the

Performance of Run-Time

Parallelization. Tech. Rep. 95-01-08,

Dept. CSE, Univ. of Washington (1995) [6]. Midkiff, S., Padua, D.: Compiler

Algorithms for Synchronization. IEEE

Trans. Computer. C-36, 12, (1987) 1485-1495

[7]. Polychronopoulos, C.: Compiler Optimizations for Enhancing Parallelism and Their Impact on Architecture Design. IEEE Trans. Comput. C-37, 8,

(1988) 991-1004

[8]. Rauchwerger, L., Amato, N., Padua, D.:

A Scalable Method for Run-Time Loop Parallelization. Int. Journal Parallel

Processing, 26(6), (1995) 537-576

[9]. Rauchwerger, L., Padua, D.: The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization. IEEE Trans. Parallel and

Distributed Systems, 10(2), (1999) 160-180

[10]. Saltz, J., Mirchandaney, R., Crowley, K.: Run-time Parallelization and Scheduling of Loops. IEEE Trans.

Comput. 40(5), (1991) 603-612

[11]. Xu, C., Chaudhary, V.: Time-Stamping Algorithms for Parallelization of Loops at Run-Time. Proc. 11th Int. Parallel

Processing Symp. (1997)

[12]. Zhu, C. Q., Yew, P. C.: A Scheme to Enforce Data Dependence on Large Multiprocessor Systems. IEEE Trans.

(5)

/* The construction of predecessor iteration table */

1 pr(1:numiter)=0 2 pw(1:numiter)=0

/* Parallel recording phase */ 3 doall p=1,numproc 4 do i=(p-1)*(numiter/numproc)+1,p*(numiter/num proc) 5 if (la(p,w(i)).ne.0) then pw(i)=la(p,w(i)) 6 if (la(p,r(i)).ne.0) then pr(i)=la(p,r(i)) 7 la(p,w(i))=i 8 la(p,r(i))=i 9 enddo 10 enddoall

/* Parallel patching phase */ 11 doall p=2,numproc 12 do i=(p-1)*(numiter/numproc)+1,p*(numiter/num proc) 13 if(pw(i).eq.0) then 14 do j=p-1,1,-1 15 if(la(j,w(i)).ne.0) then 16 pw(i)=la(j,w(i)) 17 goto S1 18 endif 19 enddo 20 s1: endif 21 if(pr(i).eq.0) then 22 do j=p-1,1,-1 23 if(la(j,r(i)).ne.0) then 24 pr(i)=la(j,r(i)) 25 goto S2 26 endif 27 enddo 28 S2: endif 29 enddo 30 enddoall

Fig. 1. The algorithm of inspector

/* Schedule iterations into wavefronts in parallel */

1 wf(1:numiter)=0 2 wf(0)=1

3 done=.false. 4 wfnum=0

/* Repeated until all iterations are scheduled */ 5 do while (done.eq..false.) 6 done=.true. 7 wfnum=wfnum+1 8 doall p=1,numproc 9 do i=(p-1)*(numiter/numproc)+1,p*(numiter/num proc) 10 if (wf(i).eq.0) then 11 if (wf(pw(i)).ne.0.and.wf(pr(i)).ne.0) then 12 wf(i)=wfnum 13 else 14 done=.false. 15 endif 16 endif 17 enddo 18 enddoall 19 enddo

(6)

Fig. 3. Speedup comparison Fig. 4. Overhead comparison Speedup comparison 0 1 2 3 4 9 16 30 52 105 199 Critical path length

Speedup PRS PIS Overhead comparison 0 20 40 60 80 9 16 30 52 105 199 Critical path length

Overhead (%)

PRS PIS

參考文獻

相關文件

– The The readLine readLine method is the same method used to read method is the same method used to read  from the keyboard, but in this case it would read from a 

臺大機構典藏NTUR (National Taiwan University 二 Repository, http://ntur.lib.ntu.edu.tw) 經驗與協助推 動臺灣學術機構典藏TAIR (Taiwan Academic Institutional Repository,

Accordingly, the article is to probe into how Taixu and the others formed the new interpretation to reason the analogy between Vaiduryanirbhasa and a pure land in this world, also

In section29-8,we saw that if we put a closed conducting loop in a B and then send current through the loop, forces due to the magnetic field create a torque to turn the loopÆ

2 Center for Theoretical Sciences and Center for Quantum Science and Engineering, National Taiwan University, Taipei 10617, Taiwan..

2 Center for Theoretical Sciences and Center for Quantum Science and Engineering, National Taiwan University, Taipei 10617, Taiwan..

Table 3 Numerical results for Cadzow, FIHT, PGD, DRI and our proposed pMAP on the noisy signal recovery experiment, including iterations (Iter), CPU time in seconds (Time), root of

{ insert each name in $2.namelist into symbol table, i.e., use Find in symbol table to check for possible duplicated names;. use Insert into symbol table to insert each name in the