National Sun Yat-sen University Institutional Repository:Item 987654321/28780

(1)

行政院國家科學委員會專題研究計畫成果報告

智慧型記憶體架構下平行編譯整合環境之建立(3/3)

計畫類別：個別型計畫計畫編號： NSC93-2213-E-110-002- 執行期間： 93 年 08 月 01 日至 94 年 07 月 31 日執行單位：國立中山大學電機工程學系(所) 計畫主持人：黃宗傳計畫參與人員：吳晟褘林頌為黃繼玄姜紹賢報告類型：完整報告報告附件：出席國際會議研究心得報告及發表論文處理方式：本計畫可公開查詢

中華民國 94 年 10 月 14 日

(2)

一、中英文摘要

【中文摘要】

近年來，在電腦相關科技的進步下，一種新的計算機結構─Intelligent Memory─已被提出，以解決記憶體與處理器間之速度差異所造成的效能瓶頸。此類計算機結構，簡而言之，就是將數個或數十個運算邏輯或是簡化的 RISC CPU 核心，整合在記憶體晶片內，使其具有簡易的計算能力，讓簡單的資料運算，能就近在記憶體內部完成；當主處理器需要資料時，所收到的是已經具有初步運算結果的資科。為了充分發揮這類先進架構的效能，並將現有程式轉換為適合 Intelligent Memory 平台的程式碼，我們已經發展了一個適合此架構的平行轉換系統──SAGE (Statement-Analysis-Grouping-Evaluation)。本計劃的目的就是來實作此 SAGE 系統。在計劃的第一年發展 SAGE 系統內的各個主要模組的演算法，並測試其分析能力；並且在 SGI Origin 200 機器上建立 FlexRAM 模擬器之環境、安裝 Polaris 平行編譯器，完成系統之調校。計劃的第二年，我們實際撰寫並實現上述演算法，且個別測試其效能。計劃的第三年，則是將前兩年之模組，整合於 SAGE 系統內，並在 SGI Origin 200 機器上所建立之 FlexRAM 模擬環境，分析相關的 Benchmark 程式，進行效能驗證。

關鍵詞：Intelligent Memory、SAGE、陳述分割，權重評估、適當排程、迴圈分割、智慧型記憶體區塊化、智慧型記憶體運算辨認。

【英文摘要】

With the rapid growth of computer technology in recent years, a new class of computer architecture, intelligent memory, has been proposed to solve the bottleneck arisen from the performance gap between the memory and processor. This class of architecture integrates several computing logics or simplified RISC cores into a memory chip, which can operate simple data. When host processors need data, the computed data can be retrieved from the memory. In order to fully utilize the new computer architecture and transform the existing source codes for it, we has developed a new parallelizing system in this project, called SAGE (Statement- Analysis- Grouping- Evaluation). In this project, we enhance the integrity of the SAGE prototype and develop a couple of optimizing techniques to improve the applicability of intelligent memory architectures. To make our SAGE prototype become a complete system, we focus on three major topics. The first year of this project developed the basic analyzing modules of statement splitting, load balancing and schedule generation. The second year designed optimization mechanisms for intelligent memory architectures, including loop splitting, intelligent memory tiling and intelligent memory operation recognition. Then, in the third year, these schemes and more precise weight evaluation mechanism were integrated into SAGE system such that the processors in each level can achieve their best performance. Finally, we evaluate our new SAGE system using the standard benchmarks such as SPEC95, NAS and Perfect Benchmark to obtain optimal parallelization efficiency on the intelligent memory platform.

Keywords： Intelligent Memory, SAGE, Statement Splitting, Weight Evaluation, Scheduling, Loop Splitting, Intelligent Memory Tiling, Intelligent-Memory Operation Recognition

(3)

二、

目錄

一、中英文摘要 ... I 【中文摘要】 ... I 【英文摘要】 ... I 二、目錄 ... II 三、報告內容 ... 1 （一）前言與研究目的 ... 1 （二）文獻探討 ... 2 （三）研究方法 ... 3 （四）結果與討論 ... 5 （五）實驗結果 ... 15 （六）結論 ... 18 四、參考文獻 ... 19 五、計畫成果自評 ... 21 六、附錄:已發表論文

(4)

三、報告內容

（一）前言與研究目的

計算機結構的設計不斷的朝向高效能與高平行度發展。但是從使用者的角度來看，應用程式實際的執行速度卻沒有如此快速的增加，分析其主要原因有二：第一是記憶體的速度未能跟上 CPU 進步的幅度。第二是應用程式的特性由計算密集轉為資料密集。這兩項因素，使得記憶體存取與處理機運算之間的速度差距成為主要的效能瓶頸，導致高效能處理機與平行架構在設計上的長處無法充分發揮出來。

有鑑於此，近年來，在美國國防部先進研究計畫處（Defense Advanced Research Projects Agency; DARPA）的專案補助："DARPA/ITO Data Intensive Systems"與各研究機構積極發展下，研究學者提出了一種嶄新的計算機結構：Intelligent RAM（或 Processor-in-Memory 簡稱 PIM）。此類計算機結構，簡而言之，就是將數個或數十個運算邏輯或是簡化的 RISC CPU 核心(Core)，整合在記憶體晶片內，使其具有簡易的計算能力，讓簡單的資料運算，就近在記憶體內部完成；當主處理器需要資料時，所收到的是已經完成初步運算的結果，如此，一方面利用記憶的簡易運算能力達到平行處理的效果，另一方面則節省了存取龐大未經處理之原始資料所需要的時間，讓 CPU 能專職處理複雜的數學運算以發揮其效能。其主要的優勢有幾項：一、大幅減少在系統匯流排上傳送的資料量，相對的增加記憶體在單位時間內的總輸出量。二、縮短處理器與記憶體之間相對的速度差距，有效降低 CPU 在等待資料時所浪費的閒置時間。三、與現行記憶體匯流排相容，因而能直接在現有的架構下，利用記憶體內的處理器，大幅提高一般桌上型工作站甚至個人電腦的平行處理能力。目前智慧型記憶體的研究方向，大都朝向硬體結構的發展，較少提及能發揮此一架構優勢的平行轉換編譯器方面的研究。但若缺乏將現有程式轉換為適合 Intelligent Memory 平台的程式碼，此一架構將不能充分發揮其效能，因此本計畫將發展並撰寫適合 Intelligent Memory 系統的程式分析轉換系統，以充分發揮此一架構之優異性能。

(5)

（二）文獻探討

關於.Intelligent Memory 計算機結構的相關研究，Patterson 等人提出了 Vector-IRAM 架構[14]，此架構結合了純量處理器、向量處理器與記憶體在一顆晶片上。加入向量處理器的理由是著眼於向量化在多媒體方面的應用，並考量其能源效率與晶片大小，並能有效的增加記憶體的頻寬。Vector-IRAM 有很多方面的應用，例如：電玩(Video game)、個人數位秘書(PDA)、智慧型磁碟系統(Intelligent disk)、低價的資料伺服器(low-cost data server)和低價的超級電腦(low-cost supercomputer)等。

Oskin/Chong/Sherwork 提出了另一種 Intelligent Memory 的架構[5]：Active Page／ RADram (Re-configurable Architecture DRAM)。在這個架構中，每顆記憶體均包含 256 個可程式化邏輯元件（Logic Element , LE），每一個 LE 均是一個標準的 FPGA 邏輯區塊。在程式編譯時，找出欲交給 LE 執行之簡單運算，透過特殊燒錄系統，先對 LE 程式化，當使用 Active Pages 時，原始資料先經由 LE 計算後，再由主處理器將 LE 處理後的資料收集起來。使用邏輯元件的優點是其具有比通用處理機更快的效能，缺點則是 LE 只能從事很簡單的計算。

Granacki 等人提出了他們的 Intelligent RAM 的模型：DIVA（Data Intensive Architecture） [6]。這個架構十分複雜，擁有浮點運算功能、硬體的虛擬地址轉換，並支援 TLB (Translation Look-aside Buffer)的管理，以及有條件的 SIMD 運算，並強調由於記憶體晶片的傳輸效率高，使得 DIVA 架構更適合在 Message-passing 環境下使用。

Huang/Torrellas 提出以整合 SIMD、MIMD 等平行電腦結構優點的 Intelligent RAM 架構[4][8]：FlexRAM。在此架構中，一個記憶體晶片中包含了一個超純量通用處理器核心與 64 個簡化通用處理器核心。由於此二者均為商品化之微處理器，因此不需要重新設計編譯器，作業系統亦無須大幅更動，且不需特別的硬體系統架構來支援，可應用於工作站、個人電腦、及伺服器等，以強化大部分應用程式的執行效率，因此我們選擇這種架構做為發展的平台。

關於 Intelligent Memory 編譯器系統的相關研究，Judd/Yelick 在[10]中提出了適用於 Vector-IRAM 架構的編譯器系統：VIRAM Compiler。這套編譯器是以克雷(Cray)超級電腦上的向量化編譯器(Vectorizing compiler)為基礎，利用其向量化轉換機制，重新設計機器碼產生器，以符合 Vector-IRAM 上，以 MIPS 為核心的純量處理器機器碼(Machine Codes)以及特殊設計的向量處理器之向量機器碼(Vectorized Codes)。

Moritz/Frank/Amarasinghe 在[11]中提出了另一個構想：FlexCache Compiler。Compiler 在編譯時期(compile-time)收集資料，利用最高存取頁分析(hot-page technique)，分析其資料區域性(Data Locality)並產生適當的特殊宣告(Directive)，配合獨特的快取記憶體控制器執行這些宣告，以減少快取記憶體標籤(cache-tag)的存取與降低快取記憶體區塊的衝突(cache line conflict)，期能充分使用快取記憶體。

(6)

（三）研究方法

1. Intelligent Memory 系統架構：

本計畫的研究發展平台，是由 UIUC IA-COMA 實驗室所發展之 FlexRAM 的架構[4][8] 延伸出來的，它簡化了原有的微型記憶體處理器。本實驗平台有二個階層的處理器，分別是原主機上的主處理器(Host Processor) P.Host、每個 Intelligent Memory 晶片上的一個記憶體處理器(Memory Processor) P.Mem。P.Host 可以喚起一至數個 P.Mem 去執行工作，而 P.Mem 與 P.Mem 之間的溝通是透過晶片間網路（Inter-chip Network）來達成。其組織架構與詳細的系統參數分別如圖一及表一所述： Intelligent Memory Chip Host Processor Core L1 Cache L2 Cache P.Host P.Mem Rambus (Memory Bus) Memory Processor Core L1 Cache DRAM Cells Memory Processor Core L1 Cache DRAM Cells Memory Processor Core L1 Cache DRAM Cells Memory Processor Core L1 Cache DRAM Cells Inter-Chip Network 圖一、 Intelligent Memory 的組織架構

(7)

由上述圖表我們得知：Intelligent Memory 架構混合了不同種類的處理器於同一系統內。P.Host 是一個高效能的處理器，有較多階的的快取記憶體，但是它的記憶體存取時間卻較長。相反的，P.Mem 之計算能力雖然不及主要處理器，但因與記憶體結合在一起，有著更低的記憶體存取時間。這些特性的差異，使得傳統之平行化技術無法直接用於此一混合的架構上，而需另外思考一套適合此架構的最佳化與平行化技術。

2. SAGE 分析系統組織架構：

在目前的平行編譯器中，轉換的重心放在迴圈，盡可能讓迴圈內的所有的或部分的輪次(Iteration)能同時被執行，也就是所謂的 Iteration-Base 的分析方式。這種方法適合用在同質性(Homogeneous)的多處理器系統中。然而對於異質性(Heterogeneous)混合(Hybrid)且緊密結合(Tightly-Coupled)的多處理器平台，卻有個明顯的缺點: 這些 Iteration 程式行為幾乎相同，但因處理器的能力相異，以 iteration 為單位，齊頭式輪詢(Round-Robin)的工作分配方式並不能充分發揮各個不同處理器的能力。因此，我們採取完全不同的思考方式：以迴圈中的“陳述”(Statement)為基本的分析單元，找出每個單元在不同處理器上所需的執行時間比例，也就是“權重”(Weight)，根據權重把每個單元分配給最適當的處理器。因而提出了 SAGE(Statement- Analysis- Grouping- Evaluation)系統。

SAGE 系統，是一個具有一組陳述分割，負載平衡，和排程技術的新分析模型。它能分析來源程式，根據相依關係分析，在不影響程式執行結果的前提下，分割程式，並產生一權重分割相依圖(Weighted Partition Dependence Graph，WPG)，決定每一程式區塊(Block) 的權重值，並且分配適合的工作給主處理器和記憶體處理器。在第四部份裡，我們將詳述各個分析轉換模組之說明。

表一、Intelligent Memory 架構的系統參數

P.Host P.Mem Bus & Memory

Freq: 800 MHz Freq: 400 MHz Bus Freq: 100 MHz Dyn. issue Width: 6 Static issue Width: 2 PHost Mem RT: 262. 5 ns Integer unit num: 6 Integer unit num: 2 PMem Mem RT: 50. 5 ns Floating unit num: 4 Floating unit num: 2 Bus Width: 16 B

FLC_Type: WT FLC_Type: WT Mem_Data_Transfer: 16 FLC_Size: 32 KB L1 Size: 16 KB Mem_Row_Width: 4K FLC_Line: 64 B L1 Line: 32 B

Replace policy: LRU Replace policy: LRU SLC_Type: WB L2 Cache: N/A SLC_Size: 256 KB Branch penalty: 2 SLC_Line: 64 B PMem_Mem_Delay: 17 Replace policy: LRU

Branch penalty: 4 PHost_Mem_Delay:

88

(8)

（四）結果與討論

在本計畫中，我們以三年的時間，完成 SAGE 系統之基本分析模組與適合 Intelligent Memory 架構的最佳化模組，並嘗試整合所有模組於 SAGE 系統。以下將詳述其理論發展與實驗結果。如前所述，SAGE 系統具有適合異質性緊密結多處理器架構之特性，特別針對智慧型記憶體環境而設計。其基本分析轉換模組如圖二所示。以下將分項說明各個模組之功能與分析轉換方法。本計劃相關之研究成果，也發表在知名國際期刊與研討會[1][2][3]。以下，我們針對幾個重要的分析模組加以說明。

【模組一】Statement Splitting (陳述分割)

在這個模組中，我們使用迴圈分配(Loop Distribution)[1]的機制，把原本的相依關係圖 (Dependence Graph)轉換成區塊分割(Node Partition) Π [1]，並建構出權重分割相依圖 (Weighted Partition Dependence Graph, WPG )，以便讓接下來的區塊權重評估(Weight Evaluation)與區塊排程(Schedule Determination)兩項機制來使用。首先，我們先介紹幾個與迴圈分配相關的基本定義。

定義 1 (Loop Denotation) [7]

迴圈表示成 L = (I1 , I2 , …. In )( S1 , S2 , …. Sk ),，其中 Ii，1≦i≦n，是迴圈指標(Loop

Index)，Sj，1≦j≦k，是一個迴圈陳述(Body Statement)，可能是指定陳述(Assignment

Statement)或是其他迴圈。

Self-Patch Weight Evaluation WPG Construction IMOP Recognition Loop Splitting Intelligent Memory Tiling Subroutine for P.Host Subroutine for P.Mem Source Program Weight Table Look-up Patch Weight Table All weight values

are determined

No Yes

Block Execution Order

Analysis and Scheduling

Statement Splitting

(9)

定義 2 ( Node Partition Π) [7]

在相依關係圖 G 上，對於一個特定迴圈 L，令 S={ S1 , S2 , …. Sd}，我們定義集合 S

上的一個區塊分割(node partition) Π 為 Sk 和 Sl , k≠ l，所形成的子集合，其中 Sk Δ Sl 且

SlΔ Sk，在此Δ是一個間接的資料相依關係。對某一分割(partition ) Π = {π1,π2 ,…,πn}，

我們定義偏序關係(partial ordering relations) α, α^,和αo_如下。

若 i≠j: 1) πi α πj：若且唯若存在有 Sk ∈ πi且 Sl ∈ πj使得 Sk δ Sl，其中δ 代表 true dependence relation . 2) πi α^ πj：若且唯若存在有 Sk ∈ πi 且 Sl ∈ πj 使得 Sk δ^ Sl,，其中δ^代表 anti dependence relation。 3) πi αoπ：若且唯若存在有 Sj k ∈ πi且 Sl ∈ πj使得 Sk δo S，其中δl o代表 output dependence relation。區塊分割技巧通常應用在迴圈分配機制，使分割後之迴圈為包含最少 statement 的簡單迴圈，以便進行其他向量化轉換。在 SAGE 中，我們利用它做為初步的迴圈切割機制以產生 WPG 圖。

【模組二】 WPG (Weighted Partition Dependence Graph) Construction

(建構權重分割相依圖)

定義 3 權重分割相依圖(Weighted Partition Dependence Graph)[1][2]

對於一個特定的 node partition Π 如定義 2 所述，我們定義一個權重分割相依關係圖 WPG(P，E)如下：對於每一個πi∈Π，存在一個節點 bi < Ii , Si , Wi , Oi > ∈ P，其中 Ii , Si 如定義 1 所述， Wi是節點 i 的權重，表示成 Wi (PH，PM) ，PH 和 PM 分別表示 PHost and PMem 的權重，而 Oi則是這個節點的執行順序。如果 bi和 bj有相依關係α, α^, 與 αo_{如定義 2 所述，則存在有一個邊 e} ij∈E 從 bi到 bj，而分別表示成⎯⎯→, ⎯⎯→ anti ，和 ⎯→ ⎯O 。根據上述定義，我們設計 WPG Construction 模組，針對輸入之原始程式，建構所對映的 WPG，以供後面各個模組分析之用。

【模組三】IMOP (Intelligent Memory Operation) Recognition (智慧型記

憶體運算辨識)

由於 Intelligent Memory 架構中的記憶體處理器，距記憶體較近，記憶體存取時間較短，適合進行大量資料運算。若能找出這些資料密集式的運算，並交由記憶體處理器執行，則能有效提升程式的執行速度。以下我們對 IMOP 做一個初步的定義：

定義 4 智慧型記憶體運算(IMOP; Intelligent Memory Operation)

For a given block bi ∈ WPG(P,E). If bi conforms to following equation, it can be classified

(10)

PH1_AC+PH1_MT * (PH2_AC + PH2_MT * PH2_MC) + PH2_MT *

( PH_MEMAC + PH_MEMMT * PH_MEMMC) + OPW(bi) > PM1_AC +

PM1_MT *PM1_MC + PM1_MT * (PM_MEMAC + PM_MEMMT *

PM_MEMMC) + OPW(bi)

where

PX1_AC is P.Host/P.Mem L1 cache access cost of bi.

PH2_AC is P.Host L2 cache access cost of bi.

PM1_MC is P.Mem L1 cache miss cost of bi.

PX1_MT is P.Host/P.Mem L1cache miss times of bi.

PH2_MC is P.Host L2 cache miss cost of bi.

PH2_MT is P.Host L2 cache miss times of bi.

OPW(bi) is operation weight of block bi.

PX_MEMAC and PX_MEMMT is P.Host/P.MEM memory access cost and miss

times. 當程式發生符合上列數學式之情況時，我們說此程式具有 IMOP 的行為。我們設計一辨識智慧型記憶體運算的機制，找出適合記憶體處理機執行之區塊，以加速程式的執行。透過 IMOP 的辨認，我們可以得到至少兩種好處，第一，可以有效的利用各種處理器潛在的效能，第二，可以省掉其他企圖達到最佳工作分配所花的轉換時間。當某一個區塊（Block）被辨識出具有 IMOP 的行為時，我們將不會對這個區塊再做迴圈切割，因為這有可能導致原來的 IMOP 行為被破壞掉，這意味著我們可以省去執行工作切割的步驟。我們將以下面的例子作一個解釋：

program

.

do I=1 to N

.

A(I)=B(I)+C(I)+…

.

end Do

.

end program

圖三、IMOP 之程式範例

(11)

圖三為一個簡單的程式，在這個程式當中有一個迴圈結構，圖四為其執行結果，它包含有一個記憶體參考的陳述，我們可以清楚的看到，當這個陳述所包含的陣列（記憶體參考）越多時，P.Host 與 P.Mem 的執行時間比的差異就越大，這告訴我們一件重要的事，那就是當一個程式所包含的資料量越大時，將資料交由 P.Mem 執行所獲得的好處也就會越大。

【模組四】Self-Patch for Weight Evaluation (自我補償區塊權重評估)

在異質多處理器的環境下，每一個指令如分支(branch)，算術運算(arithmetic)，及記憶體參考 (memory access) 在不同處理器上所耗費的執行時間都不盡相同。這一差異在 Intelligent Memory 架構下更加顯著，在記憶體晶片裡的記憶體處理器運算能力較低但存取記憶體速度卻較快，主處理器則相反。在基本模組設計階段，我們將以靜態查表加總法 (Statically Table Look-up Method)計算出 WPG 圖中各個區塊的權重。

為了達到一個好的工作分配，我們需要一個評估機制做為工作切分的依據，在過去，曾有研究者利用預測法去執行程式中幾個關鍵性的片段，利用所得的資訊作為分析的依據 [18]，這種機制的優點是可以評估程式的動態行為，但是，它沒有保存過去辛苦所得到的資訊，以致於每次執行同一個程式時，還是需要重做相同的事，如此大大的浪費了編譯時間；而在 SAGE 系統中則是使用一種靜態的查表法，如此亦會導致程式無法動態的依據環境不同而做出適當的調整。我們希望在第三年的工作裡，修正此一問題，並提出自我補償區塊權重評估(Self-Patch Weight Evaluation)。此方法乃是結合了預測法與靜態查表法，利用實測法(Profiling)去建立權重表（Weight Table）中的各項起始參數值，同時將這些有用的資訊加以保留，使用加權值的好處在於往後再度進行工作分配時，不需要再使用預測法，以避免增加編譯的時間。在討論自我補償區塊權重評估前，有一點值得我們注意的：同一個運算子（Operator）或是記憶體參考（Memory Reference）有可能會因為處理器的種類或能力不同而有不同的權重值。此外，我們尚需評估權重表中應該要包含那幾種資訊，例如第一，它必須要記錄各個運算子的權重值，第二，它必須具有記錄記憶體存取時間的能力，第三，它要能記錄程式動態產生的行為（如 cache miss、branch penalty etc.）。評估機制的初步雛形如演算法

0 100000 200000 300000 400000 500000 600000 2 3 4 5 6 7 num. of operator exec. cycle P.Host exec. Time P.Mem exec 圖四、 IMOP 程式執行時間關係圖

(12)

一所示，其中包含了兩個主要的步驟：已知運算之權重評估與未知運算之權重修補。

演算法（一）之一：自我補償區塊權重評估 ( Self-Patch for Weight Evaluation )

[ Input ]

Weight_Table and WPG( P, E )

[Algorithm]

For each bi ∈ WPG

For each Si ∈ bi /* Si is a statement in loop i. */

For each Opi ∈ Si /* Opi is an operator in Si . */

OP_find_out = Find_weight_value ( pid, Weight_Table, 0, Opi);

If ( ! OP_find_out ) /* Can’t find the operator’s weight. */ Call Patch ( Weight_Table, Si, Opi);

End if

PH_OSum = PH_Osum + Get_weight_value ( P.Host, Weight_Table, 0, Opi);

PM_OSum = PM_Osum + Get_weight_value ( P.Mem, Weight_Table, 0, Opi);

End for

For each Refi ∈ Si /* Refi is the memory reference type. */

PH_MSum = PH_MSum + Get_weight_value ( P.Host, Weight_Table, 1, Refi );

PM_MSum = PM_MSum + Get_weight_value ( P.Mem, Weight_Table, 1, Refi );

End for End for

Wi ( PH,PM ) = { PH_OSum + PH_MSum + dynamic behavior cost, PM_OSum +

PM_MSum + dynamic behavior cost };

/* Dynamic behavior cost is the cost of cache miss, condition or branch penalty. */

End for End

演算法（一）之二：未知運算權重修補(Patch)

[ Input ]

Weight_Table, Statement, and Operator

[ Intermediate ]

existed_op, existed_op_weight; /* A known operator and its weight */

Exec_in_ph(Si); /* Execute statement Si in P.Host. */

Exec_in_pm(Si); /* Execute statement Si in P.Mem. */

Replace_optr(Si, Opi, existed_op); /* Replace Opi by existed_op. */

[Subroutine]

/* Execute Si in P.Host. */

original = Exec_in_ph ( Si );

temp = Exec_in_ph ( Replace_optr( Si, existed_op ) );

ph_opw = original – temp + existed_op_weight;

call Add_weight_value ( “P.Host”, Weight_Table, 0, ph_opw )

/* Execute Si in P.Mem. */

original = Exec_in_pm ( Si );

temp = Exec_in_pm ( Replace_optr( Si, existed_op) );

pm_opw = original – temp + existed_op_weight;

call Add_weight_value ( “P.Mem”, Weight_Table, 0, pm_opw ); End

(13)

雖然，上面的演算法已陳述了加權值的測量法，但為避免過於抽像，特別舉一個例子如下圖，作更進一步的說明：我們利用修補的機制，使得預測法具有修正的能力，藉由不斷的訓練，能夠降低執行預測法所花的編譯時間，理想的狀況是可以直接使用先前所建立的加權表，並依據此表給與每個處理器一個適合它們能力的工作，使其達到更高的工作效率。

【模組五】Loop Splitting (迴圈分割)

在這一部份，我們要發展不同的工作切割方法，也就是迴圈切割法。在陳述分割中，其工作切分的著眼點在於使用特殊的排程機制去達到良好的工作效能，對於某些可以平行處理的關鍵區塊，無法利用排程法去同時使用 P.Host 與 P.Mem，所以，改用迴圈切割法就會有以下幾個優點，第一，在做陳述分割時，可能因陳述之間資料相依的關係較為複雜，而使得區塊較大，我們可以透過此一機制，將執行時間較長的區塊予以切割，使之較易達成工作負載平衡(Workload Balancing)的工作排程目的。第二，透過迴圈的切割，有可能因此縮小了資料存取的範圍，若能搭配智慧型記憶體區塊化(Intelligent Memory Tiling)，則能有效提高整個迴圈中的區域性(Locality)，進而利用資料區域性所帶來的好處。第三，由於原程式已經過陳述分割，並切割成一個個的區塊，因此，各個區塊內的陳述相依關係亦大幅簡化，更有助於迴圈分割之進行。

以下為一簡易之迴圈分割演算法，透過基本分析模組中之區塊權重評估(Block Weight Evaluation)，求出區塊權重的比值，並依此將工作分成兩部份，分別交由 P.Host 與 P.Mem 來執行，下面為其演算法：演算法（二）：迴圈分割( Loop Splitting )演算法 [ Input ] WPG( P, E ) 假設”+”為一已知加權值的運算子，且其值為 2 若現在有一陳述如下： A=B+C%D Æ “%”為未知的運算子執行過後，得到的執行時間為 6 此時，將原陳述的%運算子以+代換 A=B+C+D 並從新執行一次，可得到執行時間為 4 由此，我們就利用下列算式得知： “%”的加權值 = 6-4+2(”+”的加權值) ∴“%”的加權值 = 4 圖五、修補未知運算子之例子

(14)

WPG( P, E)

[Algorithm]

{

Step 1. Identify a block from the WPG to be split.

Step 2. Compute the workload ratio by the weights of P.Host over P.Mem. (The weights can be obtained from the WPG.

Step 3. Split the iteration space of the block by this ratio into two blocks, and modify the WPG according to their dependence relations.

Step 4. If there are blocks to be split, go to Step 1.

}

【模組六】Intelligent Memory Tiling (智慧型記憶體區塊化)

在 Intelligent Memory 這種異質型的機器上，我們可以利用各種不同的迴圈轉換（Loop Transformation）技巧去改善程式的效能，使得參考快取記憶體失敗（Cache Miss）的情況降到最低。 P.Host 這顆處理器擁有 16KB 的 L1 cache 與 256KB 的 L2 快取記憶體（參閱表一），而在 P.Mem 這顆處理器上則只有 32K 的 L1 快取記憶體，而沒有 L2 快取記憶體，對於這種特殊的記憶體架構，我們可以針對不同的記憶體階層，使用大小不同的分塊法，以增加資料的區域性作為關鍵性的考量。為了使分塊法的行為更有效率，我們參考了 Marta Jiménez 所寫的同時在多種記憶體架構下的分塊法（SMT; Simultaneous Multi-level Tiling）[15]，這種分塊演算法的好處除了能夠得到一個精確的迴圈邊界外，還能夠同時對多層的迴圈依其適合的記憶體架構做相應的切塊。除了一般的分塊法外，我們亦引進了展開與塞入法這種最佳化技巧，它可以配合純量取代法（Scalar Replacement），有效的降低程式對記憶體的存取次數。下面為智慧型記憶體區塊化的演算法。

演算法（三）：智慧型記憶體區塊化(Intelligent Memory Tiling)

[ Input ]

WPG( P, E ) and pid /* it could be P.Host or P.Mem*/

[ Output ]

WPG( P, E ); WPG( P, E ) after be applied SMT and unroll_and_jam.

[ Intermediate ]

HL1_cache; /* P.Host Level one cache. */ HL2_cache; /* P.Host Level two cache. */ ML1_cache; /* P.Mem level one cache. */ unroll_and_jam ( block, register file ); {

For each bi ∈ WPG

If ( pid = P.Host ) then

(15)

bi = SMT(bi, HL1_cache, HL2_cache );

Unroll_and_jam ( Ii, PH_reg );

End if Else

If ( the outmost thorugh the innermost loop are fully permutable ) bi = SMT(bi, ML1_cache ); Unroll_and_jam (Ii, PM_reg ); End if End if End for End

【模組七】 Block Execution Order

Analysis and Scheduling (區塊執行次

序分析與排程

) 在這一模組中，我們提出一個適合 1-P.Host－1-P.Mem 的排程演算法。在我們的機制裡，分割區塊的權重先被測定，然後依據它們的相依關係決定每個區塊執行的順序。接下來進行波前(Wavefront)排程，若區塊可同時執行，他們將會被放置到同一波前裏。最後將同一個波前裡的區塊依據它們本身的權重，分派給 PHost 或 PMem 去執行。我們在演算法四中將詳述其完整處理過程。

演算法（四）：.區塊執行次序分析與排程演算法 (Block Execution order analysis and

Scheduling Algorithm) [Input]

WPG=(P,E): the original weighted partition dependence graph before weight and order of

blocks are determined.

[Output]

An execution wavefront schedule WF= {Wf1, Wf2, ….} where Wf1={PH(ba...bb), PM(bc...bd)}

in which PH(ba...bb) means that blocks ba...bb will be assigned to PHost in wavefront i,

PM(bc...bd) means that blocks bc...bd will be assigned to PMem in wavefront i.

[Intermediate]

W: a working set of blocks to be visited.

wf_tmp, ph_sch, pm_sch: working sets of blocks for wavefront scheduling.

ph_tmp(h), pm_tmp(m): working arrays to store the blocks for wavefront scheduling. max_wf: the maximum number of wavefront.

max_pred_O(bi): the maximum execution order for all bi’s predecessor blocks.

min_pred_O(bi): the minimum execution order for all bi’s predecessor blocks.

PHW(bi): the weight of bi for PHost.

PMW(bi): the weight of bi for PMem.

[Algorithm]

/*Initialization and weight determination for each blocks */

for each bi∈P do

(16)

Oi = 0

end for

/* Execution order assignment */ W=P

for each bi with no predecessors do

Oi = 1

W=W-{ bi }

end for

done = False max_wf=0

while done = False AND W ≠ φ do

done=True for each bi∈ W do if min_pred_O(bi)=0 then done=False else Oi = max_pred_O(bi)+1 W=W-{ bi } max_wf=max(max_wf, Oi) end if end for end while /*Scheduling*/ for j=1 to max_wf

store all of bi whose Oi = j in wf_tmp

h = m = 0 for each bi∈ wf_tmp if PHW(bi ) - PMW(bi) ≤ 0 h=h + 1 Store bi in ph_tmp (h) else m=m + 1 Store bi in pm_tmp (m) end if

Sort ph_tmp (h) in decreasing order by PHW(bi )

Sort pm_tmp (m) in decreasing order by PMW(bi)

end for Token = PH ph_sch = pm_sch = {φ} p = q = ph_wei = pm_wei = 0 while p ≤ h OR q ≤ m if Token = PH if p < h /* ph_tmp is not empty */ p = p + 1 ph_sch = ph_sch + {ph_tmp(p)} ph_wei = ph_wei + PHW ( ph_tmp(p))

(17)

q = q +1 ph_sch = ph_sch + {pm_tmp(q)} ph_wei = ph_wei + PHW ( pm_tmp(q)) else /* Token = PM */ if q < m /* pm_tmp is not empty */ q = q + 1 pm_sch = pm_sch + {pm_tmp(q)} pm_wei = pm_wei + PMW ( pm_tmp(q))

else /* pm_tmp is empty, then use ph_tmp to achieve load-balance */

p = p + 1 pm_sch = pm_sch + {pm_tmp(p)} pm_wei = pm_wei + PMW ( pm_tmp(p)) end if end if if ph_wei ≥ pm_wei Token = PM else Token = PH end if Wfj ={ph_sch, pm_sch} end while end for

(18)

（五）實驗結果

本計劃裡我們以 UIUC 的 FlexRAM 模擬器為實驗平台，這個模擬器會將執行的時間以週期為單位表現出來，此外，我們利用已完成之 SAGE 系統，將原始程式轉換成符合智慧型記憶體架構之最佳化程式。

本實驗以一個 P.Host 與一個 P.Mem 的硬體架構設定，驗證上述的 SAGE 系統。我們利用了 BLAS3 的 strmm、SPEC 95 的 swim 與 tomcatv、NAS 的 ep 與 FFT 數個不同的測試程式（Benchmarks），在 SGI Origin200 的機器上來做驗證，使用 SGI F77 Compiler，並以-O2 的參數，對程式做指令階層的最佳化。在數據與圖表中，我們定義「standard」為 P.Host 單獨執行全部工作所花的時間，而「original」為 P.Mem 單獨執行全部工作所花的時間，「Standard SAGE Mode」為二個工作重疊執行，工作的分配方法為則採用 SAGE 的方式，但未開啟” Loop Splitting (迴圈分割)”與” Intelligent Memory Tiling (智慧型記憶體區塊化)”最佳化模組，以作為此兩項最佳化技巧之對照實驗組。，「Enhanced SAGE Mode」亦為重疊二個工作執行，但開啟所有的分析轉換選項。由表二的實驗數據中，我們可知道看到 SAGE 系統確有相當好的平行化與最佳化能力。。

由表二中可以看到 strmm 與 ep 兩支程式無法由「Standard SAGE Mode」轉換中獲得效能提升，這是因為這些程式其迴圈內部的陳述多為變數形式，如果 SAGE 要強行將其切分的話，需要將這些變數作數值展開（Scalar Expansion），這樣一來會很浪費記憶體的空間，進而增加了程式的執行時間。此外，我們可以看到 tomcatv 與 ep 這二個測試程式的速度提昇有限，這是因為它們程式本身的平行度就不高，在這樣的環境限制下，自然使得程式的效能也會受到一部份的影響。

圖五到圖十分別為 strmm, swim, tomcatv, ep, fft，五個測試程式的執行時間柱狀圖，其中的”useful”為執行指令所花的時間，”sync”為等待其它處理器的時間，”memory”為花費在記憶體存取的時間，”miscs”為雜項執行時間。

圖七反應出在 PIM 的環境下，若某個處理器它的工作量不適合其能力，那麼就會有其它的處理器需要等候其執行結果，也就是說花費在同步的時間會變的較為壅長；另外我們在圖七與圖九也可以看到經由「 Enhanced SAGE Mode」修正過後的程式，其花費在記憶體存取的時間也減少了，這種情況說明了二件事，第一，藉由「 Enhanced SAGE Mode」下的兩套最佳化轉換技巧，提高了程式在 P.Mem 的執行效率變的更好，亦提高了程式

表二. 實驗結果

Speedup

Benchmarks Standard Original Standard

SAGE Mode Enhance d SAGE Mode Standard SAGE Mode Enhance d SAGE Mode strmm 234331341 337237336 n/a 140942002 n/a 1.66 swim 188295086 258533896 142900464 116670133 1.32 1.61 tomcatv 380279967 455768117 391225016 297241276 0.97 1.27 ep 103006081 194402248 n/a 84813319 n/a 1.21 fft 4720700 16158182 5352920 2278611 0.88 2.07

(19)

在 PIM 的系統，資料存取的區域性，證明了 Intelligent Memory Tiling (智慧型記憶體區塊化) 技巧，確實能增進程式效能。

strmm

0.00E+00 5.00E+07 1.00E+08 1.50E+08 2.00E+08 2.50E+08 3.00E+08 3.50E+08 4.00E+08

Standard Original Standard SAGE Enhanced SAGE Execution Cycles miscs memory sync useful 圖五、strmm 的執行時間柱狀圖

swim

0.00E+00 5.00E+07 1.00E+08 1.50E+08 2.00E+08 2.50E+08 3.00E+08 3.50E+08 4.00E+08

Standard Original Standard SAGE Enhanced SAGE Execution Cycles miscs memory sync useful 圖六、swim 的執行時間柱狀圖

(20)

tomcatv

0.00E+00 5.00E+07 1.00E+08 1.50E+08 2.00E+08 2.50E+08 3.00E+08 3.50E+08 4.00E+08 4.50E+08 5.00E+08

Standard Original Standard SAGE Enhanced SAGE Execution Cycles miscs memory sync useful 圖七、tomcatv 的執行時間柱狀圖

ep

0.00E+00 5.00E+07 1.00E+08 1.50E+08 2.00E+08 2.50E+08 3.00E+08

Standard Original Standard SAGE Enhanced SAGE Execution Cycles miscs memory sync useful 圖八、ep 的執行時間柱狀圖

fft

0.00E+00 2.00E+06 4.00E+06 6.00E+06 8.00E+06 1.00E+07 1.20E+07 1.40E+07 1.60E+07 1.80E+07

Standard Original Standard SAGE Enhanced SAGE Execution Cycles miscs memory sync useful 圖九、fft 的執行時間柱狀圖

(21)

（六）結論

Intelligent Memory 這種特殊的計算機結構已提出一段時間了，其主要的目的在於有效的降低記憶體存取能力與處理器計算能力之間的差距，在以往的研究中，主要是偏重於提出新架構的智慧型記憶體，但若從編譯器的角度來看，我們需要一些程式轉換最佳化技巧來提升程式效能並充分發揮硬體架構的優勢。在本計劃中，我們研發幾種適當的程式分析轉換技巧，包括陳述分割、建構權重分割相依圖、智慧型記憶體運算辨識、自我補償區塊權重評估、迴圈分割、智慧型記憶體區塊化、與區塊執行次序分析與排程等機制。由實驗數據我們可驗證上述的轉換技巧，的確可以提升不少的程式執行速度，並充分發揮智慧型記憶體架構的特點與優勢。

(22)

四、參考文獻

[1] Huang, T. C., and Chu, S. L.: A Statement Based Parallelizing Framework for Processor-in-Memory Architectures; Information Processing Letters; Vol. 85-3, Elsevier Science, Feb. (2003) pp. 159-163.

[2] Huang, T. C., Chu, S. L. and Shu, Y. W.: A List-based Low Power Scheduling Technique for Intelligent Memory System, in Proc. Second International Conference on Information and Management Sciences, Chengdu, China, Aug. (2003) pp. 24-30. [3] Chu, S. L., Huang, T. C. and Lee, L. C.: Improving Workload Balance and Code

Optimization in Processor-in-Memory Systems, in Proc. 8th International Conference on Parallel And Distributed Systems, KyongJu City, Korea, Jun. (2001) pp. 273-278. [4] Kang, Y., Huang, W., Yoo, S., Keen, D., Ge, Z., Lam, V., Pattnaik, P., and Torrellas,

J.: FlexRAM: Toward an Advanced Intelligent Memory System. International Conference on Computer Design (ICCD), Austin, Texas, Oct. (1999).

[5] Oskin, M., Chong, F. T., and Sherwood, T.: Active Page: A Computation Model for Intelligent Memory. Computer Architecture. In Proceedings of the 25th Annual International Symposium on Computer Architecture, (1998), pp. 192 –203.

[6] Granacki, J. et al. Data Intensive Architecture: DIVA. http://www.isi.edu/asd/diva/, (1998).

[7] Kuck, D. J.: A survey of parallel machine organization and programming. ACM Comput. Surv. 9, 1, Mar. (1977), pp. 29-59.

[8] Yoo, S. M., Renau, J., Huang, M., and Torrellas, J.: FlexRAM Architecture Design Parameters. Technical Report 1584, Oct. (2000).

[9] Veenstra, J., and Fowler, R.: MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors. In MAS-COTS’94, Jan. (1994), pp. 201-207.

[10] Judd, D., and Yelick, K.: Exploiting On-Chip Memory Bandwidth in the VIRAM Compiler. In proceedings of 2nd Workshop on Intelligent Memory Systems, Cambridge, MA, Nov. 12, (2000).

[11] Moritz, C. A., Frank, M., and Amarasinghe, S.: FlexCache: A Framework for Flexible Compiler Generated Data Caching. In proceedings of 2nd Workshop on Intelligent Memory Systems, Cambridge, MA, Nov. 12, (2000).

[12] Veidenbaum, A. V., Tang, W., Gupta, R., Nicolau, A., and Ji, X.: Adapting cache line size to application behavior. In Proceedings ICS'99, Jun. (1999).

[13] Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P.: Numerical Recipes in Fortran 77. Cambridge University Press, (1992).

[14] Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C., Tomas, R., and Yelick, K.: A Case for Intelligent DRAM. IEEE Micro, Mar./Apr., (1997), pp. 33-44.

[15] Jiménez, M.: Multilevel Tiling for Non-Rectangular Iteration Spaces. Ph.D. Thesis, Departamento de Arquittectura de Computadores, Universitat Polit é cniac de Catalunya, May (1999).

[16] 國科會計畫：設計一種適用於智慧型記憶體架構之新平行化技術 , NSC89-2213-E-110-063

[17] 國科會計畫：一種在智慧型記憶體架構下以陳述為基礎的平行化系統， NSC90-2213-E-110-034

(23)

[18] Wang, K. Y.: Precise compile-time performance prediction for superscalar-based computers. Proceedings of the ACM SIGPLAN '94 conference on Programming Language Design and Implementation, (1994), pp. 73 – 84.

(24)

五、

計畫成果自評

透過本計劃的補助，我們得以發展適合 Intelligent Memory 架構所需的最佳化、平行化機制系統，並已於國際學術會議上提出我們初步的研究成果[1][2][3]。以三年的時間，經由各種測試程式之驗證來提升各個分析階段之實用性與完備性，並實際撰寫出自動化之平行編譯器系統。此外，經由過去的發展經驗，我們發展數項適合 Intelligent Memory 架構之程式最佳化轉換技術，來提升 SAGE 之程式分析轉換效能。以下為其三個主要發展階段： 1. 發展 SAGE 系統之基本分析模組：承續先前所提出之 “ 敘述－本位＂ (Statement-Based) 之分析觀點，取代過去平行化系統所採用的 “ 輪次－本位＂ (Iteration-Based)的分析方式，著手設計 SAGE 之各個模組，包括陳述間資料相依分析(Statement Dependence Analysis)、陳述分割(Statement Splitting)、WPG 產生器 (Weighted Partition Dependence Graph Generation) 、區塊權重評估 (Block Weight Evaluation)、區塊執行次序分析(Block Execution Order Analysis)、區塊排程(Block Scheduling) 、以及最後之程式碼產生器(Source Code Generation)。

2. 發展適合 Intelligent Memory 架構的最佳化模組：在上述之基本分析模組發展完後，延續過去在 Flex-Tiling 最佳化模組的研究，我們發展幾項適合 Intelligent Memory 架構之最佳化技巧，以提升程式執行的速度。包括迴圈分割(Loop Splitting)、智慧型記憶體區塊化(Intelligent Memory Tiling)、以及智慧型記憶體運算辨識(Intelligent Memory Operation Recognition)等。

3. 整合驗證上述之分析模組與最佳化技巧，並將上述之分析轉換模組予以擴充，發展更準確之區塊權值評估機制：自我修補權值評估(Self-Patch Weight Evaluation)，以提升 SAGE 排程轉換的能力。

最後我們以 FlexRAM 模擬器為實驗平台，應用上述之技術，將最具代表性之測速程式 SPEC95、Perfect 與 NAS Benchmark 最佳化與平行化，以驗證本計畫所提方法的可行性，並強化 Intelligent Memory 架構的實用性。本計畫完成後的具體成果有下列四項： 1. 設計適合 Intelligent Memory 架構之最佳化與平行化技術。 2. 提供發展 Intelligent Memory 架構之平行化、最佳化技巧之經驗。 3. 增加 Intelligent Memory 架構的實用性，有效減少處理器與記憶體之間速度差異的問題。 4. 掌握在 Intelligent Memory 架構下發展平行編譯器之關鍵技術，幫助國內學者在此架構之發展過程中，取得競爭優勢。

(25)

Information Processing Letters 85 (2003) 159–163

www.elsevier.com/locate/ipl

A statement based parallelizing framework

for processor-in-memory architectures

Tsung-Chuan Huang∗, Slo-Li Chu

Department of Electrical Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan, R.O.C.

Received 12 June 2001; received in revised form 6 March 2002 Communicated by M. Yamashita

Keywords: Processor-in-memory; Statement analysis; SAGE; Parallelizing compiler; FlexRAM; Scheduling

1. Introduction

It is widely known that current memory architec-ture is one of the bottlenecks for high-performance computers due to the increasing gap between the processor speed and memory latency. For this rea-son, several architectures, called intelligent memory (IRAM) or processor-in-memory (PIM), have been studied in recent years aiming to integrate the proces-sor and memory together. A merit of PIM architec-ture is that the PIM chips can be used to replace the main memory chips in a workstation and act as co-processors when main processor spawns them. This approach has been adopted by Active Page, DIVA, and FlexRAM, among others.

This class of architectures provides a hierarchi-cal hybrid multiprocessor environment: host (main) processors and memory processors. Host processor is more powerful with a deep cache hierarchies and higher latency to access memory. By contrast, memory processors are usually less powerful but with a lower

* _{Corresponding author.}

E-mail addresses: [email protected] (T.-C. Huang),

[email protected] (S.-L. Chu).

latency in memory access. The major problems we ad-dress in this paper are: how to dispatch suitable tasks to these different processors in PIM by their computing power and characteristics to reduce their idle time, and how to partition the original program then execute si-multaneously on these heterogeneous processors mix-ture. Based on our earlier work [3,5], we propose the SAGE (Statement-Analysis-Grouping-Evaluation) system to analyze the source program, generate a

Weight Partition Dependence Graph (WPG),

deter-mine the weight of each block, and then dispatch the most suitable jobs to the host and memory processors, respectively. From the experiment, we find that quite good speedup is obtained, which even exceeds the computation capability ratio in 1-host and 1-memory processors environment.

2. Intelligent memory architecture

A general view of the FlexRAM [7] architecture is shown in Fig. 1, which is the platform adopted in this paper. The host processor of the target machine is called P.Host. Some representative parameters of this architecture are listed in Table 1. In order to

(26)

160 T.-C. Huang, S.-L. Chu / Information Processing Letters 85 (2003) 159–163

Fig. 1. The organization of simplified FlexRAM architecture. Table 1

Major parameters of the FlexRAM

Parameters P.Host P.Mem Working frequency 800 MHz 400 MHz Superscalar type Out-of-order In-order Superscalar issue width 6 2 Integer unit number 6 2 Floating unit number 4 2 First level cache size 32 KB 16 KB Second level cache size 256 KB N/A Branch penalty cycles 4 2 Memory delay cycles 88 17 Memory access latency 262.5 ns 50.5 ns

simplify the complexity of the problem in exploiting the benefits of PIM architectures, in this paper, we only consider the system with a single host processor (P.Host) and a single memory processor (P.Mem).

3. System organization

In the current researches of parallelizing compiler, there exist two major orientations. One is designed for tightly-coupled homogeneous environments, such as SUIF and Polaris. These compilers systems address on how to transform loops so that all or part of the iterations can be executed concurrently, i.e., deal with the problem from iteration viewpoint. This approach is suitable for homogeneous multiprocessor systems. But this approach exists an obvious flaw for heterogeneous multiprocessor platforms because the behaviors of it-erations are similar but the capabilities of

heteroge-Table 2

A simple fully parallelizable program

Program *Weight for PH Weight for PM

DO I = 1 to N

S1: A= A mod B 3 6 S2: C= D[I]+ E 5 1 S3: F= G[I]+H[I] 6 2 ENDDO

* _{Weight: normalized execution time.} PH: host processor. PM: memory processor.

neous processors are different. The other kind of paral-lelizing compilers is for heterogeneous multicomput-ers, such as TINPAR [4] and PARADIGM [2]. They transform programs into course-grained parallel pro-gramming model on loosely-coupled heterogeneous multicomputer, but can not deal with medium-grained parallel programming model on tightly-coupled het-erogeneous processors mixture, like PIM. Hence we choose a different approach by using the statements as our basic analysis unit. In this approach, statements are assigned variant weights for different processors, and will be scheduled to the most suitable proces-sor according to their weights. Our SAGE system has two major advantages. Firstly, the partitioning technique provides a new medium-grained paralleliz-ing approach. Different from the conventional itera-tion based system, the statement based approach can exploit more potential capability difference in het-erogeneous multiprocessors. Secondly, the heuristic scheduling mechanism in SAGE system can generate appropriate schedules and dispatch tasks to P.Host and P.Mem according to their capabilities and overall load-balance.

Table 2 shows a simple example which is used to demonstrate the benefits of statement based paral-lelizing technique. The program is fully parallelizable and can be partitioned in statements or iterations. The statements’ weights for P.Host or P.Mem are also vided. Table 3 lists five parallelization cases of the pro-gram in Table 2 and their execution times. The first two cases are executing the program solely on the host and memory processor, respectively. Case 3 is lelizing the program by using the conventional paral-lelizing compilers, like SUIF or Polaris. The compil-ers determine the parallel loops and dispatch them to processors evenly by the workload. This approach can achieve a good speedup only for homogeneous

(27)

proces-T.-C. Huang, S.-L. Chu / Information Processing Letters 85 (2003) 159–163 161 Table 3

Five cases and execution times for the example in Table 2 No Description Execution time

1 P.Host only Latency= [PH(S1) + PH(S2) + PH(S3)]*iteration # = (3 + 5 + 6)N = 14 N 2 P.Mem only Latency= [PM(S1) + PM(S2) + PM(S3)]*iteration # = (6 + 1 + 2)N = 9 N 3 Conventional parallelizing compilers Latency= max((3 + 5 + 6)* 0.5 N, (6 + 2 + 1)* 0.5 N) = 7 N

4 Parallelizing by iteration splitting Dispatch workload in proportion to the capability ratio of PH and PM. PH : PM= 9 : 14 Latency = 14* (9/23)N = 5.48 N

5 Parallelizing by statement splitting Latency= max (PH(S1)* N, PM(S2, S3)* N ) = 3 N

(Here S1 is more suitable for PH, but S2 and S3 are suitable for PM)

sors. Case 4 assumes that the parallelizing compiler can dispatch the iterations to the processors accord-ing to the capabilities of processors, but does not con-sider the discrepancy of executing statement on dif-ferent processors. Case 5 is the approach of statement based analysis, which has the best performance in this example when processors possess different capabili-ties. That is why we develop SAGE system for PIM environments.

3.1. Statement splitting and WPG construction

In SAGE, we use Loop Distribution to split the original dependence graph into Node Partition Π [1], then construct the Weighted Partition Dependence

Graph (WPG), which will be used in the following

weight evaluation and schedule determination stages. Firstly, we introduce some definitions for partitioning loops.

Definition 1 (Node partition Π ) [1]. On the depen-dence graph G, for a given loop L, we define a node

partition Π of{S1, S2, . . . , Sd} in such a way that Sk

and Sl, k= l, are in the same subset if and only if

Sk Sl and Sl Sk, where is an indirect data

de-pendent relation.

Definition 2 (Weighted partition dependence graph) [3,5]. For a given node partition Π as in Defini-tion 1, we define a weighted partiDefini-tion dependence

graph WPG(P , E). For each πi ∈ Π, there is a node

biIi, Si, Wi, Oi ∈ P , where Ii denotes the loop

in-dex and Si represents the body statements; Wi is the

weight of node i in the form of Wi (PH, PM) with PH

and PM indicating the weights to P.Host and P.Mem,

respectively, and Oi is the execution order of this

node. There is an edge eij∈ E from bi to bj, if bi and

bj have dependence relations α, α∧and α◦, as in

De-finition 1, and which are respectively denoted by→,

anti

−→, and→.o

According to these two definitions, we developed a formal method in [5] to partition the loops.

3.2. Weight evaluation

In the PIM environments, each operation (e.g., branch, arithmetic, and memory operation) consumes variant execution time for different processors be-cause the memory processor possesses less computa-tion power but has faster memory access latency com-pared to the host processor. The typical representative parameters of FlexRAM are shown in Table 1. In order to predict the normalized execution time (i.e., weight)

for each blocks bi in WPG, we proposed a static

weight evaluation mechanism to obtain the weights of

BRANCH, INTEGER, FLOATING POINT, and MEM-ORY operations for P.Host and P.Mem, respectively.

The detailed mechanism of weight evaluation please refer to [3].

3.3. Wavefront generation and schedule determination

In this section, we propose an algorithm to schedule the blocks for the P.Host and P.Mem. In our method, the weights of the blocks in partition Π are computed first, then the execution order for each block is de-termined according to their dependence relation and lexicographic order. The algorithm partitions the set of blocks into subsets called wavefronts. The blocks in a wavefront can be executed in parallel, i.e., no two blocks within a wavefront have dependence. But

(28)

162 T.-C. Huang, S.-L. Chu / Information Processing Letters 85 (2003) 159–163

the wavefronts must be carried out in sequence. Fi-nally, the blocks in the same wavefront are scheduled to P.Host and P.Mem processors according to their weights and overall load-balance.

Algorithm 1 (Weighting and Scheduling Algorithm). [Input]

WPG= (P, E): the original weighted partition

dependence graph before the weights and orders of blocks are determined.

[Output]

An execution wavefront schedule Wf= {wf₁, wf₂, . . .}

where wfi= {PH(ba, . . . , bb), PM(bc, . . . , bd)}, in which PH(ba, . . . , bb) means blocks ba, . . . , bb will be assigned to P.Host, PM(bc, . . . , bd) means blocks bc, . . . , bdwill be assigned to P.Mem, both in wavefront i.

[Intermediate]

work: a working set of nodes ready to be visited. wf_tmp: a working set for wavefront scheduling. max_wf: the maximum wavefront number. max_pred_O(bi): the maximum execution order

for all bi’s predecessor blocks.

min_pred_O(bi): the minimum execution order for all bi’s predecessor blocks.

PHW(bi), PMW(bi): the weights of bi for P.Host and P.Mem, respectively.

Weight_Determine (bi): the subroutine performing the weight evaluation scheme presented in Section 3.2.

[Algorithm]

/* Initialization and weight determination for each blocks */

for each bi∈ P do

W_i(PH, PM)= Weight_Determine(b_i) Oi= 0

end for

/* Execution order assignment */ work= P

for each biwith no predecessors do

Oi= 1

work= work − {bi}

end for

done= False, max_wf = 0

while work= ∅ do for each bi∈ work do

if min_pred_O(bi)= 0 then Oi= max_pred_O(bi)+ 1 work= work − {bi} max_wf= max(max_wf, Oi) end if end for end while /*Scheduling*/ for j= 1 to max_wf

store all of bi with Oi= j in wf_tmp done= False

while done= False do

divide wf_tmp into two subsets a, b such that

a∪ b = wf_tmp and a ∩ b = ∅

if (|PHW(a) − PMW(b)|+ max(PHW(a), PMW(b))

is minimum for all possible a and b then

wf(j )= {PH(a), PM(b)} done= True end if end while end for 4. Experimental results

The code generated by SAGE is targeted to Flex-RAM simulator [7] developed by IA-COMA Lab in UIUC. This simulation environment models dy-namic superscalar multiprocessor and detailed mem-ory behaviors cycle by cycle. The major configura-tion parameters are shown in Table 1. The applicaconfigura-tions evaluated by us include three programs: swim from SPEC95, strmm from BLAS3, and fft from Numeri-cal Recipes [6]. Table 4 shows the execution cycles for these three applications. P.Host Exec. Cycles de-notes running on the P.Host only; P.Mem Exec. Cycles denotes running on the P.Mem only; and Optimized

Cycles denotes running on the PIM platform after the

applications are transformed by our SAGE system.

Table 4

Performance results of three programs

Benchmarks P.Host exec. cycles P.Mem exec. cycles Optimized cycles Speedup

swim 86144754 662909725 64190746 1.342

strmm 107235775 875029491 76868804 1.395

(29)

T.-C. Huang, S.-L. Chu / Information Processing Letters 85 (2003) 159–163 163

According to Table 4, the approximate performance ratio between P.Host and P.Mem is 8:1. This means that if P.Mem cooperates with P.Host simultaneously, the speedup will be up to 1.125 theoretically. But we get more than 1.125. The reason able to interpret this fact is that P.Mem has shorter memory access latency than P.Host. This attests the major objective of intelligent memory system: to reduce the performance gap between the processor and memory. The results also demonstrate that SAGE can extend the advantages of PIM architectures.

References

[1] D.J. Kuck, A survey of parallel machine organization and programming, ACM Comput. Surv. 9 (1) (1977) 29–59. [2] E. Su, A. Lain, S. Ramaswamy, D.J. Palermo, E.W. Hodges,

P. Banerjee, Advanced compilation techniques in the PARA-DIGM compiler for distributed-memory multicomputers, in:

Proc. 1995 9th ACM International Conference on Supercom-puting, 1995.

[3] S.L. Chu, T.C. Huang, L.C. Lee, Improving workload balance and code optimization in processor-in-memory systems, in: Proc. 8th International Conference on Parallel and Distributed Systems, KyongJu City, Korea, June 2001, pp. 273–278. [4] S. Goto, A. Kubota, T. Tanaka, M. Goshima, S. Mori,

H. Nakashima, S. Tomita, Optimized code generation for hetero-geneous computing environment using parallelizing compiler TINPAR, in: Proc. 1998 International Conference on Parallel Architectures and Compilation Techniques, 1998, pp. 426–433. [5] T.C. Huang, S.L. Chu, A new analysis approach for intelligent memory systems, in: Proc. ISCA 16th International Conference on Computers and Their Applications, Seattle, WA, March 2001, pp. 452–457.

[6] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in Fortran 77, Cambridge University Press, 1992.

[7] Y. Kang, W. Huang, S. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, J. Torrellas, FlexRAM: Toward an advanced intelligent memory system, in: Proc. International Conference on Computer Design, Austin, TX, October 1999.

National Sun Yat-sen University Institutional Repository:Item 987654321/28780

行政院國家科學委員會專題研究計畫 成果報告

智慧型記憶體架構下平行編譯整合環境之建立(3/3)

中 華 民 國 94 年 10 月 14 日

一、中英文摘要

【中文摘要】

【英文摘要】

二、

目錄

三、報告內容

（一）前言與研究目的

（二）文獻探討

（三）研究方法

1. Intelligent Memory 系統架構：

2. SAGE 分析系統組織架構：

（四）結果與討論

【模組一】Statement Splitting (陳述分割)

【模組二】 WPG (Weighted Partition Dependence Graph) Construction

(建構權重分割相依圖)

【模組三】IMOP (Intelligent Memory Operation) Recognition (智慧型記

憶體運算辨識)

program

.

do I=1 to N

.

A(I)=B(I)+C(I)+…

.

end Do

.

end program

【模組四】Self-Patch for Weight Evaluation (自我補償區塊權重評估)

【模組五】Loop Splitting (迴圈分割)

【模組六】Intelligent Memory Tiling (智慧型記憶體區塊化)

【模組七】 Block Execution Order

Analysis and Scheduling (區塊執行次

序分析與排程

（五）實驗結果

strmm

swim

tomcatv

ep

fft

（六）結論

四、參考文獻

五、

計畫成果自評

A statement based parallelizing framework

for processor-in-memory architectures

行政院國家科學委員會專題研究計畫成果報告

中華民國 94 年 10 月 14 日