Department of Electrical Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan, R.O.C.
Received 12 June 2001; received in revised form 6 March 2002 Communicated by M. Yamashita
Keywords: Processor-in-memory; Statement analysis; SAGE; Parallelizing compiler; FlexRAM; Scheduling
1. Introduction
It is widely known that current memory architec-ture is one of the bottlenecks for high-performance computers due to the increasing gap between the processor speed and memory latency. For this rea-son, several architectures, called intelligent memory (IRAM) or processor-in-memory (PIM), have been studied in recent years aiming to integrate the proces-sor and memory together. A merit of PIM architec-ture is that the PIM chips can be used to replace the main memory chips in a workstation and act as co-processors when main processor spawns them. This approach has been adopted by Active Page, DIVA, and FlexRAM, among others.
This class of architectures provides a hierarchi-cal hybrid multiprocessor environment: host (main) processors and memory processors. Host processor is more powerful with a deep cache hierarchies and higher latency to access memory. By contrast, memory processors are usually less powerful but with a lower
* Corresponding author.
E-mail addresses: [email protected] (T.-C. Huang), [email protected] (S.-L. Chu).
latency in memory access. The major problems we ad-dress in this paper are: how to dispatch suitable tasks to these different processors in PIM by their computing power and characteristics to reduce their idle time, and how to partition the original program then execute si-multaneously on these heterogeneous processors mix-ture. Based on our earlier work [3,5], we propose the SAGE (Statement-Analysis-Grouping-Evaluation) system to analyze the source program, generate a Weight Partition Dependence Graph (WPG), deter-mine the weight of each block, and then dispatch the most suitable jobs to the host and memory processors, respectively. From the experiment, we find that quite good speedup is obtained, which even exceeds the computation capability ratio in 1-host and 1-memory processors environment.
2. Intelligent memory architecture
A general view of the FlexRAM [7] architecture is shown in Fig. 1, which is the platform adopted in this paper. The host processor of the target machine is called P.Host. Some representative parameters of this architecture are listed in Table 1. In order to
0020-0190/02/$ – see front matter 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 0 2 0 - 0 1 9 0 ( 0 2 ) 0 0 3 5 3 - 8
160 T.-C. Huang, S.-L. Chu / Information Processing Letters 85 (2003) 159–163
Fig. 1. The organization of simplified FlexRAM architecture.
Table 1
Major parameters of the FlexRAM
Parameters P.Host P.Mem
Working frequency 800 MHz 400 MHz
Superscalar type Out-of-order In-order
Superscalar issue width 6 2
Integer unit number 6 2
Floating unit number 4 2
First level cache size 32 KB 16 KB
Second level cache size 256 KB N/A
Branch penalty cycles 4 2
Memory delay cycles 88 17
Memory access latency 262.5 ns 50.5 ns
simplify the complexity of the problem in exploiting the benefits of PIM architectures, in this paper, we only consider the system with a single host processor (P.Host) and a single memory processor (P.Mem).
3. System organization
In the current researches of parallelizing compiler, there exist two major orientations. One is designed for tightly-coupled homogeneous environments, such as SUIF and Polaris. These compilers systems address on how to transform loops so that all or part of the iterations can be executed concurrently, i.e., deal with the problem from iteration viewpoint. This approach is suitable for homogeneous multiprocessor systems. But this approach exists an obvious flaw for heterogeneous multiprocessor platforms because the behaviors of it-erations are similar but the capabilities of
heteroge-Table 2
A simple fully parallelizable program
Program *Weight for PH Weight for PM
DO I = 1 to N
S1: A= A mod B 3 6
S2: C= D[I]+ E 5 1
S3: F= G[I]+H[I] 6 2
ENDDO
* Weight: normalized execution time.
PH: host processor. PM: memory processor.
neous processors are different. The other kind of paral-lelizing compilers is for heterogeneous multicomput-ers, such as TINPAR [4] and PARADIGM [2]. They transform programs into course-grained parallel pro-gramming model on loosely-coupled heterogeneous multicomputer, but can not deal with medium-grained parallel programming model on tightly-coupled het-erogeneous processors mixture, like PIM. Hence we choose a different approach by using the statements as our basic analysis unit. In this approach, statements are assigned variant weights for different processors, and will be scheduled to the most suitable proces-sor according to their weights. Our SAGE system has two major advantages. Firstly, the partitioning technique provides a new medium-grained paralleliz-ing approach. Different from the conventional itera-tion based system, the statement based approach can exploit more potential capability difference in het-erogeneous multiprocessors. Secondly, the heuristic scheduling mechanism in SAGE system can generate appropriate schedules and dispatch tasks to P.Host and P.Mem according to their capabilities and overall load-balance.
Table 2 shows a simple example which is used to demonstrate the benefits of statement based paral-lelizing technique. The program is fully parallelizable and can be partitioned in statements or iterations. The statements’ weights for P.Host or P.Mem are also vided. Table 3 lists five parallelization cases of the pro-gram in Table 2 and their execution times. The first two cases are executing the program solely on the host and memory processor, respectively. Case 3 is lelizing the program by using the conventional paral-lelizing compilers, like SUIF or Polaris. The compil-ers determine the parallel loops and dispatch them to processors evenly by the workload. This approach can achieve a good speedup only for homogeneous
proces-T.-C. Huang, S.-L. Chu / Information Processing Letters 85 (2003) 159–163 161
Table 3
Five cases and execution times for the example in Table 2
No Description Execution time
1 P.Host only Latency= [PH(S1) + PH(S2) + PH(S3)]*iteration # = (3 + 5 + 6)N = 14 N 2 P.Mem only Latency= [PM(S1) + PM(S2) + PM(S3)]*iteration # = (6 + 1 + 2)N = 9 N 3 Conventional parallelizing compilers Latency= max((3 + 5 + 6)* 0.5 N, (6 + 2 + 1)* 0.5 N) = 7 N
4 Parallelizing by iteration splitting Dispatch workload in proportion to the capability ratio of PH and PM.
PH : PM= 9 : 14 Latency = 14* (9/23)N = 5.48 N 5 Parallelizing by statement splitting Latency= max (PH(S1)* N, PM(S2, S3)* N ) = 3 N
(Here S1 is more suitable for PH, but S2 and S3 are suitable for PM)
sors. Case 4 assumes that the parallelizing compiler can dispatch the iterations to the processors accord-ing to the capabilities of processors, but does not con-sider the discrepancy of executing statement on dif-ferent processors. Case 5 is the approach of statement based analysis, which has the best performance in this example when processors possess different capabili-ties. That is why we develop SAGE system for PIM environments.
3.1. Statement splitting and WPG construction
In SAGE, we use Loop Distribution to split the original dependence graph into Node Partition Π [1], then construct the Weighted Partition Dependence Graph (WPG), which will be used in the following weight evaluation and schedule determination stages.
Firstly, we introduce some definitions for partitioning loops.
Definition 1 (Node partition Π ) [1]. On the depen-dence graph G, for a given loop L, we define a node partition Π of{S1, S2, . . . , Sd} in such a way that Sk
and Sl, k= l, are in the same subset if and only if Sk Sl and Sl Sk, where is an indirect data de-pendent relation.
Definition 2 (Weighted partition dependence graph) [3,5]. For a given node partition Π as in Defini-tion 1, we define a weighted partiDefini-tion dependence graph WPG(P , E). For each πi ∈ Π, there is a node biIi, Si, Wi, Oi ∈ P , where Ii denotes the loop in-dex and Si represents the body statements; Wi is the weight of node i in the form of Wi (PH, PM) with PH and PM indicating the weights to P.Host and P.Mem, respectively, and Oi is the execution order of this node. There is an edge eij∈ E from bi to bj, if bi and
bj have dependence relations α, α∧and α◦, as in De-finition 1, and which are respectively denoted by→,
−→, andanti →.o
According to these two definitions, we developed a formal method in [5] to partition the loops.
3.2. Weight evaluation
In the PIM environments, each operation (e.g., branch, arithmetic, and memory operation) consumes variant execution time for different processors be-cause the memory processor possesses less computa-tion power but has faster memory access latency com-pared to the host processor. The typical representative parameters of FlexRAM are shown in Table 1. In order to predict the normalized execution time (i.e., weight) for each blocks bi in WPG, we proposed a static weight evaluation mechanism to obtain the weights of BRANCH, INTEGER, FLOATING POINT, and MEM-ORY operations for P.Host and P.Mem, respectively.
The detailed mechanism of weight evaluation please refer to [3].
3.3. Wavefront generation and schedule determination
In this section, we propose an algorithm to schedule the blocks for the P.Host and P.Mem. In our method, the weights of the blocks in partition Π are computed first, then the execution order for each block is de-termined according to their dependence relation and lexicographic order. The algorithm partitions the set of blocks into subsets called wavefronts. The blocks in a wavefront can be executed in parallel, i.e., no two blocks within a wavefront have dependence. But
162 T.-C. Huang, S.-L. Chu / Information Processing Letters 85 (2003) 159–163
the wavefronts must be carried out in sequence. Fi-nally, the blocks in the same wavefront are scheduled to P.Host and P.Mem processors according to their weights and overall load-balance.
Algorithm 1 (Weighting and Scheduling Algorithm).
[Input]
WPG= (P, E): the original weighted partition dependence graph before the weights and orders of blocks are determined.
[Output]
work: a working set of nodes ready to be visited.
wf_tmp: a working set for wavefront scheduling.
max_wf: the maximum wavefront number.
max_pred_O(bi): the maximum execution order for all bi’s predecessor blocks.
min_pred_O(bi): the minimum execution order for all bi’s predecessor blocks.
PHW(bi), PMW(bi): the weights of bi for P.Host and P.Mem, respectively.
Weight_Determine (bi): the subroutine performing the weight evaluation scheme presented in Section 3.2.
[Algorithm]
/* Initialization and weight determination for each blocks */
for each bi∈ P do
Wi(PH, PM)= Weight_Determine(bi) Oi= 0
end for
/* Execution order assignment */
work= P
for each biwith no predecessors do Oi= 1
while done= False do
divide wf_tmp into two subsets a, b such that a∪ b = wf_tmp and a ∩ b = ∅
if (|PHW(a) − PMW(b)|+ max(PHW(a), PMW(b)) is minimum for all possible a and b then
wf(j )= {PH(a), PM(b)}
The code generated by SAGE is targeted to Flex-RAM simulator [7] developed by IA-COMA Lab in UIUC. This simulation environment models dy-namic superscalar multiprocessor and detailed mem-ory behaviors cycle by cycle. The major configura-tion parameters are shown in Table 1. The applicaconfigura-tions evaluated by us include three programs: swim from SPEC95, strmm from BLAS3, and fft from Numeri-cal Recipes [6]. Table 4 shows the execution cycles for these three applications. P.Host Exec. Cycles de-notes running on the P.Host only; P.Mem Exec. Cycles denotes running on the P.Mem only; and Optimized Cycles denotes running on the PIM platform after the applications are transformed by our SAGE system.
Table 4
Performance results of three programs
Benchmarks P.Host exec. cycles P.Mem exec. cycles Optimized cycles Speedup
swim 86144754 662909725 64190746 1.342
strmm 107235775 875029491 76868804 1.395
fft 105874670 884793293 86499990 1.221
T.-C. Huang, S.-L. Chu / Information Processing Letters 85 (2003) 159–163 163
According to Table 4, the approximate performance ratio between P.Host and P.Mem is 8:1. This means that if P.Mem cooperates with P.Host simultaneously, the speedup will be up to 1.125 theoretically. But we get more than 1.125. The reason able to interpret this fact is that P.Mem has shorter memory access latency than P.Host. This attests the major objective of intelligent memory system: to reduce the performance gap between the processor and memory. The results also demonstrate that SAGE can extend the advantages of PIM architectures.
References
[1] D.J. Kuck, A survey of parallel machine organization and programming, ACM Comput. Surv. 9 (1) (1977) 29–59.
[2] E. Su, A. Lain, S. Ramaswamy, D.J. Palermo, E.W. Hodges, P. Banerjee, Advanced compilation techniques in the PARA-DIGM compiler for distributed-memory multicomputers, in:
Proc. 1995 9th ACM International Conference on Supercom-puting, 1995.
[3] S.L. Chu, T.C. Huang, L.C. Lee, Improving workload balance and code optimization in processor-in-memory systems, in:
Proc. 8th International Conference on Parallel and Distributed Systems, KyongJu City, Korea, June 2001, pp. 273–278.
[4] S. Goto, A. Kubota, T. Tanaka, M. Goshima, S. Mori, H. Nakashima, S. Tomita, Optimized code generation for hetero-geneous computing environment using parallelizing compiler TINPAR, in: Proc. 1998 International Conference on Parallel Architectures and Compilation Techniques, 1998, pp. 426–433.
[5] T.C. Huang, S.L. Chu, A new analysis approach for intelligent memory systems, in: Proc. ISCA 16th International Conference on Computers and Their Applications, Seattle, WA, March 2001, pp. 452–457.
[6] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in Fortran 77, Cambridge University Press, 1992.
[7] Y. Kang, W. Huang, S. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, J. Torrellas, FlexRAM: Toward an advanced intelligent memory system, in: Proc. International Conference on Computer Design, Austin, TX, October 1999.