OPTIMIZING PARALLEL APPLICATIONS

全文

(1)OPTIMIZING PARALLEL APPLICATIONS. by Shih-Hao Hung. A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University of Michigan 1998. Doctoral Committee: Professor Edward S. Davidson, Chair Professor William R. Martin Professor Trevor N. Mudge Professor Quentin F. Stout.

(2)

(3) ©Shih-Hao Hung. All Rights Reserved. 1998.

(4) For my family.. ii.

(5) ACKNOWLEDGMENT I owe a special thanks to my advisor, Professor Edward S. Davidson. Without his support, guidance, wisdom, and kindly revising my writing, my study toward Ph.D degree and writing this dissertation would have been much more painful. I would like to thank Prof. Gregory Hulbert, Prof. William Martin, Prof. Trevor Mudge, and Prof. Quentin Stout for their advises that helped improve this dissertation. I would like to thank Ford Motor Company, Automative Research Ceter (U.S. ArmyTACOM), and the National Partnership for Advanced Computational Infrastructure (NPACI, NSF Grant No. ACI-961920) for their generous financial support of this research. Parallel computer time was provided by the University of Michigan Center of Parallel Computing which is sponsored in part by NSF grant CDA-92-14296. Thanks to CPC staff, particularly Paul McClay, Andrew Caird, and Hal Marshall, for providing excellent technical support on the KSR, HP/Convex Exemplar, IBM SP2, and SGI PowerChallenger. I have been fortunate to meet numerous marvelous fellow students in the Parallel Performance Project (formerly the Ford Project). Lots of my work in this dissertation has been inspired or based on their research. These are what I learned from these friends who have been my officemates in 2214 EECS: hierarchical performance bounds models (Tien-Pao Shih, Eric Boyd), synthetic workload (Eric Boyd), domain decomposition (Karen Tomko), relaxed synchronization (Alex Eichenberger), and machine performance evaluation (Gheith Abandah). Thanks to Ed Tam and Viji Srinivasan for letting me use their cache simulation tools. Jude Rivers has been a bright, delightful fellow of mine, whose constant skeptical attitude toward parallel computing has been one motivation for this research. Thanks to many people, Ann Arbor has been a very wonderful place to me in the last 5 years. To people who I have played volleyball, basketball, softball, and majong with. To people who lend me tons of anime to kill countless hours. To my family back in Taiwan for their full support of my academic career over the years. To our cute little black cat, Ji-Ji, for bringing lots of laughters to our home. Finally, to my lovely wife, Weng-Ling, for always finding ways to make our lives here more interesting and meaningful. iii.

(6) TABLE OF CONTENTS DEDICATION .................................................................................... ii ACKNOWLEDGMENT ....................................................................... iii LIST OF TABLES ............................................................................ vii LIST OF FIGURES ......................................................................... viii CHAPTER 1 INTRODUCTION ...................................................................................... 1 1.1. Parallel Machines .................................................................................... 3 1.1.1 Shared-Memory Architecture - HP/Convex Exemplar ............... 4 1.1.2 Message-Passing Architecture - IBM SP2 .................................. 5 1.1.3 Usage of the Exemplar and SP2.................................................. 6. 1.2. Parallelization of Applications ................................................................ 7. 1.3. Problems in Developing Irregular Applications ..................................... 9 1.3.1 Irregular Applications.................................................................. 9 1.3.2 Example - CRASH...................................................................... 10 1.3.3 Parallelization of CRASH .......................................................... 13 1.3.4 Performance Problems ............................................................... 14. 1.4. Developing High Performance Parallel Applications........................... 15 1.4.1 Conventional Scheme................................................................. 16 1.4.2 Recent Trends............................................................................. 17 1.4.3 Problems in Using Parallel Tools .............................................. 18 1.4.4 The Parallel Performance Project ............................................. 19 1.4.5 Goals of this Dissertation .......................................................... 21. 1.5. Summary and Organization of this Dissertation ................................. 22. 2 ASSESSING THE DELIVERED PERFORMANCE .................................... 24 2.1. Machine Performance Characterization ............................................... 25. 2.2. Machine-Application Interactions......................................................... 29 2.2.1 Processor Performance............................................................... 29 2.2.2 Interprocessor Communications................................................ 32 2.2.3 Load Balancing, Scheduling, and Synchronization.................. 33 iv.

(7) 2.2.4 2.2.5. Performance Problems in CRASH-SP....................................... 35 Overall Performance .................................................................. 37. 2.3. Performance Assessment Techniques................................................... 38 2.3.1 Source Code Analysis ................................................................. 38 2.3.2 Profile-Driven Analysis.............................................................. 39 2.3.3 Trace-driven Analysis ................................................................ 48 2.3.4 Other Approaches....................................................................... 51. 2.4. Efficient Trace-Driven Techniques for Assessing the Communication Performance of Shared-Memory Applications...................................... 52 2.4.1 Overview of the KSR1/2 Cache-Only Memory Architecture .... 52 2.4.2 Categorizing Cache Misses in Distributed Shared-Memory Systems....................................... 55 2.4.3 Trace Generation: K-Trace ........................................................ 57 2.4.4 Local Cache Simulation: K-LCache........................................... 62 2.4.5 Analyzing Communication Performance with the Tools.......... 67. 2.5. Summary ................................................................................................ 70. 3 A UNIFIED PERFORMANCE TUNING METHODOLOGY ....................... 72 3.1. Step-by-step Approach ........................................................................... 73. 3.2. Partitioning the Problem (Step 1) ......................................................... 74 3.2.1 Implementing a Domain Decomposition Scheme ..................... 75 3.2.2 Overdecomposition ..................................................................... 78. 3.3. Tuning the Communication Performance (Step 2)............................... 79 3.3.1 Communication Overhead ......................................................... 79 3.3.2 Reducing the Communication traffic ........................................ 79 3.3.3 Reducing the Average Communication Latency ...................... 89 3.3.4 Avoiding Network Contention ................................................... 93 3.3.5 Summary .................................................................................... 95. 3.4. Optimizing Processor Performance (Step 3)......................................... 96. 3.5. Balancing the Load for Single Phases (Step 4) .................................. 100. 3.6. Reducing the Synchronization/Scheduling Overhead (Step 5).......... 101. 3.7. Balancing the Combined Load for Multiple Phases (Step 6)............. 105. 3.8. Balancing Dynamic Load (Step 7)....................................................... 106. 3.9. Conclusion ............................................................................................ 110. 4 HIERARCHICAL PERFORMANCE BOUNDS AND GOAL-DIRECTED PERFORMANCE TUNING .................................................................... 116 4.1. Introduction.......................................................................................... 117 4.1.1 The MACS Bounds Hierarchy ................................................. 117 4.1.2 The MACS12*B Bounds Hierarchy......................................... 119 4.1.3 Performance Gaps and Goal-Directed Tuning........................ 120 v.

(8) 4.2. Goal-Directed Tuning for Parallel Applications................................. 120 4.2.1 A New Performance Bounds Hierarchy .................................. 120 4.2.2 Goal-Directed Performance Tuning ........................................ 123 4.2.3 Practical Concerns in Bounds Generation .............................. 124. 4.3. Generating the Parallel Bounds Hierarchy: CXbound ...................... 125 4.3.1 Acquiring the I (MACS$) Bound ............................................. 125 4.3.2 Acquiring the IP Bound ........................................................... 126 4.3.3 Acquiring the IPC Bound......................................................... 127 4.3.4 Acquiring the IPCO Bound ...................................................... 128 4.3.5 Acquiring the IPCOL Bound ................................................... 129 4.3.6 Acquiring the IPCOLM Bound ................................................ 129 4.3.7 Actual Run Time and Dynamic Behavior ............................... 131. 4.4. Characterizing Applications Using the Parallel Bounds ................... 133 4.4.1 Case Study 1: Matrix Multiplication....................................... 133 4.4.2 Case Study 2: A Finite-Element Application.......................... 140. 4.5. Summary .............................................................................................. 143. 5 MODEL-DRIVEN PERFORMANCE TUNING ........................................ 147 5.1. Introduction.......................................................................................... 147. 5.2. Application Modeling ........................................................................... 151 5.2.1 The Data Layout Module ......................................................... 154 5.2.2 The Control Flow Module ........................................................ 155 5.2.3 The Data Dependence Module................................................. 168 5.2.4 The Domain Decomposition Module ....................................... 170 5.2.5 The Weight Distribution Module............................................. 173 5.2.6 Summary of Application Modeling .......................................... 174. 5.3. Model-Driven Performance Analysis .................................................. 175 5.3.1 Machine Models........................................................................ 176 5.3.2 The Model-Driven Simulator................................................... 176 5.3.3 Current Limitations of MDS ................................................... 178. 5.4. A Preliminary Case Study ................................................................... 180 5.4.1 Modeling CRASH-Serial, CRASH-SP, and CRASH-SD......... 181 5.4.2 Analyzing the Performance of CRASH-Serial, CRASH-SP, and CRASH-SD ........................................................................ 185 5.4.3 Model-Driven Performance Tuning......................................... 194 5.4.4 Summary of the Case Study .................................................... 216. 5.5. Summary .............................................................................................. 216. 6 CONCLUSION ...................................................................................... 219. REFERENCES ................................................................................ 222. vi.

(9) LIST OF TABLES Table 1-1 2-1 2-2 2-3 2-4 2-5 2-6 2-7 3-1 3-2 4-1 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 5-9 5-10 5-11 5-12 5-13 5-14 5-15. Vehicle Models for CRASH............................................................................. 13 Some Performance Specifications for HP/Convex SPP-1000........................ 26 The Memory Configuration for HP/Convex SPP-1000 at UM-CPC ............. 26 Microbenchmarking Results for the Local Memory Performance on HP/Convex SPP-1000 ................................................................................ 27 Microbenchmarking Results for the Shared-Memory Point-to-Point Communication Performance on HP/Convex SPP-1000 ...... 27 Collectable Performance Metrics with CXpa, for SPP-1600......................... 42 D-OPT Model for characterizing a Distributed Cache System. ................... 56 Tracing Directives........................................................................................... 62 Comparison of Dynamic Load Balancing Techniques ................................ 109 Performance Tuning Steps, Issues, Actions and the Effects of Actions. (1 of 2)...................................................................... 114 Performance Tuning Actions and Their Related Performance Gaps. (1 of 2) ................................................... 145 A Run Time Profile for CRASH.................................................................... 182 Computation Time Reported by MDS.......................................................... 186 Communication Time Reported by MDS. .................................................... 186 Barrier Synchronization Time Reported by MDS ....................................... 186 Wall Clock Time Reported by MDS. ............................................................ 186 Hierarchical Parallel Performance Bounds (as reported by MDS) and Actual Runtime (Measured).................................................................. 188 Performance Gaps (as reported by MDS). ................................................... 188 Working Set Analysis Reported by MDS..................................................... 188 Performance Metrics Reported by CXpa (Only the Main Program Is Instrumented). ................................................ 191 Wall Clock Time Reported by CXpa............................................................. 191 CPU Time Reported by CXpa....................................................................... 191 Cache Miss Latency Reported by CXpa....................................................... 191 Cache Miss Latency and Wall Clock Time for a Zero-Workload CRASH-SP, Reported by CXpa........................................ 192 Working Set Analysis Reported by MDS for CRASH-SP, CRASH-SD, and CRASH-SD2........................................................................................... 199 PSAT of Arrays Position, Velocity, and Force in CRASH........................... 207. vii.

(10) LIST OF FIGURES Figure 1-1 1-2 1-3 1-4 1-5 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8 2-9 2-10 2-11 2-12 3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 3-9 3-10 3-11 3-12 3-13 3-14 4-1. HP/Convex Exemplar System Overview. ........................................................ 4 Example Irregular Application, CRASH. ...................................................... 11 Collision between the Finite-Element Mesh and an Invisible Barrier. ....... 12 CRASH with Simple Parallelization (CRASH-SP). ...................................... 15 Typical Performance Tuning for Parallel Applications. ............................... 16 Performance Assessment Tools...................................................................... 25 Dependencies between Contact and Update ................................................. 36 CPU Time of MG, Visualized with CXpa....................................................... 43 Workload Included in CPU Time and Wall Clock Time. .............................. 45 Wall Clock Time Reported by the CXpa. ....................................................... 46 Examples of Trace-Driven Simulation Schemes. .......................................... 50 KSR1/2 ALLCache. ......................................................................................... 53 An Example Inline Tracing Code. .................................................................. 59 A Parallel Trace Consisting of Three Local Traces....................................... 60 Trace Generation with K-Trace. .................................................................... 61 Communications in a Sample Trace. ............................................................. 64 Coherence Misses and Communication Patterns in an Ocean Simulation Code on the KSR2................................................... 69 Performance Tuning. ...................................................................................... 72 An Ordered Performance-tuning Methodology ............................................. 73 Domain Decomposition, Connectivity Graph, and Communication Dependency Graph .............................................................. 76 A Shared-memory Parallel CRASH (CRASH-SD) ........................................ 76 A Message-passing Parallel CRASH (CRASH-MD), A Psuedo Code for First Phase is shown. ...................................................... 78 Using Private Copies to Enhance Locality and Reduce False-Sharing ....... 84 Using Gathering Buffers to Improve the Efficiency of Communications .... 85 Communication Patterns of the Ocean Code with Write-Update Protocol. 87 Communication Patterns of the Ocean Code with Noncoherent Loads....... 87 Example Pairwise Point-to-point Communication........................................ 93 Communication Patterns in a Privatized Shared-Memory Code ................. 94 Barrier, CDG-directed Synchronization, and the Use of Overdecomposition............................................................... 103 Overdecomposition and Dependency Table................................................. 103 Approximating the Dynamic Load in Stages. ............................................ 108 Performance Constraints and the Performance Bounds Hierarchy. ......... 121 viii.

(11) 4-2 4-3 4-4 4-5 4-6 4-7 4-8 4-9 4-10 4-11 4-12 4-13 4-14 4-15 4-16 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 5-9 5-10 5-11 5-12 5-13 5-14 5-15 5-16 5-17 5-18 5-19 5-20 5-21 5-22 5-23 5-24 5-25. Performance Tuning Steps and Performance Gaps. ................................... 123 Calculation of the IPCO, IPCOL, IPCOLM, IPCOLMD Bounds. .............. 130 An Example with Dynamic Load Imbalance. .............................................. 132 The Source Code for MM1. ........................................................................... 134 Parallel Performance Bounds for MM1. ...................................................... 135 Performance Gaps for MM1. ........................................................................ 135 The Source Code for MM2. ........................................................................... 136 Parallel Performance Bounds for MM2. ...................................................... 136 Comparison of MM1 and MM2 for 8-processor Configuration. .................. 137 Source Code for MM_LU .............................................................................. 138 Performance Bounds for MM_LU ................................................................ 139 Performance Comparison of MM2 on Dedicated and Multitasking Systems .......................................................... 139 Performance Bounds for the Ported Code.................................................... 140 Performance Bounds for the Tuned Code.................................................... 142 Performance Comparison between Ported and Tuned on 16-processor Configuration ..................................................................... 142 Model-Driven Performance Tuning. ............................................................ 149 Model-driven Performance Tuning. ............................................................. 151 Building an Application Model..................................................................... 152 An Example Data Layout Module for CRASH. ........................................... 154 A Control Flow Graph for CRASH. .............................................................. 157 The Tasks Defined for CRASH..................................................................... 158 A Hierarchical Control Flow Graph for CRASH. ........................................ 159 An IF-THEN-ELSE Statement Represented in a CFG. ............................. 159 A Control Flow Module for CRASH. ............................................................ 160 Program Constructs and Tasks Modeled for CRASH ................................. 162 A Control Flow Graph for a 2-Processor Parallel Execution in CRASH-SP. ............................................................................................... 163 Associating Tasks with Other Modules for Modeling CRASH-SP on 4 Processors. ........................................................ 164 Modeling the Synchronization for CRASH-SP. ........................................... 166 Modeling the Point-to-Point Synchronization, an Example. ...................... 167 A Data Dependence Module for CRASH. .................................................... 169 An Example Domain Decomposition Module for CRASH-SP. ................... 171 Domain Decomposition Schemes Supported in MDS. ................................ 172 An Example Domain Decomposition for CRASH-SD. ................................ 172 An Example Workload Module for CRASH/CRASH-SP/CRASH-SD......... 173 Model-driven Analyses Performed in MDS. ................................................ 175 An Example Machine Description of HP/Convex SPP-1000 for MDS. ................................................................... 176 A Sample Input for CRASH. ........................................................................ 180 Decomposition Scheme Used in CRASH-Serial, SP, and SD. .................... 184 Performance Bounds and Gaps Calculated for CRASH-Serial, CRASH-SP, and CRASH-SD.............................................. 187 The Layout of Array Position in the Processor Cache in CRASH-SD........ 190. ix.

(12) 5-26 Comparing the Wall Clock Time Reported by MDS and CXpa. ............................................................................................ 194 5-27 Performance Bounds Analysis for CRASH-SP. ........................................... 195 5-28 Performance Bounds Analysis for CRASH-SP and SD............................... 196 5-29 Layout of the Position Array in the Processor Cache in CRASH-SD2....... 197 5-30 Comparing the Performance Gaps of CRASH-SD2 to Its Predecessors. ... 198 5-31 Comparing the Performance of CRASH-SD2 to Its Predecessors.............. 199 5-32 A Pseudo Code for CRASH-SD3................................................................... 200 5-33 Comparing the Performance Gaps of CRASH-SD3 to Its Predecessors. ... 201 5-34 Comparing the Performance of CRASH-SD3 to Its Predecessors.............. 202 5-35 A Pseudo Code for CRASH-SD4................................................................... 203 5-36 Comparing the Performance of CRASH-SD4 to Its Predecessors.............. 204 5-37 The Layout in CRASH-SD2, SD3, SD4, SD5, and SD6. ............................. 205 5-38 Comparing the Performance of CRASH-SD5 to CRASH-SD4. .................. 206 5-39 Comparing the Performance of CRASH-SD6 to Previous Versions........... 208 5-40 Delaying Write-After-Read Data Dependencies By Using Buffers............ 209 5-41 The Results of Delaying Write-After-Read Data Dependencies By Using Buffers........................................................................................... 210 5-42 A Pseudo Code for CRASH-SD7................................................................... 211 5-43 Performance Bounds Analysis for CRASH-SD5, SD6, and SD7. ............... 212 5-44 A Pseudo Code for CRASH-SD8................................................................... 214 5-45 Data Accesses and Interprocessor Data Dependencies in CRASH-SD8.... 215 5-46 Performance Bounds Analysis for CRASH-SD6, SD7, and SD8. ............... 215 5-47 Summary of the Performance of Various CRASH Versions. ...................... 217 5-48 Performance Gaps of Various CRASH Versions. ........................................ 217. x.

(13)

(14) CHAPTER 1. INTRODUCTION “A parallel computer is a set of processors that are able to work cooperatively to solve a computational problem” [1]. Today, various types of parallel computers serve different usages ranging from embedded digital signal processing to supercomputing. In this dissertation, we focus on the application of parallel computing to solve large computational problems fast. Highly parallel supercomputers, with up to thousands of microprocessors, have been developed to solve problems that are beyond the reach of any traditional single processor supercomputer. By connecting multiple processors to a shared memory bus, parallel servers/ workstations have emerged as a cost-effective alternative to mainframe computers. While parallel computing offers an attractive perspective for the future of computers, the parallelization of applications and the performance of parallel applications have limited the success of parallel computing. First, applications need to be parallel or parallelized to take advantage of parallel machines. Writing parallel applications or parallelizing existing serial applications can be a difficult task. Second, parallel applications are expected to deliver high performance. However, more often than not, the parallel execution overhead results in unexpectedly poor performance. Compared to uniprocessor systems, there are many more factors that can greatly impact the performance of a parallel machine. It is often a difficult and timeconsuming process for users to exploit the performance capacity of a parallel computer, which generally requires them to deal with limited inherent parallelism in their applications, inefficient parallel algorithms, overhead of parallel execution, and/or poor utilization of machine resources. The latter two problems are what we intend to address in this dissertation. Parallelization is a state-of-the-art process that has not yet been automated in general. Most programmers have been trained in and have experience with serial codes and there exist many important serial application codes that could well benefit from the performance increases offered by parallel computers. Automatic parallelization is possible for loops where data dependency can be analyzed by the compiler. Unfortunately, the complexity of interpro-. 1.

(15) cedural data flow analysis often limits automatic parallelization to basic loops without procedure calls. Problems can occur even in these basic loops if, for example, there exist indirect data references such as pointers or indirectly-indexed arrays. Fortunately, many of those problems are solvable with some human effort, especially from the programmers themselves, to assist the compiler. Regardless of whether parallelization is automatic or manual, high performance parallel applications are needed to better serve the user community. So far, while some parallel applications do successfully achieve high delivered performance, many others only achieve a small fraction of peak machine performance. It is often beyond the compiler’s or the application developer’s ability to accurately identify and consider the many machine-application interactions that can potentially affect the performance of the parallelized code. Over the last several decades, during which parallel computer architectures have constantly been modified and improved, tuned application codes and software environments for these architectures have had to be discontinued and rebuilt. Different application development tools have not been well integrated or automated. The methodology to improve software performance, i.e. performance tuning, like parallelization, has never been mature enough to reduce the tuning effort to routine work that can be performed by compilers or average programmers. Poor performance and painful experiences in performance tuning have greatly reduced the interest of many programmers in parallel computing. These problems must be solved to permit routine development of high performance parallel applications. Irregular applications [2] (Section 1.3), including sparse problems and those with unstructured meshes, often require more human parallel programming effort than regular applications, due to their indirectly indexed data items and irregular load distribution among the problem subdomains. Most full scale scientific and engineering applications exhibit such irregularity, and as a result they are more difficult to parallelize, load balance and optimize. Due to the lack of systematic and effective performance-tuning schemes, many irregular applications exhibit deficient performance on parallel computers. In this dissertation, we aim to provide a unified approach to addressing this problem by integrating performance models, performance-tuning methodologies and performance analysis tools to guide the parallelization and optimization of irregular applications. This approach will also apply to the simpler problem of parallelizing and tuning regular applications.. 2.

(16) In this chapter, we introduce several key issues in developing high performance applications on parallel computers. Section 1.1 classifies parallel architectures and parallel programming models and describes the parallel computers that we use in our experiments. Section 1.2 describes the process and some of the difficulties encountered in the parallelization of applications. In Section 1.3, we discuss some aspects of irregular applications and the performance problems they pose. In Section 1.4, we give an overview of current application development environments and define the goals of our research. Section 1.5 summarizes this chapter and overviews the organization of this dissertation.. 1.1 Parallel Machines There are various kinds of parallel computers, as categorized by Flynn [3]. We focus on multiple instruction stream, multiple data stream (MIMD) machines. MIMD machines can be further divided into two classes: shared-memory and message-passing. These two classes of machines differ in the type and amount of hardware support they provide for interprocessor communication. Interprocessor communications are achieved via different mechanisms on shared-memory machines and message-passing machines. A shared-memory architecture provides a globally shared physical address space, and processors can communicate with one another by sharing data with commonly known addresses, i.e. global variables. There may also be local or private memory spaces that belong to one processor or cluster of processors and are protected from being accessed by others. In distributed shared-memory (DSM) machines, logically shared memories are physically distributed across the system, and nonuniform memory access (NUMA) time results as a function of the distance between the requesting processor and the physical memory location that is accessed. In a message-passing architecture, processors have their own (disjoint) memory spaces and can therefore communicate only by passing messages among the processors. A message-passing architecture is also referred to as a distributed-memory or a shared-nothing architecture. In this section, we briefly discuss two cases that represent current design trends in shared-memory architectures (the HP/Convex Exemplar) and message-passing architectures (the IBM SP2). The Center for Parallel Computing of the University of Michigan (CPC) provides us with access to these machines.. 3.

(17) 1.1.1 Shared-Memory Architecture - HP/Convex Exemplar The HP/Convex1 Exemplar SPP-1000 shared memory parallel computer was the first model in the Exemplar series. It has 1 to 16 hypernodes, with 4 or 8 processors per hypernode, for a total of 4 to 128 processors. Processors in different hypernodes communicate via four CTI (Coherent Toroidal Interconnect) rings. Each CTI ring supports global memory accesses with the IEEE SCI (Scalable Coherent Interface) standard [4]. Each hypernode on the ring is connected to the next by a pair of unidirectional links. Each link has a peak transfer rate of 600MB/sec. Within each hypernode, four functional blocks and an I/O interface communicate via a 5port crossbar interconnect. Each functional block contains two Hewlett-Packard PARISC7100 processors [5] running at 100MHz, 2 banks of memory and controllers. Each processor has a 1MB instruction cache and a 1MB data cache on chip. Each processor cache is a. Functional Block M M. P. F.B.. F.B.. F.B.. hypernode 0. P. C I/O Interface 5-port crossbar. Functional Block M M. P. F.B.. F.B.. F.B.. hypernode n. P. C I/O Interface 5-port crossbar. F.B. - Functional Block P - Processor M - Memory C - Controllers. CTI rings. Figure 1-1: HP/Convex Exemplar System Overview.. 1. Convex Computer Company was acquired by Hewlett-Packard (HP) in 1996. As of 1998, Convex is a division of HP and is responsible for the service and future development of the Exemplar series.. 4.

(18) direct-mapped cache with a 32-byte line size. Each hypernode contains 256MB to 2GB of physical memory, which is partitioned into three sections: hypernode-local, global, and CTIcache. The hypernode-local memory can be accessed only by the processors within the same hypernode as the memory. The global memory can be accessed by processors in any hypernode. The CTIcaches reduce traffic on the CTI rings by caching global data that was recently obtained by this hypernode from remote memory (i.e. memory in some other hypernode). The eight physical memories in each hypernode are 8-way interleaved by 64-byte blocks. The basic transfer unit on the CTI ring is also a 64-byte block. The CPC at the University of Michigan was among the first sites with a 32-processor SPP1000 as the Exemplar series was introduced in 1994. In August, 1996, the CPC upgraded the machine to an SPP-1600. The primary upgrade in the SPP-1600 is the use of more advanced HP PA-RISC 7200 processors [6], which offer several major advantages over the predecessor, PA-RISC 7100: (1) Faster clock rate (the PA7200 runs at 120MHz), (2) the Runway Bus - a split transaction bus capable of 768 MB/s bandwidth, (3) an additional on-chip 2-KByte fullyassociative assist data cache, and (4) a four state cache coherence protocol for the data cache. The newest model in the Exemplar series is the SPP-2000, which incorporates HP PARISC 8000 processors [7]. The SPP-2000 has some dramatic changes in the architectural design of the processor and interconnection network, which provide a substantial performance improvement over the SPP-1600. Using HP PA-RISC family processors, the Exemplar series runs a version of UNIX, called SPP-UX, which is based on the HP-UX that runs on HP PARISC-based workstations. Thus, the Exemplar series not only maintains software compatibility within its family, but can also run sequential applications that are developed for HP workstations. In addition, using mass-produced processors reduces the cost of machine development, and enables more rapid upgrades (with minor machine re-design or modification), as new processor models become available.. 1.1.2 Message-Passing Architecture - IBM SP2 The IBM Scalable POWERparallel SP2 connects 2 to 512 RISC System/6000 POWER2 processors via a communication subsystem, called the High Performance Switch (HPS). Each processor has its private memory space that cannot be accessed by other processors. The HPS is a bidirectional multistage interconnect with wormhole routing. The IBM SP2 at CPC has 64. 5.

(19) nodes in four towers. Each tower has 16 POWER2 processor nodes. Each of the 16 nodes in the first tower has a 66 MHz processor and 256MB of RAM. Each node in the second and third towers has a 66 MHz processor and 128MB of RAM. The last 16 nodes each have a 160 MHz processor and 1GB of RAM. Each node has a 64KB data cache and 32 KB instruction cache. The line size is 64 bytes for the data caches and 128 bytes for the instruction caches. The SP2 runs the AIX operating system, a version of UNIX, and has C, C++, Fortran77, Fortran90, and High Performance Fortran compilers. Each POWER2 processor is capable of performing 4 floating-point operations per clock cycle. This system thus offers an aggregate peak performance of 22.9 GFLOPS. However, the fact that the nodes in this system differ in processor speed and memory capacity results in a heterogeneous system which poses additional difficulties in developing high performance applications. Heterogeneous systems, which often exist in the form of a network of workstations, commonly result due to incremental machine purchases. As opposed to heterogeneous systems, a homogeneous system, such as the HP/Convex SPP-1600 at CPC, uses identical processors and nodes, and this is easier to program and load-balance for scalability. In this dissertation, we focus our discussion on homogeneous systems, but some of our techniques can be applied to heterogeneous systems as well.. 1.1.3 Usage of the Exemplar and SP2 The HP/Convex Exemplar and IBM SP2 at the CPC have been used extensively for developing and running production applications, as well as in performance evaluation research. Generally, shared-memory machines provide simpler programming models than messagepassing machines, as discussed in the next section. The interconnect of the Exemplar is focused more on reducing communication latency, so as to provide faster short shared-memory communications. The SP2 interconnect is focused more on high communication bandwidth, in order to reduce the communication time for long messages, as well as to reduce network contention. Further details of the Convex SPP series machines can be found in [8][9]. The performance of shared memory and communication on the SPP-1000 is described in detail in [10][11][12]. A comprehensive comparison between the SPP-1000 and the SPP-2000 can be found in [13]. Detailed performance characterizations of the IBM SP2 can be found in [14][15][16].. 6.

(20) 1.2 Parallelization of Applications Parallelism is the essence of parallel computing, and parallelization exposes the parallelism in the code to the machine. While some algorithms (i.e. parallel algorithms) are specially designed for parallel execution, many existing applications still use conventional (mostly sequential) algorithms and parallelization of such applications can be laborious. For certain applications, the lack of parallelism may be due to the nature of the algorithm used in the code. Rewriting the code with a parallel algorithm could be the only solution. For other applications, limited parallelism is often due to (1) insufficient parallelization by the compiler and the programmer, and/or (2) poor runtime load balance. In any case, the way a code is parallelized is highly related to its performance. To solve a problem by exploiting parallel execution, the problem must be decomposed. Both the computation and data associated with the problem need to be divided among the processors. As the alternative to functional decomposition, which first decomposes the computation, domain decomposition first partitions the data domain into disjoint subdomains, and then works out what computation is associated with each subdomain of data (usually by employing the “owner updates” rule). Domain decomposition is the method more commonly method used by programmers to partition a problem, because it results in a simpler programming style with a parallelization scheme that provides straightforward scaling to different numbers of processors and data set sizes. In conjunction with the use of domain decomposition, many programs are parallelized in the Single-Program-Multiple-Data (SPMD) programming style. In a SPMD program, one copy of the code is replicated and executed by every processor, and each processor operates on its own data subdomain, which is often accessed in globally shared memory by using index expressions that are functions of its Processor IDentification (PID). A SPMD program is symmetrical if every processor performs the same function on an equal-sized data subdomain. A near-symmetrical SPMD program performs computation symmetrically, except that one processor (often called the master processor) may be responsible for extra work such as executing serial regions or coordinating parallel execution. An asymmetrical SPMD program is considered as a Multiple-Programs-Multiple-Data (MPMD) program whose programs are packed into one code.. 7.

(21) In this dissertation, we consider symmetrical or near-symmetrical SPMD programs that employ domain decomposition, because they are sufficient to cover a wide range of applications. Symmetrical or near-symmetrical SPMD programming style is favored not only because it is simpler for programmers to use, but also because of its scalability for running on different numbers of processors. Usually, a SPMD program takes the machine configuration as an input and then decomposes the data set (or chooses a pre-decomposed data set if the domain decomposition algorithm is not integrated into the program) based on the number of processors, and possibly also the network topology. For scientific applications that operate iteratively on the same data domain, the data set can often be partitioned once at the beginning and those subdomains simply reused in later iterations. For such applications, the runtime overhead for performing domain decomposition is generally negligible. Parallel programming models refer to the type of support available for interprocessor communication. In a shared-memory programming model, the programmers declare variables as private or global, where processors share the access to global variables. In a message-passing programming model, the programmers explicitly specify communication using calls to message-passing routines. Shared-memory machines support shared-memory programming models as their native mode; while block moves between shared memory buffer areas can be used to emulate communication channels for supporting message-passing programming models [8]. Message-passing machines can support shared-memory programming models via software-emulation of shared virtual memory [17][18]. Judiciously mixing shared-memory and message-passing programming models in a program can often result in better performance. The HP/Convex Exemplar supports shared-memory programming with automatic parallelization and parallel directives in its enhanced versions of C and Fortran. Message-passing libraries, PVM and MPI, are also supported on this machine. The SP2 supports Fortran 90 and High-Performance Fortran (HPF) [19] parallel programming languages, as well as the MPI [20], PVM [21], and MPL message-passing libraries. Generally, parallelization with sharedmemory models is less difficult than with message-passing models, because the programmers (or the compilers) are not required to embed explicit communication commands in the codes. Direct compilation of serial programs for parallel execution does exist today [22][23][24], but the state-of-the-art solutions are totally inadequate. Current parallelizing compilers have some success parallelizing loops where data dependency can be found . Unfortunately, prob-. 8.

(22) lems often occur when there exist indirect data references or function calls within the loops, which causes the compiler to make safe, conservative assumptions, which in turn can severely degrade the attainable parallelism and hence performance and scalability. This problem limits the use of automatic parallelization in practice. Therefore, most production parallelizing compilers, e.g. KSR Fortran [25] and Convex Fortran [26][27], are of very limited use for parallelizing application codes. Interestingly, many of those problems can be solved by trained human experts. Use of conventional languages enhanced with parallel extensions, such as Message-Passing Interface (MPI), are commonly used by programmers to parallelize codes manually, and in fact many manually-parallelized codes perform better than their automatically-parallelized versions. So far, parallel programmers have been directly responsible for most parallelization work, and hence, the quality of parallelization today usually depends on the programmer’s skill. Parallelizing large application codes can be very time-consuming, taking months or even years of trial-and-error development, and frequently, parallelized applications need further fine tuning to exploit each new machine effectively by maximizing the application performance in light of the particular strengths and weakness of the new machine. Unfortunately, fine tuning a parallel application, even when code development and maintenance budgets would allow it, is usually beyond the capability of today’s compilers and most programmers. It often requires an intimate knowledge of the machine, the application, and, most importantly, the machine-application interactions. Irregular applications are especially difficult for the programer or the compiler to parallelize and optimize. Irregular application and their parallelization and performance problems are discussed in the next section.. 1.3 Problems in Developing Irregular Applications 1.3.1 Irregular Applications Irregular applications are characterized by indirect array indices, sparse matrix operations, nonuniform computation requirements across the data domain, and/or unstructured problem domains [2]. Compared to regular applications, irregular applications are more difficult to parallelize, load balance and optimize. Optimal partitioning of irregular applications is an NP-complete problem. Compiler optimizations, such as cache blocking, loop transforma-. 9.

(23) tions, and parallel loop detection, cannot be applied to irregular applications since the indirect array references are not known until runtime and the compilers therefore assume worst-case dependence. Interprocessor communications and the load balance are difficult to analyze without performance measurement and analysis tools. For many regular applications, domain decomposition is straightforward for programmers or compilers to apply. For irregular applications, decomposition of unstructured domains is frequently posed as a graph partitioning problem in which the data domain of the application is used to generate a graph where computation is required for each data item (vertex of the graph) and communication dependence between data items are represented by the edges. The vertices and edges can be weighted to represent the amount of computation and communication, respectively, for cases where the load is nonuniform. Weighted graph partitioning is an NP-complete problem, but several efficient heuristic algorithms are available. In our research, we have used the Chaco [28] and Metis [29] domain decomposition tools, which implement several algorithms. Some of our work on domain decomposition is motivated and/or based on profile-driven and multi-weight weighted domain decomposition algorithms developed previously by our research group [2][30].. 1.3.2 Example - CRASH In this dissertation, an example application, CRASH, is a highly simplified code that realistically represents several problems that arise in an actual vehicle crash simulation. It is used here for demonstrating these problems and their solutions. A simplified high level sketch of the serial version of this code is given in Figure 1-2. CRASH exhibits irregularity in several aspects: indirect array indexing, unstructured meshes, and nonuniform load distribution. Because of its large data set size, communication overhead, multiple phase and dynamic load balance problems, this application requires extensive performance-tuning to perform efficiently on a parallel computer. CRASH simulates the collision of objects and carries out the simulation cycle by cycle in discrete time. The vehicle is represented by a finite element mesh which is provided as input to the code, such as illustrated in Figure 1-3. Instead of a mesh, the barrier is implicitly modeled as a boundary condition. Elements in the finite-element mesh are numbered from 1 to. 10.

(24) program CRASH integer Type(Max_Elements),Num_Neighbors(Max_Elements) integer Neighbor(Max_Neighbors_per_Element,Max_Elements) real_vector Force(Max_Elements),Position(Max_Elements), Velocity(Max_Elements) real t, t_step integer i,j,type_element call Initialization c Main Simulation Loop t=0 c First phase: generate contact forces 100 do i=1,Num_Elements Force(i)=Contact_force(Position(i),Velocity(i)) do j=1,Num_Neighbors(i) Force(i)=Force(i)+Propagate_force(Position(i),Velocity(i), Position(Neighbor(j,i),Velocity(Neighbor(j,i)) end do end do c Second phase: update position and velocity 200 do i=1,Num_Elements type_element=Type(i) if (type_element .eq. plastic) then call Update_plastic(i, Position(i), Velocity(i), Force(i)) else if (type_element .eq. glass) then call Update_elastic(i, Position(i), Velocity(i), Force(i)) end if end do if (end_condition) stop t=t+t_step goto 100 end Figure 1-2: Example Irregular Application, CRASH. Num_Elements. Depending on the detail level of the vehicle model, the number of elements varies. Variable Num_Neighbors(i) stores the number of elements that element i interacts with (which in practice would vary from iteration to iteration). Array Neighbors(*,i) points to the elements that are connected to element i in the irregular mesh as well as other elements with which element i has come into contact during the crash. Type(i) specifies the type of material of element i. Force(i) stores the force calculated during contact that will be applied to element i. Position(i) and Velocity(i) store the position and velocity of element i. Force, position, and velocity of an element are declared as type real_vector variables, each of which is actually formed by three double precision (8-byte) floating-point numbers representing a three dimensional vector. Assuming the integers are 32-bits (4-bytes) 11.

(25) Time=Before Collision. Time=During Collision. Time=After Collision. Figure 1-3: Collision between the Finite-Element Mesh and an Invisible Barrier.. 12.

(26) Vehicle Model. Num_Elements. Max_Neighbors_ per_Element. Working Set (Mbytes). Small. 10000. 5. 1. Large. 100000. 10. 12. Table 1-1: Vehicle Models for CRASH. each, the storage requirement, or the total data working set, for CRASH is about (84+4*Max_Neighbor_per_Element)*Num_Elements bytes. In this dissertation, we assume two different vehicle models; their properties are shown in Table 1-1. The program calculates the forces between elements and updates the status of each element for each cycle. In the first phase, the Contact phase, the force applied to each element is calculated by calling Contact_force() to obtain and sum the forces between this element and other elements with which it has come into contact. In second phase, the Update phase, the position and velocity of each element are updated using the force generated in the contact phase. Depending on the type of material, the Update phase calls Update_Plastic() or Update_Elastic() for updating the position and velocity as a function of Force(i). Each cycle thus outputs a new finite-element mesh which is used as input to the next cycle. This example program shows irregularities in several aspects. First, objects are represented by unstructured meshes. Second, in the Contact phase, properties of neighbor elements are referenced with indirect array references, e.g. Velocity(Neighbor(j,i)), referring to the velocity of the j-th neighbor of element i. Third, the load is nonuniform because the load of calculating the force for an element during the Contact phase depends on how many neighbors each element has, and the load of updating the status of an element during the Update phase depends on the type of element being updated.. 1.3.3 Parallelization of CRASH Even for codes as simple as CRASH, most parallelizing compilers fail to exploit the parallelism in CRASH because of the complexity of analyzing indirect array references, such as Velocity(Neighbor(j,i)), and procedure calls. The communication pattern needs to be explicitly specified for a message-passing version, yet determining the pattern is not a trivial. 13.

(27) task. Fortunately, a shared-memory parallelization does not require the specification of an explicit communication pattern, and hence is initially much easier to develop. Performance tuning does, however, require some understanding of the communication pattern, in both cases. The parallelism in CRASH can quite easily be recognized by a parallel programmer: the calculations for different elements within each phase can be performed in parallel, because they have no data dependence. Manual parallelization of CRASH can be implemented by parallelizing the major loop in each phase (indexed by i). A straightforward, simple parallel version of CRASH on HP/Convex Exemplar, CRASH-SP, is illustrated in Figure 1-4. Note that the parallel directive, c$dir loop_parallel, is inserted ahead of each parallel loop. By default, loop_parallel instructs the compiler to parallelize the loop by partitioning the index into p subdomains: elements {1,2, ... N/p } are assigned to processor 1, elements { N/p+1 ... 2N/p } are assigned to processor 2, etc., where N is Num_Elements and p is the number of processors used in the execution. Since this parallelization partitions the domain into subdomains of nearly equal size, the workload will be evenly shared among the processors, if the load is evenly distributed over the index domain. However, for irregular applications like CRASH, this simple decomposition could lead to enormous communication traffic and poor load balance due to the unstructured meshes and nonuniform load distribution. More sophisticated domain decomposition algorithms are commonly used for partitioning the unstructured meshes so that the communication traffic is reduced [30].. 1.3.4 Performance Problems In this dissertation, we focus on solving major performance problems resulting from irregular applications because they pose more difficult optimization problems and lack an effective general method to guide the programmers toward achieving high performance. Once such methods exist, regular applications may be handled as a degenerate case. These problems are described in the next chapter and are addressed throughout this dissertation.. 14.

(28) program CRASH-SP integer Type(Num_Elements),Num_Neighbors(Num_Elements) integer Neighbor(Max_Neighbor_per_Elements,Num_Elements) real_vector Force(Num_Elements),Position(Num_Elements), Velocity(Num_Elements) real t, t_step integer i,j,Num_Neighbors,type_element call Initialization c Main Simulation Loop t=0 c First phase: generate contact forces c$dir loop_parallel 100 do i=1,Num_Elements Force(i)=Contact_force(Position(i),Velocity(i)) do j=1,Num_Neighbors(i) Force(i)=Force(i)+Propagate_force(Position(i),Velocity(i), Position(Neighbor(j,i),Velocity(Neighbor(j,i)) end do end do c Second phase: update position and velocity c$dir loop_parallel 200 do i=1,Num_Elements type_element=Type(i) if (type_element .eq. plastic) then call Update_plastic(i, Position(i), Velocity(i), Force(i)) else if (type_element .eq. glass) then call Update_glass(i, Position(i), Velocity(i), Force(i)) end if end do if (end_condition) stop t=t+t_step goto 100 end Figure 1-4: CRASH with Simple Parallelization (CRASH-SP).. 1.4 Developing High Performance Parallel Applications Often, performance tuning is called hand tuning, which emphasizes the programer’s central role in most performance tuning work. While many people hope that someday compiler technology will automate most performance tuning work, most parallel programmers today are directly responsible for a large portion of the performance tuning work, which often requires extensive knowledge of the underlying hardware characteristics. Parallel programmers seem to spend more time poring over performance information and repeatedly modifying the source code in a cut-and-try approach, rather than simply deciding how best to tune the performance, and then simply doing it in one pass. In this section, we discuss this conven-. 15.

(29) programmer. application. performance tools. machine. compiler. Figure 1-5: Typical Performance Tuning for Parallel Applications.. tional scheme, recent trends, current problems, and our approach for developing high performance applications.. 1.4.1 Conventional Scheme Figure 1-5 shows how performance tuning is typically done today. The compiler transforms the source code into an executable code with optimizations derived from source-code analysis and programmer written directives. Although, theoretically, compilers could accept feedback from the execution and use it when deciding whether and how to apply certain optimizations, such as during instruction scheduling [31][32], no effective feedback mechanism exists today in any compiler that provides a sufficient amount of information to guide its optimizations. Performance enhancement is the programmer’s responsibility to carry out, primarily by revising the source code and modifying directives, perhaps with some help from performance assessment tools. The hardware details of a computer are usually hidden from the user by commonly used programming models. Programmers may ignore many hardware issues that can greatly degrade a code’s performance, even when substantial effort is spent varying the directives and flag settings for an optimizing compiler. As a result, hand tuning is subsequently involved to improve code performance, inefficient performance is accepted, or the parallel execution is 16.

(30) abandoned altogether. Under this conventional scheme, the effort to tune a parallel application, especially an irregular application, can be very time-consuming and/or ineffective. However, without proper tuning, peak performance on parallel computers, or even a respectable fraction of peak, can be very difficult to achieve. It should be no surprise that few tools can be used to facilitate the tuning, because for a long time, development of tools has been neglected or discontinued due to the typically short life times of parallel machines. Poor performance and painful experiences in performance tuning have greatly reduced the interest of many programmers in parallel computing. These are the major problems in a conventional parallel application development environment.. 1.4.2 Recent Trends Until the early ‘90s, some optimistic researchers believed that parallelization could and would soon be fully automated and fairly well optimized by future parallelizing compilers, and some believed that parallel machines would then be able to carry out parallel execution fairly efficiently with a minimum of hand tuning. Unfortunately, parallelization and performance tuning turned out to be more difficult than they thought, and none of those expectations have been satisfied. More and more researchers realize that (1) manual parallelization is necessary and should be properly supported, and (2) the key to develop high performance applications lies in a strong understanding of the interactions between the machine and the application, i.e. machine-application interactions. Given these awareness, the parallel computing community started several projects (e.g. the Parallel Tools Consortium, est. 11/93) to promote the development of standardized application environments and performance tools to ease programmer’s burden and study machine-application interactions. Since then, more research has been focused on improving parallel tools, such as the following, to form a better application development environment:. • Standard Parallel Languages or Message-Passing Libraries, such as HPF [19] and MPI [20], which enable programmers to develop machine-independent codes and encourage vendors to optimize their machines for these languages. Among these new standards, MPI has been widely supported on almost all message-passing machines and even most sharedmemory machines (e.g. HP/Convex Exemplar).. 17.

(31) • Parallelized and Tuned Libraries, such as Convex LAPACK [33], which ease the programmers’ burden by parallelizing and fine-tuning commonly-used routines. Many scientific/ engineering applications can benefit from the availability of such libraries.. • Domain Decomposition Packages, such as Chaco [28] and Metis [29], which provide various algorithms for decomposing irregular problem domains. With such packages, programmers are likely to save a considerable amount of time that would otherwise be spent writing and maintaining their own domain decomposition routines for irregular applications.. • Source-code Analyzers or Visualizers, such as APR Forge [34], which assist the users in parallelizing existing applications by revealing the data dependencies or data flow in the codes. Such tools directly assist the users, instead of attempting to accomplish automatic parallelization with a parallelizing compiler.. • Performance Assessment Tools, such as CXpa [35] and IBM PV [36], which assist the users in exploring the machine-application interactions in their applications. Performance assessment tools can be further categorized into three groups: performance extraction/ measurement, performance evaluation/characterization, and performance visualization. Performance extraction/measurement tools focus on finding specific events in the application and measuring their impact on the performance. Performance evaluation/characterization tools analyze the application performance, characterize the performance, and roughly identify the performance problems. Performance visualization tools provide a friendly graphics interface to help the users access and understand performance information. Note that some of these parallel tools address the problems in parallelizing applications, while some others, also known as performance tools, address the problems which limit the application performance. Often, these two goals are both considered by one tool, since the way that the code is parallelized is highly correlated with the application performance.. 1.4.3 Problems in Using Parallel Tools Recent development of parallel tools has considerably improved the application development environment. However, current application development environments are still far from satisfactory. Typical problems that we have experienced are described below:. 18.

(32) 1. Lack of Tools and Difficulty of Using the Tools: While many tools have resulted from academic research, machine vendors have not been aggressively developing performance tools. According to the Ptools Consortium (http://www.ptools.org/): “Current parallel tools do not respond to the specific needs of scientific users who must become quasi-computer scientists to understand and use the tools”. 2. Lack of Tool Integration: Combining the use of different tools can be painful and tricky. For example, feeding performance information from the CXpa to Metis requires a mechanism to convert the performance data output from CXpa to a form that is accepted as input to Metis. Due to the lack of integration, programmers still need to spend a substantial amount of time in interfacing the use of tools. Such routine work should be automated as much as possible, allowing the programmers simply to monitor the process and make high-level decisions. 3. Unpredictable and Unreliable Tuning Methodologies: Programmers have been relying for better or worse on their personal experiences and trial-and-error processes in their practice of performance tuning. The learning curve is tedious and time consuming with no guarantee of how much if any performance improvement to expect for a particular application. Furthermore, because the side-effects of one optimization applied to solve one problem can expose or aggravate other problems, there are many cases where performance problems cannot be solved individually. 4. Unsuitable for Dynamic Applications: For one application, performance tuning may need to be repeated for different input data sets and different machines. Although integrating the performance tuning mechanism into the runtime environment would ameliorate this problem, runtime performance tuning is difficult to implement because performance tuning is far from automatic. Most performance tuning tools, either cannot be, or are not integrated into the runtime environment.. 1.4.4 The Parallel Performance Project The Parallel Performance Project (PPP) research group was established at the University of Michigan (http://www.eecs.umich.edu/PPP/PPP.html) in 1992. The objective of our work is to develop and implement automated means of assessing the potential performance of applications on targeted supercomputers, high performance workstations and parallel computer systems, identifying the causes of degradations in delivered performance, and restructuring the application codes and their data structures so that they can achieve their full 19.

(33) performance potential. Many of the techniques presented in this dissertation are based on or motivated by some previous or concurrent work in the PPP group, including:. • Experiences in Hand-tuning Applications: We have worked jointly with scientists and engineers in developing high performance applications in various fields, including: (1) vehicle crash simulation with the Ford Motor Company, (2) ocean simulation for the Office of Naval Research, (3) modeling and simulation of ground vehicle with the Automotive Research Center (http://arc.engin.umich.edu/) for the U.S. Army (TACOM), and are continuing work of this kind on (4) antenna design and scattering analysis with the Radiation Laboratory at U. of Michigan (http://www.eecs.umich.edu/RADLAB/), (5) modeling soil vapor extraction and bioventing of organic chemicals in unsaturated geological material with the Environmental and Water Resources Engineering program at U. of Michigan (http://www-personal.engin.umich.edu/~abriola/simulation.html) for the U.S. Environmental Protection Agency, and (6) computer simulation of manufacturing processes of materials and structures with the Computational Mechanics Laboratory (http://www-personal.engin.umich.edu/~kikuchi/) at U. of Michigan. This ongoing work is supported in part by the National Partnership for Advanced Computational Infrastructure (NPACI) (http://www.npaci.edu/) of the National Science Foundation.. • Machine Performance Evaluation: The PPP group has developed performance evaluation methodologies (e.g. [37]) and tools (e.g. [38]) and has evaluated targeted high performance workstations (IBM RS/6000 [39][40], HP PA-RISC, DEC Alpha) and parallel computer systems (KSR1/2 [41][42], IBM SP2 [14][15], HP/Convex Exemplar [11][12][13], Cray T3D). The knowledge gained regarding target machines helps develop future machines and machine-specific performance tuning techniques [43].. • Goal-Directed Performance Tuning: A machine-application hierarchical performance bounds methodology has been developed to characterize the runtime overhead and guide optimizations (e.g. [44][45][46][47][48][49]). Major optimization techniques developed in the PPP group include data layout (e.g. [45][50]), restructuring data access patterns (e.g. [51]), communication latency hiding (e.g. [52]), instruction scheduling (e.g. [53][54]), and register allocation (e.g. [55][56]).. • Domain Decomposition Techniques and Advanced Synchronization Algorithms: Our experiences show that irregular problem domains commonly exist in many scientific/engineering applications. The PPP group has evaluated the performance of several domain decomposition packages, including Chaco [28] and Metis [29], for balancing the load in tar-. 20.

(34) get applications. A profile-driven domain decomposition technique [30] and a multiple weight (multiphase) domain decomposition algorithm [2] have been developed for balancing non-uniformly distributed computations in single-phase as well as multiple-phase applications. Advanced synchronization algorithms, such as fuzzy barriers [57] and pointto-point synchronization [58], have been studied and implemented to improve the load balance in target applications. These techniques, together with other available tools, such as CXpa, CXtrace, and IBM PV, constitute the current application development environment for our targeted machines. While these techniques and tools have helped our performance tuning work in the PPP group, the resulting application development environment still suffered from the problems that we mentioned in Section 1.4.3 and led to the goals of this research.. 1.4.5 Goals of this Dissertation In this dissertation, we discuss several new techniques that have been developed to address the weaknesses within this application development environment. More importantly, we present a unified approach to developing high performance applications by linking most of these known techniques within a unified environment. We believe that this unified approach will eventually lead to a well-integrated and fairly automated application development environment. The following statements describe the general goals of our research and the scope of this dissertation: 1. Develop New Tools and Improve Existing Tools: More techniques and tools are needed to characterize the machine and the machine-application interactions. The user interface of each tool should be designed so that it can be used by non-computer scientists with minimal knowledge about the underlying computer architecture. 2. Develop a Unified Performance Tuning Methodology: To speed up the performance tuning process and aim at making it more automated in the future, it would be best to develop a systematic method for selecting among optimization techniques and applying them in a proper order. 3. Provide an Environment for Programmers and Computer Architects to Work Jointly in Developing Applications: Programmers often possess an intimate knowledge of the inherent behavior of the underlying algorithms of their codes, while computer architects are. 21.

(35) generally more familiar with machine characteristics. For tuning a large application, direct communications between these two groups of people are generally both time-consuming and inefficient. We thus need an environment that helps programmers expose their application behavior to computer architects. 4. Incorporate Runtime Performance Tuning within Applications: For optimizing applications whose behavior may not be discoverable at compile time, some performance tuning should be carried out during runtime. To achieve this goal, proper runtime performance monitoring support, well-integrated performance tools, and automatic performance tuning algorithms are necessary. Although we cannot satisfy this goal without the cooperation of computer manufacturers and software providers, this dissertation will lay the groundwork for satisfying this goal.. 1.5 Summary and Organization of this Dissertation In this chapter, we have overviewed parallel machines, parallelization of applications, common performance problems, and approaches to developing high performance parallel applications, and we have discussed the weaknesses in current application development environments. While most researchers focus on developing more parallel tools to help solve individual parallelization/performance problems, we aim at an integrated suite of existing and new tools and a unified approach to considering and solving the performance problems that they expose. In the following chapters, we discuss the techniques that we have developed and integrated to pursue these goals:. • Chapter 2 Assessing the Delivered Performance discusses the machine-application interactions in a parallel system, how machine performance characterization and performance assessment techniques can expose these relatively subtle interactions to help programmers visualize performance problems in their applications. We also present some innovative techniques that we developed for analyzing communication patterns in cache-based distributed shared-memory applications.. • Chapter 3 A Unified Performance Tuning Methodology summarizes our performance tuning methodology: a unified, step-by-step performance tuning approach that logically integrates various useful techniques that may be used to solve specific performance problems in an irregular application. For each step, we identify certain key issues that concern the. 22.

(36) performance tuning at that step, as well as specific tuning actions that are most appropriate to solving these issues. The interactions among these issues, and actions that are vital to effectively tuning the application, are clarified in this chapter.. • Chapter 4 Hierarchical Performance Bounds and Goal-Directed Performance Tuning further extends previously developed goal-directed performance tuning work to more completely characterize a parallel application. This new performance characterization scheme is partly automated on the HP/Convex Exemplar with the CXbound tool that we developed. We explain how the performance bounds and the gaps between successive bounds can be used in conjunction with the step-by-step performance tuning scheme discussed in Chapter 3 to tune an application more efficiently.. • Chapter 5 Model-driven Performance Tuning describes our approach to facilitating the communications between programmers and performance tuning specialists and speeding up the performance tuning process. We discuss how the application behavior exposed by the programmers and performance assessment tools can be integrated to form application models. These application models can be analyzed by the model-driven simulation tools that we developed to model the application performance. Model-driven analysis guides performance tuning specialists in tuning the application models and greatly reduces the number of tuning iterations that must be carried out explicitly on the actual application code. This model-driven performance tuning methodology thus serves as a means to shorten the application development time, and hopefully in the future, to tune the code dynamically during runtime.. • Chapter 6 Conclusion concludes this dissertation by summarizing the key contributions of this research and their significance. Some topics of further research that may be investigated in the future are also presented.. 23.

(37) CHAPTER 2. ASSESSING THE DELIVERED PERFORMANCE In the world of scientific computing, peak performance is an upper bound on the performance of a computer system. However, users often find that there is a large gap between the peak performance and the performance that their applications actually see, the delivered performance. Without proper utilization of the hardware resources, applications can perform very inefficiently. As the peak performance of parallel computers is approaching TeraFLOPS (10 billion FLoating-Point Operations per Second), the computer architecture is increasingly complex, and the performance gap is growing. Current compiler technology makes an effort to close this gap, but still leaves much to be desired, especially for parallel computing. The performance gap between peak performance and delivered performance is a complex function of the machine-application interactions. The ability to observe detailed machineapplication interactions is necessary in order to determine the primary causes of the performance gap, and subsequently resolve them by boosting delivered performance, reducing cost, and finally accepting the remaining gap. Various public domain and commercial performance assessment tools have been developed for gathering, analyzing, and visualizing application performance. However, we have not seen any single tools or tool suites that provide complete performance assessment at various needed levels of detail. In most cases that we have encountered during our practice of performance tuning, we have had to find ways to integrate the use of existing tools and often develop new tools in order to gain a sufficient assessment of the delivered performance, relative to the aspects shown in Figure 2-1. This chapter focuses on analyzing machine-application interactions for parallel applications. While the performance of a modern processor is relatively complex, the performance of a parallel system, with the addition of communication and synchronization, is even more mysterious for most users. For a user to gain a sufficient understanding of the performance of an application, the target machine must be well-characterized. For tuning the applications, the 24.