A Framework for Performance Evaluation and Optimization of Parallel Applications

(1)

A Framework for Performance Evaluation and Optimization of Parallel Applications

Shih-Hao Hung and Edward S. Davidson

Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, MI 48109-2122 {hungsh,davidson}@eecs.umich.edu

Abstract. Today’s high performance and parallel computer systems provide substantial opportunities for concurrency of execution and scalability that is largely untapped by the applications that run on them. Under traditional frame- works, developing efficient applications can be a labor-intensive process that requires an intimate knowledge of the machines, the applications, and many subtle machine-application interactions. Optimizing applications so that they can achieve their full potential on target machines is often beyond the programmer’s or the compiler’s ability or endurance. This paper argues for addressing the performance optimization problem by providing a framework for tuning application codes with substantially reduced human intervention by dynamically exploiting information gathered at compile time as well as run time so that the optimization is responsive to actual run time behavior as data sets change and installed systems evolve. Reducing performance tuning to a sufficing strategy, guided by goal-directed and model-driven methodologies, should greatly reduce the development cost of producing highly tuned parallel codes. By supporting these methodologies with a range of appropriate tools, we hope to pave the way for the incorporation of this method within future generation compilers so that it can be fully automated. Such compilers could then afford to incorporate very elaborate tuning techniques since they would be called upon only when and where needed, and applied there only to the necessary degree.

1 Introduction

Today’s high performance and parallel computer systems provide substantial opportunities for concurrency of execution and scalability that is largely untapped by the applications that run on them. Substantial performance and scalability gains can be achieved by further optimization of these application codes to better exploit the features of existing architectures [1][2][3][4][5][6][7][8][9][10]. Without such code optimization, new features deemed to be of theoretical interest may well prove to be very difficult to justify in practice. In contrast, well-optimized codes can serve much better in pointing the way to practically useful features that can simply and effectively serve their inherent needs.

Developing efficient applications within a traditional framework can be a labor- intensive process that requires an intimate knowledge of the machines, the applica-

(2)

tions, and many subtle machine-application interactions. Optimizing applications so that they can achieve their full potential on target machines is often beyond the programmer’s or the compiler’s ability or endurance. First, the performance behavior of the target application needs to be well-characterized, so that existing performance problems in the application can be exposed and solved. Second, the performance characterization and optimization process needs to be fast, particularly for applications whose performance behavior changes dynamically over time. Existing performance tools or compilers are inadequate in speed, level of detail and/or sophistication for easily and effectively solving most performance problems in applications.

We believe that the optimization problem can be better solved in future systems by using a variety of software and hardware approaches that are well-designed to com- plement one another. These will provide improved compiler analysis and code optimization techniques, driven by ample performance monitoring support in the hardware. We propose an integrated application development environment that systematically and dynamically coordinates the use of individual techniques and tools during an interactive process of application tuning and execution. By exploiting dynamic information gathered at run time so that the optimizations are responsive to actual run time behavior as data sets change and installed systems evolve, such an environment would be capable of achieving well-tuned codes with substantially reduced human intervention.

Section 2 contrasts various forms of optimization. Section 3 describes a framework that supports dynamic optimization and systematically orchestrates the optimization process. Section 4 presents two application cases that illustrate the use of the framework. Sections 5 and 6, respectively, identify areas for further development and present conclusions.

2 From Static to Dynamic Optimization

In static optimization, applications are optimized during compile time. Conventional compilers [10][11] rely heavily on source-code analysis to extract and predict the performance behavior of applications prior to execution. However, even today’s best source-code analysis techniques have difficulty obtaining accurate performance information due to the need for interprocedural analysis, disambiguating indirect data references, modeling detailed machine operation, and predicting dynamic runtime behavior. Without sufficient information, compilers cannot generally optimize applications very well.

In incremental (or iterative) optimization, additional program information is acquired by profiling or tracing application execution, and application performance can be incrementally improved as more profiled runs are executed with interspersed optimization passes. Profiling or tracing helps characterize the performance problems that occur in actual application runs, and thus focuses the optimization effort on the most significant problems that concern either the application code or the machine.

Performance characterization provides a feedback mechanism that allows very aggressive optimization techniques to be applied selectively, evaluated, and adjusted

(3)

for maximal gain with minimal risk and overhead. However, an executable code generated by an incremental compiler is tuned statically for particular profiled runs, which may or may not accurately project future application runs. If either the application input set or the target machine differs considerably from these profiled runs, part of the incremental optimization previously done may become obsolete, necessitating further profiling and re-optimization.

While Just-in-time (JIT) or on-the-fly compilation has been used to accelerate the execution of programs written in interpretive languages, such as Smalltalk [12] and Java [13], JIT compilation also allows part of the optimization be pursued at the last moment prior to each execution of the application (or application phase). As the target machine configuration and application input are available, JIT compilers can then perform machine-specific and input-specific optimization for particular application runs. JIT compilation is thus useful for optimizing applications that would run on many different machines. However, as JIT compilation itself incurs runtime overhead, the extent of optimization can be relatively limited. Thus, complicated, time- consuming performance analyses may not be compatible with JIT compilation.

For dynamic application behavior, involving dynamic load imbalance, interactive input, dynamic multitasking, etc., static optimizations generated by the above three mechanisms may need to be supplemented with dynamic schemes that carry out optimization adaptively during the runtime of the applications. Adaptive optimization of application performance requires appropriate performance monitoring support as well as efficient algorithms to recognize and resolve performance problems rapidly.

Each of the above optimization schemes has its advantages and weaknesses for particular applications. A versatile optimization process should incorporate a variety of optimization schemes in order to address a wide range of performance problems.

Moreover, a single performance problem may be better solved by a combination of multiple schemes, judiciously conducted to minimize conflicts and redundancies caused by different schemes and individual optimizations with different objectives.

3 A Framework for Integrated Performance Evaluation and Dynamic Optimization

In this section, we describe a framework that systematically incorporates and coordinates a wide range of performance characterization and optimization methods. In the following subsections, we discuss the key techniques that help unify the optimization process and effectively support dynamic optimization. Section 3.1 describes the use of hierarchical performance bounds to characterize application performance. Guided by the performance bounds, Section 3.2 suggests a goal-directed strategy that addresses individual performance problems in a logical order. Section 3.3 discusses an application modeling technique that allows program and performance information to be integrated for analysis and optimization via model-driven simulation.

Reducing performance tuning to a sufficing strategy, guided and accelerated by the goal-directed and model-driven methodologies, should greatly reduce the development cost of producing highly tuned parallel codes. Supporting these methodologies

(4)

with a range of appropriate tools, we hope to pave the way for the incorporation of this method within future generation compilers so that it can be fully automated. Such compilers could then afford to incorporate very elaborate tuning techniques since they would be called upon only when and where needed, and applied there only to the necessary degree.

3.1 Hierarchical Optimization

Figure 1 illustrates the incremental/dynamic optimization process. First, performance profiling reports raw performance metrics or traces runtime events, which are then analyzed to characterize performance of the target application. Based on this characterization, performance tuning is applied to improve performance by modifying either the application code or the machine. Such an optimization process can be carried out between application runs as needed, i.e. by incremental optimization. The costs of carrying out performance analysis and tuning do not incur runtime overhead. Addi- tionally, dynamic optimization can be employed, but to be effective, the runtime overhead it incurs must be within an acceptable range, which requires both hardware performance monitoring support and the use of only a limited range of optimization techniques.

Numerous performance problems can be solved with static compiler techniques, which incur no runtime overhead. Compilers can thus employ a broad range of analyses and optimizations for static optimization. Unfortunately, some elaborate compiler optimizations can be extremely time-consuming, and they are often not considered for commercial compilers. Furthermore, tasks such as performance modeling, interprocedural analysis, and disambiguation of indirect data references, are very time- consuming or even impossible for a compiler to perform. In our opinion, they can be better addressed by profiling and using a profile-driven incremental/dynamic

application

machine

performance profiling

performance modifying the application code

adjusting the machine

goal-directed tuning characterization performance

Fig. 1: Profile-based Goal-Directed Performance Tuning

(5)

approach. More elaborate optimization techniques then become attractive as they can be employed selectively only when they will be most effective as determined by adequate profiling and analysis.

The above discussion is illustrated in Figure 2, which compares the general cost (the overhead and the software/hardware support) and coverage (the range of targeted problems) of the four different optimization schemes (levels) and shows the sources of application information that are available to each of them. Generally, the closer to the actual execution, the more information can be made available, but the cost of optimization rises as it demands more hardware/software support and the time required to carry out optimization becomes more critical. Thus, fewer problems can be solved in the affordable time as the execution time approaches. Therefore, while it seems logical to delay solving performance problems until further information is gathered, it is best to solve particular types of problems as early as possible.

The recommended hierarchical approach toward optimizing an application is to:

(1) apply static optimization first, (2) address unsolved, semi-dynamic performance problems (problems that become tractable with profiling or tracing) with incremental optimization, (3) perform input- or machine-specific optimization with JIT compilation, and (4) use dynamic optimization to detect and solve problems that occur dynamically at runtime. The techniques discussed in Section 3.2 and Section 3.3 are designed to address these issues in this manner.

static incremental

just-in-time dynamic optimization

optimization

Fig. 2: A Hierarchical View of the Information Used by the Four Optimization Schemes machine configuration

input

code

Application

Information Optimization

increased cost

actual execution

profile

(6)

3.2 Goal-Directed Compilation

For optimizing the performance of an application, the goal is to minimize the overall application runtime. Reducing the overhead caused by multiple problems does not necessarily amount to eliminating individual problems. Furthermore, optimizing overall application performance is more difficult than optimizing the performance of individual routines locally. For the purpose of successfully directing the compilation process toward the goal, we need a method for gauging the distance to the goal and distinguishing the significance of each individual problem. In this subsection, we discuss the use of hierarchical performance bounds to realize such a goal-directed compilation.

3.2.1 Hierarchical Performance Bounds

A performance bound is an upper bound on the best achievable performance. For assessing the performance of an entire application, performance is best measured by the total runtime, rather than by bandwidth or rate-based metrics such as MFLOPS.

Hierarchical machine-application bounds models, collectively called the MACS bounds hierarchy, have been used to characterize application performance by expos- ing performance gaps between the successive levels of the hierarchy [1][3][4].

The MACS machine-application performance bound methodology provides a series of upper bounds on the best achievable performance (equivalently, lower bounds on the runtime) and has been used for a variety of loop-dominated applica- tions on vector, superscalar and other architectures. The hierarchy of bounds equa- tions is based on the peak performance of a Machine of interest (M), considering also a high level Application code of interest (MA), the Compiler-generated workload (MAC), and the actual compiler-generated Schedule for this workload (MACS), respectively.

The MACS bounds hierarchy is extended here to characterize application perfor- mance on parallel computers. The extended hierarchy addresses cache misses in the shared-memory system and the runtime overhead due to degree of parallelization, load imbalance, multiple program regions with different workload distributions, dynamic load imbalance, and I/O and operating system interference. This perfor- mance bounds hierarchy [5], as shown in Figure 3, successively includes major constraints that often limit the delivered performance of parallel applications. Beyond the MACS bounds, additional constraints are included in the order of: finite cache effect (MACS$ or I bound), partial application parallelization (IP bound), communication overhead (IPC bound), I/O and operating system interference (IPCO bound), overall¹ load imbalance (IPCOL bound), multiple phase load imbalance (IPCOLM bound), and dynamic load imbalance (IPCOLMD bound). We have found this ordering to be

1. Overall load imbalance refers to the imbalance in the distribution of the total load assigned to each processor over the entire application.

(7)

intuitive and useful in aiding the performance tuning effort; however, other variations or refinements could be considered based on application characteristics. Our bounds generation tool, CXbound, calculates the above performance bounds, except IPCOLMD bound, for applications run on HP/Convex SPP-1600, based on profiles generated by CXpa [14].

The gap between two successive bounds is named after the performance constraint that differentiates the two bounds. However, while we tried to assign a different letter to each new gap, the letters C and M are each repeated twice in the bounds hierarchy.

To avoid confusion, we shall refer to the Communication gap and Multiphase gap as C′ gap and M′ gap, respectively, to distinguish them from the Compiler inserted instructions gap and the Machine peak performance.

The definition and calculation of the bounds hierarchy is presented below. The use of the bounds hierarchy for goal-directed optimization is discussed in Section 3.2.2.

Fig. 3: Performance Constraints and the Performance Bounds Hierarchy.

I-Partial parallelization (IP) bound

Machine (M) bound M-Application (MA) bound MA-Compiler (MAC) bound MAC-Schedule (MACS) bound MACS-Cache (MACS$) bound IPC-Operating system (IPCO) bound IPCO-Load imbalance (IPCOL) bound IPCOL-Multiphase (IPCOLM) bound Actual runtime

Dynamic load imbalance Unmodeled effects

Multiphase load imbalance

Overall load imbalance

Interprocessor communication

Partial parallelization

Finite cache effect Data dependency, branches,

Compiler-inserted operations pipeline bubbles

Mismatched application workload

System Constraints Gaps

Machine peak performance

Execution Time

IPCOLM-Dynamic (IPCOLMD) bound

Ideal-parallelization (I) bound IP-Communication (IPC) bound I/O, Operating System Events

M A C S

$ P C′

O L M′

D X

Uniprocessor BoundsParallel Bounds

(8)

Machine Peak Performance: M Bound

The Machine (M) bound is defined as the minimum run time if the application workload were executed at the peak rate. The minimum workload required by the application is indicated by the total number of operations¹ observed in the high-level source code of the application. The machine peak performance is specified by the maximum number of operations that can be executed by the machine per second. The M bound (in seconds) can be computed by

M Bound = (Total Number of Operations in Source Code)/

(Machine Peak Performance in Operations per Second).

Application Workload: MA Bound

The MA bound considers the fact that an application has various types of operations that have different execution times and use different processor resources (functional units). Functional units are selected for evaluation if they are deemed likely to be a performance bottleneck in some common situations. The MA bound of an application counts the operations for each selected function unit from the high level code of the application, the utilization of each functional unit is calculated, and the MA bound is determined by the execution time of the most heavily utilized functional unit. The MA bound thus assumes that no data or control dependencies exist in the code and that any operation can be scheduled at any time during the execution, so that the function unit(s) with heaviest workload is fully utilized.

Compilation: MAC Bound

The MAC bound is similar to MA, except that it is computed using the actual operations produced by the compiler, rather than only the operations counted from the high level code. Thus MAC still assumes an ideal schedule, but does account for redundant and unnecessary operations inserted by the compiler as well as those that are necessary in order to orchestrate the code for the machine being evaluated. MAC thus adds one more constraint to the model by using an actual rather than an idealized workload.

Instruction Scheduling: MACS Bound

The MACS bound, in addition to using the actual workload, adds another constraint by using the actual schedule rather than an ideal schedule. The data and control dependencies limit the number of valid instruction schedules and may result in pipeline stalls (bubbles) in the functional units. A valid instruction schedule can require more time to execute than the idealized schedules we assumed in the M, MA, and MAC bounds.

1. In our work on scientific applications, “operations” is taken to mean floating-point operations.

(9)

Acquiring the I (MACS$) Bound

The I (MACS$) bound measures the minimum run time required to execute the application under an ideal (zero communication overhead and perfectly load balanced) parallel environment with no I/O or OS interference. The I bound for an SPMD application is the average MACS$ bound among the processors. Thus, given the number of processors involved in the execution, N, and the MACS$ bounds on the runtime for individual processors, Ω₁, Ω₂,..., Ω_N, the averaged I bound is calculated as:

I Bound =

(Σ

i=1..N Ω_i

)

^/N.

Acquiring the IP Bound

The degree of parallelization in the application is a factor that can limit the parallelism in a parallel execution. The application may contain sequential regions that are executed sequentially by one processor. Let the total computation time in the sequential regions be Ωs and total computation time in the parallel regions be Ωp, the IP bound for an N-processor execution is defined as the minimum time required to exe- cute the application under the assumption that the parallel regions are executed under an ideal parallel environment, i.e.

IP Bound = Ωs+Ωp/N, which is also known as Amdahl’s Law.

Acquiring the IPC Bound

The IPC bound is defined by the minimum time required to execute the application workload with actual communications on actual target processors, under the assumption that the workload in the parallel portions is always perfectly balanced. Note that communications may add extra workload to both the sequential portions and the parallel portions of the application. Amdahl’s Law is reapplied to the increased sequen- tial and parallel workload to acquire the IPC bound for a N-processor execution, i.e.

IPC Bound = Ω_s′ + Ω_p′ /N,

where Ω_s′ and Ω_p′ denote the sequential and parallel workload assumed in the IPC bound.

CXbound uses the CPU time in N-processor profiles generated by CXpa to mea- sure the run time that processors spend on computation and communication. The parallel CPU time (Ω_p′) that the processors spend in the parallel regions, is calculated by

Ω_p′ =

Σ

q

Σ

r c_r,q ,

where c_r,q is the total CPU time that processor q spends in parallel region r. The sequential CPU time (Ω_s′) is summed over the serial regions.

(10)

Acquiring the IPCO Bound

In many high-performance applications, input and output for a program occur mostly in the form of accessing mass storage and other peripheral devices (e.g. terminal, net- work, printer,...etc.). I/O events are mostly handled by the operating system (OS) on modern machines. The OS also handles many other operations, such as virtual memory management and multitasking, in the background. These background OS activi- ties may or may not be originated by the target application, but can greatly affect the performance of the target application.

To acquire the IPCO bound for an N-processor execution, CXbound first calculates the sequential execution time (Ω_s′′) and parallel execution time (Ω_p′′) under the environment that the IPCO bound models:

Ω_s′′ =

Σ

q=1..N

Σ

_r∈Sw_r,q , Ωp′′ =

Σ

q=1..N

Σ

r∈P w_r,q ,

where w_r,q is the wall clock time that processor q spent in region r, S is the set of sequential regions, and P is the set of parallel regions. As the wall clock time reported by CXpa, additionally includes the time spent in OS routines, which is not included in the reported CPU time. Then, Amdahl’s Law is reapplied to the increased sequential and parallel execution times under the environment that the IPCO bound models, i.e.

IPCO Bound = Ω_s′′ + Ω_p′′ /N.

Acquiring the IPCOL Bound

Load imbalance affects the degree of parallelism in the parallel execution. The execution time of an application with load imbalance is bounded by the time required to execute on the most heavily loaded processor. The IPCOL bound is defined as the minimum time required to execute the largest load assigned to one processor, under the assumption that the load from different parallel regions and iterations that is assigned to a particular processor can simply be combined.

The total wall clock time that processor q spent in parallel regions is calculated by summing processor q’s wall clock time over the parallel regions, i.e.

Ω_p,q′′ =

Σ

_r∈P w_r,q .

The IPCOL bound for the parallel regions is determined by the heaviest parallel workload among the processors; the IPCOL bound for the sequential region is carried over from the IPCO bound (Ω_s′′). Τhe IPCOL bound is thus

IPCOL Bound = Ω_s′′+ Max_q=1..N{Ω_p,q′′}.

In Figure 4, we illustrate how the IPCO and IPCOL bounds are calculated from a performance profile. The example run consists of a two-iteration loop, in which two parallel regions are each executed on two processors. Figure 4(a) shows the workload

(11)

distribution for this example. Since this example contains no sequential region, the IPCO bound (41) is essentially the average workload over the two processors, and the IPCOL bound (42) is the maximum overall workload between the two processors, as calculated in Figure 4(b). As indicated by the L gap, the load imbalance of overall workload causes an overhead of 1, which amounts to a 2.43% increase in execution time over a perfectly balanced execution.

Acquiring the IPCOLM Bound

The IPCOLM bound characterizes the multiphase load imbalance in the application.

Multiphase load imbalance usually results from different workload distributions in different program phases of the application that are separated by barrier synchroniza- tions. The execution time for each parallel region is determined by the most heavily loaded processor (the longest running thread) in that region. The IPCOLM bound is calculated by summing the execution time of the longest thread over the individual program regions, namely

IPCOLM Bound = Ω_s′′ +

Σ

_r∈P Max_q=1..N{w_r,q}.

where w_r,q, Ωs′′, and N are as above.

The Multiphase (M′) gap (IPCOLM - IPCOL) characterizes the performance impact of multiphase load imbalance. Note that an application can pose serious multiphase load imbalance and still be well balanced in terms of total workload. As we illustrate in Figure 4(c), the calculation of the IPCOLM bound finds the local maxima for individual parallel regions and hence is never smaller than the IPCOL bound. The multiphase load imbalance in this example causes an M’ gap of 4, which equals 4/42

= 9.5% runtime increase over the IPCOL bound.

Actual Run Time and Dynamic Behavior

The actual run time is measured by the wall clock time of the entire application. The gap between the actual run time and the IPCOLM bound (unmodeled gap) should characterize both dynamic behavior and other factors that have not been modeled in the IPCOLM bound, e.g. the cost of spawn/join and synchronization operations.

Dynamic workload behavior can occur if the problem domain or the workload distribution over the domain changes over time. This happens often in programs that model dynamic systems. Dynamic behavior can result in an unpredictable load distribution and renders static load balancing techniques ineffective. An IPCOLMD bound could be generated, as in Figure 4(d), to model the dynamic workload behavior if the execution time for each individual iteration is separately reported, i.e.

IPCOLMD Bound = Ω_s′′ +

Σ

_r∈P

Σ

i=1,Num_Iter Max_q=1,N{w_r,q,i} .

where w_r,q,i is the wall clock time that processor q spent in region r for iteration i, and Num_Iter is the number of iterations.

(12)

Fig. 4: Calculation of the IPCO, IPCOL, IPCOLM, IPCOLMD Bounds.

Iter./Region Load on Proc 0 Load on Proc 1

1/1 10 5

1/2 10 15

2/1 5 6

2/2 15 16

(a) A Profile Example.

Iter./Regions Load on Proc 0 Load on Proc 1

All/All 40 42

IPCO Bound = (40 + 42)/2 = 41 IPCOL Bound = Max{40, 42} = 42 Load Imbalance Gap = IPCOL - IPCO = 42 - 41 = 1

(b) Calculation of the IPCO and IPCOL Bounds.

Iter./Region Load on Proc 0 Load on Proc 1 Max. Load

All/1 15 11 15

All/2 25 31 31

IPCOLM Bound = (Max. Load of Phase 1) + (Max. Load of Phase 2) = 15+31

= 46

Multiphase Gap = IPCOLM - IPCOL = 46 - 42 = 4 (c) Calculation of the IPCOLM Bound.

Iter./Region Load on Proc 0 Load on Proc 1 Max. Load

1/1 10 5 10

1/2 10 15 15

2/1 5 6 6

2/2 15 16 16

IPCOLMD Bound =

Σ

(Max. Load in each region for each iteration) = 10+15+6+16 = 47

Dynamic Gap = IPCOLMD - IPCOLM = 47 - 46 = 1 (d) Calculation of the IPCOLMD Bound.

(13)

The Dynamic (D) gap characterizes the performance impact of the dynamic load imbalance in the application. The D gap in the above example is primarily due to the change of load distribution in region 1 from iteration 1 to iteration 2. A more dynamic example is given in Figure 5(a), and the performance problem, i.e. the dynamic behavior, is revealed via the bounds analysis shown in Figure 5(b). A severe D gap, for example, may be reduced by relaxing the synchronization between iterations or finding a better static domain decomposition, or may require dynamic decomposition.

Unfortunately, CXpa is not suitable for measuring the execution time for each individual iteration, and hence CXbound cannot generate the IPCOLMD bound. So far, we have not found a proper tool to solve this problem on the HP/Convex Exemplar.

Thus, in the case studies of Section 4, the dynamic behavior effects are lumped together with the other “unmodeled effects” as the unmodeled (X) gap which is then calculated as (Actual Execution Time) - (IPCOLM Bound).

3.2.2 Goal-Directed Optimization

In ascending through the bounds hierarchy from the M bound, the model becomes increasingly constrained as it moves in several steps from potentially deliverable toward actually delivered performance. Each gap between successive performance bounds exposes and quantifies the performance impact of specific runtime constraints, and collectively these gaps identify the bottlenecks in application performance. Performance tuning actions with the greatest potential performance gains can be selected according to which gaps are the largest, and their underlying causes. This approach is referred to as goal-directed performance tuning or goal-directed compila-

Fig. 5: An Example with Dynamic Load Imbalance.

Iteration/Region Load on Proc 0 Load on Proc 1

1/1 15 5

1/2 5 15

2/1 5 15

2/2 15 5

(a) A Profile Example

Bound Value Gap from Previous Bound

IPCO 40 N/A

IPCOL 40 0

IPCOLM 40 0

IPCOLMD 60 20

(b) Calculation of the IPCO and IPCOL Bounds

(14)

tion [5][4], which can be used to assist hand-tuning, or implemented within a goal- directed compiler for general use.

We utilize the hierarchical bounds model in implementing a goal-directed optimization strategy. As illustrated in Figure 6, we associate specific performance gaps with several key performance tuning steps in our application tuning work. Before each step, we consider specific gap(s). For example, the actions in Step 1 (partition- ing) are associated with gaps C’, L, M’, and D. Significant gaps help guide what specific performance tuning actions should be considered for each step. A step may be skipped if there is no significant gap associated with that step. After one or more performance tuning actions are applied, the bounds hierarchy can be re-calculated to evaluate the effectiveness and the side-effects of these actions.

Optimize Processor Performance

Balance Load for Single Phases

Reduce Synchroni- zation Overhead

Balance Com- bined Load for Multiple Phases

Balance Dynamic Load Serial Program

Performance-tuned Parallel Program

Fig. 6: Performance Tuning Steps and Performance Gaps.

Step 4

Step 5 Step 3

Step 6 Step 7

Tune Communica- tion Performance

Partition the Problem Domain

Step 1

Step 2

A, C, S, $-gap

C’-gap L-gap M’-gap D-gap

L, M’ ,D-gap C’, L, M’, D-gap

(15)

The numbers show the order of the steps, and the arrows show the dependence between the steps. When the program is modified in a certain step, the earlier steps found by following the backward arrows may need to be performed again as they may conflict with or be able to take advantage of the modification. For example, load balancing techniques in Steps 4, 6 and 7 may suggest different partitionings of the domain, which would cause different communication patterns that may need to be re- optimized in Step 2. Changing the memory layout of arrays to eliminate false sharing in Step 2 might conflict with certain data layout techniques that improve processor performance in Step 3. Changing communication and processor performance may affect the load distribution which then needs to be re-balanced. In general, this graph detects various types of performance problems in an ordered sequence, and a step needs to be repeated only if particular problems are detected that need to be dealt with. Less aggressive optimization techniques that are more compatible with one another are better choices in the earlier phases of code development.

For each step, we identify the relevant performance issues and possible actions to address them. Table 1 [5] shows such a grouping. For example, tuning action (AC- 29), Self-Scheduling, may be selected to solve issue (I-15), Balancing a Nonuniformly Distributed Load, for targeting the L gap gauged by the hierarchical performance analysis, but it can affect other issues either positively (for issues 16, 18, 19, and 20) or negatively (for issues 2 and 17). The “Other Affected Issues” column lists the issues are deemed likely to be affected, either positively or negatively. The gaps associated with the affected issues are also likely to change, and thus should be examined for evaluating the applied action and choosing the next action. This table categorizes the performance issues and gaps that would be affected by each tuning action, and thus provides a mechanism for evaluating tuning actions with hierarchical performance bounds in an incremental optimization process.

The tuning actions listed above should be viewed as the beginning of an open extensible set. Systems should provide performance monitoring metrics that are sufficient for evaluating the bounds, so as to identify the performance issues that are most critical to address. The list of possible tuning actions should include all optimizations that the available compilers, precompilation tools, and run-time environments might perform (and others that may involve programmer assistance), except for those rou- tine or local optimizations that conventional optimizers already perform well on their own without the goal-directed assistance of this framework.

With this framework, a large set of aggressive optimization techniques can be included. Such techniques are often not implemented in commercial optimizers today due to the high overhead of performing the optimization and the weak, overgeneral- ized, and heuristic state of the art that chooses when, where and how to apply them.

However, this framework carefully selects these optimizations only when necessary and employs them to the extent necessary, precisely where they are deemed to be most beneficial. It thus enhances the benefits, and reduces the overhead and risks of incorporating aggressive optimizations.

(16)

Tuning Step

Performance

Issue Tuning Actions

Positive for Issues

Negative for Issues

Other Affected

Issues

Targeted Perform- ance Gap(s) in

this Step

Primarily Affected Perform- ance Gap(s) Partition

the Problem (Step 1)

(I-1) Partition- ing an Irregu- lar Domain

(A-1) Applying a Proper Domain Decomposition

Algorithm for (I-1)

(1) - (3)(6)(15)

(18)(19) P $, P, C’, L, M’, D

Tuning the Communi- cation Per- formance

(Step 2)

(I-2) Exploiting Processor

Locality

(A-2) Proper Utilization of

Distributed Caches for (I-2) (2) - (4)(5)(12)

(14) C’ $, C’

(A-3) Minimizing Subdo-

main Migration for (I-2) (2) -

(1)(4)(12) (15)(18)

(19)

C’ C’, L, M’, D (I-3) Minimizing

Interprocessor Data Depen-

dence

(A-4) Minimizing the Weight of Cut Edges in Domain Decomposition for (I-3)

(3) - (1)(6)(15)

(18)(19) C’ C’, L

(I-4) Reducing Superfluity

(A-5) Array Grouping for (I-

4) (4) - (5)(12) C’ C’, $

(I-5) Reducing Unnecessary

Coherence Operations

(A-6) Privatizing Local Data

Accesses for (I-5) (4)(5) (12) (2)(14) C’ C’, $, M (A-7) Optimizing the Cache

Coherence Protocol for (I-5) (5) - (2)(4)(12) C’ C’, $ (A-8) Cache-Control Direc-

tives for (I-5) (5) - (2)(4)(12) C’ C’, $

(A-9) Relaxed Consistency

Memory Models for (I-5) (5) - (2)(4)(12) C’ C’, $ (A-10) Message-Passing

Directives for (I-5) (5) - (4)(7)(9) C’ C’, $

(I-6) Reducing the Communica- tion Distance

(A-11) Hierarchical Parti-

tioning for (I-6) (6) -

(1)(2)(3) (15)(18) (19)

C’ C’, $, L

(A-12) Optimizing the Sub- domain-Processor Mapping

for (I-6)

(6) -

(15)(16) (18)(19) (20)

C’ C’

(I-7) Hiding the Communication

Latency

(A-13) Prefetch, Update, and Out-of-order Execution

for (I-7)

(7) - (12)(14) C’ C’, $, M

(A-14) Asynchronous Com- munication via Messages for

(I-7)

(7) - (4)(5)(9)

(14) C’ C’, $

(A-15) Multithreading for (I-

7) (7)(13) - (9)(12)(1

4) C’ C’, $, S

(I-8) Reducing the Number of Communication

Transactions

(A-16) Grouping Messages

for (I-8) (8) (9) (14) C’ C’, S

(A-17) Using Long-Block

Memory Access for (I-8) (8) (4)(5)(9) (14) C’ C’, S, $ (I-9) Distribut-

ing the Commu- nications in

Space

(A-18) Selective Communi-

cation for (I-9) (8) - (14) C’ C’

(I-10) Distribut- ing the Commu- nications in

Time

(A-19) Overdecomposition to Scramble the Execution

for (I-10)

(9) - (14) C’ C’, S, $

Table 1: Performance Tuning Steps, Issues, Actions and the Effects of Actions (1 of 3).

(17)

3.3 Model-Driven Performance Tuning

Figure 7 shows a framework, called Model-Driven Performance Tuning (MDPT) that can be used to accelerate the performance analysis and tuning process by exploiting the use of an application model (AM), a parsed form (intermediate representation) of the application code generated and used within compilation. The AM is annotated with profile information, an abstraction of the application behavior derived from performance assessment, as shown in Figure 8. Driven by this application model and a machine model (a machine performance characterization created from specifications and microbenchmarking), Model-Driven Simulation (MDS), analyzes and projects

Tuning Step

Performance

Issues

Targeted Perform-

ance Gap(s) in

this Step

Primarily Affected Perform- ance Gap(s)

Optimizing Processor

Perfor- mance (Step 3)

(I-11) Choosing a Compiler or Compiler Direc-

tive

(A-20) Properly Using Com- pilers or Compiler Direc-

tives for (I-11)

(11) - - C, S, $ C, S, $, C’

(A-21) Goal-Directed Tun- ing for Processor Perfor-

mance for (I-11)

(11) - - - -

(I-12) Reducing the Cache Capacity Misses

(A-22) Cache Control Direc-

tives for (I-12) (12) (2) (4)(14) $ $, C

(A-23) Enhancing Spatial Locality by Array Grouping

for (I-12)

(12) (2) (4)(14) $ $, C

(A-24) Blocking Loops Using Overdecomposition

for (I-12)

(12) - (14) $ $, C

(I-13) Reducing the Impact of Cache Misses

(A-25) Hiding Cache Miss Latency with Prefetch and Out-of-Order Execution for

(I-13)

(13) - (7) $ $, C, M, S

(A-26) Hiding Memory Access Latency with Multi-

threading for (I-13)

(13) - (7) $ $, C, S

(I-14) Reducing Conflicts of Interest between

Improving Pro- cessor Perfor-

mance and Communication

Performance

(A-27) Repeating Steps 2

and 3 for (I-14) (14) - (2)-(13) C, S, $, C’ C, S, $, C’

Balancing the Load per Phase (Step 4)

(I-15) Balanc- ing a Nonuni-

formly Distributed

Load

(A-28) Profile-Driven Domain Decomposition for

(I-15)

(15) (1)(3)

(18)(19) L L, C’

(A-29) Self-Scheduling for (I-15)

(15)(16) (18)(19) (20)

(2)(17) L L, C’, M’,

D

(18)

the application performance by simulating the machine-application interactions on the model and issuing reports, as shown in Figure 9.

In MDPT, the application model, instead of the application, becomes the object of performance tuning. Proposed performance tuning actions are first installed in the application model and evaluated via MDS to assist the user in making tuning deci- sions. This concept of the MDPT approach, the capabilities it provides, and its potential are discussed below (each numbered paragraph is keyed to the corresponding number in Figure 7):

1. Various sources of performance assessment and program analysis contribute to the application modeling phase for providing a more complete, accurate model. Per- formance assessment tools and application developers both contribute to creating the application model.

Tuning Step

Performance

Issues

Targeted Perform- ance Gap(s) in this Step

Primarily Affected Perform-

ance Gap(s)

Reducing the Syn- chroniza-

tion/

Scheduling Overhead (Step 5)

(I-16) Reducing the Impact of Load Imbalance

(A-30) Fuzzy Barriers for (I- 16)

(16)(18)

(19)(20) - (17) L, M’ L, M’, D

(A-31) Point-to-Point Syn- chronizations for (I-16)

(16)(18)

(19)(20) - (17) L, M’ L, M’, D

(A-32) Self-scheduling of Overdecomposed Subdo-

mains for (I-16)

(2)(16) - (16)(17)

(18)(19) L, M’ L, M’, D (I-17) Reducing

the Overall Scheduling/Syn-

chronization Overhead

- - - - L, M’, D L, C’, M’,

D

Balancing the Com- bined Load

for Multi- ple Phases

(Step 6)

(I-18) Balanc- ing the Load for

a Multiphase Program

(A-33) Balancing the Most

Critical Phase for (I-18) (18) (3)(15) M’ M’, L, D (A-34) Multiple Domain

Decompositions for (I-18) (18) - (3) M’ M’, C’, L, D (A-35) Multiple-Weight

Domain Decomposition Algorithms for (I-18)

(2)(18) - (3)(15)(1

9) M’ M’, L, D

(A-36) Fusing the Phases and Balancing the Total

Load for (I-18)

(16)(18) - (15)(20) M’ M’, L, D

Balancing Dynamic

Load (Step 7)

(I-19) Reducing the Dynamic Load Imbalance

(A-37) Dynamically Rede- composing the Domain for

(I-19)

(19)(20) (2) (3)(14) D D, C’, L, M’

(A-38) Dynamic/Self-Sched- uling for (I-19)

(16)(19)

(20) (2) (3)(15)

(17)(18) D D, C’, L, M’

(A-39) Multiple-Weight Domain Decomposition for

(I-19)

(2)(19) -

(3)(15)(1

9) D D, C’, L,

M’

(I-20) Tolerat- ing the Impact of Dynamic Load Imbalance

(A-40) Relaxed Synchroniza-

tions for (I-20) (16)(19) - (14)(17)(

18) D D, C’, L,

M’

(19)

2. In the performance modeling phase, MDS is carried out to derive information by analyzing the machine-application interactions between the application model and the machine model. The machine model is based on the machine specification and/or the results of machine characterization.

3. The application model serves as a medium for experimenting with the application of performance tuning techniques as well as resolving the conflicts among them.

In MDPT, performance tuning techniques are first iteratively applied and evaluated on the application model using MDS (see cycle (3)) and only ported to the code (via cycle (6)) after reaching a desirable plateau. Such use of this short loop for what-if evaluations should significantly shorten the overall application devel- opment time.

4. The application model can be tuned by either the programmer or the compiler. A properly abstracted application model helps the user or the compiler assess the application performance at an adequate level, without the overkill burden of tuning by carrying out transformations and performance analysis directly on the application and repeatedly handling the high volume of raw performance data that is produced. Performance tuning uses the output reports of MDS (Figure 9) to select tuning actions from Table .

5. In addition to tuning the application model, the machine model can be tuned to improve the application performance. Using MDS, the users are given the oppor- tunity to evaluate various machine configurations or different machines for specific applications without actually reconfiguring or building the machine.

6. After tuning actions are evaluated with MDS and accepted, they are applied to the application code and/or the target machine to assess the actual improvement, validate, and possibly recalibrate the models.

application

machine

performance profiling

Fig. 7: Model-Driven Performance Tuning.

modifying the application code

adjusting the machine

performance tuning application

model (AM)

modifying the

machine model machine

characterization

model-driven simulation (MDS)

application model (1)

adjusting the machine model application

developer

(2)

(3)

(4) (6)

(5)

(6)

(20)

3.3.1 Application Modeling

The Application Modeling (AM) phase generates specifications of the application behavior, including the application’s control flow, data dependence, domain decom- position, and the weight distribution over the domain. This phase can be carried out by the application developer with minimal knowledge about machine-application interactions. We have designed a language, called the Application Modeling Lan- guage (AML) for the user to specify the application model and incorporate results from performance assessment tools, such as profiling.

The performance of an application is fundamentally governed by (1) the program (algorithm), (2) the input (data domain/structures), and (3) the machine that are used to execute the application. It is relatively difficult to observe machine-application interactions at this level, since detailed machine operations are often hidden from the programming model that is available to the programmer.

We would like to model the application at a level that provides us with more pre- cise information on how the application behaves, especially the behavior that directly affects performance. The control flow and the data dependence in the application are modeled because they limit the instruction schedule and determine the data access pattern for the application. The decomposition of the input data determines the decomposition of the workload (for an SPMD application). The layout of the data structure determines the data allocation and affects the actual data flow in the machine, especially for a cache-based, distributed shared-memory application. The

Fig. 8: Building an Application Model.

program

layout source-code

trace-driven

profile-driven domain

analysis analysis

analysis decomposition

machine input data

Application

data dependence

Application Model control

flow weight

distribution domain

decomposition algorithm

Analysis

(21)

workload in the application certainly requires resources from the processors and hence needs to be modeled for addressing load balance problems. An application model is acquired by abstracting (1) control flow, (2) data dependence, (3) domain decomposition, (4) data layout, and (5) the weight distribution (workload) from the application. These five components are hereafter referred to as modules of the appli- cation model.

Figure 8 illustrates how we model applications on the HP/Convex SPP-1600 via the use of source code analysis (mostly done by the programmer), profiling (CXpa), and trace-driven simulation tools (Smait and CIAT/CCAT [15]). In this flow chart, a solid line indicates a path that we currently employ to create a particular module, and a dashed line indicates an additional path that might be useful for creating the module.

We briefly describe the process used to create these modules as follows:

• The control structure, the data dependence, and the data layout that are encoded in the program are abstracted via source code analysis. While analyzing irregular applications can be difficult for compilers, this task can be eased with profiling and tracing, and programmer assistance where needed. However, since compilers are useful for analyzing most regular applications without assistance, we assume that the generation of these modules can be done mostly by converting the results of compiler analysis (from the internal representation used by the compiler).

• We use weights to represent the application workload in different code sections.

Although the instruction sequence in a code section can be extracted to model the workload, accurately predicting the execution time of the code section based on the instruction sequence can be rather complicated and difficult. Profile-driven analysis can be used straightforwardly by the user for extracting the weights where the load is uniformly-distributed. For non-uniformly-distributed cases, techniques such as the weight classification and predication method [18] may be needed.

• In an SPMD application, the computation is decomposed by decomposing the domain. The domain decomposition can be implicitly specified in the application by DOALL statements, or explicitly programmed into the code according to the output of a domain decomposition package such as Metis [17]. The data dependence and the weight distribution of the application are given as inputs to the domain decomposition package.

We believe that building such an application model is highly feasible for the application developers with the programming tools available today. Most of the above pro- cedures involved in modeling an application require very little knowledge about the target machine, and tools such as profiling provide additional help in measuring the workload and in helping the programmer to extract the application behavior.

3.3.2 Model-Driven Performance Evaluation

The Model-Driven Simulation (MDS) derives performance information based on the application model by analyzing the data flow, working set, cache utilization, workload, degree of parallelism, communication pattern, and the hierarchical performance bounds. MDS performs a broad range of analyses that use combinations of conven-

(22)

tional performance assessment tools. Results from MDS are used to validate the application model by comparing results with those of previous performance assess- ments in known cases (both cases that were previously used to generate the model, as well as new cases with new profiles).

MDS is a performance analyzer that derives performance information for an application by simulating the application’s model with a machine model. MDS is a model- driven simulator that executes the tasks in the application model as if executing a For- tran or C program. Figure 9 shows the performance analyses that are carried out in MDS.

MDS follows the flow defined in the control flow module. When a task is executed, MDS performs the operations required by the task according to the modules associated with the task. MDS handles a task according to the following steps:

1. The domain decomposition module is used to group the iteration space into sub- domains. The workload for one sub-domain forms a sub-task.

Fig. 9: Model-driven Analyses Performed in MDS.

layout Machine & Application Models

data dependence control

flow weight

distribution machine

model

domain

communication analysis analysis workload

analysis

working set &

Load Imbalance Timing

Synchronization Cost Scheduling Cost

Data Flow

Communication Pattern

Cache Misses MDS

CXbound analysis

Performance Bounds

decomposition

data flow

cache analysis