Quantitative Principles of Computer Design

Amdahl’s Law deﬁnes the speedup that can be gained by using a particular feature. What is speedup? Suppose that we can make an enhancement to a ma-chine that will improve performance when it is used. Speedup is the ratio Speedup =

Alternatively, Speedup =

Speedup tells us how much faster a task will run using the machine with the en-hancement as opposed to the original machine.

Amdahl’s Law gives us a quick way to ﬁnd the speedup from some enhance-ment, which depends on two factors:

1. The fraction of the computation time in the original machine that can be converted to take advantage of the enhancement—For example, if 20 seconds of the execution time of a program that takes 60 seconds in total can use an enhancement, the fraction is 20/60. This value, which we will call Fraction_enhanced, is always less than or equal to 1.

2. The improvement gained by the enhanced execution mode; that is, how much faster the task would run if the enhanced mode were used for the entire pro-gram—This value is the time of the original mode over the time of the en-hanced mode: If the enen-hanced mode takes 2 seconds for some portion of the program that can completely use the mode, while the original mode took 5 sec-onds for the same portion, the improvement is 5/2. We will call this value, which is always greater than 1, Speedup_enhanced.

The execution time using the original machine with the enhanced mode will be the time spent using the unenhanced portion of the machine plus the time spent using the enhancement:

Execution time_new = Execution time_old×

The overall speedup is the ratio of the execution times:

Speedup_overall= =

E X A M P L E Suppose that we are considering an enhancement to the processor of a server system used for web serving. The new CPU is 10 times faster on computation in the web serving application than the original processor.

Assuming that the original CPU is busy with computation 40% of the time Performance for entire task using the enhancement when possible

Performance for entire task without using the enhancement

Execution time for entire task without using the enhancement Execution time for entire task using the enhancement when possible

1–Fraction_enhanced

---and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement?

A N S W E R Fraction_enhanced = 0.4 Speedup_enhanced= 10

Speedup_overall = = ≈ 1.56

Amdahl’s Law expresses the law of diminishing returns: The incremental im-provement in speedup gained by an additional imim-provement in the performance of just a portion of the computation diminishes as improvements are added. An important corollary of Amdahl’s Law is that if an enhancement is only usable for a fraction of a task, we can’t speed up the task by more than the reciprocal of 1 minus that fraction.

A common mistake in applying Amdahl’s Law is to confuse “fraction of time converted to use an enhancement” and “fraction of time after enhancement is in use.” If, instead of measuring the time that we could use the enhancement in a computation, we measure the time after the enhancement is in use, the results will be incorrect! (Try Exercise 1.2 to see how wrong.)

Amdahl’s Law can serve as a guide to how much an enhancement will im-prove performance and how to distribute resources to imim-prove cost/performance.

The goal, clearly, is to spend resources proportional to where time is spent. Am-dahl’s Law is particularly useful for comparing the overall system performance of two alternatives, but it can also be applied to compare two CPU design alterna-tives, as the following Example shows.

E X A M P L E A common transformation required in graphics engines is square root. Im-plementations of ﬂoating-point (FP) square root vary signiﬁcantly in per-formance, especially among processor designed for graphics. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark. One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10. The other alter-native is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for a total of 50% of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort as required for the fast square root. Compare these two design

al-1 0.6 0.4

---10 +

--- 1 0.64

---ternatives.

A N S W E R We can compare these two alternatives by comparing the speedups:

Speedup_FPSQR = = = 1.22

Speedup_FP = = = 1.23

Improving the performance of the FP operations overall is slightly better

because of the higher frequency. n

In the above example, we needed to know the time consumed by the new and improved FP operations; often it is difﬁcult to measure these times directly. In the next section, we will see another way of doing such comparisons based on the use of an equation that decomposes the CPU execution time into three separate components. If we know how an alternative affects these three components, we can determine its overall performance effect. Furthermore, it is often possible to build simulators that measure these components before the hardware is actually designed.

The CPU Performance Equation

Essentially all computers are constructed using a clock running at a constant rate.

These discrete time events are called ticks, clock ticks, clock periods, clocks, cy-cles, or clock cycles. Computer designers refer to the time of a clock period by its duration (e.g., 1 ns) or by its rate (e.g., 1 GHz). CPU time for a program can then be expressed two ways:

CPU time =

In addition to the number of clock cycles needed to execute a program, we can also count the number of instructions executed—the instruction path length or in-struction count (IC). If we know the number of clock cycles and the inin-struction count we can calculate the average number of clock cycles per instruction (CPI).

Because it is easier to work with and because we will deal with simple processors 1

1–0.2

( ) 0.2

---10 +

--- 1 0.82

---1 1–0.5

( ) 0.5

1.6 ---+

--- 1 0.8125

---CPU time = CPU clock cycles for a program×Clock cycle time

CPU clock cycles for a program Clock rate

---in this chapter, we use CPI. Designers sometimes also use Instructions per Clock or IPC, which is the inverse of CPI.

CPI is computed as:

CPI =

This CPU ﬁgure of merit provides insight into different styles of instruction sets and implementations, and we will use it extensively in the next four chapters.

By transposing instruction count in the above formula, clock cycles can be de-ﬁned as . This allows us to use CPI in the execution time formula:

CPU time =

Expanding the ﬁrst formula into the units of measurement and inverting the clock rate shows how the pieces ﬁt together:

= = CPU time

As this formula demonstrates, CPU performance is dependent upon three charac-teristics: clock cycle (or rate), clock cycles per instruction, and instruction count.

Furthermore, CPU time is equally dependent on these three characteristics: A 10% improvement in any one of them leads to a 10% improvement in CPU time.

Unfortunately, it is difﬁcult to change one parameter in complete isolation from others because the basic technologies involved in changing each character-istic are interdependent:

n Clock cycle time—Hardware technology and organization

n CPI—Organization and instruction set architecture

n Instruction count—Instruction set architecture and compiler technology Luckily, many potential performance improvement techniques primarily improve one component of CPU performance with small or predictable impacts on the other two.

Sometimes it is useful in designing the CPU to calculate the number of total CPU clock cycles as

CPU clock cycles =

where ICi represents number of times instruction i is executed in a program and CPIi represents the average number of instructions per clock for instruction i. This form can be used to express CPU time as

CPU clock cycles for a program Instruction Count

---IC×CPI

CPU time = Instruction Count×Clock cycle time×Cycles per Instruction

Instruction Count×Clock cycle time Clock rate

and overall CPI as:

The latter form of the CPI calculation uses each individual CPIi and the fraction of occurrences of that instruction in a program (i.e., ). CPIi should be measured and not just calculated from a table in the back of a reference manual since it must include pipeline effects, cache misses, and any other memory system inefﬁciencies.

Consider our earlier example, here modiﬁed to use measurements of the fre-quency of the instructions and of the instruction CPI values, which, in practice, is obtained by simulation or by hardware instrumentation.

E X A M P L E Suppose we have made the following measurements:

Frequency of FP operations (other than FPSQR) = 25%

Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR= 2%

CPI of FPSQR = 20

Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the average CPI of all FP operations to 2.5.

Compare these two design alternatives using the CPU performance equation.

A N S W E R First, observe that only the CPI changes; the clock rate and instruction count remain identical. We start by ﬁnding the original CPI with neither en-hancement:

We can compute the CPI for the enhanced FPSQR by subtracting the cycles saved from the original CPI:

CPU time IC_i×CPI_i

CPI_original CPI_i IC_i

Instruction count

We can compute the CPI for the enhancement of all FP instructions the same way or by summing the FP and non-FP CPIs. Using the latter gives us

Since the CPI of the overall FP enhancement is slightly lower, its perfor-mance will be marginally better. Speciﬁcally, the speedup for the overall FP enhancement is

Happily, this is the same speedup we obtained using Amdahl’s Law on page 42. It is often possible to measure the constituent parts of the CPU performance equation. This is a key advantage for using the CPU perfor-mance equation versus Amdahl’s Law in the above example. In particular, it may be difﬁcult to measure things such as the fraction of execution time for which a set of instructions is responsible. In practice this would proba-bly be computed by summing the product of the instruction count and the CPI for each of the instructions in the set. Since the starting point is often individual instruction count and CPI measurements, the CPU

perfor-mance equation is incredibly useful. n

Measuring and Modeling the Components of the CPU Performance Equation To use the CPU performance equation as a design tool, we need to be able to measure the various factors. For an existing processor, it is easy to obtain the exe-cution time by measurement, and the clock speed is known. The challenge lies in discovering the instruction count or the CPI. Most newer processors include counters for both instructions executed and for clock cycles. By periodically monitoring these counters, it is also possible to attach execution time and instruc-tion count to segments of the code, which can be helpful to programmers trying to understand and tune the performance of an application. Often, a designer or programmer will want to understand performance at a more ﬁne-grained level than what is available from the hardware counters. For example, they may want to know why the CPI is what it is. In such cases, simulation techniques like those used for processors that are being designed are used.

CPIwith new FPSQR CPI_original 2% CPI_{old FPSQR}–CPI

of new FPSQR only

Speedup_{new FP} CPU time_original CPU time_{new FP}

There are three general classes of simulation techniques that are used. In gen-eral, the more sophisticated techniques yield more accuracy, particularly for more recent architectures, at the cost of longer execution time The ﬁrst and simplest technique, and hence the least costly, is proﬁle-based, static modeling. In this technique a dynamic execution proﬁle of the program, which indicates how often each instruction is executed, is obtained by one of three methods:

1. By using hardware counters on the processor, which are periodically saved.

This technique often gives an approximate profile, but one that is within a few percent of exact.

2. By using instrumented execution, in which instrumentation code is compiled into the program. This code is used to increment counters, yielding an exact profile. (This technique can also be used to create a trace of memory address that are accessed, which is useful for other simulation techniques.)

3. By interpreting the program at the instruction set level, compiling instruction counts in the process.

Once the proﬁle is obtained, it is used to analyze the program in a static fashion by looking at the code. Obviously with the proﬁle, the total instruction count is easy to obtain. It is also easy to get a detailed dynamic instruction mix telling what types of instructions were executed with what frequency. Finally, for simple processors, it is possible to compute an approximation to the CPI. This approxi-mation is computed by modeling and analyzing the execution of each basic block (or straightline code segment) and then computing an overall estimate of CPI or total compute cycles by multiplying the estimate for each basic block by the number of times it is executed. Although this simple model ignores memory be-havior and has severe limits for modeling complex pipelines, it is a reasonable and very fast technique for modeling the performance of short, integer pipelines, ignoring the memory system behavior.

Trace-driven simulation is a more sophisticated technique for modeling per-formance and is particularly useful for modeling memory system perper-formance. In trace-driven simulation, a trace of the memory references executed is created, usually either by simulation or by instrumented execution. The trace includes what instructions were executed (given by the instruction address), as well as the data addresses accessed.

Trace-driven simulation can be used in several different ways. The most com-mon use is to model memory system performance, which can be done by simulat-ing the memory system, includsimulat-ing the caches and any memory management hardware using the address trace. A trace-driven simulation of the memory sys-tem can be combined with a static analysis of pipeline performance to obtain a reasonably accurate performance model for simple pipelined processors. For more complex pipelines, the trace data can be used to perform a more detailed analysis of the pipeline performance by simulation of the processor pipeline.

Since the trace data allows a simulation of the exact ordering of instructions, higher accuracy can be achieved than with a static approach. Trace-driven simu-lation typically isolates the simusimu-lation of any pipeline behavior from the memory system. In particular, it assumes that the trace is completely independent of the memory system behavior. As we will see in Chapters 3 and 5, this is not the case for the most advanced processors–a third technique is needed.

The third technique, which is the most accurate and most costly, is execution-driven simulation. In execution-execution-driven simulation a detailed simulation of the memory system and the processor pipeline are done simultaneously. This allows the exact modeling of the interaction between the two, which is critical as we will see in Chapters 3 and 5.

There are many variations on these three basic techniques. We will see exam-ples of these tools in later chapters and use various versions of them in the exer-cises.

Locality of Reference

Although Amdahl’s Law is a theorem that applies to any system, other important fundamental observations come from properties of programs. The most important program property that we regularly exploit is locality of reference: Programs tend to reuse data and instructions they have used recently. A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of the code. An implication of locality is that we can predict with reasonable accuracy what instructions and data a program will use in the near future based on its ac-cesses in the recent past.

Locality of reference also applies to data accesses, though not as strongly as to code accesses. Two different types of locality have been observed. Temporal lo-cality states that recently accessed items are likely to be accessed in the near fu-ture. Spatial locality says that items whose addresses are near one another tend to be referenced close together in time. We will see these principles applied in Chapter 5.

Take Advantage of Parallelism

Taking advantage of parallelism is one of the most important methods for im-proving performance. We give three brief examples, which are expounded on in later chapters. Our ﬁrst example is the use of parallelism at the system level. To improve the throughput performance on a typical server benchmark, such as SPECWeb or TPC, multiple processors and multiple disks can be used. The workload of handling requests can then be spread among the CPUs or disks re-sulting in improved throughput. This is the reason that scalability is viewed as a valuable asset for server applications.

At the level of an individual processor, taking advantage of parallelism among instructions is critical to achieving high performance. One of the simplest ways

to do this is through pipelining. The basic idea behind pipelining, which is ex-plained in more detail in Appendix A and a major focus of Chapter 3, is to over-lap the execution of instructions, so as to reduce the total time to complete a sequence of instructions. Viewed from the perspective of the CPU performance equation, we can think of pipelining as reducing the CPI by allowing instructions that take multiple cycles to overlap. A key insight that allows pipelining to work is that not every instruction depends on its immediate predecessor, and thus, exe-cuting the instructions completely or partially in parallel may be possible.

Parallelism can also be exploited at the level of detailed digital design. For ex-ample, set associative caches use multiple banks of memory that are typical searched in parallel to ﬁnd a desired item. Modern ALUs use carry-lookahead, which uses parallelism to speed the process of computing sums from linear in the number of bits in the operands to logarithmic.

There are many different ways designers take advantage of parallelism. One common class of techniques is parallel computation of two or more possible out-comes, followed by late selection. This technique is used in carry select adders, in set associative caches, and in handling branches in pipelines. Virtually every chapter in this book will have an example of how performance is enhanced through the exploitation of parallelism.

In the Putting It All Together sections that appear near the end of every chapter, we show real examples that use the principles in that chapter. In this section we look at measures of performance and price-performance ﬁrst in desktop systems using the SPEC CPU benchmarks, then at servers using TPC-C as the

在文檔中 Fundamentals of Computer Design 1 (頁 41-50)