Fallacies and Pitfalls - Fundamentals of Computer Design 1

0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0

Automotive Office Telecomm

Relatgive performance per Watt

AMD ElanSC520 AMD K6-2E+

IBM PowerPC 750CX NEC VR 5432 NEC VR4122

not a good metric, even if the instruction sets are identical. Figure 1.28 shows the performance of a 1.7 GHz Pentium 4 relative to a 1 GHz Pentium III. The ﬁgure also shows the performance of a hypothetical 1.7 GHz Pentium III assuming lin-ear scaling of performance based on the clock rate. In all cases except the SPEC ﬂoating point suite, the Pentium 4 delivers less performance per MHz than the Pentium III. As mentioned earlier, instruction set enhancements (the SSE2 exten-sions), which signiﬁcantly boost ﬂoating point execution rates, are probably re-sponsible for the better performance of the Pentium 4 for these ﬂoating point benchmarks.

FIGURE 1.28 A comparison of the performance of the Pentium 4 (P4) relative to the Pentium III (P3) on five different sets of benchmark suites. The bars show the relative performance of a 1.7 GHz P4 versus a 1 GHz P3. The triple vertical line at 1.7 shows how much faster a Pentium 4 at 1.7 GHz would be than a 1 GHz Pentium III assuming performance scaled linearly with clock rate. Of course, this line represents an idealized approximation to how fast a P3 would run. The first two sets of bars are the SPEC integer and floating point suites. The third set of bars represents three multimedia benchmarks.

The fourth set represents a pair of benchmarks based on the Game Quake, and the final benchmark is the composite Web-mark score, a PC-based web benchWeb-mark

0 . 0 0 0 . 2 0 0 . 4 0 0 . 6 0 0 . 8 0 1 . 0 0 1 . 2 0 1 . 4 0 1 . 6 0 1 . 8 0

SPECbase CINT2000 SPECbase CFP2000 Multimedia Game benchmark Web benchmark

Relative performance

Performance within a single processor implementation family (such as Pen-tium III) usually scales slower than clock speed because of the increased relative cost of stalls in the memory system. Across generations (such as the Pentium 4 and Pentium III) enhancements to the basic implementation usually yield a per-formance that is somewhat better than what would be derived from just clock rate scaling. As Figure 1.28 shows, the Pentium 4 is usually slower than the Pentium III when performance is adjusted by linearly scaling the clock rate. This may partly derive from the focus on high clock rate as a primary design goal. We dis-cuss both the differences between the Pentium III and Pentium 4 further in Chap-ter 3 as well as why the performance does not scale as fast as the clock rate does.

Fallacy: Benchmarks remain valid indeﬁnitely.

Several factors inﬂuence the usefulness of a benchmark as a predictor of real per-formance and some of these may change over time. A big factor inﬂuencing the usefulness of a benchmark is the ability of the benchmark to resist “cracking,”

also known as benchmark engineering or “benchmarksmanship.” Once a bench-mark becomes standardized and popular, there is tremendous pressure to improve performance by targeted optimizations or by aggressive interpretation of the rules for running the benchmark. Small kernels or programs that spend their time in a very small number of lines of code are particularly vulnerable.

For example, despite the best intentions, the initial SPEC89 benchmark suite included a small kernel, called matrix300, which consisted of eight different 300

× 300 matrix multiplications. In this kernel, 99% of the execution time was in a single line (see SPEC [1989]). Optimization of this inner loop by the compiler (using an idea called blocking, discussed in Chapter 5) for the IBM Powerstation 550 resulted in performance improvement by a factor of more than 9 over an ear-lier version of the compiler! This benchmark tested compiler performance and was not, of course, a good indication of overall performance, nor of this particu-lar optimization.

Even after the elimination of this benchmark, vendors found methods to tune the performance of individual benchmarks by the use of different compilers or preprocessors, as well as benchmark-speciﬁc ﬂags. Although the baseline perfor-mance measurements requires the use of one set of ﬂags for all benchmarks, the tuned or optimized performance does not. In fact, benchmark-speciﬁc ﬂags are al-lowed, even if they are illegal in general and could lead to incorrect compilation!

Allowing benchmark and even input-speciﬁc ﬂags has led to long lists of op-tions, as Figure 1.29 shows. This list of opop-tions, which is not signiﬁcantly differ-ent from the option lists used by other vendors, is used to obtain the peak performance for the Compaq AlphaServer DS20E Model 6/667. The list makes it clear why the baseline measurements were needed. The performance difference between the baseline and tuned numbers can be substantial. For the SPEC CFP2000 benchmarks on the AlphaServer DS20E Model 6/667, the overall per-formance (which by SPEC CPU2000 rules is summarized by geometric mean) is

1.12 times higher for the peak numbers. As compiler technology improves, the achieves closer to peak performance using the base ﬂags. Similarly, as the bench-marks improve in quality, they become less suspectible to highly application spe-ciﬁc optimizations. Thus, the gap between peak and base, which in early times was often 20%, has narrowed.

Ongoing improvements in technology can also change what a benchmark measures. Consider the benchmark gcc, considered one of the most realistic and challenging of the SPEC92 benchmarks. Its performance is a combination of CPU time and real system time. Since the input remains ﬁxed and real system time is limited by factors, including disk access time, that improve slowly, an in-creasing amount of the runtime is system time rather than CPU time. This may be appropriate. On the other hand, it may be appropriate to change the input over time, reﬂecting the desire to compile larger programs. In fact, the SPEC92 input was changed to include four copies of each input ﬁle used in SPEC89; although this increases runtime, it may or may not reﬂect the way compilers are actually being used.

Over a long period of time, these changes may make even a well-chosen benchmark obsolete. For example, more than half the benchmarks added to the 1992 and 1995 SPEC CPU benchmark release were dropped from the next gener-Peak: -v -g3 -arch ev6 -non_shared ONESTEP plus:

168.wupwise: f77 -fast -O4 -pipeline -unroll 2 171.swim: f90 -fast -O5 -transform_loops

172.mgrid: kf77 -O5 -transform_loops -tune ev6 -unroll 8 173.applu: f77 -fast -O5 -transform_loops -unroll 14 177.mesa: cc -fast -O4

178.galgel: kf90 -O4 -unroll 2 -ldxml RM_SOURCES = lapak.f90 179.art: kcc -fast -O4 -ckapargs='-arl=4 -ur=4' -unroll 10

183.equake: kcc -fast -ckapargs='-arl=4' -xtaso_short 187.facerec: f90 -fast -O4

188.ammp: cc -fast -O4 -xtaso_short

189.lucas: kf90 -fast -O5 -fkapargs='-ur=1' -unroll 1 191.fma3d: kf90 -O4

200.sixtrack: f90 -fast -O5 -transform_loops

301.apsi: kf90 -O5 -transform_loops -unroll 8 -fkapargs='-ur=1'

FIGURE 1.29 The tuning parameters for the SPEC CFP2000 report on an AlphaServer DS20E Model 6/667. This is the portion of the SPEC report for the tuned performance corresponding to that in Figure 1.14 on page 34. These parameters describe the compiler options (four different compilers are used). Each line shows the option used for one of the SPEC CFP2000 benchmarks. Data from: http://www.spec.org/osg/cpu2000/results/res1999q4/cpu2000-19991130-00012.html.

ation of the suite! To show how dramatically benchmarks must adapt over time, we summarize the status of the integer and FP benchmarks from SPEC 89, 92, and 95 in Figure 1.30.

Pitfall: Comparing hand-coded assembly and compiler generated high level language performance.

In most applications of computers, hand-coding is simply not tenable. A combi-nation of the high cost of software development and maintenance together with time-to-market pressures have made it impossible for many applications to con-sider assembly language. In parts of the embedded market, however, several fac-tors have continued to encourage limited use of hand coding, at least of key loops. The most important factors favoring this tendency are the importance of a few small loops to overall performance (particularly real-time performance) in some embedded applications, and the inclusion of instructions that can signiﬁ-cantly boost performance of certain types of computations, but that compilers can not effectively use.

When performance is measured either by kernels or by applications that spend most of their time in a small number of loops, hand coding of the critical parts of the benchmark can lead to large performance gains. In such instances, the perfor-mance difference between the hand-coded and machine-generated versions of a benchmark can be very large, as shown in for two different machines in Figure 1.31. Both designers and users must be aware of this potentially large difference

Benchmark name Integer or FP

SPEC 89 SPEC 92 SPEC 95 SPEC 2000

gcc integer adopted modified modified modified

espresso integer adopted modified dropped

li integer adopted modified modified dropped

eqntott integer adopted dropped

spice FP adopted modified dropped

doduc FP adopted dropped

nasa7 FP adopted dropped

fpppp FP adopted modified dropped

matrix300 FP adopted dropped

tomcatv FP adopted modified dropped

compress integer adopted modified dropped

sc integer adopted dropped

mdljdp2 FP adopted dropped

wave5 FP adopted modified dropped

ora FP adopted dropped

mdljsp2 FP adopted dropped

alvinn FP adopted dropped

ear FP adopted dropped

swm256 (aka swim) FP adopted modified modified

su2cor FP adopted modified dropped

hydro2d FP adopted modified dropped

go integer adopted dropped

m88ksim integer adopted dropped

ijpeg integer adopted dropped

perl integer adopted modified

vortex integer adopted modified

mgrid FP adopted modified

applu FP adopted dropped

apsi FP adopted modified

turb3d adopted dropped

FIGURE 1.30 The evolution of the SPEC benchmarks over time showing when benchmarks were adopted, modi-fied and dropped. All the programs in the 89, 92, and 95 releases are show. Modimodi-fied indicates that either the input or the size of the benchmark was changed, usually to increase its running time and avoid perturbation in measurement or domi-nation of the execution time by some factor other than CPU time.

and not extrapolate performance for compiler generate code from hand coded benchmarks.

Fallacy: Peak performance tracks observed performance.

The only universally true deﬁnition of peak performance is “the performance lev-el a machine is guaranteed not to exceed.” The gap between peak performance and observed performance is typically a factor of 10 or more in supercomputers.

(See Appendix B on vectors for an explanation.) Since the gap is so large and can vary signiﬁcantly by benchmark, peak performance is not useful in predicting ob-served performance unless the workload consists of small programs that normal-ly operate close to the peak.

As an example of this fallacy, a small code segment using long vectors ran on the Hitachi S810/20 in 1.3 seconds and on the Cray X-MP in 2.6 seconds. Al-though this suggests the S810 is two times faster than the X-MP, the X-MP runs a program with more typical vector lengths two times faster than the S810. These data are shown in Figure 1.32.

Fallacy: The best design for a computer is the one that optimizes the primary objective without considering implementation.

Machine EEMBC

benchmark set

Performance Compiler generated

Performance Hand coded

Ratio hand/

compiler

Trimedia 1300 @166 MHz Consumer 23.3 110.0 4.7

BOPS Manta @ 136 MHz Telecomm 2.6 225.8 44.6

TI TMS320C6203 @ 300MHz Telecomm 6.8 68.5 10.1

FIGURE 1.31 The performance of three embedded processors on C and hand-coded versions of portions of the EEMBC benchmark suite. In the case of the BOPS and TI processor, they also provide versions that are compiled but where the C is altered initially to improve performance and code generation; such versions can achieve most of the benefit from hand optimization at least for these machines and these benchmarks.

Measurement

Cray X-MP

Hitachi

S810/20 Performance A(i)=B(i)*C(i)+D(i)*E(i)

(vector length 1000 done 100,000 times)

2.6 secs 1.3 secs Hitachi 2 times faster

Vectorized FFT

(vector lengths 64,32,…,2)

3.9 secs 7.7 secs Cray 2 times faster

FIGURE 1.32 Measurements of peak performance and actual performance for the Hi-tachi S810/20 and the Cray X-MP. Note that the gap between peak and observed perfor-mance is large and can vary across benchmarks. Data from pages 18–20 of Lubeck, Moore, and Mendez [1985]. Also see Fallacies and Pitfalls in Appendix B.

Although in a perfect world where implementation complexity and implementa-tion time could be ignored, this might be true, design complexity is an important factor. Complex designs take longer to complete, prolonging time to market. Giv-en the rapidly improving performance of computers, longer design time means that a design will be less competitive. The architect must be constantly aware of the impact of his design choices on the design time for both hardware and soft-ware. The many postponements of the availability of the Itanium processor (roughly a two year delay from the initial target date) should serve as a topical re-minder of the risks of introducing both a new architecture and a complex design.

With processor performance increasing by just over 50% per year, each week de-lay translates to a 1% loss in relative performance!

Pitfall: Neglecting the cost of software in either evaluating a system or examining cost-performance.

For many years, hardware was so expensive that it clearly dominated the cost of software, but this is no longer true. Software costs in 2001 can be a large fraction of both the purchase and operational costs of a system. For example, for a medi-um size database OLTP server, Microsoft OS software might run about $2,000, while the Oracle software would run between $6,000 and $9,000 for a four-year, one-processor license. Assuming a four-year software lifetime means a total soft-ware cost for these two major components of between $8,000 and $11,000. A midrange Dell server with 512MB of memory, Pentium III at 1 GHz, and be-tween 20 and 100 GB of disk would cost roughly the same amount as these two major software components. Meaning that software costs are roughly 50% of the total system cost!

Alternatively, consider a professional desktop system, which can be purchased with 1 GHz Pentium III, 128 MB DRAM, 20 GB disk, and a 19 inch monitor for just under $1000. The software costs of a Windows OS and Ofﬁce 2000 are about

$300 if bundled with the system and about double that if purchased separately, so the software costs are somewhere between 23% and 38% of the total cost!

Pitfall: Falling prey to Amdahl’s Law.

Virtually every practicing computer architect knows Amdahl’s Law. Despite this, we almost all occasionally fall into the trap of expending tremendous effort opti-mizing some aspect of a system before we measure its usage. Only when the overall speedup is unrewarding, do we recall that we should have measured the usage of that feature before we spent so much effort enhancing it!

Fallacy: Synthetic benchmarks predict performance for real programs.

This fallacy appeared in the ﬁrst edition of this book, published in 1990. With the arrival and dominance of organizations such as SPEC and TPC, we thought per-haps the computer industry had learned a lesson and reformed its faulty practices, but the emerging embedded market, has embraced Dhrystone as its most quoted benchmark! Hence, this fallacy survives.

The best known examples of synthetic benchmarks are Whetstone and Dhrys-tone. These are not real programs and, as such, may not reﬂect program behavior for factors not measured. Compiler and hardware optimizations can artiﬁcially inﬂate performance of these benchmarks but not of real programs. The other side of the coin is that because these benchmarks are not natural programs, they don’t reward optimizations of behaviors that occur in real programs. Here are some examples:

n Optimizing compilers can discard 25% of the Dhrystone code; examples in-clude loops that are only executed once, making the loop overhead instructions unnecessary. To address these problems the authors of the benchmark “re-quire” both optimized and unoptimized code to be reported. In addition, they

“forbid” the practice of inline-procedure expansion optimization, since Dhry-stone’s simple procedure structure allows elimination of all procedure calls at almost no increase in code size.

n Most Whetstone ﬂoating-point loops execute small numbers of times or in-clude calls inside the loop. These characteristics are different from many real programs. As a result Whetstone underrewards many loop optimizations and gains little from techniques such as multiple issue (Chapter 3) and vectorization (Appendix B).

n Compilers can optimize a key piece of the Whetstone loop by noting the rela-tionship between square root and exponential, even though this is very unlikely to occur in real programs. For example, one key loop contains the following FORTRAN code:

X = SQRT(EXP(ALOG(X)/T1)) It could be compiled as if it were

X = EXP(ALOG(X)/(2×T1)) since

SQRT(EXP(X)) = = EXP(X/2)

It would be surprising if such optimizations were ever invoked except in this syn-thetic benchmark. (Yet one reviewer of this book found several compilers that performed this optimization!) This single change converts all calls to the square root function in Whetstone into multiplies by 2, surely improving performance—

if Whetstone is your measure.

Fallacy: MIPS is an accurate measure for comparing performance among computers.

This fallacy also appeared in the ﬁrst edition of this book, published in 1990.

Your authors initially thought it could be retired, but, alas, the embedded market e^X

2 = e^{X 2}^/

not only uses Dhrystone as the benchmark of choice, but reports performance as

“Dhrystone MIPS”, a measure that this fallacy will show is problematic.

One alternative to time as the metric is MIPS, or million instructions per sec-ond. For a given program, MIPS is simply

MIPS = =

Some ﬁnd this rightmost form convenient since clock rate is ﬁxed for a machine and CPI is usually a small number, unlike instruction count or execution time.

Relating MIPS to time,

Execution time =

Since MIPS is a rate of operations per unit time, performance can be speciﬁed as the inverse of execution time, with faster machines having a higher MIPS rating.

The good news about MIPS is that it is easy to understand, especially by a customer, and faster machines means bigger MIPS, which matches intuition. The problem with using MIPS as a measure for comparison is threefold:

n MIPS is dependent on the instruction set, making it difficult to compare MIPS of computers with different instruction sets.

n MIPS varies between programs on the same computer.

n Most importantly, MIPS can vary inversely to performance!

The classic example of the last case is the MIPS rating of a machine with option-al ﬂoating-point hardware. Since it generoption-ally takes more clock cycles per ﬂoat-ing-point instruction than per integer instruction, ﬂoatﬂoat-ing-point programs using the optional hardware instead of software ﬂoating-point routines take less time but have a lower MIPS rating. Software ﬂoating point executes simpler instruc-tions, resulting in a higher MIPS rating, but it executes so many more that overall execution time is longer.

MIPS is sometimes used by a single vendor (e.g. IBM) within a single set of applications, where this measure is less hamrful since relative differences among MIPS ratings of machines with the same architecture and the same benchmarks are reasonably likely to track relative performance differences.

To try to avoid the worst difﬁculties of using MIPS as a performance measure, computer designers began using relative MIPS, which we discuss in detail on page 75, and this is what the embedded market reports for Dhrystone. Although less harmful than an actual MIPS measurement, relative MIPS have their short-comings (e.g., they are not really MIPS!), especially when measured using Dhry-stone!

Instruction count Execution time × 10⁶

Clock rate CPI× 10⁶

Instruction count MIPS× 10⁶

This chapter has introduced a number of concepts that we will expand upon as we go through this book. The major ideas in instruction set architecture and the

在文檔中 Fundamentals of Computer Design 1 (頁 60-70)