Another View: The Trimedia TM32 CPU - Z Stop the machine and ring the warning bell

Z Stop the machine and ring the warning bell

2.13 Another View: The Trimedia TM32 CPU

Application area Benchmarks

Data Communication Verterbi decoding

Audio coding AC3 Decode

Video coding MPEG2 encode, DVD decode

Video processing Layered natural motion, Dynamic noise, Reduction, Peaking

Graphics 3D renderer library

FIGURE 2.35 Media processor application areas and example benchmarks. From Ri-emens [1999]. This lists shares only Viterbi decoding with the EEMBC benchmarks (see Fig-ure 1.12 in Chapter 1), with the rest being generally larger programs than EEMBC.

Given the Trimedia TM32 CPU has longer instruction words and they often contain NOPs, Trimedia compacts its instructions in memory, decoding them to the full size when loaded into the cache.

Figure 2.37 shows the TM32 CPU instruction mix for the EEMBC bench-marks. Using the unmodiﬁed source code, the instruction mix is similar to others, although there are more byte data transfers. If the C code is hand-tuned, it can ex-tensively use SIMD instructions. Note the large number of pack and merge in-structions to align the data for the SIMD inin-structions. The cost in code size of these VLIW instructions is still a factor of two to three larger than MIPS after compaction.

Architects have repeatedly tripped on common, but erroneous, beliefs. In this section we look at a few of them.

Operation

33 signed, unsigned, register indirect, indexed, scaled

addressing Byte shuffles shift right 1-, 2-, 3-bytes, select byte, merge,

pack

11 SIMD type convert

Bit shifts asl, asr, lsl, lsr, rol, 10 shifts, SIMD

Multiplies and multimedia

mul, sum of products, sum-of-SIMD-elements, multimedia, e.g. sum of products (FIR)

23 round, saturate, 2’s comp, SIMD

Integer arithmetic add, sub, min, max, abs, average, bitand, bitor, bitxor, bitinv, bitandinv eql, neq, gtr, geq, les, leq, sign extend, zero extend, sum of absolute differences

62 saturate, 2’s comp, unsigned, immediate,

SIMD Floating point add, sub, neg, mul, div, sqrt

eql, neq, gtr, geq, les, leq, IEEE flags

42 scalar

Special ops alloc, prefetch, copy back, read tag read, cache status, read counter

20 cache, special regs

Branch jmpt, jmpf 6 (un)interruptible

Total 207

FIGURE 2.36 List of operations and number of variations in Trimedia TM32 CPU. The data transfer opcodes include addressing modes in the count of operations, so the number is high compared to other architectures. SIMD means parti-tioned ALU operations of multiple narrow data items being operated on simultaneously in a 32-bit ALU, these include special operations for multimedia. The branches are delayed 3 slots.

2.14 Fallacies and Pitfalls

Operation Out-of-the-box Modified C Source Code

add word 26.5% 20.5%

load byte 10.4% 1.0%

subtract word 10.1% 1.1%

shift left arithmetic 7.8% 0.2%

store byte 7.4% 1.5%

multiply word 5.5% 0.4%

shift right arithmetic 3.6% 0.7%

and word 3.6% 6.8%

load word 3.5% 7.2%

load immediate 3.1% 1.6%

set greater than, equal 2.9% 1.3%

store word 2.0% 5.3%

jump 1.8% 0.8%

conditional branch 1.3% 1.0%

pack/merge bytes 2.6% 16.8%

SIMD sum of half word products 0.0% 10.1%

SIMD sum of byte products 0.0% 7.7%

pack/merge half words 0.0% 6.5%

SIMD subtract half word 0.0% 2.9%

SIMD maximum byte 0.0% 1.9%

Total 92.2% 95.5%

TM32 CPU Code Size (bytes) 243,968 387,328

MIPS Code Size (bytes) 120,729

FIGURE 2.37 TM32 CPU instruction mix running EEMBC consumer benchmark. The instruction mix for “out-of-the-box” C code is similar to general-purpose computers, with a higher emphasis of byte data transfers. The hand-optimized C code uses the SIMD instruc-tions and the pack and merge instrucinstruc-tions to align the data. The middle column shows the relative instruction mix for unmodified kernels, while the right column allows modification at the C level. These columns list of all operation that were responsible for at least 1% of the total in one of the mixes. MIPS code size is for the Apogee compiler for the NECVR5432.

Pitfall: Designing a “high-level” instruction set feature speciﬁcally oriented to supporting a high-level language structure.

Attempts to incorporate high-level language features in the instruction set have led architects to provide powerful instructions with a wide range of ﬂexibility.

However, often these instructions do more work than is required in the frequent case, or they don’t exactly match the requirements of some languages. Many such efforts have been aimed at eliminating what in the 1970s was called the se-mantic gap. Although the idea is to supplement the instruction set with additions that bring the hardware up to the level of the language, the additions can generate what Wulf [1981] has called a semantic clash:

... by giving too much semantic content to the instruction, the computer designer made it possible to use the instruction only in limited contexts. [p. 43]

More often the instructions are simply overkill—they are too general for the most frequent case, resulting in unneeded work and a slower instruction. Again, the VAX CALLS is a good example. CALLS uses a callee-save strategy (the regis-ters to be saved are speciﬁed by the callee) but the saving is done by the call in-struction in the caller. The CALLS instruction begins with the arguments pushed on the stack, and then takes the following steps:

1. Align the stack if needed.

2. Push the argument count on the stack.

3. Save the registers indicated by the procedure call mask on the stack (as men-tioned in section 2.11). The mask is kept in the called procedure’s code—this permits callee to specify the registers to be saved by the caller even with sep-arate compilation.

4. Push the return address on the stack, and then push the top and base of stack pointers (for the activation record).

5. Clear the condition codes, which sets the trap enables to a known state.

6. Push a word for status information and a zero word on the stack.

7. Update the two stack pointers.

8. Branch to the first instruction of the procedure.

The vast majority of calls in real programs do not require this amount of over-head. Most procedures know their argument counts, and a much faster linkage convention can be established using registers to pass arguments rather than the stack in memory. Furthermore, the CALLS instruction forces two registers to be used for linkage, while many languages require only one linkage register. Many attempts to support procedure call and activation stack management have failed to be useful, either because they do not match the language needs or because they are too general and hence too expensive to use.

The VAX designers provided a simpler instruction, JSB, that is much faster since it only pushes the return PC on the stack and jumps to the procedure.

However, most VAX compilers use the more costly CALLS instructions. The call instructions were included in the architecture to standardize the procedure link-age convention. Other computers have standardized their calling convention by agreement among compiler writers and without requiring the overhead of a com-plex, very general-procedure call instruction.

Fallacy: There is such a thing as a typical program.

Many people would like to believe that there is a single “typical” program that could be used to design an optimal instruction set. For example, see the synthetic benchmarks discussed in Chapter 1. The data in this chapter clearly show that programs can vary signiﬁcantly in how they use an instruction set. For example, Figure 2.38 shows the mix of data transfer sizes for four of the SPEC2000 pro-grams: It would be hard to say what is typical from these four programs. The variations are even larger on an instruction set that supports a class of applica-tions, such as decimal instrucapplica-tions, that are unused by other applications.

FIGURE 2.38 Data reference size of four programs from SPEC2000. Although you can calculate an average size, it would be hard to claim the average is typical of programs. <<Art-ist: make data label font smaller>>

18%

62%

22%

19%

28%

31%

94%

40%

60%

0 % 2 0 % 4 0 % 6 0 % 8 0 % 1 0 0 %

Byte (8 bits) Half word

(16 bits) Word

(32 bits) Double word (64 bits)

applu equake gzip perl

Pitfall: Innovating at the instruction set architecture to reduce code size with-out accounting for the compiler.

Figure 2.39 shows the relative code sizes for four compilers for the MIPS instruc-tion set. Whereas architects struggle to reduce code size by 30% to 40%, different compiler strategies can change code size by much larger factors. Similar to per-formance optimization techniques, the architect should start with the tightest code the compilers can produce before proposing hardware innovations to save space.

Pitfall: Expecting to get good performance from a compiler for DSPs.

Figure 2.40 shows the performance improvement to be gained by using assembly language, versus compiling from C for two Texas Instruments DSPs. Assembly language programming gains factors of 3 to 10 in performance and factors of 1 to 8 in code size. This gain is large enough to lure DSP programmers away from high-level languages, despite their well-documented advantages in programmer productivity and software maintenance.

Fallacy: An architecture with ﬂaws cannot be successful.

The 80x86 provides a dramatic example: The instruction set architecture is one only its creators could love (see Appendix C). Succeeding generations of Intel engineers have tried to correct unpopular architectural decisions made in design-ing the 80x86. For example, the 80x86 supports segmentation, whereas all others picked paging; it uses extended accumulators for integer data, but other proces-sors use general-purpose registers; and it uses a stack for ﬂoating-point data, when everyone else abandoned execution stacks long before.

Compiler Apogee Software:

Version 4.1

Green Hills:

Multi2000 Version 2.0

Algorithmics SDE4.0B

IDT/c 7.2.1

Architecture MIPS IV MIPS IV MIPS 32 MIPS 32

Processor NEC VR5432 NEC VR5000 IDT 32334 IDT

79RC32364

Auto Correlation kernel 1.0 2.1 1.1 2.7

Convolutional Encoder kernel 1.0 1.9 1.2 2.4

Fixed-Point Bit Allocation kernel 1.0 2.0 1.2 2.3

Fixed-Point Complex FFT kernel 1.0 1.1 2.7 1.8

Viterbi GSM Decoder kernel 1.0 1.7 0.8 1.1

Geometric Mean of 5 kernels 1.0 1.7 1.4 2.0

FIGURE 2.39 Code size relative to Apogee Software Version 4.1 C compiler for Telecom application of EEMBC benchmarks. The instruction set architectures are virtually identical, yet the code sizes vary by factors of two. These results were reported February to June 2000.

Despite these major difﬁculties, the 80x86 architecture has been enormously successful. The reasons are threefold: ﬁrst, its selection as the microprocessor in the initial IBM PC makes 80x86 binary compatibility extremely valuable. Sec-ond, Moore’s Law provided sufﬁcient resources for 80x86 microprocessors to translate to an internal RISC instruction set and then execute RISC-like instruc-tions (see section 3.8 in the next chapter). This mix enables binary compatibility with the valuable PC software base and performance on par with RISC proces-sors. Third, the very high volumes of PC microprocessors means Intel can easily pay for the increased design cost of hardware translation. In addition, the high volumes allow the manufacturer to go up the learning curve, which lowers the cost of the product.

The larger die size and increased power for translation may be a liability for embedded applications, but it makes tremendous economic sense for the desktop.

And its cost-performance in the desktop also makes it attractive for servers, with its main weakness for servers being 32-bit addresses: companies already offer high-end servers with more than one terabyte (2⁴⁰ bytes) of memory.

TMS320C54 D

Convolution 11.8 16.5 Convolutional Encoder 44.0 0.5

FIR 11.5 8.7 Fixed-Point Complex FFT 13.5 1.0

Matrix 1x3 7.7 8.1 Viterbi GSM Decoder 13.0 0.7

FIR2dim 5.3 6.5 Fixed-point Bit Allocation 7.0 1.4

Dot product 5.2 14.1 Auto Collrelation 1.8 0.7

FIGURE 2.40 Ratio of execution time and code size for compiled code vs. hand written code for TMS320C54 DSPs on left (using the 14 DSPstone kernels) and Texas Instruments TMS 320C6203 on right (using the 6 EEMBC Telecom kernels). The geometric mean of performance improvements is 3.2:1 for C54 running DSPstone and 10.0:1 for the C62 run-ning EEMBC. The compiler does a better job on code space for the C62, which is a VLIW processor, but the geometric mean of code size for the C54 is almost a factor of 8 larger when compiled. Modifying the C code gives much better results. The EEMBC results were reported May 2000. For DSPstone, see Ropers [1999]

Fallacy: You can design a ﬂawless architecture.

All architecture design involves trade-offs made in the context of a set of hard-ware and softhard-ware technologies. Over time those technologies are likely to change, and decisions that may have been correct at the time they were made look like mistakes. For example, in 1975 the VAX designers overemphasized the importance of cosize efﬁciency, underestimating how important ease of de-coding and pipelining would be ﬁve years later. An example in the RISC camp is delayed branch (see Appendix B <RISC>). It was a simple to control pipeline hazards with ﬁve-stage pipelines, but a challenge for processors with longer pipe-lines that issue multiple instructions per clock cycle. In addition, almost all archi-tectures eventually succumb to the lack of sufﬁcient address space.

In general, avoiding such ﬂaws in the long run would probably mean compro-mising the efﬁciency of the architecture in the short run, which is dangerous, since a new instruction set architecture must struggle to survive its ﬁrst few years.

The earliest architectures were limited in their instruction sets by the hardware technology of that time. As soon as the hardware technology permitted, computer architects began looking for ways to support high-level languages. This search led to three distinct periods of thought about how to support programs efﬁciently.

In the 1960s, stack architectures became popular. They were viewed as being a good match for high-level languages—and they probably were, given the com-piler technology of the day. In the 1970s, the main concern of architects was how to reduce software costs. This concern was met primarily by replacing software with hardware, or by providing high-level architectures that could simplify the task of software designers. The result was both the high-level-language computer architecture movement and powerful architectures like the VAX, which has a large number of addressing modes, multiple data types, and a highly orthogonal architecture. In the 1980s, more sophisticated compiler technology and a re-newed emphasis on processor performance saw a return to simpler architectures, based mainly on the load-store style of computer.

The following instruction set architecture changes occurred in the 1990s:

n Address size doubles: The 32-bit address instruction sets for most desktop and server processors were extended to 64-bit addresses, expanding the width of the registers (among other things) to 64 bits. Appendix B <RISC> gives three examples of architectures that have gone from 32 bits to 64 bits.

n Optimization of conditional branches via conditional execution: In the next two chapters we see that conditional branches can limit the performance of aggres-sive computer designs. Hence, there was interest in replacing conditional branches with conditional completion of operations, such as conditional move (see Chapter 4), which was added to most instruction sets.

2.15 Concluding Remarks

n Optimization of cache performance via prefetch: Chapter 5 explains the in-creasing role of memory hierarchy in performance of computers, with a cache miss on some computers taking as many instruction times as page faults took on earlier computers. Hence, prefetch instructions were added to try to hide the cost of cache misses by prefetching (see Chapter 5).

n Support for multimedia: Most desktop and embedded instruction sets were ex-tended with support for multimedia and DSP applications, as discussed in this chapter.

n Faster ﬂoating-point Operations: Appendix G <Float> describes operations added to enhance ﬂoating-point performance, such as operations that perform a multiply and an add and paired single execution. (We include them in MIPS.) Looking to the next decade, we see the following trends in instruction set de-sign:

n Long Instruction Words: The desire to achieve more instruction level parallel-ism by making changing the architecture to support wider instructions (see Chapter 4).

n Increased Conditional Execution: More support for conditional execution of operations to support greater speculation.

n Blending of general purpose and DSP architectures: Parallel efforts between desktop and embedded processors to add DSP support vs. extending DSP pro-cessors to make them better targets for compilers, suggesting a culture clash in the marketplace between general purpose and DSPs.

n 80x86 emulation: Given the popularity of software for the 80x86 architecture, many companies are looking to see if changes to the instruction sets can signif-icantly improve performance, cost, or power when emulating the 80x86 archi-tecture.

Between 1970 and 1985 many thought the primary job of the computer archi-tect was the design of instruction sets. As a result, textbooks of that era empha-size instruction set design, much as computer architecture textbooks of the 1950s and 1960s emphasized computer arithmetic. The educated architect was expected to have strong opinions about the strengths and especially the weaknesses of the popular computers. The importance of binary compatibility in quashing innova-tions in instruction set design was unappreciated by many researchers and text-book writers, giving the impression that many architects would get a chance to design an instruction set.

The deﬁnition of computer architecture today has been expanded to include design and evaluation of the full computer system—not just the deﬁnition of the instruction set and not just the processor—and hence there are plenty of topics for the architect to study. (You may have guessed this the ﬁrst time you lifted this book.) Hence, the bulk of this book is on design of computers versus instruction sets.

The many appendices may satisfy readers interested in instruction set archi-tecture: Appendix B compares seven popular load-store computers with MIPS.

Appendix C describes the most widely used instruction set, the Intel 80x86, and compares instruction counts for it with that of MIPS for several programs. For those interested in the historical computers, Appendix D summarizes the VAX ar-chitecture and Appendix E summarizes the IBM 360/370.

One’s eyebrows should rise whenever a future architecture is developed with a stack- or register-oriented instruction set. [p. 20]

Meyers [1978]

The earliest computers, including the UNIVAC I, the EDSAC, and the IAS com-puters, were accumulator-based computers. The simplicity of this type of computer made it the natural choice when hardware resources were very constrained. The ﬁrst general-purpose register computer was the Pegasus, built by Ferranti, Ltd. in 1956. The Pegasus had eight general-purpose registers, with R0 always being zero.

Block transfers loaded the eight registers from the drum memory.

Stack Architectures

In 1963, Burroughs delivered the B5000. The B5000 was perhaps the ﬁrst computer to seriously consider software and hardware-software trade-offs. Bar-ton and the designers at Burroughs made the B5000 a stack architecture (as de-scribed in Barton [1961]). Designed to support high-level languages such as ALGOL, this stack architecture used an operating system (MCP) written in a high-level language. The B5000 was also the ﬁrst computer from a U.S.

manufacturer to support virtual memory. The B6500, introduced in 1968 (and discussed in Hauck and Dent [1968]), added hardware-managed activation records. In both the B5000 and B6500, the top two elements of the stack were kept in the processor and the rest of the stack was kept in memory. The stack ar-chitecture yielded good code density, but only provided two high-speed storage locations. The authors of both the original IBM 360 paper [Amdahl, Blaauw, and Brooks 1964] and the original PDP-11 paper [Bell et al. 1970] argue against the stack organization. They cite three major points in their arguments against stacks:

1. Performance is derived from fast registers, not the way they are used.

2. The stack organization is too limiting and requires many swap and copy oper-ations.

3. The stack has a bottom, and when placed in slower memory there is a

在文檔中 Fundamentals of Computer Design 1 (頁 140-168)