Historical Perspective and References - Faster Division with One Adder

Faster Division with One Adder

J.12 Historical Perspective and References

J.12 Historical Perspective and References ■ J-63

There is, of course, no denying the fact that human time is consumed in arranging for the introduction of suitable scale factors. We only argue that the time so consumed is a very small percentage of the total time we will spend in preparing an interesting problem for our machine. The first advantage of the float-ing point is, we feel, somewhat illusory. In order to have such a floatfloat-ing point, one must waste memory capacity that could otherwise be used for carrying more dig-its per word. It would therefore seem to us not at all clear whether the modest advantages of a floating binary point offset the loss of memory capacity and the increased complexity of the arithmetic and control circuits.

This enables us to see things from the perspective of early computer design-ers, who believed that saving computer time and memory were more important than saving programmer time.

The original papers introducing the Wallace tree, Booth recoding, SRT divi-sion, overlapped triplets, and so on are reprinted in Swartzlander [1990]. A good explanation of an early machine (the IBM 360/91) that used a pipelined Wallace tree, Booth recoding, and iterative division is in Anderson et al. [1967]. A discus-sion of the average time for single-bit SRT dividiscus-sion is in Freiman [1961]; this is one of the few interesting historical papers that does not appear in Swartzlander.

The standard book of Mead and Conway [1980] discouraged the use of CLAs as not being cost effective in VLSI. The important paper by Brent and Kung [1982] helped combat that view. An example of a detailed layout for CLAs can be found in Ngai and Irwin [1985] or in Weste and Eshraghian [1993], and a more theoretical treatment is given by Leighton [1992]. Takagi, Yasuura, and Yajima [1985] provide a detailed description of a signed-digit tree multiplier.

Before the ascendancy of IEEE arithmetic, many different floating-point for-mats were in use. Three important ones were used by the IBM 370, the DEC VAX, and the Cray. Here is a brief summary of these older formats. The VAX format is closest to the IEEE standard. Its single-precision format (F format) is like IEEE single precision in that it has a hidden bit, 8 bits of exponent, and 23 bits of fraction. However, it does not have a sticky bit, which causes it to round halfway cases up instead of to even. The VAX has a slightly different exponent range from IEEE single: E_min is −128 rather than −126 as in IEEE, and Emax is 126 instead of 127. The main differences between VAX and IEEE are the lack of special values and gradual underflow. The VAX has a reserved operand, but it works like a signaling NaN: It traps whenever it is referenced. Originally, the VAX’s double precision (D format) also had 8 bits of exponent. However, as this is too small for many applications, a G format was added; like the IEEE standard, this format has 11 bits of exponent. The VAX also has an H format, which is 128 bits long.

The IBM 370 floating-point format uses base 16 rather than base 2. This means it cannot use a hidden bit. In single precision, it has 7 bits of exponent and 24 bits (6 hex digits) of fraction. Thus, the largest representable number is 16²⁷= 2⁴× 2⁷= 2²⁹, compared with 2²⁸ for IEEE. However, a number that is nor-malized in the hexadecimal sense only needs to have a nonzero leading digit.

When interpreted in binary, the three most-significant bits could be zero. Thus, there are potentially fewer than 24 bits of significance. The reason for using the

higher base was to minimize the amount of shifting required when adding floating-point numbers. However, this is less significant in current machines, where the floating-point add time is usually fixed independently of the operands.

Another difference between 370 arithmetic and IEEE arithmetic is that the 370 has neither a round digit nor a sticky digit, which effectively means that it trun-cates rather than rounds. Thus, in many computations, the result will systemati-cally be too small. Unlike the VAX and IEEE arithmetic, every bit pattern is a valid number. Thus, library routines must establish conventions for what to return in case of errors. In the IBM FORTRAN library, for example, returns 2!

Arithmetic on Cray computers is interesting because it is driven by a motiva-tion for the highest possible floating-point performance. It has a 15-bit exponent field and a 48-bit fraction field. Addition on Cray computers does not have a guard digit, and multiplication is even less accurate than addition. Thinking of multiplication as a sum of p numbers, each 2p bits long, Cray computers drop the low-order bits of each summand. Thus, analyzing the exact error characteristics of the multiply operation is not easy. Reciprocals are computed using iteration, and division of a by b is done by multiplying a times 1/b. The errors in multiplication and reciprocation combine to make the last three bits of a divide operation unreliable. At least Cray computers serve to keep numerical analysts on their toes!

The IEEE standardization process began in 1977, inspired mainly by W. Kahan and based partly on Kahan’s work with the IBM 7094 at the University of Toronto [Kahan 1968]. The standardization process was a lengthy affair, with gradual underflow causing the most controversy. (According to Cleve Moler, vis-itors to the United States were advised that the sights not to be missed were Las Vegas, the Grand Canyon, and the IEEE standards committee meeting.) The stan-dard was finally approved in 1985. The Intel 8087 was the first major commercial IEEE implementation and appeared in 1981, before the standard was finalized. It contains features that were eliminated in the final standard, such as projective bits. According to Kahan, the length of double-extended precision was based on what could be implemented in the 8087. Although the IEEE standard was not based on any existing floating-point system, most of its features were present in some other system. For example, the CDC 6600 reserved special bit patterns for INDEFINITE and INFINITY, while the idea of denormal numbers appears in Goldberg [1967] as well as in Kahan [1968]. Kahan was awarded the 1989 Tur-ing prize in recognition of his work on floatTur-ing point.

Although floating point rarely attracts the interest of the general press, news-papers were filled with stories about floating-point division in November 1994. A bug in the division algorithm used on all of Intel’s Pentium chips had just come to light. It was discovered by Thomas Nicely, a math professor at Lynchburg Col-lege in Virginia. Nicely found the bug when doing calculations involving recipro-cals of prime numbers. News of Nicely’s discovery first appeared in the press on the front page of the November 7 issue of Electronic Engineering Times. Intel’s immediate response was to stonewall, asserting that the bug would only affect theoretical mathematicians. Intel told the press, “This doesn’t even qualify as an errata . . . even if you’re an engineer, you’re not going to see this.”

4 –

J.12 Historical Perspective and References ■ J-65

Under more pressure, Intel issued a white paper, dated November 30, explain-ing why they didn’t think the bug was significant. One of their arguments was based on the fact that if you pick two floating-point numbers at random and divide one into the other, the chance that the resulting quotient will be in error is about 1 in 9 billion. However, Intel neglected to explain why they thought that the typical customer accessed floating-point numbers randomly.

Pressure continued to mount on Intel. One sore point was that Intel had known about the bug before Nicely discovered it, but had decided not to make it public. Finally, on December 20, Intel announced that they would uncondition-ally replace any Pentium chip that used the faulty algorithm and that they would take an unspecified charge against earnings, which turned out to be $300 million.

The Pentium uses a simple version of SRT division as discussed in Section J.9. The bug was introduced when they converted the quotient lookup table to a PLA. Evidently there were a few elements of the table containing the quotient digit 2 that Intel thought would never be accessed, and they optimized the PLA design using this assumption. The resulting PLA returned 0 rather than 2 in these situations. However, those entries were really accessed, and this caused the divi-sion bug. Even though the effect of the faulty PLA was to cause 5 out of 2048 table entries to be wrong, the Pentium only computes an incorrect quotient 1 out of 9 billion times on random inputs. This is explored in Exercise J.34.

References

Anderson, S. F., J. G. Earle, R. E. Goldschmidt, and D. M. Powers [1967]. “The IBM System/360 Model 91: Floating-point execution unit,” IBM J. Research and Develop-ment 11, 34–53. Reprinted in Swartzlander [1990].

Good description of an early high-performance floating-point unit that used a pipe-lined Wallace tree multiplier and iterative division.

Bell, C. G., and A. Newell [1971]. Computer Structures: Readings and Examples, McGraw-Hill, New York.

Birman, M., A. Samuels, G. Chu, T. Chuk, L. Hu, J. McLeod, and J. Barnes [1990].

“Developing the WRL3170/3171 SPARC floating-point coprocessors,” IEEE Micro 10:1, 55–64.

These chips have the same floating-point core as the Weitek 3364, and this paper has a fairly detailed description of that floating-point design.

Brent, R. P., and H. T. Kung [1982]. “A regular layout for parallel adders,” IEEE Trans.

on Computers C-31, 260–264.

This is the paper that popularized CLAs in VLSI.

Burgess, N., and T. Williams [1995]. “Choices of operand truncation in the SRT division algorithm,” IEEE Trans. on Computers 44:7.

Analyzes how many bits of divisor and remainder need to be examined in SRT division.

Burks, A. W., H. H. Goldstine, and J. von Neumann [1946]. “Preliminary discussion of the logical design of an electronic computing instrument,” Report to the U.S. Army Ordnance Department, p. 1; also appears in Papers of John von Neumann, W. Aspray and A. Burks, eds., MIT Press, Cambridge, Mass., and Tomash Publishers, Los Angeles, 1987, 97–146.

Cody, W. J., J. T. Coonen, D. M. Gay, K. Hanson, D. Hough, W. Kahan, R. Karpinski, J. Palmer, F. N. Ris, and D. Stevenson [1984]. “A proposed radix- and word-length-independent standard for floating-point arithmetic,” IEEE Micro 4:4, 86–100.

Contains a draft of the 854 standard, which is more general than 754. The signifi-cance of this article is that it contains commentary on the standard, most of which is equally relevant to 754. However, be aware that there are some differences between this draft and the final standard.

Coonen, J. [1984]. “Contributions to a proposed standard for binary floating point arith-metic,” Ph.D. thesis, University of California–Berkeley.

The only detailed discussion of how rounding modes can be used to implement effi-cient binary decimal conversion.

Darley, H. M. et al. [1989]. “Floating point/integer processor with divide and square root functions,” U.S. Patent 4,878,190, October 31, 1989.

Pretty readable as patents go. Gives a high-level view of the TI 8847 chip, but doesn’t have all the details of the division algorithm.

Demmel, J. W., and X. Li [1994]. “Faster numerical algorithms via exception handling,”

IEEE Trans. on Computers 43:8, 983–992.

A good discussion of how the features unique to IEEE floating point can improve the performance of an important software library.

Freiman, C. V. [1961]. “Statistical analysis of certain binary division algorithms,” Proc.

IRE 49:1, 91–103.

Contains an analysis of the performance of shifting-over-zeros SRT division algo-rithm.

Goldberg, D. [1991]. “What every computer scientist should know about floating-point arithmetic,” Computing Surveys 23:1, 5–48.

Contains an in-depth tutorial on the IEEE standard from the software point of view.

Goldberg, I. B. [1967]. “27 bits are not enough for 8-digit accuracy,” Comm. ACM 10:2, 105–106.

This paper proposes using hidden bits and gradual underflow.

Gosling, J. B. [1980]. Design of Arithmetic Units for Digital Computers, Springer-Verlag, New York.

A concise, well-written book, although it focuses on MSI designs.

Hamacher, V. C., Z. G. Vranesic, and S. G. Zaky [1984]. Computer Organization, 2nd ed., McGraw-Hill, New York.

Introductory computer architecture book with a good chapter on computer arithmetic.

Hwang, K. [1979]. Computer Arithmetic: Principles, Architecture, and Design, Wiley, New York.

This book contains the widest range of topics of the computer arithmetic books.

IEEE [1985]. “IEEE standard for binary floating-point arithmetic,” SIGPLAN Notices 22:2, 9–25.

IEEE 754 is reprinted here.

Kahan, W. [1968]. “7094-II system support for numerical analysis,” SHARE Secretarial Distribution SSD-159.

This system had many features that were incorporated into the IEEE floating-point standard.

Kahaner, D. K. [1988]. “Benchmarks for ‘real’ programs,” SIAM News (November).

The benchmark presented in this article turns out to cause many underflows.

J.12 Historical Perspective and References ■ J-67

Knuth, D. [1981]. The Art of Computer Programming, Vol. II, 2nd ed., Addison-Wesley, Reading, Mass.

Has a section on the distribution of floating-point numbers.

Kogge, P. [1981]. The Architecture of Pipelined Computers, McGraw-Hill, New York.

Has a brief discussion of pipelined multipliers.

Kohn, L., and S.-W. Fu [1989]. “A 1,000,000 transistor microprocessor,” IEEE Int’l.

Solid-State Circuits Conf. Digest of Technical Papers, 54–55.

There are several articles about the i860, but this one contains the most details about its floating-point algorithms.

Koren, I. [1989]. Computer Arithmetic Algorithms, Prentice Hall, Englewood Cliffs, N.J.

Leighton, F. T. [1992]. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann, San Francisco.

This is an excellent book, with emphasis on the complexity analysis of algorithms.

Section 1.2.1 has a nice discussion of carry-lookahead addition on a tree.

Magenheimer, D. J., L. Peters, K. W. Pettis, and D. Zuras [1988]. “Integer multiplication and division on the HP Precision architecture,” IEEE Trans. on Computers 37:8, 980–990.

Gives rationale for the integer- and divide-step instructions in the Precision architecture.

Markstein, P. W. [1990]. “Computation of elementary functions on the IBM RISC System/6000 processor,” IBM J. of Research and Development 34:1, 111–119.

Explains how to use fused muliply-add to compute correctly rounded division and square root.

Mead, C., and L. Conway [1980]. Introduction to VLSI Systems, Addison-Wesley, Read-ing, Mass.

Montoye, R. K., E. Hokenek, and S. L. Runyon [1990]. “Design of the IBM RISC System/

6000 floating-point execution,” IBM J. of Research and Development 34:1, 59–70.

Describes one implementation of fused multiply-add.

Ngai, T.-F., and M. J. Irwin [1985]. “Regular, area-time efficient carry-lookahead adders,”

Proc. Seventh IEEE Symposium on Computer Arithmetic, 9–15.

Describes a CLA like that of Figure J.17, where the bits flow up and then come back down.

Patterson, D. A., and J. L. Hennessy [2009]. Computer Organization and Design: The Hardware/Software Interface, 4th Edition, Morgan Kaufmann, San Francisco.

Chapter 3 is a gentler introduction to the first third of this appendix.

Peng, V., S. Samudrala, and M. Gavrielov [1987]. “On the implementation of shifters, multipliers, and dividers in VLSI floating point units,” Proc. Eighth IEEE Symposium on Computer Arithmetic, 95–102.

Highly recommended survey of different techniques actually used in VLSI designs.

Rowen, C., M. Johnson, and P. Ries [1988]. “The MIPS R3010 floating-point coproces-sor,” IEEE Micro, 53–62 (June).

Santoro, M. R., G. Bewick, and M. A. Horowitz [1989]. “Rounding algorithms for IEEE multipliers,” Proc. Ninth IEEE Symposium on Computer Arithmetic, 176–183.

A very readable discussion of how to efficiently implement rounding for floating-point multiplication.

Scott, N. R. [1985]. Computer Number Systems and Arithmetic, Prentice Hall, Englewood Cliffs, N.J.

Swartzlander, E., ed. [1990]. Computer Arithmetic, IEEE Computer Society Press, Los Alamitos, Calif.

A collection of historical papers in two volumes.

Takagi, N., H. Yasuura, and S. Yajima [1985].“High-speed VLSI multiplication algorithm with a redundant binary addition tree,” IEEE Trans. on Computers C-34:9, 789–796.

A discussion of the binary tree signed multiplier that was the basis for the design used in the TI 8847.

Taylor, G. S. [1981]. “Compatible hardware for division and square root,” Proc. Fifth IEEE Symposium on Computer Arithmetic, May 18–19, 1981, Ann Arbor, Mich., 127–134.

Good discussion of a radix-4 SRT division algorithm.

Taylor, G. S. [1985]. “Radix 16 SRT dividers with overlapped quotient selection stages,” Proc.

Seventh IEEE Symposium on Computer Arithmetic, June 4–6, 1985, Urbana, Ill., 64–71.

Describes a very sophisticated high-radix division algorithm.

Weste, N., and K. Eshraghian [1993]. Principles of CMOS VLSI Design: A Systems Per-spective, 2nd ed., Addison-Wesley, Reading, Mass.

This textbook has a section on the layouts of various kinds of adders.

Williams, T. E., M. Horowitz, R. L. Alverson, and T. S. Yang [1987]. “A self-timed chip for division,” Advanced Research in VLSI, Proc. 1987 Stanford Conf., MIT Press, Cambridge, Mass.

Describes a divider that tries to get the speed of a combinational design without using the area that would be required by one.

J.1 [12] <J.2> Using n bits, what is the largest and smallest integer that can be repre-sented in the two’s complement system?

J.2 [20/25] <J.2> In the subsection “Signed Numbers” (page J-7), it was stated that two’s complement overflows when the carry into the high-order bit position is different from the carry-out from that position.

a. [20] <J.2> Give examples of pairs of integers for all four combinations of carry-in and carry-out. Verify the rule stated above.

b. [25] <J.2> Explain why the rule is always true.

J.3 [12] <J.2> Using 4-bit binary numbers, multiply −8 × −8 using Booth recoding.

J.4 [15] <J.2> Equations J.2.1 and J.2.2 are for adding two n-bit numbers. Derive similar equations for subtraction, where there will be a borrow instead of a carry.

J.5 [25] <J.2> On a machine that doesn’t detect integer overflow in hardware, show how you would detect overflow on a signed addition operation in software.

J.6 [15/15/20] <J.3> Represent the following numbers as single-precision and double-precision IEEE floating-point numbers:

a. [15] <J.3> 10.

b. [15] <J.3> 10.5.

c. [20] <J.3> 0.1.

Exercises

Exercises ■ J-69

J.7 [12/12/12/12/12] <J.3> Below is a list of floating-point numbers. In single preci-sion, write down each number in binary, in decimal, and give its representation in IEEE arithmetic.

a. [12] <J.3> The largest number less than 1.

b. [12] <J.3> The largest number.

c. [12] <J.3> The smallest positive normalized number.

d. [12] <J.3> The largest denormal number.

e. [12] <J.3> The smallest positive number.

J.8 [15] <J.3> Is the ordering of nonnegative floating-point numbers the same as integers when denormalized numbers are also considered?

J.9 [20] <J.3> Write a program that prints out the bit patterns used to represent floating-point numbers on your favorite computer. What bit pattern is used for NaN?

J.10 [15] <J.4> Using p = 4, show how the binary floating-point multiply algorithm computes the product of 1.875 × 1.875.

J.11 [12/10] <J.4> Concerning the addition of exponents in floating-point multiply:

a. [12] <J.4> What would the hardware that implements the addition of expo-nents look like?

b. [10] <J.4> If the bias in single precision were 129 instead of 127, would addi-tion be harder or easier to implement?

J.12 [15/12] <J.4> In the discussion of overflow detection for floating-point multipli-cation, it was stated that (for single precision) you can detect an overflowed exponent by performing exponent addition in a 9-bit adder.

a. [15] <J.4> Give the exact rule for detecting overflow.

b. [12] <J.4> Would overflow detection be any easier if you used a 10-bit adder instead?

J.13 [15/10] <J.4> Floating-point multiplication:

a. [15] <J.4> Construct two single-precision floating-point numbers whose product doesn’t overflow until the final rounding step.

b. [10] <J.4> Is there any rounding mode where this phenomenon cannot occur?

J.14 [15] <J.4> Give an example of a product with a denormal operand but a normal-ized output. How large was the final shifting step? What is the maximum possible shift that can occur when the inputs are double-precision numbers?

J.15 [15] <J.5> Use the floating-point addition algorithm on page J-23 to compute 1.010₂ − .10012 (in 4-bit precision).

J.16 [10/15/20/20/20] <J.5> In certain situations, you can be sure that a + b is exactly representable as a floating-point number, that is, no rounding is necessary.

a. [10] <J.5> If a, b have the same exponent and different signs, explain why a + b is exact. This was used in the subsection “Speeding Up Addition” on page J-25.

b. [15] <J.5> Give an example where the exponents differ by 1, a and b have different signs, and a + b is not exact.

c. [20] <J.5> If a ≥ b ≥ 0, and the top two bits of a cancel when computing a – b, explain why the result is exact (this fact is mentioned on page J-22).

d. [20] <J.5> If a ≥ b ≥ 0, and the exponents differ by 1, show that a − b is exact

在文檔中 Although a tremendous variety of algorithms have been proposed for use in floating-point accelerators, actual implementations are usually based on refinements and variations of the few basic algorithms presented here (頁 63-74)