Although integers provide an exact representation for numeric values, they suffer from two major drawbacks: the inability to represent fractional values and a limited dynamic range. Floating point arithmetic solves these two problems at the expense of accuracy and, on some processors, speed. Most programmers are aware of the speed loss associated with floating point arithmetic; however, they are blithely unware of the prob- lems with accuracy.
For many applications, the benefits of floating point outweigh the disadvantages.
However, to properly use floating point arithmetic in any program, you must learn how floating point arithmetic operates. Intel, understanding the importance of floating point arithmetic in modern programs, provided support for floating point arithmetic in the ear- liest designs of the 8086 – the 80x87 FPU (floating point unit or math coprocessor). How- ever, on processors eariler than the 80486 (or on the 80486sx), the floating point processor is an optional device; it this device is not present you must simulate it in software.
This chapter contains four main sections. The first section discusses floating point arithmetic from a mathematical point of view. The second section discusses the binary floating point formats commonly used on Intel processors. The third discusses software floating point and the math routines from the UCR Standard Library. The fourth section discusses the 80x87 FPU chips.
14.1 The Mathematics of Floating Point Arithmetic
A big problem with floating point arithmetic is that it does not follow the standard rules of algebra. Nevertheless, many programmers apply normal algebraic rules when using floating point arithmetic. This is a source of bugs in many programs. One of the pri- mary goals of this section is to describe the limitations of floating point arithmetic so you will understand how to use it properly.
Normal algebraic rules apply only to infinte precision arithmetic. Consider the simple statement x:=x+1, x is an integer. On any modern computer this statement follows the nor- mal rules of algebra as long as overflow does not occur. That is, this statement is valid only for
certain values of x (minint <= x < maxint). Most programmers do not have a problem with this because they are well aware of the fact that integers in a program do not follow the standard algebraic rules (e.g., 5/2 ≠ 2.5).
Integers do not follow the standard rules of algebra because the computer represents them with a finite number of bits. You cannot represent any of the (integer) values above the maximum integer or below the minimum integer. Floating point values suffer from this same problem, only worse. After all, the integers are a subset of the real numbers.
Therefore, the floating point values must represent the same infinite set of integers. How- ever, there are an infinite number of values between any two real values, so this problem is infinitely worse. Therefore, as well as having to limit your values between a maximum and minimum range, you cannot represent all the values between those two ranges, either.
To represent real numbers, most floating point formats employ scientific notation and use some number of bits to represent a mantissa and a smaller number of bits to represent an exponent. The end result is that floating point numbers can only represent numbers with a specific number of significant digits. This has a big impact on how floating point arithmetic operations. To easily see the impact of limited precision arithmetic, we will adopt a simplified decimal floating point format for our examples. Our floating point for- mat will provide a mantissa with three significant digits and a decimal exponent with two digits. The mantissa and exponents are both signed values (see Figure 14.1).
When adding and subtracting two numbers in scientific notation, you must adjust the two values so that their exponents are the same. For example, when adding 1.23e1 and 4.56e0, you must adjust the values so they have the same exponent. One way to do this is to to convert 4.56e0 to 0.456e1 and then add. This produces 1.686e1. Unfortunately, the result does not fit into three significant digits, so we must either round or truncate the result to three significant digits. Rounding generally produces the most accurate result, so let’s round the result to obtain 1.69e1. As you can see, the lack of precision (the number of digits or bits we maintain in a computation) affects the accuracy (the correctness of the computation).
In the previous example, we were able to round the result because we maintained four significant digits during the calculation. If our floating point calculation is limited to three significant digits during computation, we would have had to truncate the last digit of the smaller number, obtaining 1.68e1 which is even less correct. Extra digits available during a computation are known as guard digits (or guard bits in the case of a binary format). They greatly enhance accuracy during a long chain of computations.
The accuracy loss during a single computation usually isn’t enough to worry about unless you are greatly concerned about the accuracy of your computations. However, if you compute a value which is the result of a sequence of floating point operations, the error can accumulate and greatly affect the computation itself. For example, suppose we were to add 1.23e3 with 1.00e0. Adjusting the numbers so their exponents are the same before the addition produces 1.23e3 + 0.001e3. The sum of these two values, even after rounding, is 1.23e3. This might seem perfectly reasonable to you; after all, we can only maintain three significant digits, adding in a small value shouldn’t affect the result at all.
However, suppose we were to add 1.00e0 1.23e3 ten times. The first time we add 1.00e0 to 1.23e3 we get 1.23e3. Likewise, we get this same result the second, third, fourth, ..., and tenth time we add 1.00e0 to 1.23e3. On the other hand, had we added 1.00e0 to itself ten times, then added the result (1.00e1) to 1.23e3, we would have gotten a different result, 1.24e3. This is the most important thing to know about limited precision arithmetic:
Figure 14.1 Simple Floating Point Format
e ±
The order of evaluation can effect the accuracy of the result.
You will get more accurate results if the relative magnitudes (that is, the exponents) are close to one another. If you are performing a chain calculation involving addition and subtraction, you should attempt to group the values appropriately.
Another problem with addition and subtraction is that you can wind up with false pre- cision. Consider the computation 1.23e0 - 1.22 e0. This produces 0.01e0. Although this is mathematically equivalent to 1.00e-2, this latter form suggests that the last two digits are exactly zero. Unfortunately, we’ve only got a single significant digit at this time. Indeed, some FPUs or floating point software packages might actually insert random digits (or bits) into the L.O. positions. This brings up a second important rule concerning limited precision arithmetic:
Whenever subtracting two numbers with the same signs or adding two numbers with different signs, the accuracy of the result may be less than the precision available in the floating point format.
Multiplication and division do not suffer from the same problems as addition and subtraction since you do not have to adjust the exponents before the operation; all you need to do is add the exponents and multiply the mantissas (or subtract the exponents and divide the mantissas). By themselves, multiplication and division do not produce par- ticularly poor results. However, they tend to multiply any error which already exists in a value. For example, if you multiply 1.23e0 by two, when you should be multiplying 1.24e0 by two, the result is even less accurate. This brings up a third important rule when work- ing with limited precision arithmetic:
When performing a chain of calculations involving addition, subtraction, multi- plication, and division, try to perform the multiplication and division operations first.
Often, by applying normal algebraic transformations, you can arrange a calculation so the multiply and divide operations occur first. For example, suppose you want to com- pute x*(y+z). Normally you would add y and z together and multiply their sum by x.
However, you will get a little more accuracy if you transform x*(y+z) to get x*y+x*z and compute the result by performing the multiplications first.
Multiplication and division are not without their own problems. When multiplying two very large or very small numbers, it is quite possible for overflow or underflow to occur. The same situation occurs when dividing a small number by a large number or dividing a large number by a small number. This brings up a fourth rule you should attempt to follow when multiplying or dividing values:
When multiplying and dividing sets of numbers, try to arrange the multiplica- tions so that they multiply large and small numbers together; likewise, try to divide numbers that have the same relative magnitudes.
Comparing floating pointer numbers is very dangerous. Given the inaccuracies present in any computation (including converting an input string to a floating point value), you should never compare two floating point values to see if they are equal. In a binary floating point format, different computations which produce the same (mathemati- cal) result may differ in their least significant bits. For example, adding 1.31e0+1.69e0 should produce 3.00e0. Likewise, adding 2.50e0+1.50e0 should produce 3.00e0. However, were you to compare (1.31e0+1.69e0) agains (2.50e0+1.50e0) you might find out that these sums are not equal to one another. The test for equality succeeds if and only if all bits (or digits) in the two operands are exactly the same. Since this is not necessarily true after two different floating point computations which should produce the same result, a straight test for equality may not work.
The standard way to test for equality between floating point numbers is to determine how much error (or tolerance) you will allow in a comparison and check to see if one value is within this error range of the other. The straight-forward way to do this is to use a test like the following:
Another common way to handle this same comparison is to use a statement of the form:
if abs(Value1-Value2) <= error then …
Most texts, when discussing floating point comparisons, stop immediately after dis- cussing the problem with floating point equality, assuming that other forms of compari- son are perfectly okay with floating point numbers. This isn’t true! If we are assuming that x=y if x is within y±error, then a simple bitwise comparison of x and y will claim that x<y if y is greater than x but less than y+error. However, in such a case x should really be treated as equal to y, not less than y. Therefore, we must always compare two floating point numbers using ranges, regardless of the actual comparison we want to perform. Try- ing to compare two floating point numbers directly can lead to an error. To compare two floating point numbers, x and y, against one another, you should use one of the following forms:
= if abs(x-y) <= error then …
≠ if abs(x-y) > error then …
< if (x-y) < error then …
≤ if (x-y) <= error then …
> if (x-y) > error then …
≥ if (x-y) >= error then …
You must exercise care when choosing the value for error. This should be a value slightly greater than the largest amount of error which will creep into your computations.
The exact value will depend upon the particular floating point format you use, but more on that a little later. The final rule we will state in this section is
When comparing two floating point numbers, always compare one value to see if it is in the range given by the second value plus or minus some small error value.
There are many other little problems that can occur when using floating point values.
This text can only point out some of the major problems and make you aware of the fact that you cannot treat floating point arithmetic like real arithmetic – the inaccuracies present in limited precision arithmetic can get you into trouble if you are not careful. A good text on numerical analysis or even scientific computing can help fill in the details which are beyond the scope of this text. If you are going to be working with floating point arithmetic, in any language, you should take the time to study the effects of limited preci- sion arithmetic on your computations.
14.2 IEEE Floating Point Formats
When Intel planned to introduce a floating point coprocessor for their new 8086 microprocessor, they were smart enough to realize that the electrical engineers and solid-state physicists who design chips were, perhaps, not the best people to do the neces- sary numerical analysis to pick the best possible binary representation for a floating point format. So Intel went out and hired the best numerical analyst they could find to design a floating point format for their 8087 FPU. That person then hired two other experts in the field and the three of them (Kahn, Coonan, and Stone) designed Intel’s floating point for- mat. They did such a good job designing the KCS Floating Point Standard that the IEEE organization adopted this format for the IEEE floating point format1.
To handle a wide range of performance and accuracy requirements, Intel actually introduced three floating point formats: single precision, double precision, and extended precision. The single and double precision formats corresponded to C’s float and double types or FORTRAN’s real and double precision types. Intel intended to use extended pre- cision for long chains of computations. Extended precision contains 16 extra bits that the
1. There were some minor changes to the way certain degenerate operations were handled, but the bit representa- tion remained essentially unchanged.
calculations could use for guard bits before rounding down to a double precision value when storing the result.
The single precision format uses a one’s complement 24 bit mantissa and an eight bit excess-128 exponent. The mantissa usually represents a value between 1.0 to just under 2.0. The H.O. bit of the mantissa is always assumed to be one and represents a value just to the left of the binary point2. The remaining 23 mantissa bits appear to the right of the binary point. Therefore, the mantissa represents the value:
1.mmmmmmm mmmmmmmm mmmmmmmm
The “mmmm…” characters represent the 23 bits of the mantissa. Keep in mind that we are working with binary numbers here. Therefore, each position to the right of the binary point represents a value (zero or one) times a successive negative power of two. The implied one bit is always multiplied by 20, which is one. This is why the mantissa is always greater than or equal to one. Even if the other mantissa bits are all zero, the implied one bit always gives us the value one3. Of course, even if we had an almost infi- nite number of one bits after the binary point, they still would not add up to two. This is why the mantissa can represent values in the range one to just under two.
Although there are an infinite number of values between one and two, we can only represent eight million of them because we a 23 bit mantissa (the 24th bit is always one).
This is the reason for inaccuracy in floating point arithmetic – we are limited to 23 bits of precision in compuations involving single precision floating point values.
The mantissa uses a one’s complement format rather than two’s complement. This means that the 24 bit value of the mantissa is simply an unsigned binary number and the sign bit determines whether that value is positive or negative. One’s complement num- bers have the unusual property that there are two representations for zero (with the sign bit set or clear). Generally, this is important only to the person designing the floating point software or hardware system. We will assume that the value zero always has the sign bit clear.
To represent values outside the range 1.0 to just under 2.0, the exponent portion of the floating point format comes into play. The floating point format raise two to the power specified by the exponent and then multiplies the mantissa by this value. The exponent is eight bits and is stored in an excess-127 format. In excess-127 format, the exponent 20 is represented by the value 127 (7fh). Therefore, to convert an exponent to excess-127 format simply add 127 to the exponent value. The use of excess-127 format makes it easier to compare floating point values. The single precision floating point format takes the form shown in Figure 14.2.
With a 24 bit mantissa, you will get approximately 6-1/2 digits of precision (one half digit of precision means that the first six digits can all be in the range 0..9 but the seventh digit can only be in the range 0..x where x<9 and is generally close to five). With an eight
2. The binary point is the same thing as the decimal point except it appears in binary numbers rather than decimal numbers.
3. Actually, this isn’t necessarily true. Thye IEEE floating point format supports denormalized values where the H.O. bit is not zero. However, we will ignore denormalized values in our discussion.
Figure 14.2 32 Bit Single Precision Floating Point Format
31 2 3 15 7 0
Mantissa Bits Exponent Bits
Sign Bit
The 24th mantissa bit is implied and is always one.
bit excess-128 exponent, the dynamic range of single precision floating point numbers is approximately 2±128 or about 10±38.
Although single precision floating point numbers are perfectly suitable for many applications, the dynamic range is somewhat small for many scientific applications and the very limited precision is unsuitable for many financial, scientific, and other applica- tions. Furthermore, in long chains of computations, the limited precision of the single pre- cision format may introduce serious error.
The double precision format helps overcome the problems of single preicision floating point. Using twice the space, the double precision format has an 11-bit excess-1023 expo- nent and a 53 bit mantissa (with an implied H.O. bit of one) plus a sign bit. This provides a dynamic range of about 10±308and 14-1/2 digits of precision, sufficient for most applica- tions. Double precision floating point values take the form shown in Figure 14.3.
In order to help ensure accuracy during long chains of computations involving dou- ble precision floating point numbers, Intel designed the extended precision format. The extended precision format uses 80 bits. Twelve of the additional 16 bits are appended to the mantissa, four of the additional bits are appended to the end of the exponent. Unlike the single and double precision values, the extended precision format does not have an implied H.O. bit which is always one. Therefore, the extended precision format provides a 64 bit mantissa, a 15 bit excess-16383 exponent, and a one bit sign. The format for the extended precision floating point value is shown in Figure 14.4.
On the 80x87 FPUs and the 80486 CPU, all computations are done using the extended precision form. Whenever you load a single or double precision value, the FPU automati- cally converts it to an extended precision value. Likewise, when you store a single or dou- ble precision value to memory, the FPU automatically rounds the value down to the appropriate size before storing it. By always working with the extended precision format, Intel guarantees a large number of guard bits are present to ensure the accuracy of your computations. Some texts erroneously claim that you should never use the extended pre- cision format in your own programs, because Intel only guarantees accurate computations when using the single or double precision formats. This is foolish. By performing all com- putations using 80 bits, Intel helps ensure (but not guarantee) that you will get full 32 or 64 bit accuracy in your computations. Since the 80x87 FPUs and 80486 CPU do not pro- vide a large number of guard bits in 80 bit computations, some error will inevitably creep into the L.O. bits of an extended precision computation. However, if your computation is correct to 64 bits, the 80 bit computation will always provide at least 64 accurate bits. Most of the time you will get even more. While you cannot assume that you get an accurate 80 Figure 14.3 64 Bit Double Precision Floating Point Format
6 3 5 2 7 0
Mantissa Bits Exponent Bits
Sign Bit
The 53rd mantissa bit is implied and is always one.
Figure 14.4 80 Bit Extended Precision Floating Point Format
7 9 6 4 7 0
Mantissa Bits Exponent Bits
Sign Bit
bit computation, you can usually do better than 64 when using the extended precision for- mat.
To maintain maximum precision during computation, most computations use normal- ized values. A normalized floating point value is one that has a H.O. mantissa bit equal to one. Almost any non-normalized value can be normalized by shifting the mantissa bits to the left and decrementing the exponent by one until a one appears in the H.O. bit of the mantissa. Remember, the exponent is a binary exponent. Each time you increment the exponent, you multiply the floating point value by two. Likewise, whenever you decre- ment the exponent, you divide the floating point value by two. By the same token, shifting the mantissa to the left one bit position multiplies the floating point value by two; like- wise, shifting the mantissa to the right divides the floating point value by two. Therefore, shifting the mantissa to the left one position and decrementing the exponent does not change the value of the floating point number at all.
Keeping floating point numbers normalized is beneficial because it maintains the maximum number of bits of precision for a computation. If the H.O. bits of the mantissa are all zero, the mantissa has that many fewer bits of precision available for computation.
Therefore, a floating point computation will be more accurate if it involves only normal- ized values.
There are two important cases where a floating point number cannot be normalized.
The value 0.0 is a special case. Obviously it cannot be normalized because the floating point representation for zero has no one bits in the mantissa. This, however, is not a prob- lem since we can exactly represent the value zero with only a single bit.
The second case is when we have some H.O. bits in the mantissa which are zero but the biased exponent is also zero (and we cannot decrement it to normalize the mantissa).
Rather than disallow certain small values, whose H.O. mantissa bits and biased exponent are zero (the most negative exponent possible), the IEEE standard allows special denormalized values to represent these smaller values4. Although the use of denormalized values allows IEEE floating point computations to produce better results than if under- flow occurred, keep in mind that denormalized values offer less bits of precision and are inherently less accurate.
Since the 80x87 FPUs and 80486 CPU always convert single and double precision val- ues to extended precision, extended precision arithmetic is actually faster than single or double precision. Therefore, the expected performance benefit of using the smaller for- mats is not present on these chips. However, when designing the Pentium/586 CPU, Intel redesigned the built-in floating point unit to better compete with RISC chips. Most RISC chips support a native 64 bit double precision format which is faster than Intel’s extended precision format. Therefore, Intel provided native 64 bit operations on the Pentium to bet- ter compete against the RISC chips. Therefore, the double precision format is the fastest on the Pentium and later chips.
14.3 The UCR Standard Library Floating Point Routines
In most assembly language texts, which bother to cover floating point arithmetic, this section would normally describe how to design your own floating point routines for addi- tion, subtraction, multiplication, and division. This text will not do that for several rea- sons. First, to design a good floating point library requires a solid background in numerical analysis; a prerequisite this text does not assume of its readers. Second, the UCR Standard Library already provides a reasonable set of floating point routines in source code form;
why waste space in this text when the sources are readily available elsewhere? Third, floating point units are quickly becoming standard equipment on all modern CPUs or motherboards; it makes no more sense to describe how to manually perform a floating point computation than it does to describe how to manually perform an integer computa- tion. Therefore, this section will describe how to use the UCR Standard Library routines if 4. The alternative would be to underflow the values to zero.
you do not have an FPU available; a later section will describe the use of the floating point unit.
The UCR Standard Library provides a large number of routines to support floating point computation and I/O. This library uses the same memory format for 32, 64, and 80 bit floating point numbers as the 80x87 FPUs. The UCR Standard Library’s floating point routines do not exactly follow the IEEE requirements with respect to error conditions and other degenerate cases, and it may produce slightly different results than an 80x87 FPU, but the results will be very close5. Since the UCR Standard Library uses the same memory format for 32, 64, and 80 bit numbers as the 80x87 FPUs, you can freely mix computations involving floating point between the FPU and the Standard Library routines.
The UCR Standard Library provides numerous routines to manipulate floating point numbes. The following sections describe each of these routines, by category.
14.3.1 Load and Store Routines
Since 80x86 CPUs without an FPU do not provide any 80-bit registers, the UCR Stan- dard Library must use memory-based variables to hold floating point values during com- putation. The UCR Standard Library routines use two pseudo registers, an accumlator register and an operand register, when performing floating point operations. For example, the floating point addition routine adds the value in the floating point operand register to the floating point accumulator register, leaving the result in the accumulator. The load and store routines allow you to load floating point values into the floating point accumulator and operand registers as well as store the value of the floating point accumulator back to memory. The routines in this category include accop, xaccop, lsfpa, ssfpa, ldfpa, sdfpa, lefpa, sefpa,lefpal, lsfpo, ldfpo, lefpo, and lefpol.
The accop routine copies the value in the floating point accumulator to the floating point operand register. This routine is useful when you want to use the result of one com- putation as the second operand of a second computation.
The xaccop routine exchanges the values in the floating point accumuator and oper- and registers. Note that many floating point computations destory the value in the float- ing point operand register, so you cannot blindly assume that the routines preserve the operand register. Therefore, calling this routine only makes sense after performing some computation which you know does not affect the floating point operand register.
Lsfpa, ldfpa, and lefpa load the floating point accumulator with a single, double, or extended precision floating point value, respectively. The UCR Standard Library uses its own internal format for computations. These routines convert the specified values to the internal format during the load. On entry to each of these routines, es:di must contain the address of the variable you want to load into the floating point accumulator. The follow- ing code demonstrates how to call these routines:
rVar real4 1.0
drVar real8 2.0
xrVar real10 3.0
. . .
lesi rVar lsfpa
. . .
lesi drVar ldfpa
. . .
5. Note, by the way, that different floating point chips, especially across different CPU lines, but even within the Intel family, produce slightly different results. So the fact that the UCR Standard Library does not produce the exact same results as a particular FPU is not that important.
lesi xrVar lefpa
The lsfpo, ldfpo, and lefpo routines are similar to the lsfpa, ldfpa, and lefpa routines except, of course, they load the floating point operand register rather than the floating point accumulator with the value at address es:di.
Lefpal and lefpol load the floating point accumulator or operand register with a literal 80 bit floating point constant appearing in the code stream. To use these two routines, sim- ply follow the call with a real10 directive and the appropriate constant, e.g.,
lefpal real10 1.0 lefpol
real10 2.0e5
The ssfpa, sdfpa, and sefpa routines store the value in the floating point accumulator into the memory based floating point variable whose address appears in es:di. There are no corresponding ssfpo, sdfpo, or sefpo routines because a result you would want to store should never appear in the floating point operand register. If you happen to get a value in the floating point operand that you want to store into memory, simply use the xaccop rou- tine to swap the accumulator and operand registers, then use the store accumulator rou- tines to save the result. The following code demonstrates the use of these routines:
rVar real4 1.0
drVar real8 2.0
xrVar real10 3.0
. . .
lesi rVar ssfpa
. . .
lesi drVar sdfpa
. . .
lesi xrVar sefpa
14.3.2 Integer/Floating Point Conversion
The UCR Standard Library includes several routines to convert between binary inte- gers and floating point values. These routines are itof, utof, ltof, ultof, ftoi, ftou, ftol, and ftoul.
The first four routines convert signed and unsigned integers to floating point format, the last four routines truncate floating point values and convert them to an integer value.
Itof converts the signed 16-bit value in ax to a floating point value and leaves the result in the floating point accumulator. This routine does not affect the floating point operand register. Utof converts the unsigned integer in ax in a similar fashion. Ltof and ultof convert the 32 bit signed (ltof) or unsigned (ultof) integer in dx:ax to a floating point value, leaving the value in the floating point accumulator. These routines always succeed.
Ftoi converts the value in the floating point accumulator to a signed integer value, leaving the result in ax. Conversion is by truncation; this routine keeps the integer portion and throws away the fractional part. If an overflow occurs because the resulting integer portion does not fit into 16 bits, ftoi returns the carry flag set. If the conversion occurs with- out error, ftoi return the carry flag clear. Ftou works in a similar fashion, except it converts the floating point value to an unsigned integer in ax; it returns the carry set if the floating point value was negative.
Ftol and ftoul converts the value in the floating point accumulator to a 32 bit integer leaving the result in dx:ax. Ftol works on signed values, ftoul works with unsigned values.
As with ftoi and ftou, these routines return the carry flag set if a conversion error occurs.
14.3.3 Floating Point Arithmetic
Floating point arithmetic is handled by the fpadd, fp sub, fpcmp, fpmul, and fpdiv rou- tines. Fpadd adds the value in the floating point accumulator to the floating point accumu- lator. Fpsub subtracts the value in the floating point operand from the floating point accumulator. Fpmul multiplies the value in the floating accumulator by the floating point operand. Fpdiv divides the value in the floating point accumulator by the value in the floating point operand register. Fpcmp compares the value in the floating point accumula- tor against the floating point operand.
The UCR Standard Library arithmetic routines do very little error checking. For exam- ple, if arithmetic overflow occurs during addition, subtraction, multiplication, or division, the Standard Library simply sets the result to the largest legal value and returns. This is one of the major deviations from the IEEE floating point standard. Likewise, when under- flow occurs the routines simply set the result to zero and return. If you divide any value by zero, the Standard Library routines simply set the result to the largest possible value and return. You may need to modify the standard library routines if you need to check for overflow, underflow, or division by zero in your programs.
The floating point comparison routine (fpcmp) compares the floating point accumula- tor against the floating point operand and returns -1, 0, or 1 in the ax register if the accu- mulator is less than, equal, or greater than the floating point operand. It also compares ax with zero immediately before returning so it sets the flags so you can use the jg, jge, jl, jle, je, and jne instructions immediately after calling fpcmp. Unlike fpadd, fpsub, fpmul, and fpdiv, fpcmp does not destroy the value in the floating point accumulator or the floating point operand register. Keep in mind the problems associated with comparing floating point numbers!
14.3.4 Float/Text Conversion and Printff
The UCR Standard Library provides three routines, ftoa, etoa, and atof, that let you convert floating point numbers to ASCII strings and vice versa; it also provides a special version of printf, printff, that includes the ability to print floating point values as well as other data types.
Ftoa converts a floating point number to an ASCII string which is a decimal represen- tation of that floating point number. On entry, the floating point accumulator contains the number you want to convert to a string. The es:di register pair points at a buffer in mem- ory where ftoa will store the string. The al register contains the field width (number of print positions). The ah register contains the number of positions to display to the right of the decimal point. If ftoa cannot display the number using the print format specified by al and ah, it will create a string of “#” characters, ah characters long. Es:di must point at a byte array containing at least al+1 characters and al should contain at least five. The field width and decimal length values in the al and ah registers are similar to the values appearing after floating point numbers in the Pascal write statement, e.g.,
Etoa outputs the floating point number in exponential form. As with ftoa, es:di points at the buffer where etoa will store the result. The al register must contain at least eight and is the field width for the number. If al contains less than eight, etoa will output a string of
“#” characters. The string that es:di points at must contain at least al+1 characters. This conversion routine is similar to Pascal’s write procedure when writing real values with a single field width specification:
The Standard Library printff routine provides all the facilities of the standard printf routine plus the ability to handle floating point output. The printff routine includes sev-
scientific notation. The specifications are
• %x.yF Prints a 32 bit floating point number in decimal form.
• %x.yGF Prints a 64 bit floating point number in decimal form.
• %x.yLF Prints an 80 bit floating point number in decimal form.
• %zE Prints a 32 bit floating point number using scientific notation.
• %zGE Prints a 64 bit floating point number using scientific notation.
• %zLE Prints an 80 bit floating point value using scientific notation.
In the format strings above, x and z are integer constants that denote the field width of the number to print. The y item is also an integer constant that specifies the number of posi- tions to print after the decimal point. The x.y values are comparable to the values passed to ftoa in al and ah. The z value is comparable to the value etoa expects in the al register.
Other than the addition of these six new formats, the printff routine is identical to the printf routine. If you use the printff routine in your assembly language programs, you should not use the printf routine as well. Printff duplicates all the facilities of printf and using both would only waste memory.
14.4 The 80x87 Floating Point Coprocessors
When the 8086 CPU first appeared in the late 1970’s, semiconductor technology was not to the point where Intel could put floating point instrutions directly on the 8086 CPU.
Therefore, they devised a scheme whereby they could use a second chip to perform the floating point calculations – the floating point unit (or FPU)6. They released their original floating point chip, the 8087, in 1980. This particular FPU worked with the 8086, 8088, 80186, and 80188 CPUs. When Intel introduced the 80286 CPU, they released a redesigned 80287 FPU chip to accompany it. Although the 80287 was compatible with the 80386 CPU, Intel designed a better FPU, the 80387, for use in 80386 systems. The 80486 CPU was the first Intel CPU to include an on-chip floating point unit. Shortly after the release of the 80486, Intel introduced the 80486sx CPU that was an 80486 without the built-in FPU. To get floating point capabilities on this chip, you had to add an 80487 chip, although the 80487 was really nothing more than a full-blown 80486 which took over for the “sx” chip in the system. Intel’s Pentium/586 chips provide a high-performance floating point unit directly on the CPU. There is no floating point coprocessor available for the Pentium chip.
Collectively, we will refer to all these chips as the 80x87 FPU. Given the obsolesence of the 8086, 80286, 8087, and 80287 chips, this text will concentrate on the 80387 and later chips. There are some differences between the 80387/80486/Pentium floating point units and the earlier FPUs. If you need to write code that will execute on those earlier machines, you should consult the appropriate Intel documentation for those devices.
14.4.1 FPU Registers
The 80x87 FPUs add 13 registers to the 80386 and later processors: eight floating point data registers, a control register, a status register, a tag register, an instruction pointer, and a data pointer. The data registers are similar to the 80x86’s general purpose register set insofar as all floating point calculations take place in these registers. The control register contains bits that let you decide how the 80x87 handles certain degenerate cases like rounding of inaccurate computations, control precision, and so on. The status register is similar to the 80x86’s flags register; it contains the condition code bits and several other floating point flags that describe the state of the 80x87 chip. The tag register contains sev- eral groups of bits that determine the state of the value in each of the eight general pur- pose registers. The instruction and data pointer registers contain certain state information 6. Intel has also refered to this device as the Numeric Data Processor (NDP), Numeric Processor Extension (NPX), and math coprocessor.
about the last floating point instruction executed. We will not consider the last three regis- ters in this text, see the Intel documentation for more details. The FPU Data Registers
The 80x87 FPUs provide eight 80 bit data registers organized as a stack. This is a sig- nificant departure from the organization of the general purpose registers on the 80x86 CPU that comprise a standard general-purpose register set. Intel refers to these registers as ST(0), ST(1), …, ST(7). Most assemblers will accept ST as an abbreviation for ST(0).
The biggest difference between the FPU register set and the 80x86 register set is the stack organization. On the 80x86 CPU, the ax register is always the ax register, no matter what happens. On the 80x87, however, the register set is an eight element stack of 80 bit floating point values (see Figure 14.5). ST(0) refers to the item on the top of the stack, ST(1) refers to the next item on the stack, and so on. Many floating point instructions push and pop items on the stack; therefore, ST(1) will refer to the previous contents of ST(0) after you push something onto the stack. It will take some thought and practice to get used to the fact that the registers are changing under you, but this is an easy problem to overcome. The FPU Control Register
When Intel designed the 80x87 (and, essentially, the IEEE floating point standard), there were no standards in floating point hardware. Different (mainframe and mini) com- puter manufacturers all had different and incompatible floating point formats. Unfortu- nately, much application software had been written taking into account the idiosyncrasies of these different floating point formats. Intel wanted to designed an FPU that could work with the majority of the software out there (keep in mind, the IBM PC was three to four years away when Intel began designing the 8087, they couldn’t rely on that “mountain” of software available for the PC to make their chip popular). Unfortunately, many of the fea- tures found in these older floating point formats were mutually exclusive. For example, in some floating point systems rounding would occur when there was insufficient precision;
in others, truncation would occur. Some applications would work with one floating point system but not with the other. Intel wanted as many applications as possible to work with as few changes as possible on their 80x87 FPUs, so they added a special register, the FPU control register, that lets the user choose one of several possible operating modes for the 80x87.
The 80x87 control register contains 16 bits organized as shown in Figure 14.6.
Bit 12 of the control register is only present on the 8087 and 80287 chips. It controls how the 80x87 responds to infinity. The 80387 and later chips always use a form of infinitly known and affine closure because this is the only form supported by the IEEE Figure 14.5 80x87 Floating Point Register Stack
st(0) st(1) st(2) st(3) st(4) st(5) st(6) st(7)
7 9 6 4 0
754/854 standards. As such, we will ignore any further use of this bit and assume that it is always programmed with a one.
Bits 10 and 11 provide rounding control according to the following values:
The “00” setting is the default. The 80x87 rounds values above one-half of the least significant bit up. It rounds values below one-half of the least significant bit down. If the value below the least significant bit is exactly one-half the least significant bit, the 80x87 rounds the value towards the value whose least significant bit is zero. For long strings of computations, this provides a reasonable, automatic, way to maintain maximum preci- sion.
The round up and round down options are present for those computations where it is important to keep track of the accuracy during a computation. By setting the rounding control to round down and performing the operation, the repeating the operation with the rounding control set to round up, you can determine the minimum and maximum ranges between which the true result will fall.
The truncate option forces all computations to truncate any excess bits during the computation. You will rarely use this option if accuracy is important to you. However, if you are porting older software to the 80x87, you might use this option to help when port- ing the software.
Bits eight and nine of the control register control the precision during computation.
This capability is provided mainly to allow compatbility with older software as required by the IEEE 754 standard. The precision control bits use the following values:
Table 58: Rounding Control
Bits 10 & 11 Function
00 To nearest or even
01 Round down
10 Round up
11 Truncate
Figure 14.6 80x87 Control Register
Exception Masks Precision
Control Rounding
Reserved on 80387 and later FPUs.
00 - To nearest or even 01 - Round down 10 - Round up 11 - Truncate result
00 - 24 bits 01 - reserved 10 - 53 bits 11 - 64 bits
Precision Underflow Overflow Zero Divide Denormalized Invalid Operation
1 5 11 10 9 8 5 4 3 2 1 0
For modern applications, the precision control bits should always be set to “11” to obtain 64 bits of precision. This will produce the most accurate results during numerical computation.
Bits zero through five are the exception masks. These are similar to the interrupt enable bit in the 80x86’s flags register. If these bits contain a one, the corresponding condition is ignored by the 80x87 FPU. However, if any bit contains zero, and the corresponding con- dition occurs, then the FPU immediately generates an interrupt so the program can han- dle the degenerate condition.
Bit zero corresponds to an invalid operation error. This generally occurs as the result of a programming error. Problem which raise the invalid operation exception include pushing more than eight items onto the stack or attempting to pop an item off an empty stack, taking the square root of a negative number, or loading a non-empty register.
Bit one masks the denormalized interrupt which occurs whenever you try to manipu- late denormalized values. Denormalized values generally occur when you load arbitrary extended precision values into the FPU or work with very small numbers just beyond the range of the FPU’s capabilities. Normally, you would probably not enable this exception.
Bit two masks the zero divide exception. If this bit contains zero, the FPU will generate an interrupt if you attempt to divide a nonzero value by zero. If you do not enable the zero division exception, the FPU will produce NaN (not a number) whenever you perform a zero division.
Bit three masks the overflow exception. The FPU will raise the overflow exception if a calculation overflows or if you attempt to store a value which is too large to fit into a des- tination operand (e.g., storing a large extended precision value into a single precision vari- able).
Bit four, if set, masks the underflow exception. Underflow occurs when the result is too small to fit in the desintation operand. Like overflow, this exception can occur whenever you store a small extended precision value into a smaller variable (single or double preci- sion) or when the result of a computation is too small for extended precision.
Bit five controls whether the precision exception can occur. A precision exception occurs whenever the FPU produces an imprecise result, generally the result of an internal rounding operation. Although many operations will produce an exact result, many more will not. For example, dividing one by ten will produce an inexact result. Therefore, this bit is usually one since inexact results are very common.
Bits six and thirteen through fifteen in the control register are currently undefined and reserved for future use. Bit seven is the interrupt enable mask, but it is only active on the 8087 FPU; a zero in this bit enables 8087 interrupts and a one disables FPU interrupts.
The 80x87 provides two instructions, FLDCW (load control word) and FSTCW (store control word), that let you load and store the contents of the control register. The single operand to these instructions must be a 16 bit memory location. The FLDCW instruction loads the control register from the specified memory location, FSTCW stores the control register into the specified memory location.
Table 59: Mantissa Precision Control Bits Bits 8 & 9 Precision Control
00 24 bits
01 Reserved
10 53 bits
11 64 bits The FPU Status Register
The FPU status register provides the status of the coprocessor at the instant you read it. The FSTSW instruction stores the16 bit floating point status register into the mod/reg/rm operand. The status register s a 16 bit register, its layout appears in Figure 14.7.
Bits zero through five are the exception flags. These bits are appear in the same order as the exception masks in the control register. If the corresponding condition exists, then the bit is set. These bits are independent of the exception masks in the control register. The 80x87 sets and clears these bits regardless of the corresponding mask setting.
Bit six (active only on 80386 and later processors) indicates a stack fault. A stack fault occurs whenever there is a stack overflow or underflow. When this bit is set, the C1 condi- tion code bit determines whether there was a stack overflow (C1=1) or stack underflow (C1=0) condition.
Bit seven of the status register is set if any error condition bit is set. It is the logical OR of bits zero through five. A program can test this bit to quickly determine if an error condi- tion exists.
Bits eight, nine, ten, and fourteen are the coprocessor condition code bits. Various instructions set the condition code bits as shown in the following table:
Table 60: FPU Condition Code Bits
Instruction Condition Code Bits
C3 C2 C1 C0
fcom, fcomp, fcompp, ficom, ficomp
0 0 X 0
0 0 X 1
1 0 X 0
1 1 X 1
ST > source ST < source ST = source ST or source undefined X = Don’t care Figure 14.7 FPU Status Register Exception Flags Reserved on 80387 and later FPUs. Exception Flag Stack Fault Precision Underflow Overflow Zero Divide Denormalized Invalid Operation 1 5 1 4 1 3 1 2 1 1 1 0 9 8 7 6 5 4 3 2 1 0
Busy C3 Top of stack
Pointer C2 C1 C0
Condition Codes
ftst 0 0 X 0
0 0 X 1
1 0 X 0
1 1 X 1
ST is positive ST is negative ST is zero (+ or -) ST is uncomparable fxam 0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 0
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 0
0 0 0 1
0 0 1 1
0 1 0 1
0 1 1 1
1 X X 1
+ Unnormalized -Unnormalized +Normalized -Normalized +0 -0 +Denormalized -Denormalized +NaN -NaN +Infinity -Infinity Empty register fucom, fucomp, fucompp 0 0 X 0
0 0 X 1
1 0 X 0
1 1 X 1
ST > source ST < source ST = source Unorder
Table 60: FPU Condition Code Bits
Instruction Condition Code Bits
C3 C2 C1 C0
X = Don’t care
Table 61: Condition Code Interpretation
Insruction(s) C0 C3 C2 C1
fcom, fcomp, fcmpp, ftst, fucom, fucomp, fucompp, ficom, ficomp
Result of comparison.
See table above.
Result of comparison.
See table above.
Operand is not comparable.
Result of com- parison (see table above) or stack over- flow/underflow (if stack excep- tion bit is set ).
fxam See previous
See previous table.
See previous table.
Sign of result, or stack over- flow/underflow (if stack excep- tion bit is set ).
fprem, fprem1
Bit 2 of remain- der
Bit 0 of remain- der
0- reduction done.
1- reduction incomplete.
Bit 1 of remain- der or stack over- flow/underflow (if stack excep- tion bit is set ).
fist, fbstp, frndint, fst, fstp, fadd, fmul, fdiv, fdivr, fsub, fsubr, fscale, fsqrt, fpatan, f2xm1, fyl2x, fyl2xp1
Undefined Undefined Undefined
Round up occurred or stack overflow/under- flow (if stack exception bit is set ).
fptan, fsin,
fcos, fsincos Undefined Undefined
0- reduction done.
1- reduction incomplete.
Round up occurred or stack overflow/under- flow (if stack exception bit is set ).
fchs, fabs, fxch, fincstp, fdecstp, constant loads , fxtract, fld, fild, fbld, fstp (80 bit)
Undefined Undefined Undefined
Zero result or stack over- flow/underflow (if stack excep- tion bit is set ).
fldenv, fstor Restored from memory oper- and.
Restored from memory oper- and.
Restored from memory oper- and.
Restored from memory oper- and.
fldcw, fstenv, fstcw, fstsw, fclex
Undefined Undefined Undefined Undefined
finit, fsave Cleared to zero. Cleared to zero. Cleared to zero. Cleared to zero.
Bits 11-13 of the FPU status register provide the register number of the top of stack.
During computations, the 80x87 adds (modulo eight) the logical register numbers sup- plied by the programmer to these three bits to determine the physical register number at run time.
Bit 15 of the status register is the busy bit. It is set whenever the FPU is busy. Most pro- grams will have little reason to access this bit.
14.4.2 FPU Data Types
The 80x87 FPU supports seven different data types: three integer types, a packed dec- imal type, and three floating point types. Since the 80x86 CPUs already support integer data types, these are few reasons why you would want to use the 80x87 integer types. The packed decimal type provides a 17 digit signed decimal (BCD) integer. However, we are avoiding BCD arithmetic in this text, so we will ignore this data type in the 80x87 FPU.
The remaining three data types are the 32 bit, 64 bit, and 80 bit floating point data types we’ve looked at so far. The 80x87 data types appear in Figure 14.8, Figure 14.9, and Figure 14.10.
Figure 14.8 80x87 Floating Point Formats
3 1 2 3 1 5 7 0
32 bit Single Precision Floating Point Format
6 3 5 2 7 0
64 bit Double Precision Floating Point Format
7 9 6 4 7 0
80 bit Extended Precision Floating Point Format
Figure 14.9 80x87 Integer Formats
15 7 0
16 Bit Two's Complement Integer
31 23 15 7 0
32 bit Two's Complement Integer
63 52 7 0
64 bit Two's Complement Integer
The 80x87 FPU generally stores values in a normalized format. When a floating point number is normalized, the H.O. bit is always one. In the 32 and 64 bit floating point for- mats, the 80x87 does not actually store this bit, the 80x87 always assumes that it is one.
Therefore, 32 and 64 bit floating point numbers are always normalized. In the extended precision 80 bit floating point format, the 80x87 does not assume that the H.O. bit of the mantissa is one, the H.O. bit of the number appears as part of the string of bits.
Normalized values provide the greatest precision for a given number of bits. How- ever, there are a large number of non-normalized values which we can represent with the 80 bit format. These values are very close to zero and represent the set of values whose mantissa H.O. bit is not zero. The 80x87 FPUs support a special form of 80 bit known as denormalized values. Denormalized values allow the 80x87 to encode very small values it cannot encode using normalized values, but at a price. Denormalized values offer less bits of precision than normalized values. Therefore, using denormalized values in a computa- tion may introduce some slight inaccuracy into a computation. Of course, this is always better than underflowing the denormalized value to zero (which could make the compu- tation even less accurate), but you must keep in mind that if you work with very small values you may lose some accuracy in your computations. Note that the 80x87 status reg- ister contains a bit you can use to detect when the FPU uses a denormalized value in a computation.
14.4.3 The FPU Instruction Set
The 80387 (and later) FPU adds over 80 new instructions to the 80x86 instruction set.
We can classify these instructions as data movement instructions, conversions, arithmetic instructions, comparisons, constant instructions, transcendental instructions, and miscellaneous instructions. The following sections describe each of the instructions in these categories.
14.4.4 FPU Data Movement Instructions
The data movement instructions transfer data between the internal FPU registers and memory. The instructions in this category are fld, fst, fstp, and fxch. The fld instructions always pushes its operand onto the floating point stack. The fstp instruction always pops the top of stack after storing the top of stack (tos) into its operation. The remaining instruc- tions do not affect the number of items on the stack. The FLD Instruction
The fld instruction loads a 32 bit, 64 bit, or 80 bit floating point value onto the stack.
This instruction converts 32 and 64 bit operand to an 80 bit extended precision value before pushing the value onto the floating point stack.
The fld instruction first decrements the tos pointer (bits 11-13 of the status register) and then stores the 80 bit value in the physical register specified by the new tos pointer. If the source operand of the fld instruction is a floating point data register, ST(i), then the actual Figure 14.10 80x87 Packed Decimal Formats
79 72 68 64 60 7 4 0
D0 D1
D2 D14
D15 D16
80 Bit Packed Decimal Integer (BCD) Sign Unused
register the 80x87 uses for the load operation is the register number before decrementing the tos pointer. Therefore, fld st or fld st(0) duplicates the value on the top of the stack.
The fld instruction sets the stack fault bit if stack overflow occurs. It sets the the denor- malized exception bit if you load an 80 bit denormalized value. It sets the invalid opera- tion bit if you attempt to load an empty floating point register onto the stop of stack (or perform some other invalid operation).
fld st(1)
fld mem_32 fld MyRealVar fld mem_64[bx] The FST and FSTP Instructions
The fst and fstp instructions copy the value on the top of the floating point register stack to another floating point register or to a 32, 64, or 80 bit memory variable. When copying data to a 32 or 64 bit memory variable, the 80 bit extended precision value on the top of stack is rounded to the smaller format as specified by the rounding control bits in the FPU control register.
The fstp instruction pops the value off the top of stack when moving it to the destina- tion location. It does this by incrementing the top of stack pointer in the status register after accessing the data in st(0). If the destination operand is a floating point register, the FPU stores the value at the specified register number before popping the data off the top of the stack.
Executing an fstp st(0) instruction effectively pops the data off the top of stack with no data transfer. Examples:
fst mem_32 fstp mem_64
fstp mem_64[ebx*8]
fst mem_80
fst st(2)
fstp st(1)
The last example above effectively pops st(1) while leaving st(0) on the top of the stack.
The fst and fstp instructions will set the stack exception bit if a stack underflow occurs (attempting to store a value from an empty register stack). They will set the precision bit if there is a loss of precision during the store operation (this will occur, for example, when storing an 80 bit extended precision value into a 32 or 64 bit memory variable and there are some bits lost during conversion). They will set the underflow exception bit when storing an 80 bit value value into a 32 or 64 bit memory variable, but the value is too small to fit into the destination operand. Likewise, these instructions will set the overflow exception bit if the value on the top of stack is too big to fit into a 32 or 64 bit memory vari- able. The fst and fstp instructions set the denormalized flag when you try to store a denor- malized value into an 80 bit register or variable7. They set the invalid operation flag if an invalid operation (such as storing into an empty register) occurs. Finally, these instruc- tions set the C1 condition bit if rounding occurs during the store operation (this only occurs when storing into a 32 or 64 bit memory variable and you have to round the man- tissa to fit into the destination). The FXCH Instruction
The fxch instruction exchanges the value on the top of stack with one of the other FPU registers. This instruction takes two forms: one with a single FPU register as an operand, 7. Storing a denormalized value into a 32 or 64 bit memory variable will always set the underflow exception bit.
ified register. The second form of fxch swaps the top of stack with st(1).
Many FPU instructions, e.g., fsqrt, operate only on the top of the register stack. If you want to perform such an operation on a value that is not on the top of stack, you can use the fxch instruction to swap that register with tos, perform the desired operation, and then use the fxch to swap the tos with the original register. The following example takes the square root of st(2):
fxch st(2) fsqrt
fxch st(2)
The fxch instruction sets the stack exception bit if the stack is empty. It sets the invalid operation bit if you specify an empty register as the operand. This instruction always clears the C1 condition code bit.
14.4.5 Conversions
The 80x87 chip performs all arithmetic operations on 80 bit real quantities. In a sense, the fld and fst/fstp instructions are conversion instructions as well as data movement instructions because they automatically convert between the internal 80 bit real format and the 32 and 64 bit memory formats. Nonetheless, we’ll simply classify them as data movement operations, rather than conversions, because they are moving real values to and from memory. The 80x87 FPU provides five routines which convert to or from integer or binary coded decimal (BCD) format when moving data. These instructions are fild, fist, fistp, fbld, and fbstp. The FILD Instruction
The fild (integer load) instruction converts a 16, 32, or 64 bit two’s complement integer to the 80 bit extended precision format and pushes the result onto the stack. This instruc- tion always expects a single operand. This operand must be the address of a word, double word, or quad word integer variable. Although the instruction format for fild uses the familiar mod/rm fields, the operand must be a memory variable, even for 16 and 32 bit integers. You cannot specify one of the 80386’s 16 or 32 bit general purpose registers. If you want to push an 80x86 general purpose register onto the FPU stack, you must first store it into a memory variable and then use fild to push that value of that memory vari- able.
The fild instruction sets the stack exception bit and C1 (accordingly) if stack overflow occurs while pushing the converted value. Examples:
fild mem_16
fild mem_32[ecx*4]
fild mem_64[ebx+ecx*8] The FIST and FISTP Instructions
The fist and fistp instructions convert the 80 bit extended precision variable on the top of stack to a 16, 32, or 64 bit integer and store the result away into the memory variable specified by the single operand. These instructions convert the value on tos to an integer according to the rounding setting in the FPU control register (bits 10 and 11). As for the fild instruction, the fist and fistp instructions will not let you specify one of the 80x86’s general purpose 16 or 32 bit registers as the destination operand.
The fist instruction converts the value on the top of stack to an integer and then stores the result; it does not otherwise affect the floating point register stack. The fistp instruction pops the value off the floating point register stack after storing the converted value.