師大
analysis
Tsung-Ming Huang
Department of Mathematics National Taiwan Normal University, Taiwan
September 21, 2014
1 / 116
師大
Outline
1 Round-off errors and computer arithmetic IEEE standard floating-point format
Absolute and Relative Errors Machine Epsilon
Loss of Significance
2 Algorithms and Convergence Algorithm
Stability
Rate of convergence
2 / 116
師大
Outline
1 Round-off errors and computer arithmetic IEEE standard floating-point format
Absolute and Relative Errors Machine Epsilon
Loss of Significance
2 Algorithms and Convergence Algorithm
Stability
Rate of convergence
3 / 116
師大
What is the difference for the arithmetic in algebra and computer?
1 For arithmetic in algebra, 256 + 1 = 257, √
256 + 12
= 257
2 For arithmetic in computer (MATLAB), int8(256) +int8(1) =127 ???????
int16(256) +int16(1) = 257
sqrt(256+1)ˆ2 = ? The solution is equal to 257 or not.
(single(sqrt(5))+single(sqrt(3)))ˆ2 - (sqrt(3)+sqrt(5))ˆ2
4 / 116
師大
Example 1
Consider the following recurrence algorithm
x0= 1, x1 = 13 xn+1= 133 xn−43xn−1
for computing the sequence of {xn= (13)n}.
Matlab program
n = 30; x = zeros(n,1); x(1) = 1; x(2) = 1/3;
for ii = 3:n
x(ii) = 13 / 3 * x(ii-1) - 4 / 3 * x(ii-2);
xn = (1/3)ˆ(ii-1); RelErr = abs(xn-x(ii)) / xn;
fprintf(’x(%2.0f) = %15.8e, x ast(%2.0f) = %14.8e,’, ...
’RelErr(%2.0f) = %11.4e \n’, ii,x(ii),ii,xn,ii,RelErr);
end
5 / 116
師大
Example 2
What is the binary representation of 23?
Solution: To determine the binary representation for 23, we write 2
3 = (0.a1a2a3. . .)2. Multiply by 2to obtain
4
3 = (a1.a2a3. . .)2.
Therefore, we get a1 = 1by taking the integer part of both sides.
6 / 116
師大
Example 2
What is the binary representation of 23?
Solution: To determine the binary representation for 23, we write 2
3 = (0.a1a2a3. . .)2. Multiply by 2to obtain
4
3 = (a1.a2a3. . .)2.
Therefore, we get a1 = 1by taking the integer part of both sides.
7 / 116
師大
Example 2
What is the binary representation of 23?
Solution: To determine the binary representation for 23, we write 2
3 = (0.a1a2a3. . .)2. Multiply by 2to obtain
4
3 = (a1.a2a3. . .)2.
Therefore, we get a1 = 1by taking the integer part of both sides.
8 / 116
師大
Subtracting 1, we have 1
3 = (0.a2a3a4. . .)2. Repeating the previous step, we arrive at
2
3 = (0.101010 . . .)2.
9 / 116
師大
Subtracting 1, we have 1
3 = (0.a2a3a4. . .)2. Repeating the previous step, we arrive at
2
3 = (0.101010 . . .)2.
10 / 116
師大
In the computational world, each representable number has only afixedandfinitenumber of digits.
For any real number x, let
x = ±1.a1a2· · · atat+1at+2· · · × 2m,
denote the normalized scientific binary representation of x.
In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.
11 / 116
師大
In the computational world, each representable number has only afixedandfinitenumber of digits.
For any real number x, let
x = ±1.a1a2· · · atat+1at+2· · · × 2m,
denote the normalized scientific binary representation of x.
In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.
12 / 116
師大
In the computational world, each representable number has only afixedandfinitenumber of digits.
For any real number x, let
x = ±1.a1a2· · · atat+1at+2· · · × 2m,
denote the normalized scientific binary representation of x.
In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.
13 / 116
師大
Single precision
The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number
±q × 2m as shown in the following figure.
23 bits sign of mantissa
normalized mantissa exponent
8 bits
0 1 8 9 31
Thefirst bitis asignindicator, denoted s. This is followed by an8-bit exponent cand a23-bit mantissa f.
The base for the exponent and mantissa is 2, and the actualexponent isc − 127. The value of c is restricted by the inequality 0 ≤ c ≤ 255.
14 / 116
師大
Single precision
The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number
±q × 2m as shown in the following figure.
23 bits sign of mantissa
normalized mantissa exponent
8 bits
0 1 8 9 31
Thefirst bitis asignindicator, denoted s. This is followed by an8-bit exponent cand a23-bit mantissa f.
The base for the exponent and mantissa is 2, and the actualexponent isc − 127. The value of c is restricted by the inequality 0 ≤ c ≤ 255.
15 / 116
師大
Single precision
The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number
±q × 2m as shown in the following figure.
23 bits sign of mantissa
normalized mantissa exponent
8 bits
0 1 8 9 31
Thefirst bitis asignindicator, denoted s. This is followed by an8-bit exponent cand a23-bit mantissa f.
The base for the exponent and mantissa is 2, and the actualexponent isc − 127. The value of c is restricted by the inequality 0 ≤ c ≤ 255.
16 / 116
師大
The actual exponent of the number is restricted by the inequality−127 ≤ c − 127 ≤ 128.
A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.
Using this system gives a floating-point number of the form (−1)s2c−127(1 + f ).
17 / 116
師大
The actual exponent of the number is restricted by the inequality−127 ≤ c − 127 ≤ 128.
A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.
Using this system gives a floating-point number of the form (−1)s2c−127(1 + f ).
18 / 116
師大
The actual exponent of the number is restricted by the inequality−127 ≤ c − 127 ≤ 128.
A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.
Using this system gives a floating-point number of the form (−1)s2c−127(1 + f ).
19 / 116
師大
Example 3
What is the decimal number of the machine number 01000000101000000000000000000000?
1 The leftmost bit is zero, which indicates that the number is positive.
2 The next 8 bits, 10000001, are equivalent to c = 1 · 27+ 0 · 26+ · · · + 0 · 21+ 1 · 20= 129.
The exponential part of the number is 2129−127= 22.
3 The final 23 bits specify that the mantissa is
f = 0 · (2)−1+ 1 · (2)−2+ 0 · (2)−3+ · · · + 0 · (2)−23= 0.25.
4 Consequently, this machine number precisely represents the decimal number
(−1)s2c−127(1 + f ) = 22· (1 + 0.25) = 5.
20 / 116
師大
Example 3
What is the decimal number of the machine number 01000000101000000000000000000000?
1 The leftmost bit is zero, which indicates that the number is positive.
2 The next 8 bits, 10000001, are equivalent to c = 1 · 27+ 0 · 26+ · · · + 0 · 21+ 1 · 20= 129.
The exponential part of the number is 2129−127= 22.
3 The final 23 bits specify that the mantissa is
f = 0 · (2)−1+ 1 · (2)−2+ 0 · (2)−3+ · · · + 0 · (2)−23= 0.25.
4 Consequently, this machine number precisely represents the decimal number
(−1)s2c−127(1 + f ) = 22· (1 + 0.25) = 5.
21 / 116
師大
Example 3
What is the decimal number of the machine number 01000000101000000000000000000000?
1 The leftmost bit is zero, which indicates that the number is positive.
2 The next 8 bits, 10000001, are equivalent to c = 1 · 27+ 0 · 26+ · · · + 0 · 21+ 1 · 20= 129.
The exponential part of the number is 2129−127= 22.
3 The final 23 bits specify that the mantissa is
f = 0 · (2)−1+ 1 · (2)−2+ 0 · (2)−3+ · · · + 0 · (2)−23= 0.25.
4 Consequently, this machine number precisely represents the decimal number
(−1)s2c−127(1 + f ) = 22· (1 + 0.25) = 5.
22 / 116
師大
Example 3
What is the decimal number of the machine number 01000000101000000000000000000000?
1 The leftmost bit is zero, which indicates that the number is positive.
2 The next 8 bits, 10000001, are equivalent to c = 1 · 27+ 0 · 26+ · · · + 0 · 21+ 1 · 20= 129.
The exponential part of the number is 2129−127= 22.
3 The final 23 bits specify that the mantissa is
f = 0 · (2)−1+ 1 · (2)−2+ 0 · (2)−3+ · · · + 0 · (2)−23= 0.25.
4 Consequently, this machine number precisely represents the decimal number
(−1)s2c−127(1 + f ) = 22· (1 + 0.25) = 5.
23 / 116
師大
Example 3
What is the decimal number of the machine number 01000000101000000000000000000000?
1 The leftmost bit is zero, which indicates that the number is positive.
2 The next 8 bits, 10000001, are equivalent to c = 1 · 27+ 0 · 26+ · · · + 0 · 21+ 1 · 20= 129.
The exponential part of the number is 2129−127= 22.
3 The final 23 bits specify that the mantissa is
f = 0 · (2)−1+ 1 · (2)−2+ 0 · (2)−3+ · · · + 0 · (2)−23= 0.25.
4 Consequently, this machine number precisely represents the decimal number
(−1)s2c−127(1 + f ) = 22· (1 + 0.25) = 5.
24 / 116
師大
Example 4
What is the decimal number of the machine number 01000000100111111111111111111111?
1 The final 23 bits specify that the mantissa is
f = 0 · (2)−1+ 0 · (2)−2+ 1 · (2)−3+ · · · + 1 · (2)−23
= 0.2499998807907105.
2 Consequently, this machine number precisely represents the decimal number
(−1)s2c−127(1 + f ) = 22· (1 + 0.2499998807907105)
= 4.999999523162842.
25 / 116
師大
Example 4
What is the decimal number of the machine number 01000000100111111111111111111111?
1 The final 23 bits specify that the mantissa is
f = 0 · (2)−1+ 0 · (2)−2+ 1 · (2)−3+ · · · + 1 · (2)−23
= 0.2499998807907105.
2 Consequently, this machine number precisely represents the decimal number
(−1)s2c−127(1 + f ) = 22· (1 + 0.2499998807907105)
= 4.999999523162842.
26 / 116
師大
Example 4
What is the decimal number of the machine number 01000000100111111111111111111111?
1 The final 23 bits specify that the mantissa is
f = 0 · (2)−1+ 0 · (2)−2+ 1 · (2)−3+ · · · + 1 · (2)−23
= 0.2499998807907105.
2 Consequently, this machine number precisely represents the decimal number
(−1)s2c−127(1 + f ) = 22· (1 + 0.2499998807907105)
= 4.999999523162842.
27 / 116
師大
Example 5
What is the decimal number of the machine number 01000000101000000000000000000001?
1 The final 23 bits specify that the mantissa is
f = 0 · 2−1+ 1 · 2−2+ 0 · 2−3+ · · · + 0 · 2−22+ 1 · 2−23
= 0.2500001192092896.
2 Consequently, this machine number precisely represents the decimal number
(−1)s2c−127(1 + f ) = 22· (1 + 0.2500001192092896)
= 5.000000476837158.
28 / 116
師大
Example 5
What is the decimal number of the machine number 01000000101000000000000000000001?
1 The final 23 bits specify that the mantissa is
f = 0 · 2−1+ 1 · 2−2+ 0 · 2−3+ · · · + 0 · 2−22+ 1 · 2−23
= 0.2500001192092896.
2 Consequently, this machine number precisely represents the decimal number
(−1)s2c−127(1 + f ) = 22· (1 + 0.2500001192092896)
= 5.000000476837158.
29 / 116
師大
Example 5
What is the decimal number of the machine number 01000000101000000000000000000001?
1 The final 23 bits specify that the mantissa is
f = 0 · 2−1+ 1 · 2−2+ 0 · 2−3+ · · · + 0 · 2−22+ 1 · 2−23
= 0.2500001192092896.
2 Consequently, this machine number precisely represents the decimal number
(−1)s2c−127(1 + f ) = 22· (1 + 0.2500001192092896)
= 5.000000476837158.
30 / 116
師大
Summary
Above three examples
01000000100111111111111111111111 ⇒ 4.999999523162842 01000000101000000000000000000000 ⇒ 5
01000000101000000000000000000001 ⇒ 5.000000476837158 Only a relativelysmall subsetof the real number system is used for the representation of all the real numbers.
This subset, which are called thefloating-point numbers, contains only rational numbers, both positive and negative.
When a number can not be represented exactly with the fixed finite number of digits in a computer, anear-by floating-point number is chosen for approximate representation.
31 / 116
師大
Summary
Above three examples
01000000100111111111111111111111 ⇒ 4.999999523162842 01000000101000000000000000000000 ⇒ 5
01000000101000000000000000000001 ⇒ 5.000000476837158 Only a relativelysmall subsetof the real number system is used for the representation of all the real numbers.
This subset, which are called thefloating-point numbers, contains only rational numbers, both positive and negative.
When a number can not be represented exactly with the fixed finite number of digits in a computer, anear-by floating-point number is chosen for approximate representation.
32 / 116
師大
Summary
Above three examples
01000000100111111111111111111111 ⇒ 4.999999523162842 01000000101000000000000000000000 ⇒ 5
01000000101000000000000000000001 ⇒ 5.000000476837158 Only a relativelysmall subsetof the real number system is used for the representation of all the real numbers.
This subset, which are called thefloating-point numbers, contains only rational numbers, both positive and negative.
When a number can not be represented exactly with the fixed finite number of digits in a computer, anear-by floating-point number is chosen for approximate representation.
33 / 116
師大
Summary
Above three examples
01000000100111111111111111111111 ⇒ 4.999999523162842 01000000101000000000000000000000 ⇒ 5
01000000101000000000000000000001 ⇒ 5.000000476837158 Only a relativelysmall subsetof the real number system is used for the representation of all the real numbers.
This subset, which are called thefloating-point numbers, contains only rational numbers, both positive and negative.
When a number can not be represented exactly with the fixed finite number of digits in a computer, anear-by floating-point number is chosen for approximate representation.
34 / 116
師大
The smallest positive number
Let s = 0, c = 1 and f = 0 which is equivalent to 2−126· (1 + 0) ≈ 1.175 × 10−38 The largest number
Let s = 0, c = 254 and f = 1 − 2−23which is equivalent to 2127· (2 − 2−23) ≈ 3.403 × 1038
Definition 6
If a number x with |x| < 2−126· (1 + 0), then we say that an underflowhas occurred and is generally set to zero.
If |x| > 2127· (2 − 2−23), then we say that anoverflowhas occurred.
35 / 116
師大
The smallest positive number
Let s = 0, c = 1 and f = 0 which is equivalent to 2−126· (1 + 0) ≈ 1.175 × 10−38 The largest number
Let s = 0, c = 254 and f = 1 − 2−23which is equivalent to 2127· (2 − 2−23) ≈ 3.403 × 1038
Definition 6
If a number x with |x| < 2−126· (1 + 0), then we say that an underflowhas occurred and is generally set to zero.
If |x| > 2127· (2 − 2−23), then we say that anoverflowhas occurred.
36 / 116
師大
The smallest positive number
Let s = 0, c = 1 and f = 0 which is equivalent to 2−126· (1 + 0) ≈ 1.175 × 10−38 The largest number
Let s = 0, c = 254 and f = 1 − 2−23which is equivalent to 2127· (2 − 2−23) ≈ 3.403 × 1038
Definition 6
If a number x with |x| < 2−126· (1 + 0), then we say that an underflowhas occurred and is generally set to zero.
If |x| > 2127· (2 − 2−23), then we say that anoverflowhas occurred.
37 / 116
師大
The smallest positive number
Let s = 0, c = 1 and f = 0 which is equivalent to 2−126· (1 + 0) ≈ 1.175 × 10−38 The largest number
Let s = 0, c = 254 and f = 1 − 2−23which is equivalent to 2127· (2 − 2−23) ≈ 3.403 × 1038
Definition 6
If a number x with |x| < 2−126· (1 + 0), then we say that an underflowhas occurred and is generally set to zero.
If |x| > 2127· (2 − 2−23), then we say that anoverflowhas occurred.
38 / 116
師大
Double precision
A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.
1 sign of mantissa
normalized mantissa exponent
52-bit mantissa 0 1
11-bit
11 12
63
Thefirstbit is a sign indicator, denoted s. This is followed by an11-bitexponent c and a52-bitmantissa f .
The actual exponent isc − 1023.
39 / 116
師大
Double precision
A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.
1 sign of mantissa
normalized mantissa exponent
52-bit mantissa 0 1
11-bit
11 12
63
Thefirstbit is a sign indicator, denoted s. This is followed by an11-bitexponent c and a52-bitmantissa f .
The actual exponent isc − 1023.
40 / 116
師大
Double precision
A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.
1 sign of mantissa
normalized mantissa exponent
52-bit mantissa 0 1
11-bit
11 12
63
Thefirstbit is a sign indicator, denoted s. This is followed by an11-bitexponent c and a52-bitmantissa f .
The actual exponent isc − 1023.
41 / 116
師大
Format of floating-point number
(−1)s× (1 + f ) × 2c−1023 The smallest positive number
Let s = 0, c = 1 and f = 0 which is equivalent to 2−1022· (1 + 0) ≈ 2.225 × 10−308. The largest number
Let s = 0, c = 2046 and f = 1 − 2−52which is equivalent to 21023· (2 − 2−52) ≈ 1.798 × 10308.
42 / 116
師大
Format of floating-point number
(−1)s× (1 + f ) × 2c−1023 The smallest positive number
Let s = 0, c = 1 and f = 0 which is equivalent to 2−1022· (1 + 0) ≈ 2.225 × 10−308. The largest number
Let s = 0, c = 2046 and f = 1 − 2−52which is equivalent to 21023· (2 − 2−52) ≈ 1.798 × 10308.
43 / 116
師大
Format of floating-point number
(−1)s× (1 + f ) × 2c−1023 The smallest positive number
Let s = 0, c = 1 and f = 0 which is equivalent to 2−1022· (1 + 0) ≈ 2.225 × 10−308. The largest number
Let s = 0, c = 2046 and f = 1 − 2−52which is equivalent to 21023· (2 − 2−52) ≈ 1.798 × 10308.
44 / 116
師大
Chopping and rounding
For any real number x, let
x = ±1.a1a2· · · atat+1at+2· · · × 2m,
denote the normalized scientific binary representation of x.
1 chopping: simply discard the excess bits at+1, at+2, . . .to obtain
f l(x) = ±1.a1a2· · · at× 2m.
2 rounding: add 2−(t+1)× 2m to x and then chop the excess bits to obtain a number of the form
f l(x) = ±1.δ1δ2· · · δt× 2m.
In this method, if at+1= 1, we add 1 to atto obtain f l(x), and if at+1 = 0, we merely chop off all but the first t digits.
45 / 116
師大
Chopping and rounding
For any real number x, let
x = ±1.a1a2· · · atat+1at+2· · · × 2m,
denote the normalized scientific binary representation of x.
1 chopping: simply discard the excess bits at+1, at+2, . . .to obtain
f l(x) = ±1.a1a2· · · at× 2m.
2 rounding: add 2−(t+1)× 2m to x and then chop the excess bits to obtain a number of the form
f l(x) = ±1.δ1δ2· · · δt× 2m.
In this method, if at+1= 1, we add 1 to atto obtain f l(x), and if at+1 = 0, we merely chop off all but the first t digits.
46 / 116
師大
Chopping and rounding
For any real number x, let
x = ±1.a1a2· · · atat+1at+2· · · × 2m,
denote the normalized scientific binary representation of x.
1 chopping: simply discard the excess bits at+1, at+2, . . .to obtain
f l(x) = ±1.a1a2· · · at× 2m.
2 rounding: add 2−(t+1)× 2m to x and then chop the excess bits to obtain a number of the form
f l(x) = ±1.δ1δ2· · · δt× 2m.
In this method, if at+1= 1, we add 1 to atto obtain f l(x), and if at+1 = 0, we merely chop off all but the first t digits.
47 / 116
師大
Definition 7 (Roundoff error)
The error results from replacing a number with its floating-point form is calledroundoff errororrounding error.
Definition 8 (Absolute Error and Relative Error)
If x is an approximation to the exact value x∗, theabsolute error is|x∗− x|and therelative erroris |x|x∗−x|∗| , provided that x∗ 6= 0.
Example 9
(a) If x∗ = 0.3000 × 10−3and x = 0.3100 × 10−3, then the absolute error is 0.1 × 10−4and the relative error is 0.3333 × 10−1.
(b) If x∗ = 0.3000 × 104and x = 0.3100 × 104, then the absolute error is 0.1 × 103 and the relative error is 0.3333 × 10−1.
48 / 116
師大
Definition 7 (Roundoff error)
The error results from replacing a number with its floating-point form is calledroundoff errororrounding error.
Definition 8 (Absolute Error and Relative Error)
If x is an approximation to the exact value x∗, theabsolute error is|x∗− x|and therelative erroris |x|x∗−x|∗| , provided that x∗ 6= 0.
Example 9
(a) If x∗ = 0.3000 × 10−3and x = 0.3100 × 10−3, then the absolute error is 0.1 × 10−4and the relative error is 0.3333 × 10−1.
(b) If x∗ = 0.3000 × 104and x = 0.3100 × 104, then the absolute error is 0.1 × 103 and the relative error is 0.3333 × 10−1.
49 / 116
師大
Definition 7 (Roundoff error)
The error results from replacing a number with its floating-point form is calledroundoff errororrounding error.
Definition 8 (Absolute Error and Relative Error)
If x is an approximation to the exact value x∗, theabsolute error is|x∗− x|and therelative erroris |x|x∗−x|∗| , provided that x∗ 6= 0.
Example 9
(a) If x∗ = 0.3000 × 10−3and x = 0.3100 × 10−3, then the absolute error is 0.1 × 10−4and the relative error is 0.3333 × 10−1.
(b) If x∗ = 0.3000 × 104and x = 0.3100 × 104, then the absolute error is 0.1 × 103 and the relative error is 0.3333 × 10−1.
50 / 116
師大
Definition 7 (Roundoff error)
The error results from replacing a number with its floating-point form is calledroundoff errororrounding error.
Definition 8 (Absolute Error and Relative Error)
If x is an approximation to the exact value x∗, theabsolute error is|x∗− x|and therelative erroris |x|x∗−x|∗| , provided that x∗ 6= 0.
Example 9
(a) If x∗ = 0.3000 × 10−3and x = 0.3100 × 10−3, then the absolute error is 0.1 × 10−4and the relative error is 0.3333 × 10−1.
(b) If x∗ = 0.3000 × 104and x = 0.3100 × 104, then the absolute error is 0.1 × 103 and the relative error is 0.3333 × 10−1.
51 / 116
師大
Remark 1
As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.
Definition 10
The number x is said to approximate x∗to tsignificant digitsif t is the largest nonnegative integer for which
|x − x∗|
|x∗| ≤ 5 × 10−t.
52 / 116
師大
Remark 1
As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.
Definition 10
The number x is said to approximate x∗to tsignificant digitsif t is the largest nonnegative integer for which
|x − x∗|
|x∗| ≤ 5 × 10−t.
53 / 116
師大
If the floating-point representation f l(x) for the number x is obtained by using t digits and chopping procedure, then the relative error is
|x − f l(x)|
|x| = |0.00 · · · 0at+1at+2· · · × 2m|
|1.a1a2· · · atat+1at+2· · · × 2m|
= |0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t. The minimal value of the denominator is 1. The numerator is bounded above by 1. As a consequence
x − f l(x) x
≤ 2−t.
54 / 116
師大
If the floating-point representation f l(x) for the number x is obtained by using t digits and chopping procedure, then the relative error is
|x − f l(x)|
|x| = |0.00 · · · 0at+1at+2· · · × 2m|
|1.a1a2· · · atat+1at+2· · · × 2m|
= |0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t. The minimal value of the denominator is 1. The numerator is bounded above by 1. As a consequence
x − f l(x) x
≤ 2−t.
55 / 116
師大
If the floating-point representation f l(x) for the number x is obtained by using t digits and chopping procedure, then the relative error is
|x − f l(x)|
|x| = |0.00 · · · 0at+1at+2· · · × 2m|
|1.a1a2· · · atat+1at+2· · · × 2m|
= |0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t. The minimal value of the denominator is 1. The numerator is bounded above by 1. As a consequence
x − f l(x) x
≤ 2−t.
56 / 116
師大
If the floating-point representation f l(x) for the number x is obtained by using t digits and chopping procedure, then the relative error is
|x − f l(x)|
|x| = |0.00 · · · 0at+1at+2· · · × 2m|
|1.a1a2· · · atat+1at+2· · · × 2m|
= |0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t. The minimal value of the denominator is 1. The numerator is bounded above by 1. As a consequence
x − f l(x) x
≤ 2−t.
57 / 116
師大
If t-digit rounding arithmetic is used and
at+1= 0, then f l(x) = ±1.a1a2· · · at× 2m.A bound for the relative error is
|x − f l(x)|
|x| = |0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t≤ 2−(t+1), since the numerator is bounded above by 12 due to at+1= 0.
at+1= 1, then f l(x) = ±(1.a1a2· · · at+ 2−t) × 2m.The upper bound for relative error becomes
|x − f l(x)|
|x| = |1 − 0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t≤ 2−(t+1), since the numerator is bounded by 12due to at+1= 1.
Therefore the relative error for rounding arithmetic is
x − f l(x) x
≤ 2−(t+1)= 1 2× 2−t.
58 / 116
師大
If t-digit rounding arithmetic is used and
at+1= 0, then f l(x) = ±1.a1a2· · · at× 2m. A bound for the relative error is
|x − f l(x)|
|x| = |0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t≤ 2−(t+1), since the numerator is bounded above by 12 due to at+1= 0.
at+1= 1, then f l(x) = ±(1.a1a2· · · at+ 2−t) × 2m. The upper bound for relative error becomes
|x − f l(x)|
|x| = |1 − 0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t≤ 2−(t+1), since the numerator is bounded by 12due to at+1= 1.
Therefore the relative error for rounding arithmetic is
x − f l(x) x
≤ 2−(t+1)= 1 2× 2−t.
59 / 116
師大
If t-digit rounding arithmetic is used and
at+1= 0, then f l(x) = ±1.a1a2· · · at× 2m. A bound for the relative error is
|x − f l(x)|
|x| = |0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t≤ 2−(t+1), since the numerator is bounded above by 12 due to at+1= 0.
at+1= 1, then f l(x) = ±(1.a1a2· · · at+ 2−t) × 2m.The upper bound for relative error becomes
|x − f l(x)|
|x| = |1 − 0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t≤ 2−(t+1), since the numerator is bounded by 12due to at+1= 1.
Therefore the relative error for rounding arithmetic is
x − f l(x) x
≤ 2−(t+1)= 1 2× 2−t.
60 / 116
師大
If t-digit rounding arithmetic is used and
at+1= 0, then f l(x) = ±1.a1a2· · · at× 2m. A bound for the relative error is
|x − f l(x)|
|x| = |0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t≤ 2−(t+1), since the numerator is bounded above by 12 due to at+1= 0.
at+1= 1, then f l(x) = ±(1.a1a2· · · at+ 2−t) × 2m. The upper bound for relative error becomes
|x − f l(x)|
|x| = |1 − 0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t≤ 2−(t+1), since the numerator is bounded by 12due to at+1= 1.
Therefore the relative error for rounding arithmetic is
x − f l(x) x
≤ 2−(t+1)= 1 2× 2−t.
61 / 116
師大
If t-digit rounding arithmetic is used and
at+1= 0, then f l(x) = ±1.a1a2· · · at× 2m. A bound for the relative error is
|x − f l(x)|
|x| = |0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t≤ 2−(t+1), since the numerator is bounded above by 12 due to at+1= 0.
at+1= 1, then f l(x) = ±(1.a1a2· · · at+ 2−t) × 2m. The upper bound for relative error becomes
|x − f l(x)|
|x| = |1 − 0.at+1at+2· · · |
|1.a1a2· · · atat+1at+2· · · | × 2−t≤ 2−(t+1), since the numerator is bounded by 12due to at+1= 1.
Therefore the relative error for rounding arithmetic is
x − f l(x) x
≤ 2−(t+1)= 1 2× 2−t.
62 / 116
師大
Definition 11 (Machine epsilon)
The floating-point representation, f l(x), of x can be expressed as
f l(x) = x(1 + δ), |δ| ≤ εM, (1) whereεM ≡ 2−t is referred to as theunit roundoff erroror machine epsilon.
Single precision IEEE standard floating-point format The mantissa f corresponds to 23 binary digits (i.e., t = 23), the machine epsilon is
εM = 2−23≈ 1.192 × 10−7.
This approximately corresponds to7accurate decimal digits
63 / 116
師大
Definition 11 (Machine epsilon)
The floating-point representation, f l(x), of x can be expressed as
f l(x) = x(1 + δ), |δ| ≤ εM, (1) whereεM ≡ 2−t is referred to as theunit roundoff erroror machine epsilon.
Single precision IEEE standard floating-point format The mantissa f corresponds to 23 binary digits (i.e., t = 23), the machine epsilon is
εM = 2−23≈ 1.192 × 10−7.
This approximately corresponds to7accurate decimal digits
64 / 116
師大
Definition 11 (Machine epsilon)
The floating-point representation, f l(x), of x can be expressed as
f l(x) = x(1 + δ), |δ| ≤ εM, (1) whereεM ≡ 2−t is referred to as theunit roundoff erroror machine epsilon.
Single precision IEEE standard floating-point format The mantissa f corresponds to 23 binary digits (i.e., t = 23), the machine epsilon is
εM = 2−23≈ 1.192 × 10−7.
This approximately corresponds to7accurate decimal digits
65 / 116
師大
Double precision IEEE standard floating-point format The mantissa f corresponds to 52 binary digits (i.e., t = 52), the machine epsilon is
εM = 2−52≈ 2.220 × 10−16.
which provides between15and16decimal digits of accuracy.
Summary of IEEE standard floating-point format
single precision double precision
εM 1.192 × 10−7 2.220 × 10−16
smallest positive number 1.175 × 10−38 2.225 × 10−308 largest number 3.403 × 1038 1.798 × 10308
decimal precision 7 16
66 / 116
師大
Double precision IEEE standard floating-point format The mantissa f corresponds to 52 binary digits (i.e., t = 52), the machine epsilon is
εM = 2−52≈ 2.220 × 10−16.
which provides between15and16decimal digits of accuracy.
Summary of IEEE standard floating-point format
single precision double precision
εM 1.192 × 10−7 2.220 × 10−16
smallest positive number 1.175 × 10−38 2.225 × 10−308 largest number 3.403 × 1038 1.798 × 10308
decimal precision 7 16
67 / 116
師大
Double precision IEEE standard floating-point format The mantissa f corresponds to 52 binary digits (i.e., t = 52), the machine epsilon is
εM = 2−52≈ 2.220 × 10−16.
which provides between15and16decimal digits of accuracy.
Summary of IEEE standard floating-point format
single precision double precision
εM 1.192 × 10−7 2.220 × 10−16
smallest positive number 1.175 × 10−38 2.225 × 10−308 largest number 3.403 × 1038 1.798 × 10308
decimal precision 7 16
68 / 116
師大
Let stand for any one of the four basic arithmetic operators +, −, ?, ÷.
Whenever twomachine numbersxand y are to be combined arithmetically, the computer will produce f l(x y)instead of x y.
Under (1), the relative error of f l(x y) satisfies
f l(x y) = (x y)(1 + δ), δ ≤ εM, (2)
where εM is the unit roundoff.
But if x, y arenotmachine numbers, then they must first rounded to floating-point format before the arithmetic operation and the resulting relative error becomes
f l(f l(x) f l(y)) = (x(1 + δ1) y(1 + δ2))(1 + δ3),
where δi ≤ εM, i = 1, 2, 3.
69 / 116
師大
Let stand for any one of the four basic arithmetic operators +, −, ?, ÷.
Whenever twomachine numbersxand y are to be combined arithmetically, the computer will produce f l(x y)instead of x y.
Under (1), the relative error of f l(x y) satisfies
f l(x y) = (x y)(1 + δ), δ ≤ εM, (2)
where εM is the unit roundoff.
But if x, y arenotmachine numbers, then they must first rounded to floating-point format before the arithmetic operation and the resulting relative error becomes
f l(f l(x) f l(y)) = (x(1 + δ1) y(1 + δ2))(1 + δ3),
where δi ≤ εM, i = 1, 2, 3.
70 / 116
師大
Let stand for any one of the four basic arithmetic operators +, −, ?, ÷.
Whenever twomachine numbersxand y are to be combined arithmetically, the computer will produce f l(x y)instead of x y.
Under (1), the relative error of f l(x y) satisfies
f l(x y) = (x y)(1 + δ), δ ≤ εM, (2)
where εM is the unit roundoff.
But if x, y arenotmachine numbers, then they must first rounded to floating-point format before the arithmetic operation and the resulting relative error becomes
f l(f l(x) f l(y)) = (x(1 + δ1) y(1 + δ2))(1 + δ3),
where δi ≤ εM, i = 1, 2, 3.
71 / 116
師大
Let stand for any one of the four basic arithmetic operators +, −, ?, ÷.
Whenever twomachine numbersxand y are to be combined arithmetically, the computer will produce f l(x y)instead of x y.
Under (1), the relative error of f l(x y) satisfies
f l(x y) = (x y)(1 + δ), δ ≤ εM, (2)
where εM is the unit roundoff.
But if x, y arenotmachine numbers, then they must first rounded to floating-point format before the arithmetic operation and the resulting relative error becomes
f l(f l(x) f l(y)) = (x(1 + δ1) y(1 + δ2))(1 + δ3),
where δi ≤ εM, i = 1, 2, 3.
72 / 116
師大
Example
Let x = 0.54617 and y = 0.54601. Using rounding and four-digit arithmetic, then
x∗= f l(x) = 0.5462is accurate tofoursignificant digits since
|x − x∗|
|x| = 0.00003
0.54617 = 5.5 × 10−5 ≤ 5 × 10−4.
y∗ = f l(y) = 0.5460is accurate tofivesignificant digits since
|y − y∗|
|y| = 0.00001
0.54601 = 1.8 × 10−5≤ 5 × 10−5.
73 / 116
師大
Example
Let x = 0.54617 and y = 0.54601. Using rounding and four-digit arithmetic, then
x∗= f l(x) = 0.5462is accurate tofoursignificant digits since
|x − x∗|
|x| = 0.00003
0.54617 = 5.5 × 10−5 ≤ 5 × 10−4.
y∗ = f l(y) = 0.5460is accurate tofivesignificant digits since
|y − y∗|
|y| = 0.00001
0.54601 = 1.8 × 10−5≤ 5 × 10−5.
74 / 116
師大
Example
Let x = 0.54617 and y = 0.54601. Using rounding and four-digit arithmetic, then
x∗= f l(x) = 0.5462is accurate tofoursignificant digits since
|x − x∗|
|x| = 0.00003
0.54617 = 5.5 × 10−5 ≤ 5 × 10−4.
y∗ = f l(y) = 0.5460is accurate tofivesignificant digits since
|y − y∗|
|y| = 0.00001
0.54601 = 1.8 × 10−5≤ 5 × 10−5.
75 / 116
師大
The exact value of subtraction is r = x − y = 0.00016.
But
r∗≡ x y = f l(f l(x) − f l(y)) = 0.0002.
Since
|r − r∗|
|r| = 0.25 ≤ 5 × 10−1 the result has onlyonesignificant digit.
Loss of accuracy
76 / 116
師大
The exact value of subtraction is r = x − y = 0.00016.
But
r∗≡ x y = f l(f l(x) − f l(y)) = 0.0002.
Since
|r − r∗|
|r| = 0.25 ≤ 5 × 10−1 the result has onlyonesignificant digit.
Loss of accuracy
77 / 116
師大
Loss of Significance
One of the most common error-producing calculations involves the cancellation of significant digits due to the subtraction of nearly equal numbersor theaddition of one very large number and one very small number.
Sometimes, loss of significance can be avoided by rewriting the mathematical formula.
Example 12
The quadratic formulas for computing the roots of ax2+ bx + c = 0, when a 6= 0, are
x1= −b +√
b2− 4ac
2a and x2= −b −√
b2− 4ac
2a .
Consider the quadratic equationx2+ 62.10x + 1 = 0and discuss the numerical results.
78 / 116
師大
Loss of Significance
One of the most common error-producing calculations involves the cancellation of significant digits due to the subtraction of nearly equal numbersor theaddition of one very large number and one very small number.
Sometimes, loss of significance can be avoided by rewriting the mathematical formula.
Example 12
The quadratic formulas for computing the roots of ax2+ bx + c = 0, when a 6= 0, are
x1= −b +√
b2− 4ac
2a and x2= −b −√
b2− 4ac
2a .
Consider the quadratic equationx2+ 62.10x + 1 = 0and discuss the numerical results.
79 / 116
師大
Loss of Significance
One of the most common error-producing calculations involves the cancellation of significant digits due to the subtraction of nearly equal numbersor theaddition of one very large number and one very small number.
Sometimes, loss of significance can be avoided by rewriting the mathematical formula.
Example 12
The quadratic formulas for computing the roots of ax2+ bx + c = 0, when a 6= 0, are
x1= −b +√
b2− 4ac
2a and x2= −b −√
b2− 4ac
2a .
Consider the quadratic equationx2+ 62.10x + 1 = 0and discuss the numerical results.
80 / 116
師大
Solution
Using the quadratic formula and 8-digit rounding arithmetic, one can obtain
x1 = −0.01610723 and x2 = −62.08390.
Now we perform the calculations with 4-digit rounding arithmetic. First we have
pb2− 4ac =p
62.102− 4.000 =√
3856 − 4.000 = 62.06, and
f l(x1) = −62.10 + 62.06
2.000 = −0.04000
2.000 = −0.02000.
The relative error in computing x1is
|f l(x1) − x1|
|x1| = | − 0.02000 + 0.01610723|
| − 0.01610723| ≈ 0.2417 ≤ 5×10−1.
81 / 116