Chopping and rounding

師大

For any real number x, let

x = ±1.a1a2· · · a_tat+1at+2· · · × 2^m,

denote the normalized scientific binary representation of x.

1 chopping: simply discard the excess bits a_t+1, at+2, . . .to obtain

f l(x) = ±1.a1a2· · · a_t× 2^m.

2 rounding: add 2^−(t+1)× 2^m to x and then chop the excess bits to obtain a number of the form

f l(x) = ±1.δ₁δ₂· · · δ_t× 2^m.

In this method, if at+1= 1, we add 1 to atto obtain f l(x), and if a_t+1 = 0, we merely chop off all but the first t digits.

46 / 116

師大

Chopping and rounding

For any real number x, let

x = ±1.a1a2· · · a_tat+1at+2· · · × 2^m,

denote the normalized scientific binary representation of x.

1 chopping: simply discard the excess bits a_t+1, at+2, . . .to obtain

f l(x) = ±1.a1a2· · · a_t× 2^m.

2 rounding: add 2^−(t+1)× 2^m to x and then chop the excess bits to obtain a number of the form

f l(x) = ±1.δ₁δ₂· · · δ_t× 2^m.

In this method, if at+1= 1, we add 1 to atto obtain f l(x), and if a_t+1 = 0, we merely chop off all but the first t digits.

47 / 116

師大

Definition 7 (Roundoff error)

The error results from replacing a number with its floating-point form is calledroundoff errororrounding error.

Definition 8 (Absolute Error and Relative Error)

If x is an approximation to the exact value x^∗, theabsolute error is|x^∗− x|and therelative erroris ^|x_|x^∗^−x|∗| , provided that x^∗ 6= 0.

Example 9

(a) If x^∗ = 0.3000 × 10⁻³and x = 0.3100 × 10⁻³, then the absolute error is 0.1 × 10⁻⁴and the relative error is 0.3333 × 10⁻¹.

(b) If x^∗ = 0.3000 × 10⁴and x = 0.3100 × 10⁴, then the absolute error is 0.1 × 10³ and the relative error is 0.3333 × 10⁻¹.

48 / 116

師大

Definition 7 (Roundoff error)

The error results from replacing a number with its floating-point form is calledroundoff errororrounding error.

Definition 8 (Absolute Error and Relative Error)

If x is an approximation to the exact value x^∗, theabsolute error is|x^∗− x|and therelative erroris ^|x_|x^∗^−x|∗| , provided that x^∗ 6= 0.

Example 9

(a) If x^∗ = 0.3000 × 10⁻³and x = 0.3100 × 10⁻³, then the absolute error is 0.1 × 10⁻⁴and the relative error is 0.3333 × 10⁻¹.

(b) If x^∗ = 0.3000 × 10⁴and x = 0.3100 × 10⁴, then the absolute error is 0.1 × 10³ and the relative error is 0.3333 × 10⁻¹.

49 / 116

師大

Definition 7 (Roundoff error)

The error results from replacing a number with its floating-point form is calledroundoff errororrounding error.

Definition 8 (Absolute Error and Relative Error)

If x is an approximation to the exact value x^∗, theabsolute error is|x^∗− x|and therelative erroris ^|x_|x^∗^−x|∗| , provided that x^∗ 6= 0.

Example 9

(a) If x^∗ = 0.3000 × 10⁻³and x = 0.3100 × 10⁻³, then the absolute error is 0.1 × 10⁻⁴and the relative error is 0.3333 × 10⁻¹.

(b) If x^∗ = 0.3000 × 10⁴and x = 0.3100 × 10⁴, then the absolute error is 0.1 × 10³ and the relative error is 0.3333 × 10⁻¹.

50 / 116

師大

Definition 7 (Roundoff error)

The error results from replacing a number with its floating-point form is calledroundoff errororrounding error.

Definition 8 (Absolute Error and Relative Error)

If x is an approximation to the exact value x^∗, theabsolute error is|x^∗− x|and therelative erroris ^|x_|x^∗^−x|∗| , provided that x^∗ 6= 0.

Example 9

(a) If x^∗ = 0.3000 × 10⁻³and x = 0.3100 × 10⁻³, then the absolute error is 0.1 × 10⁻⁴and the relative error is 0.3333 × 10⁻¹.

(b) If x^∗ = 0.3000 × 10⁴and x = 0.3100 × 10⁴, then the absolute error is 0.1 × 10³ and the relative error is 0.3333 × 10⁻¹.

51 / 116

師大

Remark 1

As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

Definition 10

The number x is said to approximate x^∗to tsignificant digitsif t is the largest nonnegative integer for which

|x − x^∗|

|x^∗| ≤ 5 × 10^−t.

52 / 116

師大

Remark 1

As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

Definition 10

The number x is said to approximate x^∗to tsignificant digitsif t is the largest nonnegative integer for which

|x − x^∗|

|x^∗| ≤ 5 × 10^−t.

53 / 116

師大

If the floating-point representation f l(x) for the number x is obtained by using t digits and chopping procedure, then the relative error is

|x − f l(x)|

|x| = |0.00 · · · 0a_t+1at+2· · · × 2^m|

|1.a₁a₂· · · a_ta_t+1a_t+2· · · × 2^m|

= |0.a_t+1at+2· · · |

|1.a₁a2· · · a_tat+1at+2· · · | × 2^−t. The minimal value of the denominator is 1. The numerator is bounded above by 1. As a consequence

x − f l(x) x

≤ 2^−t.

54 / 116

師大

If the floating-point representation f l(x) for the number x is obtained by using t digits and chopping procedure, then the relative error is

|x − f l(x)|

|x| = |0.00 · · · 0a_t+1at+2· · · × 2^m|

|1.a₁a₂· · · a_ta_t+1a_t+2· · · × 2^m|

= |0.a_t+1at+2· · · |

|1.a₁a2· · · a_tat+1at+2· · · | × 2^−t. The minimal value of the denominator is 1. The numerator is bounded above by 1. As a consequence

x − f l(x) x

≤ 2^−t.

55 / 116

師大

If the floating-point representation f l(x) for the number x is obtained by using t digits and chopping procedure, then the relative error is

|x − f l(x)|

|x| = |0.00 · · · 0a_t+1at+2· · · × 2^m|

|1.a₁a₂· · · a_ta_t+1a_t+2· · · × 2^m|

= |0.a_t+1at+2· · · |

|1.a₁a2· · · a_tat+1at+2· · · | × 2^−t. The minimal value of the denominator is 1. The numerator is bounded above by 1. As a consequence

x − f l(x) x

≤ 2^−t.

56 / 116

師大

If the floating-point representation f l(x) for the number x is obtained by using t digits and chopping procedure, then the relative error is

|x − f l(x)|

|x| = |0.00 · · · 0a_t+1at+2· · · × 2^m|

|1.a₁a₂· · · a_ta_t+1a_t+2· · · × 2^m|

= |0.a_t+1at+2· · · |

|1.a₁a2· · · a_tat+1at+2· · · | × 2^−t. The minimal value of the denominator is 1. The numerator is bounded above by 1. As a consequence

x − f l(x) x

≤ 2^−t.

57 / 116

師大

If t-digit rounding arithmetic is used and

a_t+1= 0, then f l(x) = ±1.a1a₂· · · a_t× 2^m.A bound for the relative error is

|x − f l(x)|

|x| = |0.at+1at+2· · · |

|1.a1a₂· · · ata_t+1a_t+2· · · | × 2^−t≤ 2^−(t+1), since the numerator is bounded above by ¹₂ due to at+1= 0.

at+1= 1, then f l(x) = ±(1.a1a2· · · at+ 2^−t) × 2^m.The upper bound for relative error becomes

|x − f l(x)|

|x| = |1 − 0.at+1a_t+2· · · |

|1.a1a2· · · atat+1at+2· · · | × 2^−t≤ 2^−(t+1), since the numerator is bounded by ¹₂due to at+1= 1.

Therefore the relative error for rounding arithmetic is

x − f l(x) x

≤ 2^−(t+1)= 1 2× 2^−t.

58 / 116

師大

If t-digit rounding arithmetic is used and

a_t+1= 0, then f l(x) = ±1.a1a₂· · · a_t× 2^m. A bound for the relative error is

|x − f l(x)|

|x| = |0.at+1at+2· · · |

|1.a1a₂· · · ata_t+1a_t+2· · · | × 2^−t≤ 2^−(t+1), since the numerator is bounded above by ¹₂ due to at+1= 0.

at+1= 1, then f l(x) = ±(1.a1a2· · · at+ 2^−t) × 2^m. The upper bound for relative error becomes

|x − f l(x)|

|x| = |1 − 0.at+1a_t+2· · · |

|1.a1a2· · · atat+1at+2· · · | × 2^−t≤ 2^−(t+1), since the numerator is bounded by ¹₂due to at+1= 1.

Therefore the relative error for rounding arithmetic is

x − f l(x) x

≤ 2^−(t+1)= 1 2× 2^−t.

59 / 116

師大

If t-digit rounding arithmetic is used and

a_t+1= 0, then f l(x) = ±1.a1a₂· · · a_t× 2^m. A bound for the relative error is

|x − f l(x)|

|x| = |0.at+1at+2· · · |

|1.a1a₂· · · ata_t+1a_t+2· · · | × 2^−t≤ 2^−(t+1), since the numerator is bounded above by ¹₂ due to at+1= 0.

at+1= 1, then f l(x) = ±(1.a1a2· · · at+ 2^−t) × 2^m.The upper bound for relative error becomes

|x − f l(x)|

|x| = |1 − 0.at+1a_t+2· · · |

|1.a1a2· · · atat+1at+2· · · | × 2^−t≤ 2^−(t+1), since the numerator is bounded by ¹₂due to at+1= 1.

Therefore the relative error for rounding arithmetic is

x − f l(x) x

≤ 2^−(t+1)= 1 2× 2^−t.

60 / 116

師大

If t-digit rounding arithmetic is used and

a_t+1= 0, then f l(x) = ±1.a1a₂· · · a_t× 2^m. A bound for the relative error is

|x − f l(x)|

|x| = |0.at+1at+2· · · |

|1.a1a₂· · · ata_t+1a_t+2· · · | × 2^−t≤ 2^−(t+1), since the numerator is bounded above by ¹₂ due to at+1= 0.

at+1= 1, then f l(x) = ±(1.a1a2· · · at+ 2^−t) × 2^m. The upper bound for relative error becomes

|x − f l(x)|

|x| = |1 − 0.at+1a_t+2· · · |

|1.a1a2· · · atat+1at+2· · · | × 2^−t≤ 2^−(t+1), since the numerator is bounded by ¹₂due to at+1= 1.

Therefore the relative error for rounding arithmetic is

x − f l(x) x

≤ 2^−(t+1)= 1 2× 2^−t.

61 / 116

師大

If t-digit rounding arithmetic is used and

a_t+1= 0, then f l(x) = ±1.a1a₂· · · a_t× 2^m. A bound for the relative error is

|x − f l(x)|

|x| = |0.at+1at+2· · · |

|1.a1a₂· · · ata_t+1a_t+2· · · | × 2^−t≤ 2^−(t+1), since the numerator is bounded above by ¹₂ due to at+1= 0.

at+1= 1, then f l(x) = ±(1.a1a2· · · at+ 2^−t) × 2^m. The upper bound for relative error becomes

|x − f l(x)|

|x| = |1 − 0.at+1a_t+2· · · |

|1.a1a2· · · atat+1at+2· · · | × 2^−t≤ 2^−(t+1), since the numerator is bounded by ¹₂due to at+1= 1.

Therefore the relative error for rounding arithmetic is

x − f l(x) x

≤ 2^−(t+1)= 1 2× 2^−t.

62 / 116

師大

Definition 11 (Machine epsilon)

The floating-point representation, f l(x), of x can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε_M, (1) whereεM ≡ 2^−t is referred to as theunit roundoff erroror machine epsilon.

Single precision IEEE standard floating-point format The mantissa f corresponds to 23 binary digits (i.e., t = 23), the machine epsilon is

εM = 2⁻²³≈ 1.192 × 10⁻⁷.

This approximately corresponds to7accurate decimal digits

63 / 116

師大

Definition 11 (Machine epsilon)

The floating-point representation, f l(x), of x can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε_M, (1) whereεM ≡ 2^−t is referred to as theunit roundoff erroror machine epsilon.

Single precision IEEE standard floating-point format The mantissa f corresponds to 23 binary digits (i.e., t = 23), the machine epsilon is

εM = 2⁻²³≈ 1.192 × 10⁻⁷.

This approximately corresponds to7accurate decimal digits

64 / 116

師大

Definition 11 (Machine epsilon)

The floating-point representation, f l(x), of x can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε_M, (1) whereεM ≡ 2^−t is referred to as theunit roundoff erroror machine epsilon.

Single precision IEEE standard floating-point format The mantissa f corresponds to 23 binary digits (i.e., t = 23), the machine epsilon is

εM = 2⁻²³≈ 1.192 × 10⁻⁷.

This approximately corresponds to7accurate decimal digits

65 / 116

師大

Double precision IEEE standard floating-point format The mantissa f corresponds to 52 binary digits (i.e., t = 52), the machine epsilon is

εM = 2⁻⁵²≈ 2.220 × 10⁻¹⁶.

which provides between15and16decimal digits of accuracy.

Summary of IEEE standard floating-point format

single precision double precision

ε_M 1.192 × 10⁻⁷ 2.220 × 10⁻¹⁶

smallest positive number 1.175 × 10⁻³⁸ 2.225 × 10⁻³⁰⁸ largest number 3.403 × 10³⁸ 1.798 × 10³⁰⁸

decimal precision 7 16

66 / 116

師大

Double precision IEEE standard floating-point format The mantissa f corresponds to 52 binary digits (i.e., t = 52), the machine epsilon is

εM = 2⁻⁵²≈ 2.220 × 10⁻¹⁶.

which provides between15and16decimal digits of accuracy.

Summary of IEEE standard floating-point format

single precision double precision

ε_M 1.192 × 10⁻⁷ 2.220 × 10⁻¹⁶

smallest positive number 1.175 × 10⁻³⁸ 2.225 × 10⁻³⁰⁸ largest number 3.403 × 10³⁸ 1.798 × 10³⁰⁸

decimal precision 7 16

67 / 116

師大

Double precision IEEE standard floating-point format The mantissa f corresponds to 52 binary digits (i.e., t = 52), the machine epsilon is

εM = 2⁻⁵²≈ 2.220 × 10⁻¹⁶.

which provides between15and16decimal digits of accuracy.

Summary of IEEE standard floating-point format

single precision double precision

ε_M 1.192 × 10⁻⁷ 2.220 × 10⁻¹⁶

smallest positive number 1.175 × 10⁻³⁸ 2.225 × 10⁻³⁰⁸ largest number 3.403 × 10³⁸ 1.798 × 10³⁰⁸

decimal precision 7 16

68 / 116

師大

Let stand for any one of the four basic arithmetic operators +, −, ?, ÷.

Whenever twomachine numbersxand y are to be combined arithmetically, the computer will produce f l(x y)instead of x y.

Under (1), the relative error of f l(x y) satisfies

f l(x y) = (x y)(1 + δ), δ ≤ ε_M, (2)

where ε_M is the unit roundoff.

But if x, y arenotmachine numbers, then they must first rounded to floating-point format before the arithmetic operation and the resulting relative error becomes

f l(f l(x) f l(y)) = (x(1 + δ₁) y(1 + δ₂))(1 + δ₃),

where δ_i ≤ ε_M, i = 1, 2, 3.

69 / 116

師大

Let stand for any one of the four basic arithmetic operators +, −, ?, ÷.

Whenever twomachine numbersxand y are to be combined arithmetically, the computer will produce f l(x y)instead of x y.

Under (1), the relative error of f l(x y) satisfies

f l(x y) = (x y)(1 + δ), δ ≤ ε_M, (2)

where ε_M is the unit roundoff.

But if x, y arenotmachine numbers, then they must first rounded to floating-point format before the arithmetic operation and the resulting relative error becomes

f l(f l(x) f l(y)) = (x(1 + δ₁) y(1 + δ₂))(1 + δ₃),

where δ_i ≤ ε_M, i = 1, 2, 3.

70 / 116

師大

Let stand for any one of the four basic arithmetic operators +, −, ?, ÷.

Whenever twomachine numbersxand y are to be combined arithmetically, the computer will produce f l(x y)instead of x y.

Under (1), the relative error of f l(x y) satisfies

f l(x y) = (x y)(1 + δ), δ ≤ ε_M, (2)

where ε_M is the unit roundoff.

But if x, y arenotmachine numbers, then they must first rounded to floating-point format before the arithmetic operation and the resulting relative error becomes

f l(f l(x) f l(y)) = (x(1 + δ₁) y(1 + δ₂))(1 + δ₃),

where δ_i ≤ ε_M, i = 1, 2, 3.

71 / 116

師大

Let stand for any one of the four basic arithmetic operators +, −, ?, ÷.

Whenever twomachine numbersxand y are to be combined arithmetically, the computer will produce f l(x y)instead of x y.

Under (1), the relative error of f l(x y) satisfies

f l(x y) = (x y)(1 + δ), δ ≤ ε_M, (2)

where ε_M is the unit roundoff.

But if x, y arenotmachine numbers, then they must first rounded to floating-point format before the arithmetic operation and the resulting relative error becomes

f l(f l(x) f l(y)) = (x(1 + δ₁) y(1 + δ₂))(1 + δ₃),

where δ_i ≤ ε_M, i = 1, 2, 3.

72 / 116

師大

Example

Let x = 0.54617 and y = 0.54601. Using rounding and four-digit arithmetic, then

x^∗= f l(x) = 0.5462is accurate tofoursignificant digits since

|x − x^∗|

|x| = 0.00003

0.54617 = 5.5 × 10⁻⁵ ≤ 5 × 10⁻⁴.

y^∗ = f l(y) = 0.5460is accurate tofivesignificant digits since

|y − y^∗|

|y| = 0.00001

0.54601 = 1.8 × 10⁻⁵≤ 5 × 10⁻⁵.

73 / 116

師大

Example

Let x = 0.54617 and y = 0.54601. Using rounding and four-digit arithmetic, then

x^∗= f l(x) = 0.5462is accurate tofoursignificant digits since

|x − x^∗|

|x| = 0.00003

0.54617 = 5.5 × 10⁻⁵ ≤ 5 × 10⁻⁴.

y^∗ = f l(y) = 0.5460is accurate tofivesignificant digits since

|y − y^∗|

|y| = 0.00001

0.54601 = 1.8 × 10⁻⁵≤ 5 × 10⁻⁵.

74 / 116

師大

Example

Let x = 0.54617 and y = 0.54601. Using rounding and four-digit arithmetic, then

x^∗= f l(x) = 0.5462is accurate tofoursignificant digits since

|x − x^∗|

|x| = 0.00003

0.54617 = 5.5 × 10⁻⁵ ≤ 5 × 10⁻⁴.

y^∗ = f l(y) = 0.5460is accurate tofivesignificant digits since

|y − y^∗|

|y| = 0.00001

0.54601 = 1.8 × 10⁻⁵≤ 5 × 10⁻⁵.

75 / 116

師大

The exact value of subtraction is r = x − y = 0.00016.

But

r^∗≡ x y = f l(f l(x) − f l(y)) = 0.0002.

Since

|r − r^∗|

|r| = 0.25 ≤ 5 × 10⁻¹ the result has onlyonesignificant digit.

Loss of accuracy

76 / 116

師大

The exact value of subtraction is r = x − y = 0.00016.

But

r^∗≡ x y = f l(f l(x) − f l(y)) = 0.0002.

Since

|r − r^∗|

|r| = 0.25 ≤ 5 × 10⁻¹ the result has onlyonesignificant digit.

Loss of accuracy

77 / 116

師大

Loss of Significance

One of the most common error-producing calculations involves the cancellation of significant digits due to the subtraction of nearly equal numbersor theaddition of one very large number and one very small number.

Sometimes, loss of significance can be avoided by rewriting the mathematical formula.

Example 12

The quadratic formulas for computing the roots of ax²+ bx + c = 0, when a 6= 0, are

x₁= −b +√

b²− 4ac

2a and x₂= −b −√

b²− 4ac

2a .

Consider the quadratic equationx²+ 62.10x + 1 = 0and discuss the numerical results.

78 / 116

師大

Loss of Significance

Sometimes, loss of significance can be avoided by rewriting the mathematical formula.

Example 12

The quadratic formulas for computing the roots of ax²+ bx + c = 0, when a 6= 0, are

x₁= −b +√

b²− 4ac

2a and x₂= −b −√

b²− 4ac

2a .

Consider the quadratic equationx²+ 62.10x + 1 = 0and discuss the numerical results.

79 / 116

師大

Loss of Significance

Sometimes, loss of significance can be avoided by rewriting the mathematical formula.

Example 12

The quadratic formulas for computing the roots of ax²+ bx + c = 0, when a 6= 0, are

x₁= −b +√

b²− 4ac

2a and x₂= −b −√

b²− 4ac

2a .

Consider the quadratic equationx²+ 62.10x + 1 = 0and discuss the numerical results.

80 / 116

師大

在文檔中 Mathematical preliminaries and error analysis (頁 45-81)