師大

**analysis**

Tsung-Ming Huang

Department of Mathematics National Taiwan Normal University, Taiwan

August 28, 2011

**1 / 37**

師大

**Outline**

**1** **Round-off errors and computer arithmetic**
IEEE standard floating-point format

Absolute and Relative Errors Machine Epsilon

Loss of Significance

**2** **Algorithms and Convergence**
Algorithm

Stability

Rate of convergence

師大

**Example 1**

What is the binary representation of ^{2}_{3}?

Solution: To determine the binary representation for ^{2}_{3}, we write
2

3 = (0.a_{1}a_{2}a_{3}. . .)_{2}.
Multiply by 2to obtain

4

3 = (a_{1}.a_{2}a_{3}. . .)_{2}.

Therefore, we get a_{1} = 1by taking the integer part of both
sides.

**3 / 37**

師大

Subtracting 1, we have 1

3 = (0.a2a3a4. . .)2. Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)2.

師大

In the computational world, each representable number has only afixedandfinitenumber of digits.

For any real number x, let

x = ±1.a_{1}a_{2}· · · a_{t}a_{t+1}a_{t+2}· · · × 2^{m},

denote the normalized scientific binary representation of x.

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

**5 / 37**

師大

**Single precision**

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2^{m} as shown in the following figure.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Thefirst bitis asignindicator, denoted s. This is followed by an8-bit exponent cand a23-bit mantissa f.

The base for the exponent and mantissa is 2, and the actualexponent isc − 127. The value of c is restricted by the inequality 0 ≤ c ≤ 255.

師大

The actual exponent of the number is restricted by the inequality−127 ≤ c − 127 ≤ 128.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Using this system gives a floating-point number of the form
(−1)^{s}2^{c−127}(1 + f ).

**7 / 37**

師大

**Example 2**

What is the decimal number of the machine number 01000000101000000000000000000000?

**1** The leftmost bit is zero, which indicates that the number is
positive.

**2** The next 8 bits, 10000001, are equivalent to
c = 1 · 2^{7}+ 0 · 2^{6}+ · · · + 0 · 2^{1}+ 1 · 2^{0}= 129.

The exponential part of the number is 2^{129−127}= 2^{2}.

**3** The final 23 bits specify that the mantissa is

f = 0 · (2)^{−1}+ 1 · (2)^{−2}+ 0 · (2)^{−3}+ · · · + 0 · (2)^{−23}= 0.25.

**4** Consequently, this machine number precisely represents
the decimal number

(−1)^{s}2^{c−127}(1 + f ) = 2^{2}· (1 + 0.25) = 5.

師大

**Example 3**

What is the decimal number of the machine number 01000000100111111111111111111111?

**1** The final 23 bits specify that the mantissa is

f = 0 · (2)^{−1}+ 0 · (2)^{−2}+ 1 · (2)^{−3}+ · · · + 1 · (2)^{−23}

= 0.2499998807907105.

**2** Consequently, this machine number precisely represents
the decimal number

(−1)^{s}2^{c−127}(1 + f ) = 2^{2}· (1 + 0.2499998807907105)

= 4.999999523162842.

**9 / 37**

師大

**Example 4**

What is the decimal number of the machine number 01000000101000000000000000000001?

**1** The final 23 bits specify that the mantissa is

f = 0 · 2^{−1}+ 1 · 2^{−2}+ 0 · 2^{−3}+ · · · + 0 · 2^{−22}+ 1 · 2^{−23}

= 0.2500001192092896.

**2** Consequently, this machine number precisely represents
the decimal number

(−1)^{s}2^{c−127}(1 + f ) = 2^{2}· (1 + 0.2500001192092896)

= 5.000000476837158.

師大

**Summary**

**Above three examples**

01000000100111111111111111111111 ⇒ 4.999999523162842 01000000101000000000000000000000 ⇒ 5

01000000101000000000000000000001 ⇒ 5.000000476837158 Only a relativelysmall subsetof the real number system is used for the representation of all the real numbers.

This subset, which are called thefloating-point numbers, contains only rational numbers, both positive and negative.

When a number can not be represented exactly with the fixed finite number of digits in a computer, anear-by floating-point number is chosen for approximate representation.

**11 / 37**

師大

**The smallest positive number**

Let s = 0, c = 1 and f = 0 which is equivalent to
2^{−126}· (1 + 0) ≈ 1.175 × 10^{−38}
**The largest number**

Let s = 0, c = 254 and f = 1 − 2^{−23}which is equivalent to
2^{127}· (2 − 2^{−23}) ≈ 3.403 × 10^{38}

**Definition 5**

If a number x with |x| < 2^{−126}· (1 + 0), then we say that an
underflowhas occurred and is generally set to zero.

If |x| > 2^{127}· (2 − 2^{−23}), then we say that anoverflowhas
occurred.

師大

**Double precision**

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.

1 sign of mantissa

normalized mantissa exponent

52-bit mantissa 0 1

11-bit

11 12

63

Thefirstbit is a sign indicator, denoted s. This is followed by an11-bitexponent c and a52-bitmantissa f .

The actual exponent isc − 1023.

**13 / 37**

師大

**Format of floating-point number**

(−1)^{s}× (1 + f ) × 2^{c−1023}
**The smallest positive number**

Let s = 0, c = 1 and f = 0 which is equivalent to
2^{−1022}· (1 + 0) ≈ 2.225 × 10^{−308}.
**The largest number**

Let s = 0, c = 2046 and f = 1 − 2^{−52}which is equivalent to
2^{1023}· (2 − 2^{−52}) ≈ 1.798 × 10^{308}.

師大

**Chopping and rounding**

For any real number x, let

x = ±1.a1a2· · · a_{t}at+1at+2· · · × 2^{m},

denote the normalized scientific binary representation of x.

**1** **chopping: simply discard the excess bits a**_{t+1}, at+2, . . .to
obtain

f l(x) = ±1.a1a2· · · a_{t}× 2^{m}.

**2** **rounding: add 2**^{−(t+1)}× 2^{m} to x and then chop the excess
bits to obtain a number of the form

f l(x) = ±1.δ_{1}δ_{2}· · · δ_{t}× 2^{m}.

In this method, if at+1= 1, we add 1 to atto obtain f l(x),
and if a_{t+1} = 0, we merely chop off all but the first t digits.

**15 / 37**

師大

**Definition 6 (Roundoff error)**

The error results from replacing a number with its floating-point form is calledroundoff errororrounding error.

**Definition 7 (Absolute Error and Relative Error)**

If x is an approximation to the exact value x^{?}, theabsolute error
is|x^{?}− x|and therelative erroris ^{|x}_{|x}^{?}^{−x|}?| , provided that x^{?} 6= 0.

**Example 8**

(a) If x = 0.3000 × 10^{−3}and x^{∗} = 0.3100 × 10^{−3}, then the
absolute error is 0.1 × 10^{−4}and the relative error is
0.3333 × 10^{−1}.

(b) If x = 0.3000 × 10^{4}and x^{∗}= 0.3100 × 10^{4}, then the absolute
error is 0.1 × 10^{3} and the relative error is 0.3333 × 10^{−1}.

師大

**Remark 1**

As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

**Definition 9**

The number x^{∗} is said to approximate x to tsignificant digitsif t
is the largest nonnegative integer for which

|x − x^{∗}|

|x| ≤ 5 × 10^{−t}.

**17 / 37**

師大

If the floating-point representation f l(x) for the number x is obtained by using t digits and chopping procedure, then the relative error is

|x − f l(x)|

|x| = |0.00 · · · 0a_{t+1}at+2· · · × 2^{m}|

|1.a_{1}a_{2}· · · a_{t}a_{t+1}a_{t+2}· · · × 2^{m}|

= |0.a_{t+1}at+2· · · |

|1.a_{1}a2· · · a_{t}at+1at+2· · · | × 2^{−t}.
The minimal value of the denominator is 1. The numerator
is bounded above by 1. As a consequence

x − f l(x) x

≤ 2^{−t}.

師大

If t-digit rounding arithmetic is used and

a_{t+1}= 0, then f l(x) = ±1.a1a_{2}· · · a_{t}× 2^{m}. A bound for the
relative error is

|x − f l(x)|

|x| = |0.at+1at+2· · · |

|1.a1a_{2}· · · ata_{t+1}a_{t+2}· · · | × 2^{−t}≤ 2^{−(t+1)},
since the numerator is bounded above by ^{1}_{2} due to at+1= 0.

at+1= 1, then f l(x) = ±(1.a1a2· · · at+ 2^{−t}) × 2^{m}. The
upper bound for relative error becomes

|x − f l(x)|

|x| = |1 − 0.at+1a_{t+2}· · · |

|1.a1a2· · · atat+1at+2· · · | × 2^{−t}≤ 2^{−(t+1)},
since the numerator is bounded by ^{1}_{2}due to at+1= 1.

Therefore the relative error for rounding arithmetic is

x − f l(x) x

≤ 2^{−(t+1)}= 1
2× 2^{−t}.

**19 / 37**

師大

**Definition 10 (Machine epsilon)**

The floating-point representation, f l(x), of x can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε_{M}, (1)
whereεM ≡ 2^{−t} is referred to as theunit roundoff erroror
machine epsilon.

**Single precision IEEE standard floating-point format**
The mantissa f corresponds to 23 binary digits (i.e., t = 23),
the machine epsilon is

εM = 2^{−23}≈ 1.192 × 10^{−7}.

This approximately corresponds to6accurate decimal digits

師大

**Double precision IEEE standard floating-point format**
The mantissa f corresponds to 52 binary digits (i.e., t = 52),
the machine epsilon is

εM = 2^{−52}≈ 2.220 × 10^{−16}.

which provides between15and16decimal digits of accuracy.

**Summary of IEEE standard floating-point format**

single precision double precision

ε_{M} 1.192 × 10^{−7} 2.220 × 10^{−16}

smallest positive number 1.175 × 10^{−38} 2.225 × 10^{−308}
largest number 3.403 × 10^{38} 1.798 × 10^{308}

decimal precision 6 15

**21 / 37**

師大

Let stand for any one of the four basic arithmetic operators +, −, ?, ÷.

Whenever twomachine numbersxand y are to be combined arithmetically, the computer will produce f l(x y)instead of x y.

Under (1), the relative error of f l(x y) satisfies

f l(x y) = (x y)(1 + δ), δ ≤ ε_{M}, (2)

where ε_{M} is the unit roundoff.

But if x, y arenotmachine numbers, then they must first rounded to floating-point format before the arithmetic operation and the resulting relative error becomes

f l(f l(x) f l(y)) = (x(1 + δ_{1}) y(1 + δ_{2}))(1 + δ_{3}),

where δ_{i} ≤ ε_{M}, i = 1, 2, 3.

師大

**Example**

Let x = 0.54617 and y = 0.54601. Using rounding and four-digit arithmetic, then

x^{∗}= f l(x) = 0.5462is accurate tofoursignificant digits
since

|x − x^{∗}|

|x| = 0.00003

0.54617 = 5.5 × 10^{−5} ≤ 5 × 10^{−4}.
y^{∗} = f l(y) = 0.5460is accurate tofivesignificant digits
since

|y − y^{∗}|

|y| = 0.00001

0.54601 = 1.8 × 10^{−5}≤ 5 × 10^{−5}.

**23 / 37**

師大

The exact value of subtraction is r = x − y = 0.00016.

But

r^{∗}≡ x y = f l(f l(x) − f l(y)) = 0.0002.

Since

|r − r^{∗}|

|r| = 0.25 ≤ 5 × 10^{−1}
the result has onlyonesignificant digit.

Loss of accuracy

師大

**Loss of Significance**

One of the most common error-producing calculations involves the cancellation of significant digits due to the subtraction of nearly equal numbersor theaddition of one very large number and one very small number.

Sometimes, loss of significance can be avoided by rewriting the mathematical formula.

**Example 11**

The quadratic formulas for computing the roots of
ax^{2}+ bx + c = 0, when a 6= 0, are

x_{1}= −b +√

b^{2}− 4ac

2a and x_{2}= −b −√

b^{2}− 4ac

2a .

Consider the quadratic equationx^{2}+ 62.10x + 1 = 0and
discuss the numerical results.

**25 / 37**

師大

**Solution**

Using the quadratic formula and 8-digit rounding arithmetic, one can obtain

x_{1} = −0.01610723 and x_{2} = −62.08390.

Now we perform the calculations with 4-digit rounding arithmetic. First we have

pb^{2}− 4ac =p

62.10^{2}− 4.000 =√

3856 − 4.000 = 62.06, and

f l(x_{1}) = −62.10 + 62.06

2.000 = −0.04000

2.000 = −0.02000.

|f l(x_{1}) − x_{1}|

|x_{1}| = | − 0.02000 + 0.01610723|

| − 0.01610723| ≈ 0.2417 ≤ 5×10^{−1}.

師大

In calculating x_{2},

f l(x2) = −62.10 − 62.06

2.000 = −124.2

2.000 = −62.10,

|f l(x_{2}) − x2|

|x_{2}| = | − 62.10 + 62.08390|

| − 62.08390| ≈ 0.259×10^{−3} ≤ 5×10^{−4}.
In this equation, b^{2} = 62.10^{2}is much larger than 4ac = 4.

Hence b and√

b^{2}− 4ac become two nearly equal numbers.

The calculation of x1 involves the subtraction of two nearly equal numbers.

To obtain a more accurate 4-digit rounding approximation
for x_{1}, we change the formulation by rationalizing the
numerator, that is,

x1 = −2c b +√

b^{2}− 4ac.

**27 / 37**

師大

Then

f l(x1) = −2.000

62.10 + 62.06 = −2.000

124.2 = −0.01610.

The relative error in computing x_{1} is now reduced to 6.2 × 10^{−4}

**Example 12**
Let

p(x) = x^{3}− 3x^{2}+ 3x − 1,
q(x) = ((x − 3)x + 3)x − 1.

Compare the function values at x = 2.19 with using three-digit arithmetic.

師大

**Solution**

Use 3-digit and rounding for p(2.19) and q(2.19).

ˆ

p(2.19) = ((2.19^{3}− 3 × 2.19^{2}) + 3 × 2.19) − 1

= ((10.5 − 14.4) + 3 × 2.19) − 1

= (−3.9 + 6.57) − 1

= 2.67 − 1 = 1.67 and

ˆ

q(2.19) = ((2.19 − 3) × 2.19 + 3) × 2.19 − 1

= (−0.81 × 2.19 + 3) × 2.19 − 1

= (−1.77 + 3) × 2.19 − 1

= 1.23 × 2.19 − 1

= 2.69 − 1 = 1.69.

**29 / 37**

師大

With more digits, one can have

p(2.19) = g(2.19) = 1.685159

|p(2.19) − ˆp(2.19)| = 0.015159 and

|q(2.19) − ˆq(2.19)| = 0.004841, respectively. q(x) is better than p(x).

師大

**Definition 13 (Algorithm)**

Analgorithmis a procedure that describes a finite sequence of steps to be performed in a specified order.

**Example 14**

Give an algorithm to computePn

i=1xi, where n and
x_{1}, x_{2}, . . . , x_{n}are given.

**Algorithm**

INPUT n, x_{1}, x_{2}, . . . , x_{n}.
OUTPUT SU M =Pn

i=1xi.

Step 1. Set SU M = 0. (Initialize accumulator.) Step 2. For i = 1, 2, . . . , n do

Set SU M = SU M + xi. (Add the next term.) Step 3. OUTPUT SU M ;

STOP

**31 / 37**

師大

**Definition 15 (Stable)**

An algorithm is called stable ifsmallchanges in the initial data of the algorithm produce correspondinglysmallchanges in the final results.

**Definition 16 (Unstable)**

An algorithm is unstable if small errors made at one stage of the algorithm are magnified and propagated in subsequent stages and seriously degrade the accuracy of the overall calculation.

**Remark**

Whether an algorithm is stable or unstable should be decided on the basis of relative error.

師大

**Example 17**

Consider the following recurrence algorithm

x0= 1, x1 = ^{1}_{3}
x_{n+1}= ^{13}_{3} x_{n}−^{4}_{3}x_{n−1}

for computing the sequence of {x_{n}= (^{1}_{3})^{n}}. This algorithm is
unstable.

A Matlab implementation of the recurrence algorithm gives the following result.

**33 / 37**

師大

n xn x^{∗}_{n} RelErr

8 4.57247371e-04 4.57247371e-04 4.4359e-10
10 5.08052602e-05 5.08052634e-05 6.3878e-08
12 5.64497734e-06 5.64502927e-06 9.1984e-06
14 6.26394672e-07 6.27225474e-07 1.3246e-03
15 2.05751947e-07 2.09075158e-07 1.5895e-02
16 5.63988754e-08 6.96917194e-08 1.9074e-01
17 -2.99408028e-08 2.32305731e-08 2.2889e+00
The error present in x_{n}is multiplied by ^{13}_{3} in computing x_{n+1}.
For example, the error will be propagated with a factor of ^{13}_{3}14

in computing x_{15}. Additional roundoff errors in computing
x2, x3, . . .may also be propagated and added to that of x15.

師大

**Matlab program**
n = 30;

x = zeros(n,1);

x(1) = 1;

x(2) = 1/3;

for ii = 3:n

x(ii) = 13 / 3 * x(ii-1) - 4 / 3 * x(ii-2);

xn = (1/3)ˆ(ii-1);

RelErr = abs(xn-x(ii)) / xn;

fprintf(’x(%2.0f) = %20.8d, x ast(%2.0f) = %20.8d,’, ...

’RelErr(%2.0f) = %14.4d \n’, ii,x(ii),ii,xn,ii,RelErr);

end

**35 / 37**

師大

**Definition 18**

Suppose{β_{n}} → 0and{x_{n}} → x^{∗}. If∃ c > 0and an integer
N > 0such that

|x_{n}− x^{∗}| ≤ c|β_{n}|, ∀ n ≥ N,

then we say{x_{n}}convergestox^{∗} withrate of convergence
O(β_{n}), and writex_{n}= x^{∗}+ O(β_{n}).

**Example 19**

Compare the convergence behavior of {x_{n}} and {y_{n}}, where
xn= n + 1

n^{2} , and yn= n + 3
n^{3} .

師大

**Solution:**

Note that both

n→∞lim x_{n}= 0 and lim

n→∞y_{n}= 0.

Let α_{n}= _{n}^{1} and β_{n}= _{n}^{1}2. Then

|x_{n}− 0| = n + 1

n^{2} ≤ n + n
n^{2} = 2

n = 2α_{n},

|y_{n}− 0| = n + 3

n^{3} ≤ n + 3n
n^{3} = 4

n^{2} = 4β_{n}.
Hence

x_{n}= 0 + O(1

n) and y_{n}= 0 + O( 1
n^{2}).

This shows that {y_{n}} converges to 0 much faster than {x_{n}}.

**37 / 37**