• 沒有找到結果。

# Mathematical preliminaries and error analysis

N/A
N/A
Protected

Share "Mathematical preliminaries and error analysis"

Copied!
45
0
0

(1)

### analysis

Tsung-Ming Huang

Department of Mathematics National Taiwan Normal University, Taiwan

September 12, 2015

1 / 45

(2)

### Outline

1 Round-off errors and computer arithmetic IEEE standard floating-point format

Absolute and Relative Errors Machine Epsilon

Loss of Significance

2 Algorithms and Convergence Algorithm

Stability

Rate of convergence

(3)

What is the difference for the arithmetic in algebra and computer?

1 For arithmetic in algebra, 256 + 1 = 257, √

256 + 12

= 257

2 For arithmetic in computer (MATLAB), int8(256) +int8(1) =127 ???????

int16(256) +int16(1) = 257

sqrt(256+1)ˆ2 = ? The solution is equal to 257 or not.

(single(sqrt(5))+single(sqrt(3)))ˆ2 - (sqrt(3)+sqrt(5))ˆ2

3 / 45

(4)

Example 1

Consider the following recurrence algorithm

 x0= 1, x1 = 13 xn+1= 133 xn43xn−1

for computing the sequence of {xn= (13)n}.

Matlab program

n = 30; x = zeros(n,1); x(1) = 1; x(2) = 1/3;

for ii = 3:n

x(ii) = 13 / 3 * x(ii-1) - 4 / 3 * x(ii-2);

xn = (1/3)ˆ(ii-1); RelErr = abs(xn-x(ii)) / xn;

fprintf(’x(%2.0f) = %15.8e, x ast(%2.0f) = %14.8e,’, ...

’RelErr(%2.0f) = %11.4e \n’, ii,x(ii),ii,xn,ii,RelErr);

end

(5)

Example 2

What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write 2

3 = (0.a1a2a3. . .)2. Multiply by 2to obtain

4

3 = (a1.a2a3. . .)2.

Therefore, we get a1 = 1by taking the integer part of both sides.

5 / 45

(6)

Subtracting 1, we have 1

3 = (0.a2a3a4. . .)2. Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)2.

(7)

In the computational world, each representable number has only afixedandfinitenumber of digits.

For any real number x, let

x = ±1.a1a2· · · atat+1at+2· · · × 2m,

denote the normalized scientific binary representation of x.

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

7 / 45

(8)

### Single precision

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2m as shown in the following figure.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Thefirst bitis asignindicator, denoted s. This is followed by an8-bit exponent cand a23-bit mantissa f.

The base for the exponent and mantissa is 2, and the actualexponent isc − 127. The value of c is restricted by the inequality 0 ≤ c ≤ 255.

(9)

The actual exponent of the number is restricted by the inequality−127 ≤ c − 127 ≤ 128.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Using this system gives a floating-point number of the form (−1)s2c−127(1 + f ).

9 / 45

(10)

Example 3

What is the decimal number of the machine number 01000000101000000000000000000000?

1 The leftmost bit is zero, which indicates that the number is positive.

2 The next 8 bits, 10000001, are equivalent to c = 1 · 27+ 0 · 26+ · · · + 0 · 21+ 1 · 20= 129.

The exponential part of the number is 2129−127= 22.

3 The final 23 bits specify that the mantissa is

f = 0 · (2)−1+ 1 · (2)−2+ 0 · (2)−3+ · · · + 0 · (2)−23= 0.25.

4 Consequently, this machine number precisely represents the decimal number

(−1)s2c−127(1 + f ) = 22· (1 + 0.25) = 5.

(11)

Example 4

What is the decimal number of the machine number 01000000100111111111111111111111?

1 The final 23 bits specify that the mantissa is

f = 0 · (2)−1+ 0 · (2)−2+ 1 · (2)−3+ · · · + 1 · (2)−23

= 0.2499998807907105.

2 Consequently, this machine number precisely represents the decimal number

(−1)s2c−127(1 + f ) = 22· (1 + 0.2499998807907105)

= 4.999999523162842.

11 / 45

(12)

Example 5

What is the decimal number of the machine number 01000000101000000000000000000001?

1 The final 23 bits specify that the mantissa is

f = 0 · 2−1+ 1 · 2−2+ 0 · 2−3+ · · · + 0 · 2−22+ 1 · 2−23

= 0.2500001192092896.

2 Consequently, this machine number precisely represents the decimal number

(−1)s2c−127(1 + f ) = 22· (1 + 0.2500001192092896)

= 5.000000476837158.

(13)

### Summary

Above three examples

01000000100111111111111111111111 ⇒ 4.999999523162842 01000000101000000000000000000000 ⇒ 5

01000000101000000000000000000001 ⇒ 5.000000476837158 Only a relativelysmall subsetof the real number system is used for the representation of all the real numbers.

This subset, which are called thefloating-point numbers, contains only rational numbers, both positive and negative.

When a number can not be represented exactly with the fixed finite number of digits in a computer, anear-by floating-point number is chosen for approximate representation.

13 / 45

(14)

The smallest positive number

Let s = 0, c = 1 and f = 0 which is equivalent to 2−126· (1 + 0) ≈ 1.175 × 10−38 The largest number

Let s = 0, c = 254 and f = 1 − 2−23which is equivalent to 2127· (2 − 2−23) ≈ 3.403 × 1038

Definition 6

If a number x with |x| < 2−126· (1 + 0), then we say that an underflowhas occurred and is generally set to zero.

If |x| > 2127· (2 − 2−23), then we say that anoverflowhas occurred.

(15)

### Double precision

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.

1 sign of mantissa

normalized mantissa exponent

52-bit mantissa 0 1

11-bit

11 12

63

Thefirstbit is a sign indicator, denoted s. This is followed by an11-bitexponent c and a52-bitmantissa f .

The actual exponent isc − 1023.

15 / 45

(16)

Format of floating-point number

(−1)s× (1 + f ) × 2c−1023 The smallest positive number

Let s = 0, c = 1 and f = 0 which is equivalent to 2−1022· (1 + 0) ≈ 2.225 × 10−308. The largest number

Let s = 0, c = 2046 and f = 1 − 2−52which is equivalent to 21023· (2 − 2−52) ≈ 1.798 × 10308.

(17)

### Chopping and rounding

For any real number x, let

x = ±1.a1a2· · · atat+1at+2· · · × 2m,

denote the normalized scientific binary representation of x.

1 chopping: simply discard the excess bits at+1, at+2, . . .to obtain

f l(x) = ±1.a1a2· · · at× 2m.

2 rounding: add 2−(t+1)× 2m to x and then chop the excess bits to obtain a number of the form

f l(x) = ±1.δ1δ2· · · δt× 2m.

In this method, if at+1= 1, we add 1 to atto obtain f l(x), and if at+1 = 0, we merely chop off all but the first t digits.

17 / 45

(18)

Definition 7 (Roundoff error)

The error results from replacing a number with its floating-point form is calledroundoff errororrounding error.

Definition 8 (Absolute Error and Relative Error)

If x is an approximation to the exact value x, theabsolute error is|x− x|and therelative erroris |x|x−x|| , provided that x 6= 0.

Example 9

(a) If x = 0.3000 × 10−3and x = 0.3100 × 10−3, then the absolute error is 0.1 × 10−4and the relative error is 0.3333 × 10−1.

(b) If x = 0.3000 × 104and x = 0.3100 × 104, then the absolute error is 0.1 × 103 and the relative error is 0.3333 × 10−1.

(19)

Remark 1

As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

Definition 10

The number x is said to approximate xto tsignificant digitsif t is the largest nonnegative integer for which

|x − x|

|x| ≤ 5 × 10−t.

19 / 45

(20)

If the floating-point representation f l(x) for the number x is obtained by using t digits and chopping procedure, then the relative error is

|x − f l(x)|

|x| = |0.00 · · · 0at+1at+2· · · × 2m|

|1.a1a2· · · atat+1at+2· · · × 2m|

= |0.at+1at+2· · · |

|1.a1a2· · · atat+1at+2· · · | × 2−t. The minimal value of the denominator is 1. The numerator is bounded above by 1. As a consequence

x − f l(x) x

≤ 2−t.

(21)

If t-digit rounding arithmetic is used and

at+1= 0, then f l(x) = ±1.a1a2· · · at× 2m. A bound for the relative error is

|x − f l(x)|

|x| = |0.at+1at+2· · · |

|1.a1a2· · · atat+1at+2· · · | × 2−t≤ 2−(t+1), since the numerator is bounded above by 12 due to at+1= 0.

at+1= 1, then f l(x) = ±(1.a1a2· · · at+ 2−t) × 2m. The upper bound for relative error becomes

|x − f l(x)|

|x| = |1 − 0.at+1at+2· · · |

|1.a1a2· · · atat+1at+2· · · | × 2−t≤ 2−(t+1), since the numerator is bounded by 12due to at+1= 1.

Therefore the relative error for rounding arithmetic is

x − f l(x) x

≤ 2−(t+1)= 1 2× 2−t.

21 / 45

(22)

Definition 11 (Machine epsilon)

The floating-point representation, f l(x), of x can be expressed as

f l(x) = x(1 + δ), |δ| ≤ εM, (1) whereεM ≡ 2−t is referred to as theunit roundoff erroror machine epsilon.

Single precision IEEE standard floating-point format The mantissa f corresponds to 23 binary digits (i.e., t = 23), the machine epsilon is

εM = 2−23≈ 1.192 × 10−7.

This approximately corresponds to7accurate decimal digits

(23)

Double precision IEEE standard floating-point format The mantissa f corresponds to 52 binary digits (i.e., t = 52), the machine epsilon is

εM = 2−52≈ 2.220 × 10−16.

which provides between15and16decimal digits of accuracy.

Summary of IEEE standard floating-point format

single precision double precision

εM 1.192 × 10−7 2.220 × 10−16

smallest positive number 1.175 × 10−38 2.225 × 10−308 largest number 3.403 × 1038 1.798 × 10308

decimal precision 7 16

23 / 45

(24)

Let stand for any one of the four basic arithmetic operators +, −, ?, ÷.

Whenever twomachine numbersxand y are to be combined arithmetically, the computer will produce f l(x y)instead of x y.

Under (1), the relative error of f l(x y) satisfies

f l(x y) = (x y)(1 + δ), δ ≤ εM, (2)

where εM is the unit roundoff.

But if x, y arenotmachine numbers, then they must first rounded to floating-point format before the arithmetic operation and the resulting relative error becomes

f l(f l(x) f l(y)) = (x(1 + δ1) y(1 + δ2))(1 + δ3),

where δi ≤ εM, i = 1, 2, 3.

(25)

### Example

Let x = 0.54617 and y = 0.54601. Using rounding and four-digit arithmetic, then

x= f l(x) = 0.5462is accurate tofoursignificant digits since

|x − x|

|x| = 0.00003

0.54617 = 5.5 × 10−5 ≤ 5 × 10−4. y = f l(y) = 0.5460is accurate tofivesignificant digits since

|y − y|

|y| = 0.00001

0.54601 = 1.8 × 10−5≤ 5 × 10−5.

25 / 45

(26)

The exact value of subtraction is r = x − y = 0.00016.

But

r≡ x y = f l(f l(x) − f l(y)) = 0.0002.

Since

|r − r|

|r| = 0.25 ≤ 5 × 10−1 the result has onlyonesignificant digit.

Loss of accuracy

(27)

Loss of Significance

One of the most common error-producing calculations involves the cancellation of significant digits due to the subtraction of nearly equal numbersor theaddition of one very large number and one very small number.

Sometimes, loss of significance can be avoided by rewriting the mathematical formula.

Example 12

The quadratic formulas for computing the roots of ax2+ bx + c = 0, when a 6= 0, are

x1= −b +√

b2− 4ac

2a and x2= −b −√

b2− 4ac

2a .

Consider the quadratic equationx2+ 62.10x + 1 = 0and discuss the numerical results.

27 / 45

(28)

### Solution

Using the quadratic formula and 8-digit rounding arithmetic, one can obtain

x1 = −0.01610723 and x2 = −62.08390.

Now we perform the calculations with 4-digit rounding arithmetic. First we have

pb2− 4ac =p

62.102− 4.000 =√

3856 − 4.000 = 62.06, and

f l(x1) = −62.10 + 62.06

2.000 = −0.04000

2.000 = −0.02000.

The relative error in computing x1is

|f l(x1) − x1|

|x1| = | − 0.02000 + 0.01610723|

| − 0.01610723| ≈ 0.2417 ≤ 5×10−1.

(29)

In calculating x2,

f l(x2) = −62.10 − 62.06

2.000 = −124.2

2.000 = −62.10, and the relative error in computing x2 is

|f l(x2) − x2|

|x2| = | − 62.10 + 62.08390|

| − 62.08390| ≈ 0.259×10−3 ≤ 5×10−4. In this equation, b2 = 62.102is much larger than 4ac = 4.

Hence b and√

b2− 4ac become two nearly equal numbers.

The calculation of x1 involves the subtraction of two nearly equal numbers.

To obtain a more accurate 4-digit rounding approximation for x1, we change the formulation by rationalizing the numerator, that is,

x1 = −2c b +√

b2− 4ac.

29 / 45

(30)

Then

f l(x1) = −2.000

62.10 + 62.06 = −2.000

124.2 = −0.01610.

The relative error in computing x1 is now reduced to 6.2 × 10−4

Example 13 Let

p(x) = x3− 3x2+ 3x − 1, q(x) = ((x − 3)x + 3)x − 1.

Compare the function values at x = 2.19 with using three-digit arithmetic.

(31)

### Solution

Use 3-digit and rounding for p(2.19) and q(2.19).

ˆ

p(2.19) = ((2.193− 3 × 2.192) + 3 × 2.19) − 1

= ((10.5 − 14.4) + 3 × 2.19) − 1

= (−3.9 + 6.57) − 1

= 2.67 − 1 = 1.67 and

ˆ

q(2.19) = ((2.19 − 3) × 2.19 + 3) × 2.19 − 1

= (−0.81 × 2.19 + 3) × 2.19 − 1

= (−1.77 + 3) × 2.19 − 1

= 1.23 × 2.19 − 1

= 2.69 − 1 = 1.69.

31 / 45

(32)

With more digits, one can have

p(2.19) = g(2.19) = 1.685159 Hence the absolute errors are

|p(2.19) − ˆp(2.19)| = 0.015159 and

|q(2.19) − ˆq(2.19)| = 0.004841,

respectively. One can observe that the evaluation formula q(x) is better than p(x).

(33)

Exercise

Page 28: 4, 11, 12, 15, 18

33 / 45

(34)

Definition 14 (Algorithm)

Analgorithmis a procedure that describes a finite sequence of steps to be performed in a specified order.

Example 15

Give an algorithm to computePn

i=1xi, where n and x1, x2, . . . , xnare given.

Algorithm

INPUT n, x1, x2, . . . , xn. OUTPUT SU M =Pn

i=1xi.

Step 1. Set SU M = 0. (Initialize accumulator.) Step 2. For i = 1, 2, . . . , n do

Set SU M = SU M + xi. (Add the next term.) Step 3. OUTPUT SU M ;

STOP

(35)

Definition 16 (Stable)

An algorithm is called stable ifsmallchanges in the initial data of the algorithm produce correspondinglysmallchanges in the final results.

Definition 17 (Unstable)

An algorithm is unstable if small errors made at one stage of the algorithm are magnified and propagated in subsequent stages and seriously degrade the accuracy of the overall calculation.

Remark

Whether an algorithm is stable or unstable should be decided on the basis of relative error.

35 / 45

(36)

Example 18

Consider the following recurrence algorithm

 x0= 1, x1 = 13 xn+1= 133 xn43xn−1

for computing the sequence of {xn= (13)n}. This algorithm is unstable.

A Matlab implementation of the recurrence algorithm gives the following result.

(37)

n xn xn RelErr

8 4.57247371e-04 4.57247371e-04 4.4359e-10 10 5.08052602e-05 5.08052634e-05 6.3878e-08 12 5.64497734e-06 5.64502927e-06 9.1984e-06 14 6.26394672e-07 6.27225474e-07 1.3246e-03 15 2.05751947e-07 2.09075158e-07 1.5895e-02 16 5.63988754e-08 6.96917194e-08 1.9074e-01 17 -2.99408028e-08 2.32305731e-08 2.289e+00 20 -3.40210767e-06 8.60391597e-10 3.955e+03 23 -2.17789924e-04 3.18663555e-11 6.835e+06 27 -5.57542287e-02 3.93411796e-13 1.417e+11 30 -3.56827064e+00 1.45708072e-14 2.449e+14

37 / 45

(38)

For any constants c1 and c2, xn= c1

 1 3

n

+ c2(4n) is a solution to the recursive equation

xn= 13

3 xn−1−4 3xn−2 since

13

3 xn−1−4 3xn−2

= 13 3

"

c1 1 3

n−1

+ c24n−1

#

− 4 3

"

c1 1 3

n−2

+ c24n−2

#

= c1 1 3

n−2

 13 3 ·1

3 −4 3



+ c24n−2 13

3 · 4 −4 3



= c1

 1 3

n

+ c24n= xn.

(39)

Take x0= 1and x1 = 13. This determine unique values as c1= 1and c2 = 0. Therefore,

xn= 1 3

n

for all n.

In computer arithmetic, ˆx0= 1and ˆx1= 0.33 · · · 3. The generated sequence {ˆxn} is then given by

ˆ xn= ˆc1

 1 3

n

+ ˆc2(4n) ,

where ˆc1≈ 1 and |ˆc2| ≈ ε. Therefore, the round-off error is xn− ˆxn= (1 − ˆc1) 1

3

n

− ˆc2(4n) which growsexponentiallywith n.

39 / 45

(40)

Matlab program n = 30;

x = zeros(n,1);

x(1) = 1;

x(2) = 1/3;

for ii = 3:n

x(ii) = 13 / 3 * x(ii-1) - 4 / 3 * x(ii-2);

xn = (1/3)ˆ(ii-1);

RelErr = abs(xn-x(ii)) / xn;

fprintf(’x(%2.0f) = %20.8d, x ast(%2.0f) = %20.8d,’, ...

’RelErr(%2.0f) = %14.4d \n’, ii,x(ii),ii,xn,ii,RelErr);

end

(41)

Example 19

Consider the following recurrence algorithm

 x0= 1, x1 = 13 xn+1= 2xn− xn−1

for computing the sequence of {xn= 1 −23n}. This algorithm is stable.

For any constants c1 and c2,

xn= c1+ c2n is a solution to the recursive equation

xn= 2xn−1− xn−2.

41 / 45

(42)

Take x0= 1and x1 = 13. This determine unique values as c1= 1and c2 = −23. Therefore,

xn= 1 − 2

3n, for all n.

In computer arithmetic, ˆx0= 1and ˆx1= 0.33 · · · 3. The generated sequence {ˆxn} is then given by

ˆ

xn= ˆc1− ˆc2n,

where ˆc1≈ 1 and |ˆc2| ≈ 23. Therefore, the round-off error is xn− ˆxn= (1 − ˆc1) − 2

3 − ˆc2

 n which growslinearlywith n.

(43)

Definition 20

Suppose{βn} → 0and{xn} → x. If∃ c > 0and an integer N > 0such that

|xn− x| ≤ c|βn|, ∀ n ≥ N,

then we say{xn}convergestox withrate of convergence O(βn), and writexn= x+ O(βn).

Example 21

Compare the convergence behavior of {xn} and {yn}, where xn= n + 1

n2 , and yn= n + 3 n3 .

43 / 45

(44)

### Solution:

Note that both

n→∞lim xn= 0 and lim

n→∞yn= 0.

Let αn= n1 and βn= n12. Then

|xn− 0| = n + 1

n2 ≤ n + n n2 = 2

n = 2αn,

|yn− 0| = n + 3

n3 ≤ n + 3n n3 = 4

n2 = 4βn. Hence

xn= 0 + O(1

n) and yn= 0 + O( 1 n2).

This shows that {yn} converges to 0 much faster than {xn}.

(45)

Exercise

Page 39: 3.a, 6, 7, 11

45 / 45

floating number system, precision, accuarcy and error analysis, higher dimensional root finding methods and convergence analysis, numerical methods for linear systems, numerical

Thus any continuous vector function r defines a space curve C that is traced out by the tip of the moving vector r(t), as shown in Figure 1.... The curve, shown in Figure 2,

In contrast to Rudin’s observation that uniform convergence of functions in (X ) is equivalent to convergence in the metric d ( f, g) = || f – g||, we shall show here that there is

In attempting to generalize this function, we recall that interchanging two rows of a matrix changes the sign of its determinant.. This suggests the

9. The IEEE standard only requires that the extended precision format contain more bits than the double precision format... ing very common, it is still possible that you may need

• Last data pointer stores the memory address of the operand for the last non-control instruction. Last instruction pointer stored the address of the last

As as single precision floating point number, they represent 23.850000381, but as a double word integer, they represent 1,103,023,309.. The CPU does not know which is the

After lots of tests, we record two players’ winning probabilities and average number of rounds to figure out mathematical principles behind the problems and derive general formulas

The first row shows the eyespot with white inner ring, black middle ring, and yellow outer ring in Bicyclus anynana.. The second row provides the eyespot with black inner ring

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.. 1 sign

A) the approximate atomic number of each kind of atom in a molecule B) the approximate number of protons in a molecule. C) the actual number of chemical bonds in a molecule D)

In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using

An algorithm is called stable if it satisfies the property that small changes in the initial data produce correspondingly small changes in the final results. (初始資料的微小變動

If the skyrmion number changes at some point of time.... there must be a singular point

If the best number of degrees of freedom for pure error can be speciﬁed, we might use some standard optimality criterion to obtain an optimal design for the given model, and

• When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate

For R-K methods, the relationship between the number of (function) evaluations per step and the order of LTE is shown in the following

The difference resulted from the co- existence of two kinds of words in Buddhist scriptures a foreign words in which di- syllabic words are dominant, and most of them are the

a) Visitor arrivals is growing at a compound annual growth rate. The number of visitors fluctuates from 2012 to 2018 and does not increase in compound growth rate in reality.

Microphone and 600 ohm line conduits shall be mechanically and electrically connected to receptacle boxes and electrically grounded to the audio system ground point.. Lines in

To convert a string containing floating-point digits to its floating-point value, use the static parseDouble method of the Double class..

  Uses the parameter value to set the number Uses the parameter value to set the number of threads to be active in parallel sections of of threads to be active in parallel sections

The process of optimization is as shown below: (1) the design is generated based on uniform random number, (2) the structural analysis is conducted by structural analysis