• 沒有找到結果。

Computer Arithmetic

N/A
N/A
Protected

Academic year: 2022

Share "Computer Arithmetic"

Copied!
133
0
0

加載中.... (立即查看全文)

全文

(1)

Computer Arithmetic

1

Numerical Analysis

NTNU

Tsung-Min Hwang September 14, 2003

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(2)

Computer Arithmetic

2

1 Floating-Point Number and Roundoff Error

Normalized scientific notation for the decimal number system of

x

:

x = ±r × 10

n

,

where

1

10 ≤ r < 1,

and

n

is an integer (positive, negative, or zero).

r

is called the mantissa and

n

is the exponent. – The leading digit in the fraction is not zero. – For example,

42.965 = 0.42965 × 10

2

,

−0.00234 = −0.234 × 10

−2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(3)

Computer Arithmetic

2

1 Floating-Point Number and Roundoff Error

Normalized scientific notation for the decimal number system of

x

:

x = ±r × 10

n

,

where

1

10 ≤ r < 1,

and

n

is an integer (positive, negative, or zero).

r

is called the mantissa and

n

is the exponent. – The leading digit in the fraction is not zero. – For example,

42.965 = 0.42965 × 10

2

,

−0.00234 = −0.234 × 10

−2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(4)

Computer Arithmetic

2

1 Floating-Point Number and Roundoff Error

Normalized scientific notation for the decimal number system of

x

:

x = ±r × 10

n

,

where

1

10 ≤ r < 1,

and

n

is an integer (positive, negative, or zero).

r

is called the mantissa and

n

is the exponent.

– The leading digit in the fraction is not zero.

– For example,

42.965 = 0.42965 × 10

2

,

−0.00234 = −0.234 × 10

−2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(5)

Computer Arithmetic

2

1 Floating-Point Number and Roundoff Error

Normalized scientific notation for the decimal number system of

x

:

x = ±r × 10

n

,

where

1

10 ≤ r < 1,

and

n

is an integer (positive, negative, or zero).

r

is called the mantissa and

n

is the exponent.

– The leading digit in the fraction is not zero.

– For example,

42.965 = 0.42965 × 10

2

,

−0.00234 = −0.234 × 10

−2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(6)

Computer Arithmetic

3

Scientific notation for the binary number system of

x

:

x = ±q × 2

m

with

1

2 ≤ q < 1,

and some integer

m

. For example,

(1001.1101)

2

= 1 × 2

3

+ 1 × 2

0

+ 1 × 2

−1

+ 1 × 2

−2

+ 1 × 2

−4

= 0.10011101 × 2

4

= (9.8125)

10

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(7)

Computer Arithmetic

3

Scientific notation for the binary number system of

x

:

x = ±q × 2

m

with

1

2 ≤ q < 1,

and some integer

m

.

For example,

(1001.1101)

2

= 1 × 2

3

+ 1 × 2

0

+ 1 × 2

−1

+ 1 × 2

−2

+ 1 × 2

−4

= 0.10011101 × 2

4

= (9.8125)

10

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(8)

Computer Arithmetic

3

Scientific notation for the binary number system of

x

:

x = ±q × 2

m

with

1

2 ≤ q < 1,

and some integer

m

. For example,

(1001.1101)

2

= 1 × 2

3

+ 1 × 2

0

+ 1 × 2

−1

+ 1 × 2

−2

+ 1 × 2

−4

= 0.10011101 × 2

4

= (9.8125)

10

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(9)

Computer Arithmetic

4

Example 1.1 What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write

2

3 = (0.a

1

a

2

a

3

. . .)

2

.

Multiply by 2 to obtain

4

3 = (a

1

.a

2

a

3

. . .)

2

.

Therefore, we get

a

1

= 1

by taking the integer part of both sides. Subtracting 1, we have

1

3 = (0.a

2

a

3

a

4

. . .)

2

.

Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)

2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(10)

Computer Arithmetic

4

Example 1.1 What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write

2

3 = (0.a

1

a

2

a

3

. . .)

2

.

Multiply by 2 to obtain

4

3 = (a

1

.a

2

a

3

. . .)

2

.

Therefore, we get

a

1

= 1

by taking the integer part of both sides. Subtracting 1, we have

1

3 = (0.a

2

a

3

a

4

. . .)

2

.

Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)

2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(11)

Computer Arithmetic

4

Example 1.1 What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write

2

3 = (0.a

1

a

2

a

3

. . .)

2

.

Multiply by 2 to obtain

4

3 = (a

1

.a

2

a

3

. . .)

2

.

Therefore, we get

a

1

= 1

by taking the integer part of both sides. Subtracting 1, we have

1

3 = (0.a

2

a

3

a

4

. . .)

2

.

Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)

2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(12)

Computer Arithmetic

4

Example 1.1 What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write

2

3 = (0.a

1

a

2

a

3

. . .)

2

.

Multiply by 2 to obtain

4

3 = (a

1

.a

2

a

3

. . .)

2

.

Therefore, we get

a

1

= 1

by taking the integer part of both sides.

Subtracting 1, we have

1

3 = (0.a

2

a

3

a

4

. . .)

2

.

Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)

2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(13)

Computer Arithmetic

4

Example 1.1 What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write

2

3 = (0.a

1

a

2

a

3

. . .)

2

.

Multiply by 2 to obtain

4

3 = (a

1

.a

2

a

3

. . .)

2

.

Therefore, we get

a

1

= 1

by taking the integer part of both sides. Subtracting 1, we have

1

3 = (0.a

2

a

3

a

4

. . .)

2

.

Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)

2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(14)

Computer Arithmetic

4

Example 1.1 What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write

2

3 = (0.a

1

a

2

a

3

. . .)

2

.

Multiply by 2 to obtain

4

3 = (a

1

.a

2

a

3

. . .)

2

.

Therefore, we get

a

1

= 1

by taking the integer part of both sides. Subtracting 1, we have

1

3 = (0.a

2

a

3

a

4

. . .)

2

.

Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)

2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(15)

Computer Arithmetic

5

Only a relatively small subset of the real number system is used for the representation of all the real numbers.

This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(16)

Computer Arithmetic

5

Only a relatively small subset of the real number system is used for the representation of all the real numbers.

This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(17)

Computer Arithmetic

5

Only a relatively small subset of the real number system is used for the representation of all the real numbers.

This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(18)

Computer Arithmetic

5

Only a relatively small subset of the real number system is used for the representation of all the real numbers.

This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(19)

Computer Arithmetic

6

For any real number

x

, let

x = ±0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

, a

1

6= 0,

denote the normalized scientific binary representation of

x

.

a

1

6= 0

, hence

a

1

= 1

.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

. – There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

t+1

, a

t+2

, . . .

to obtain

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m

.

2. rounding up: add

2

−(t+1)

× 2

m to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

1

δ

2

· · · δ

t

× 2

m

.

In this method, if

a

t+1

= 1

, we add

1

to

a

t to obtain

f l(x)

, and if

a

t+1

= 0

, we

merely chop off all but the first

t

digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(20)

Computer Arithmetic

6

For any real number

x

, let

x = ±0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

, a

1

6= 0,

denote the normalized scientific binary representation of

x

.

a

1

6= 0

, hence

a

1

= 1

.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

. – There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

t+1

, a

t+2

, . . .

to obtain

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m

.

2. rounding up: add

2

−(t+1)

× 2

m to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

1

δ

2

· · · δ

t

× 2

m

.

In this method, if

a

t+1

= 1

, we add

1

to

a

t to obtain

f l(x)

, and if

a

t+1

= 0

, we

merely chop off all but the first

t

digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(21)

Computer Arithmetic

6

For any real number

x

, let

x = ±0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

, a

1

6= 0,

denote the normalized scientific binary representation of

x

.

a

1

6= 0

, hence

a

1

= 1

.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

.

– There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

t+1

, a

t+2

, . . .

to obtain

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m

.

2. rounding up: add

2

−(t+1)

× 2

m to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

1

δ

2

· · · δ

t

× 2

m

.

In this method, if

a

t+1

= 1

, we add

1

to

a

t to obtain

f l(x)

, and if

a

t+1

= 0

, we

merely chop off all but the first

t

digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(22)

Computer Arithmetic

6

For any real number

x

, let

x = ±0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

, a

1

6= 0,

denote the normalized scientific binary representation of

x

.

a

1

6= 0

, hence

a

1

= 1

.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

. – There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

t+1

, a

t+2

, . . .

to obtain

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m

.

2. rounding up: add

2

−(t+1)

× 2

m to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

1

δ

2

· · · δ

t

× 2

m

.

In this method, if

a

t+1

= 1

, we add

1

to

a

t to obtain

f l(x)

, and if

a

t+1

= 0

, we

merely chop off all but the first

t

digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(23)

Computer Arithmetic

6

For any real number

x

, let

x = ±0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

, a

1

6= 0,

denote the normalized scientific binary representation of

x

.

a

1

6= 0

, hence

a

1

= 1

.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

. – There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

t+1

, a

t+2

, . . .

to obtain

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m

.

2. rounding up: add

2

−(t+1)

× 2

m to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

1

δ

2

· · · δ

t

× 2

m

.

In this method, if

a

t+1

= 1

, we add

1

to

a

t to obtain

f l(x)

, and if

a

t+1

= 0

, we

merely chop off all but the first

t

digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(24)

Computer Arithmetic

6

For any real number

x

, let

x = ±0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

, a

1

6= 0,

denote the normalized scientific binary representation of

x

.

a

1

6= 0

, hence

a

1

= 1

.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

. – There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

t+1

, a

t+2

, . . .

to obtain

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m

.

2. rounding up: add

2

−(t+1)

× 2

m to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

1

δ

2

· · · δ

t

× 2

m

.

In this method, if

a

t+1

= 1

, we add

1

to

a

t to obtain

f l(x)

, and if

a

t+1

= 0

, we

merely chop off all but the first

t

digits.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(25)

Computer Arithmetic

7

Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.

Definition 1.2 (Absolute Error and Relative Error) If

x

is an approximation to the exact value

x

?, the absolute error is

|x

?

− x|

and the relative error is |x|x?−x|?| , provided that

x

?

6= 0

.

Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(26)

Computer Arithmetic

7

Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.

Definition 1.2 (Absolute Error and Relative Error) If

x

is an approximation to the exact value

x

?, the absolute error is

|x

?

− x|

and the relative error is |x|x?−x|?| , provided that

x

?

6= 0

.

Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(27)

Computer Arithmetic

7

Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.

Definition 1.2 (Absolute Error and Relative Error) If

x

is an approximation to the exact value

x

?, the absolute error is

|x

?

− x|

and the relative error is |x|x?−x|?| , provided that

x

?

6= 0

.

Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(28)

Computer Arithmetic

7

Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.

Definition 1.2 (Absolute Error and Relative Error) If

x

is an approximation to the exact value

x

?, the absolute error is

|x

?

− x|

and the relative error is |x|x?−x|?| , provided that

x

?

6= 0

.

Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(29)

Computer Arithmetic

8

If the floating-point representation

f l(x)

for the number

x

is obtained by using

t

digits and chopping procedure, then the relative error is

|x − fl(x)|

|x| = |0.00 · · · 0a

t+1

a

t+2

· · · × 2

m

|

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

|

= |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

.

Since

a

1

6= 0

, the minimal value of the denominator is 12. The numerator is bounded above by 1. As a consequence

x − fl(x) x

≤ 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(30)

Computer Arithmetic

8

If the floating-point representation

f l(x)

for the number

x

is obtained by using

t

digits and chopping procedure, then the relative error is

|x − fl(x)|

|x| = |0.00 · · · 0a

t+1

a

t+2

· · · × 2

m

|

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

|

= |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

.

Since

a

1

6= 0

, the minimal value of the denominator is 12. The numerator is bounded above by 1. As a consequence

x − fl(x) x

≤ 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(31)

Computer Arithmetic

8

If the floating-point representation

f l(x)

for the number

x

is obtained by using

t

digits and chopping procedure, then the relative error is

|x − fl(x)|

|x| = |0.00 · · · 0a

t+1

a

t+2

· · · × 2

m

|

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

|

= |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

.

Since

a

1

6= 0

, the minimal value of the denominator is 12. The numerator is bounded above by 1. As a consequence

x − fl(x) x

≤ 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(32)

Computer Arithmetic

8

If the floating-point representation

f l(x)

for the number

x

is obtained by using

t

digits and chopping procedure, then the relative error is

|x − fl(x)|

|x| = |0.00 · · · 0a

t+1

a

t+2

· · · × 2

m

|

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

|

= |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

.

Since

a

1

6= 0

, the minimal value of the denominator is 12. The numerator is bounded above by 1.

As a consequence

x − fl(x) x

≤ 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(33)

Computer Arithmetic

8

If the floating-point representation

f l(x)

for the number

x

is obtained by using

t

digits and chopping procedure, then the relative error is

|x − fl(x)|

|x| = |0.00 · · · 0a

t+1

a

t+2

· · · × 2

m

|

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

|

= |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

.

Since

a

1

6= 0

, the minimal value of the denominator is 12. The numerator is bounded above by 1. As a consequence

x − fl(x) x

≤ 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(34)

Computer Arithmetic

9

If

t

-digit rounding arithmetic is used and

a

t+1

= 0

, then

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m.

A bound for the relative error is

|x − fl(x)|

|x| = |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded above by 12.

a

t+1

= 1

, then

f l (x) = ±(0.a

1

a

2

· · · a

t

+ 2

−t

) × 2

m. The upper bound for relative error becomes

|x − fl(x)|

|x| = |1 − 0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded by 12 due to

a

t+1

= 1

.

Therefore the relative error for rounding arithmetic is

x − fl(x) x

≤ 2

−t

= 1

2 × 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(35)

Computer Arithmetic

9

If

t

-digit rounding arithmetic is used and

a

t+1

= 0

, then

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m. A bound for the relative error is

|x − fl(x)|

|x| = |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded above by 12.

a

t+1

= 1

, then

f l (x) = ±(0.a

1

a

2

· · · a

t

+ 2

−t

) × 2

m. The upper bound for relative error becomes

|x − fl(x)|

|x| = |1 − 0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded by 12 due to

a

t+1

= 1

.

Therefore the relative error for rounding arithmetic is

x − fl(x) x

≤ 2

−t

= 1

2 × 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(36)

Computer Arithmetic

9

If

t

-digit rounding arithmetic is used and

a

t+1

= 0

, then

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m. A bound for the relative error is

|x − fl(x)|

|x| = |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded above by 12.

a

t+1

= 1

, then

f l (x) = ±(0.a

1

a

2

· · · a

t

+ 2

−t

) × 2

m. The upper bound for relative error becomes

|x − fl(x)|

|x| = |1 − 0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded by 12 due to

a

t+1

= 1

.

Therefore the relative error for rounding arithmetic is

x − fl(x) x

≤ 2

−t

= 1

2 × 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(37)

Computer Arithmetic

9

If

t

-digit rounding arithmetic is used and

a

t+1

= 0

, then

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m. A bound for the relative error is

|x − fl(x)|

|x| = |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded above by 12.

a

t+1

= 1

, then

f l (x) = ±(0.a

1

a

2

· · · a

t

+ 2

−t

) × 2

m. The upper bound for relative error becomes

|x − fl(x)|

|x| = |1 − 0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded by 12 due to

a

t+1

= 1

.

Therefore the relative error for rounding arithmetic is

x − fl(x) x

≤ 2

−t

= 1

2 × 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(38)

Computer Arithmetic

10

The number

ε

M

≡ 2

−t+1 is referred to as the unit roundoff error or machine epsilon. The floating-point representation,

f l(x)

, of

x

can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε

M

.

(1)

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(39)

Computer Arithmetic

10

The number

ε

M

≡ 2

−t+1 is referred to as the unit roundoff error or machine epsilon.

The floating-point representation,

f l(x)

, of

x

can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε

M

.

(1)

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(40)

Computer Arithmetic

10

The number

ε

M

≡ 2

−t+1 is referred to as the unit roundoff error or machine epsilon.

The floating-point representation,

f l(x)

, of

x

can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε

M

.

(1)

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(41)

Computer Arithmetic

10

The number

ε

M

≡ 2

−t+1 is referred to as the unit roundoff error or machine epsilon.

The floating-point representation,

f l(x)

, of

x

can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε

M

.

(1)

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(42)

Computer Arithmetic

11

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

.

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

. The

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

.

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(43)

Computer Arithmetic

11

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

.

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

. The

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

.

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(44)

Computer Arithmetic

11

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

.

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

. The

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

.

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(45)

Computer Arithmetic

11

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

.

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

. The

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

.

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(46)

Computer Arithmetic

11

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

.

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

. The

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

.

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(47)

Computer Arithmetic

11

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

.

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

. The

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

.

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(48)

Computer Arithmetic

12

The mantissa

f

actually corresponds to 24 binary digits (i.e., precision

t = 24

), the

machine epsilon is

ε

M

= 2

−24+1

= 2

−23

≈ 1.192 × 10

−7

.

(2)

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is

1 + 2

−23.

The largest number that can be represented by the single precision format is approximately

2

128

≈ 3.403 × 10

38, and the smallest positive number is

2

−126

≈ 1.175 × 10

−38.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(49)

Computer Arithmetic

12

The mantissa

f

actually corresponds to 24 binary digits (i.e., precision

t = 24

),

the machine epsilon is

ε

M

= 2

−24+1

= 2

−23

≈ 1.192 × 10

−7

.

(2)

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is

1 + 2

−23.

The largest number that can be represented by the single precision format is approximately

2

128

≈ 3.403 × 10

38, and the smallest positive number is

2

−126

≈ 1.175 × 10

−38.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(50)

Computer Arithmetic

12

The mantissa

f

actually corresponds to 24 binary digits (i.e., precision

t = 24

), the

machine epsilon is

ε

M

= 2

−24+1

= 2

−23

≈ 1.192 × 10

−7

.

(2)

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is

1 + 2

−23.

The largest number that can be represented by the single precision format is approximately

2

128

≈ 3.403 × 10

38, and the smallest positive number is

2

−126

≈ 1.175 × 10

−38.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(51)

Computer Arithmetic

12

The mantissa

f

actually corresponds to 24 binary digits (i.e., precision

t = 24

), the

machine epsilon is

ε

M

= 2

−24+1

= 2

−23

≈ 1.192 × 10

−7

.

(2)

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is

1 + 2

−23.

The largest number that can be represented by the single precision format is approximately

2

128

≈ 3.403 × 10

38, and the smallest positive number is

2

−126

≈ 1.175 × 10

−38.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(52)

Computer Arithmetic

13

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.

1

sign of mantissa

normalized mantissa exponent

52-bit

mantissa

0 1

11-bit

11 12

63

Figure 2: 64-bit double precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 11-bit exponent

c

and a 52-bit mantissa

f

.

The actual exponent is

c − 1023

.

The machine epsilon

ε

M

= 2

−52

≈ 2.220 × 10

−16

,

which provides between 15 and 16 decimal digits of accuracy.

Range of approximately

2

−1022

≈ 2.225 × 10

−308 to

2

1024

≈ 1.798 × 10

308.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(53)

Computer Arithmetic

13

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.

1

sign of mantissa

normalized mantissa exponent

52-bit

mantissa

0 1

11-bit

11 12

63

Figure 2: 64-bit double precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 11-bit exponent

c

and a 52-bit mantissa

f

.

The actual exponent is

c − 1023

.

The machine epsilon

ε

M

= 2

−52

≈ 2.220 × 10

−16

,

which provides between 15 and 16 decimal digits of accuracy.

Range of approximately

2

−1022

≈ 2.225 × 10

−308 to

2

1024

≈ 1.798 × 10

308.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(54)

Computer Arithmetic

13

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.

1

sign of mantissa

normalized mantissa exponent

52-bit

mantissa

0 1

11-bit

11 12

63

Figure 2: 64-bit double precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 11-bit exponent

c

and a 52-bit mantissa

f

.

The actual exponent is

c − 1023

.

The machine epsilon

ε

M

= 2

−52

≈ 2.220 × 10

−16

,

which provides between 15 and 16 decimal digits of accuracy.

Range of approximately

2

−1022

≈ 2.225 × 10

−308 to

2

1024

≈ 1.798 × 10

308.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

參考文獻

相關文件

For periodic sequence (with period n) that has exactly one of each 1 ∼ n in any group, we can find the least upper bound of the number of converged-routes... Elementary number

Based on [BL], by checking the strong pseudoconvexity and the transmission conditions in a neighborhood of a fixed point at the interface, we can derive a Car- leman estimate for

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.. 1 sign

A) the approximate atomic number of each kind of atom in a molecule B) the approximate number of protons in a molecule. C) the actual number of chemical bonds in a molecule D)

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.. 1 sign

If x or F is a vector, then the condition number is defined in a similar way using norms and it measures the maximum relative change, which is attained for some, but not all

In an Ising spin glass with a large number of spins the number of lowest-energy configurations (ground states) grows exponentially with increasing number of spins.. It is in

If the skyrmion number changes at some point of time.... there must be a singular point