• 沒有找到結果。

Computer Arithmetic

N/A
N/A
Protected

Academic year: 2022

Share "Computer Arithmetic"

Copied!
133
0
0
顯示更多 ( 頁)

全文

(1)

Computer Arithmetic

1

Numerical Analysis

NTNU

Tsung-Min Hwang September 14, 2003

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(2)

Computer Arithmetic

2

1 Floating-Point Number and Roundoff Error

Normalized scientific notation for the decimal number system of

x

:

x = ±r × 10

n

,

where

1

10 ≤ r < 1,

and

n

is an integer (positive, negative, or zero).

r

is called the mantissa and

n

is the exponent. – The leading digit in the fraction is not zero. – For example,

42.965 = 0.42965 × 10

2

,

−0.00234 = −0.234 × 10

−2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(3)

Computer Arithmetic

2

1 Floating-Point Number and Roundoff Error

Normalized scientific notation for the decimal number system of

x

:

x = ±r × 10

n

,

where

1

10 ≤ r < 1,

and

n

is an integer (positive, negative, or zero).

r

is called the mantissa and

n

is the exponent. – The leading digit in the fraction is not zero. – For example,

42.965 = 0.42965 × 10

2

,

−0.00234 = −0.234 × 10

−2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(4)

Computer Arithmetic

2

1 Floating-Point Number and Roundoff Error

Normalized scientific notation for the decimal number system of

x

:

x = ±r × 10

n

,

where

1

10 ≤ r < 1,

and

n

is an integer (positive, negative, or zero).

r

is called the mantissa and

n

is the exponent.

– The leading digit in the fraction is not zero.

– For example,

42.965 = 0.42965 × 10

2

,

−0.00234 = −0.234 × 10

−2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(5)

Computer Arithmetic

2

1 Floating-Point Number and Roundoff Error

Normalized scientific notation for the decimal number system of

x

:

x = ±r × 10

n

,

where

1

10 ≤ r < 1,

and

n

is an integer (positive, negative, or zero).

r

is called the mantissa and

n

is the exponent.

– The leading digit in the fraction is not zero.

– For example,

42.965 = 0.42965 × 10

2

,

−0.00234 = −0.234 × 10

−2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(6)

Computer Arithmetic

3

Scientific notation for the binary number system of

x

:

x = ±q × 2

m

with

1

2 ≤ q < 1,

and some integer

m

. For example,

(1001.1101)

2

= 1 × 2

3

+ 1 × 2

0

+ 1 × 2

−1

+ 1 × 2

−2

+ 1 × 2

−4

= 0.10011101 × 2

4

= (9.8125)

10

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(7)

Computer Arithmetic

3

Scientific notation for the binary number system of

x

:

x = ±q × 2

m

with

1

2 ≤ q < 1,

and some integer

m

.

For example,

(1001.1101)

2

= 1 × 2

3

+ 1 × 2

0

+ 1 × 2

−1

+ 1 × 2

−2

+ 1 × 2

−4

= 0.10011101 × 2

4

= (9.8125)

10

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(8)

Computer Arithmetic

3

Scientific notation for the binary number system of

x

:

x = ±q × 2

m

with

1

2 ≤ q < 1,

and some integer

m

. For example,

(1001.1101)

2

= 1 × 2

3

+ 1 × 2

0

+ 1 × 2

−1

+ 1 × 2

−2

+ 1 × 2

−4

= 0.10011101 × 2

4

= (9.8125)

10

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(9)

Computer Arithmetic

4

Example 1.1 What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write

2

3 = (0.a

1

a

2

a

3

. . .)

2

.

Multiply by 2 to obtain

4

3 = (a

1

.a

2

a

3

. . .)

2

.

Therefore, we get

a

1

= 1

by taking the integer part of both sides. Subtracting 1, we have

1

3 = (0.a

2

a

3

a

4

. . .)

2

.

Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)

2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(10)

Computer Arithmetic

4

Example 1.1 What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write

2

3 = (0.a

1

a

2

a

3

. . .)

2

.

Multiply by 2 to obtain

4

3 = (a

1

.a

2

a

3

. . .)

2

.

Therefore, we get

a

1

= 1

by taking the integer part of both sides. Subtracting 1, we have

1

3 = (0.a

2

a

3

a

4

. . .)

2

.

Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)

2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(11)

Computer Arithmetic

4

Example 1.1 What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write

2

3 = (0.a

1

a

2

a

3

. . .)

2

.

Multiply by 2 to obtain

4

3 = (a

1

.a

2

a

3

. . .)

2

.

Therefore, we get

a

1

= 1

by taking the integer part of both sides. Subtracting 1, we have

1

3 = (0.a

2

a

3

a

4

. . .)

2

.

Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)

2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(12)

Computer Arithmetic

4

Example 1.1 What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write

2

3 = (0.a

1

a

2

a

3

. . .)

2

.

Multiply by 2 to obtain

4

3 = (a

1

.a

2

a

3

. . .)

2

.

Therefore, we get

a

1

= 1

by taking the integer part of both sides.

Subtracting 1, we have

1

3 = (0.a

2

a

3

a

4

. . .)

2

.

Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)

2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(13)

Computer Arithmetic

4

Example 1.1 What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write

2

3 = (0.a

1

a

2

a

3

. . .)

2

.

Multiply by 2 to obtain

4

3 = (a

1

.a

2

a

3

. . .)

2

.

Therefore, we get

a

1

= 1

by taking the integer part of both sides. Subtracting 1, we have

1

3 = (0.a

2

a

3

a

4

. . .)

2

.

Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)

2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(14)

Computer Arithmetic

4

Example 1.1 What is the binary representation of 23?

Solution: To determine the binary representation for 23, we write

2

3 = (0.a

1

a

2

a

3

. . .)

2

.

Multiply by 2 to obtain

4

3 = (a

1

.a

2

a

3

. . .)

2

.

Therefore, we get

a

1

= 1

by taking the integer part of both sides. Subtracting 1, we have

1

3 = (0.a

2

a

3

a

4

. . .)

2

.

Repeating the previous step, we arrive at

2

3 = (0.101010 . . .)

2

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(15)

Computer Arithmetic

5

Only a relatively small subset of the real number system is used for the representation of all the real numbers.

This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(16)

Computer Arithmetic

5

Only a relatively small subset of the real number system is used for the representation of all the real numbers.

This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(17)

Computer Arithmetic

5

Only a relatively small subset of the real number system is used for the representation of all the real numbers.

This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(18)

Computer Arithmetic

5

Only a relatively small subset of the real number system is used for the representation of all the real numbers.

This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(19)

Computer Arithmetic

6

For any real number

x

, let

x = ±0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

, a

1

6= 0,

denote the normalized scientific binary representation of

x

.

a

1

6= 0

, hence

a

1

= 1

.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

. – There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

t+1

, a

t+2

, . . .

to obtain

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m

.

2. rounding up: add

2

−(t+1)

× 2

m to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

1

δ

2

· · · δ

t

× 2

m

.

In this method, if

a

t+1

= 1

, we add

1

to

a

t to obtain

f l(x)

, and if

a

t+1

= 0

, we

merely chop off all but the first

t

digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(20)

Computer Arithmetic

6

For any real number

x

, let

x = ±0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

, a

1

6= 0,

denote the normalized scientific binary representation of

x

.

a

1

6= 0

, hence

a

1

= 1

.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

. – There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

t+1

, a

t+2

, . . .

to obtain

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m

.

2. rounding up: add

2

−(t+1)

× 2

m to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

1

δ

2

· · · δ

t

× 2

m

.

In this method, if

a

t+1

= 1

, we add

1

to

a

t to obtain

f l(x)

, and if

a

t+1

= 0

, we

merely chop off all but the first

t

digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(21)

Computer Arithmetic

6

For any real number

x

, let

x = ±0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

, a

1

6= 0,

denote the normalized scientific binary representation of

x

.

a

1

6= 0

, hence

a

1

= 1

.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

.

– There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

t+1

, a

t+2

, . . .

to obtain

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m

.

2. rounding up: add

2

−(t+1)

× 2

m to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

1

δ

2

· · · δ

t

× 2

m

.

In this method, if

a

t+1

= 1

, we add

1

to

a

t to obtain

f l(x)

, and if

a

t+1

= 0

, we

merely chop off all but the first

t

digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(22)

Computer Arithmetic

6

For any real number

x

, let

x = ±0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

, a

1

6= 0,

denote the normalized scientific binary representation of

x

.

a

1

6= 0

, hence

a

1

= 1

.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

. – There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

t+1

, a

t+2

, . . .

to obtain

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m

.

2. rounding up: add

2

−(t+1)

× 2

m to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

1

δ

2

· · · δ

t

× 2

m

.

In this method, if

a

t+1

= 1

, we add

1

to

a

t to obtain

f l(x)

, and if

a

t+1

= 0

, we

merely chop off all but the first

t

digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(23)

Computer Arithmetic

6

For any real number

x

, let

x = ±0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

, a

1

6= 0,

denote the normalized scientific binary representation of

x

.

a

1

6= 0

, hence

a

1

= 1

.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

. – There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

t+1

, a

t+2

, . . .

to obtain

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m

.

2. rounding up: add

2

−(t+1)

× 2

m to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

1

δ

2

· · · δ

t

× 2

m

.

In this method, if

a

t+1

= 1

, we add

1

to

a

t to obtain

f l(x)

, and if

a

t+1

= 0

, we

merely chop off all but the first

t

digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(24)

Computer Arithmetic

6

For any real number

x

, let

x = ±0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

, a

1

6= 0,

denote the normalized scientific binary representation of

x

.

a

1

6= 0

, hence

a

1

= 1

.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

. – There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

t+1

, a

t+2

, . . .

to obtain

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m

.

2. rounding up: add

2

−(t+1)

× 2

m to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

1

δ

2

· · · δ

t

× 2

m

.

In this method, if

a

t+1

= 1

, we add

1

to

a

t to obtain

f l(x)

, and if

a

t+1

= 0

, we

merely chop off all but the first

t

digits.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(25)

Computer Arithmetic

7

Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.

Definition 1.2 (Absolute Error and Relative Error) If

x

is an approximation to the exact value

x

?, the absolute error is

|x

?

− x|

and the relative error is |x|x?−x|?| , provided that

x

?

6= 0

.

Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(26)

Computer Arithmetic

7

Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.

Definition 1.2 (Absolute Error and Relative Error) If

x

is an approximation to the exact value

x

?, the absolute error is

|x

?

− x|

and the relative error is |x|x?−x|?| , provided that

x

?

6= 0

.

Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(27)

Computer Arithmetic

7

Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.

Definition 1.2 (Absolute Error and Relative Error) If

x

is an approximation to the exact value

x

?, the absolute error is

|x

?

− x|

and the relative error is |x|x?−x|?| , provided that

x

?

6= 0

.

Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(28)

Computer Arithmetic

7

Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.

Definition 1.2 (Absolute Error and Relative Error) If

x

is an approximation to the exact value

x

?, the absolute error is

|x

?

− x|

and the relative error is |x|x?−x|?| , provided that

x

?

6= 0

.

Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(29)

Computer Arithmetic

8

If the floating-point representation

f l(x)

for the number

x

is obtained by using

t

digits and chopping procedure, then the relative error is

|x − fl(x)|

|x| = |0.00 · · · 0a

t+1

a

t+2

· · · × 2

m

|

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

|

= |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

.

Since

a

1

6= 0

, the minimal value of the denominator is 12. The numerator is bounded above by 1. As a consequence

x − fl(x) x

≤ 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(30)

Computer Arithmetic

8

If the floating-point representation

f l(x)

for the number

x

is obtained by using

t

digits and chopping procedure, then the relative error is

|x − fl(x)|

|x| = |0.00 · · · 0a

t+1

a

t+2

· · · × 2

m

|

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

|

= |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

.

Since

a

1

6= 0

, the minimal value of the denominator is 12. The numerator is bounded above by 1. As a consequence

x − fl(x) x

≤ 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(31)

Computer Arithmetic

8

If the floating-point representation

f l(x)

for the number

x

is obtained by using

t

digits and chopping procedure, then the relative error is

|x − fl(x)|

|x| = |0.00 · · · 0a

t+1

a

t+2

· · · × 2

m

|

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

|

= |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

.

Since

a

1

6= 0

, the minimal value of the denominator is 12. The numerator is bounded above by 1. As a consequence

x − fl(x) x

≤ 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(32)

Computer Arithmetic

8

If the floating-point representation

f l(x)

for the number

x

is obtained by using

t

digits and chopping procedure, then the relative error is

|x − fl(x)|

|x| = |0.00 · · · 0a

t+1

a

t+2

· · · × 2

m

|

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

|

= |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

.

Since

a

1

6= 0

, the minimal value of the denominator is 12. The numerator is bounded above by 1.

As a consequence

x − fl(x) x

≤ 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(33)

Computer Arithmetic

8

If the floating-point representation

f l(x)

for the number

x

is obtained by using

t

digits and chopping procedure, then the relative error is

|x − fl(x)|

|x| = |0.00 · · · 0a

t+1

a

t+2

· · · × 2

m

|

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · × 2

m

|

= |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

.

Since

a

1

6= 0

, the minimal value of the denominator is 12. The numerator is bounded above by 1. As a consequence

x − fl(x) x

≤ 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(34)

Computer Arithmetic

9

If

t

-digit rounding arithmetic is used and

a

t+1

= 0

, then

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m.

A bound for the relative error is

|x − fl(x)|

|x| = |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded above by 12.

a

t+1

= 1

, then

f l (x) = ±(0.a

1

a

2

· · · a

t

+ 2

−t

) × 2

m. The upper bound for relative error becomes

|x − fl(x)|

|x| = |1 − 0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded by 12 due to

a

t+1

= 1

.

Therefore the relative error for rounding arithmetic is

x − fl(x) x

≤ 2

−t

= 1

2 × 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(35)

Computer Arithmetic

9

If

t

-digit rounding arithmetic is used and

a

t+1

= 0

, then

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m. A bound for the relative error is

|x − fl(x)|

|x| = |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded above by 12.

a

t+1

= 1

, then

f l (x) = ±(0.a

1

a

2

· · · a

t

+ 2

−t

) × 2

m. The upper bound for relative error becomes

|x − fl(x)|

|x| = |1 − 0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded by 12 due to

a

t+1

= 1

.

Therefore the relative error for rounding arithmetic is

x − fl(x) x

≤ 2

−t

= 1

2 × 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(36)

Computer Arithmetic

9

If

t

-digit rounding arithmetic is used and

a

t+1

= 0

, then

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m. A bound for the relative error is

|x − fl(x)|

|x| = |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded above by 12.

a

t+1

= 1

, then

f l (x) = ±(0.a

1

a

2

· · · a

t

+ 2

−t

) × 2

m. The upper bound for relative error becomes

|x − fl(x)|

|x| = |1 − 0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded by 12 due to

a

t+1

= 1

.

Therefore the relative error for rounding arithmetic is

x − fl(x) x

≤ 2

−t

= 1

2 × 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(37)

Computer Arithmetic

9

If

t

-digit rounding arithmetic is used and

a

t+1

= 0

, then

f l (x) = ±0.a

1

a

2

· · · a

t

× 2

m. A bound for the relative error is

|x − fl(x)|

|x| = |0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded above by 12.

a

t+1

= 1

, then

f l (x) = ±(0.a

1

a

2

· · · a

t

+ 2

−t

) × 2

m. The upper bound for relative error becomes

|x − fl(x)|

|x| = |1 − 0.a

t+1

a

t+2

· · · |

|0.a

1

a

2

· · · a

t

a

t+1

a

t+2

· · · | × 2

−t

≤ 2

−t

,

since the numerator is bounded by 12 due to

a

t+1

= 1

.

Therefore the relative error for rounding arithmetic is

x − fl(x) x

≤ 2

−t

= 1

2 × 2

−t+1

.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(38)

Computer Arithmetic

10

The number

ε

M

≡ 2

−t+1 is referred to as the unit roundoff error or machine epsilon. The floating-point representation,

f l(x)

, of

x

can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε

M

.

(1)

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(39)

Computer Arithmetic

10

The number

ε

M

≡ 2

−t+1 is referred to as the unit roundoff error or machine epsilon.

The floating-point representation,

f l(x)

, of

x

can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε

M

.

(1)

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(40)

Computer Arithmetic

10

The number

ε

M

≡ 2

−t+1 is referred to as the unit roundoff error or machine epsilon.

The floating-point representation,

f l(x)

, of

x

can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε

M

.

(1)

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(41)

Computer Arithmetic

10

The number

ε

M

≡ 2

−t+1 is referred to as the unit roundoff error or machine epsilon.

The floating-point representation,

f l(x)

, of

x

can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε

M

.

(1)

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(42)

Computer Arithmetic

11

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

.

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

. The

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

.

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(43)

Computer Arithmetic

11

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

.

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

. The

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

.

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(44)

Computer Arithmetic

11

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

.

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

. The

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

.

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(45)

Computer Arithmetic

11

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

.

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

. The

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

.

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(46)

Computer Arithmetic

11

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

.

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

. The

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

.

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(47)

Computer Arithmetic

11

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

.

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

. The

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

.

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

.

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(48)

Computer Arithmetic

12

The mantissa

f

actually corresponds to 24 binary digits (i.e., precision

t = 24

), the

machine epsilon is

ε

M

= 2

−24+1

= 2

−23

≈ 1.192 × 10

−7

.

(2)

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is

1 + 2

−23.

The largest number that can be represented by the single precision format is approximately

2

128

≈ 3.403 × 10

38, and the smallest positive number is

2

−126

≈ 1.175 × 10

−38.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(49)

Computer Arithmetic

12

The mantissa

f

actually corresponds to 24 binary digits (i.e., precision

t = 24

),

the machine epsilon is

ε

M

= 2

−24+1

= 2

−23

≈ 1.192 × 10

−7

.

(2)

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is

1 + 2

−23.

The largest number that can be represented by the single precision format is approximately

2

128

≈ 3.403 × 10

38, and the smallest positive number is

2

−126

≈ 1.175 × 10

−38.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(50)

Computer Arithmetic

12

The mantissa

f

actually corresponds to 24 binary digits (i.e., precision

t = 24

), the

machine epsilon is

ε

M

= 2

−24+1

= 2

−23

≈ 1.192 × 10

−7

.

(2)

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is

1 + 2

−23.

The largest number that can be represented by the single precision format is approximately

2

128

≈ 3.403 × 10

38, and the smallest positive number is

2

−126

≈ 1.175 × 10

−38.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(51)

Computer Arithmetic

12

The mantissa

f

actually corresponds to 24 binary digits (i.e., precision

t = 24

), the

machine epsilon is

ε

M

= 2

−24+1

= 2

−23

≈ 1.192 × 10

−7

.

(2)

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is

1 + 2

−23.

The largest number that can be represented by the single precision format is approximately

2

128

≈ 3.403 × 10

38, and the smallest positive number is

2

−126

≈ 1.175 × 10

−38.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(52)

Computer Arithmetic

13

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.

1

sign of mantissa

normalized mantissa exponent

52-bit

mantissa

0 1

11-bit

11 12

63

Figure 2: 64-bit double precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 11-bit exponent

c

and a 52-bit mantissa

f

.

The actual exponent is

c − 1023

.

The machine epsilon

ε

M

= 2

−52

≈ 2.220 × 10

−16

,

which provides between 15 and 16 decimal digits of accuracy.

Range of approximately

2

−1022

≈ 2.225 × 10

−308 to

2

1024

≈ 1.798 × 10

308.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(53)

Computer Arithmetic

13

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.

1

sign of mantissa

normalized mantissa exponent

52-bit

mantissa

0 1

11-bit

11 12

63

Figure 2: 64-bit double precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 11-bit exponent

c

and a 52-bit mantissa

f

.

The actual exponent is

c − 1023

.

The machine epsilon

ε

M

= 2

−52

≈ 2.220 × 10

−16

,

which provides between 15 and 16 decimal digits of accuracy.

Range of approximately

2

−1022

≈ 2.225 × 10

−308 to

2

1024

≈ 1.798 × 10

308.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(54)

Computer Arithmetic

13

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.

1

sign of mantissa

normalized mantissa exponent

52-bit

mantissa

0 1

11-bit

11 12

63

Figure 2: 64-bit double precision.

The first bit is a sign indicator, denoted

s

. This is followed by an 11-bit exponent

c

and a 52-bit mantissa

f

.

The actual exponent is

c − 1023

.

The machine epsilon

ε

M

= 2

−52

≈ 2.220 × 10

−16

,

which provides between 15 and 16 decimal digits of accuracy.

Range of approximately

2

−1022

≈ 2.225 × 10

−308 to

2

1024

≈ 1.798 × 10

308.

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

參考文獻

相關文件

A real number L is an accumulation point of the sequence a if and only if the sequence has an extended term infinitely near L.. This result yields a direct proof of

Bandlimited signals From the point of view of the preceding discussion, the problem for interpolation, is high frequencies, and the best thing a signal can be is a finite

The execution of a comparison-based algorithm can be described by a comparison tree, and the tree depth is the greatest number of comparisons, i.e., the worst-case

In words, this says that the values of f(x) can be made arbitrarily close to L (within a distance ε, where ε is any positive number) by requiring x to be sufficiently large

When the relative phases of the state of a quantum system are known, the system can be represented as a coherent superposition (as in (1.2)), called a pure state; when the sys-

• A call gives its holder the right to buy a number of the underlying asset by paying a strike price.. • A put gives its holder the right to sell a number of the underlying asset

As as single precision floating point number, they represent 23.850000381, but as a double word integer, they represent 1,103,023,309.. The CPU does not know which is the

Normalization by the number of reads in the sample, or by calculating a Z score, should be performed on the reported read counts before comparisons among samples. For genes with

For periodic sequence (with period n) that has exactly one of each 1 ∼ n in any group, we can find the least upper bound of the number of converged-routes... Elementary number

Based on [BL], by checking the strong pseudoconvexity and the transmission conditions in a neighborhood of a fixed point at the interface, we can derive a Car- leman estimate for

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.. 1 sign

A) the approximate atomic number of each kind of atom in a molecule B) the approximate number of protons in a molecule. C) the actual number of chemical bonds in a molecule D)

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.. 1 sign

If x or F is a vector, then the condition number is defined in a similar way using norms and it measures the maximum relative change, which is attained for some, but not all

In an Ising spin glass with a large number of spins the number of lowest-energy configurations (ground states) grows exponentially with increasing number of spins.. It is in

If the skyrmion number changes at some point of time.... there must be a singular point

• A call gives its holder the right to buy a number of the underlying asset by paying a strike price.. • A put gives its holder the right to sell a number of the underlying asset

* All rights reserved, Tei-Wei Kuo, National Taiwan University, 2005..

There is no general formula for counting the number of transitive binary relations on A... The poset A in the above example is not

Experiment a little with the Hello program. It will say that it has no clue what you mean by ouch. The exact wording of the error message is dependent on the compiler, but it might

To convert a string containing floating-point digits to its floating-point value, use the static parseDouble method of the Double class..

Abstract— This paper has analyzed link probability, expected node degree, expected number of links, and expected area collectively covered by a finite number of nodes in wireless ad

A Wireless Sensor Network is composed by a group of tiny devices with limited energy. Since the number of sensing nodes is usually huge, the sensing nodes are usually on