Computer Arithmetic

(1)

Computer Arithmetic

¹

Numerical Analysis

NTNU

Tsung-Min Hwang September 14, 2003

Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(2)

Computer Arithmetic

²

1 Floating-Point Number and Roundoff Error

•

Normalized scientific notation for the decimal number system of

x

:

x = ±r × 10

ⁿ

,

where

1 10 ≤ r < 1,

and

n

is an integer (positive, negative, or zero). –

r

is called the mantissa and

n

is the exponent. – The leading digit in the fraction is not zero. – For example,

42.965 = 0.42965 × 10

²

,

−0.00234 = −0.234 × 10

⁻²

.

(3)

Computer Arithmetic

²

1 Floating-Point Number and Roundoff Error

• x

:

x = ±r × 10

ⁿ

,

where

1 10 ≤ r < 1,

and

n

is an integer (positive, negative, or zero).

–

r

n

is the exponent. – The leading digit in the fraction is not zero. – For example,

42.965 = 0.42965 × 10

²

,

−0.00234 = −0.234 × 10

⁻²

.

(4)

Computer Arithmetic

²

1 Floating-Point Number and Roundoff Error

• x

:

x = ±r × 10

ⁿ

,

where

1 10 ≤ r < 1,

and

n

–

r

n

is the exponent.

– The leading digit in the fraction is not zero.

– For example,

42.965 = 0.42965 × 10

²

,

−0.00234 = −0.234 × 10

⁻²

.

(5)

Computer Arithmetic

²

1 Floating-Point Number and Roundoff Error

• x

:

x = ±r × 10

ⁿ

,

where

1 10 ≤ r < 1,

and

n

–

r

n

is the exponent.

– The leading digit in the fraction is not zero.

– For example,

42.965 = 0.42965 × 10

²

,

−0.00234 = −0.234 × 10

⁻²

.

(6)

Computer Arithmetic

³

•

Scientific notation for the binary number system of

x

:

x = ±q × 2

^m

with

1 2 ≤ q < 1,

and some integer

m

. For example,

(1001.1101)

₂

= 1 × 2

³

+ 1 × 2

⁰

+ 1 × 2

⁻¹

+ 1 × 2

⁻²

+ 1 × 2

⁻⁴

= 0.10011101 × 2

⁴

= (9.8125)

₁₀

(7)

Computer Arithmetic

³

• x

:

x = ±q × 2

^m

with

1 2 ≤ q < 1,

and some integer

m

.

For example,

(1001.1101)

₂

= 1 × 2

³

+ 1 × 2

⁰

+ 1 × 2

⁻¹

+ 1 × 2

⁻²

+ 1 × 2

⁻⁴

= 0.10011101 × 2

⁴

= (9.8125)

₁₀

(8)

Computer Arithmetic

³

• x

:

x = ±q × 2

^m

with

1 2 ≤ q < 1,

and some integer

m

. For example,

(1001.1101)

₂

= 1 × 2

³

+ 1 × 2

⁰

+ 1 × 2

⁻¹

+ 1 × 2

⁻²

+ 1 × 2

⁻⁴

= 0.10011101 × 2

⁴

= (9.8125)

₁₀

(9)

Computer Arithmetic

⁴

Example 1.1 What is the binary representation of ²₃?

Solution: To determine the binary representation for ²₃, we write

2 3 = (0.a

₁

a

₂

a

₃

. . .)

₂

.

Multiply by 2 to obtain

4 3 = (a

₁

.a

₂

a

₃

. . .)

₂

.

Therefore, we get

a

₁

= 1

by taking the integer part of both sides. Subtracting 1, we have

1 3 = (0.a

₂

a

₃

a

₄

. . .)

₂

.

Repeating the previous step, we arrive at

2 3 = (0.101010 . . .)

₂

.

(10)

Computer Arithmetic

⁴

2 3 = (0.a

₁

a

₂

a

₃

. . .)

₂

.

4 3 = (a

₁

.a

₂

a

₃

. . .)

₂

.

Therefore, we get

a

₁

= 1

1 3 = (0.a

₂

a

₃

a

₄

. . .)

₂

.

2 3 = (0.101010 . . .)

₂

.

(11)

Computer Arithmetic

⁴

2 3 = (0.a

₁

a

₂

a

₃

. . .)

₂

.

4 3 = (a

₁

.a

₂

a

₃

. . .)

₂

.

Therefore, we get

a

₁

= 1

1 3 = (0.a

₂

a

₃

a

₄

. . .)

₂

.

2 3 = (0.101010 . . .)

₂

.

(12)

Computer Arithmetic

⁴

2 3 = (0.a

₁

a

₂

a

₃

. . .)

₂

.

4 3 = (a

₁

.a

₂

a

₃

. . .)

₂

.

Therefore, we get

a

₁

= 1

by taking the integer part of both sides.

Subtracting 1, we have

1 3 = (0.a

₂

a

₃

a

₄

. . .)

₂

.

2 3 = (0.101010 . . .)

₂

.

(13)

Computer Arithmetic

⁴

2 3 = (0.a

₁

a

₂

a

₃

. . .)

₂

.

4 3 = (a

₁

.a

₂

a

₃

. . .)

₂

.

Therefore, we get

a

₁

= 1

1 3 = (0.a

₂

a

₃

a

₄

. . .)

₂

.

2 3 = (0.101010 . . .)

₂

.

(14)

Computer Arithmetic

⁴

2 3 = (0.a

₁

a

₂

a

₃

. . .)

₂

.

4 3 = (a

₁

.a

₂

a

₃

. . .)

₂

.

Therefore, we get

a

₁

= 1

1 3 = (0.a

₂

a

₃

a

₄

. . .)

₂

.

2 3 = (0.101010 . . .)

₂

.

(15)

Computer Arithmetic

⁵

•

Only a relatively small subset of the real number system is used for the representation of all the real numbers.

•

This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.

•

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(16)

Computer Arithmetic

⁵

•

(17)

Computer Arithmetic

⁵

•

(18)

Computer Arithmetic

⁵

•

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation.

(19)

Computer Arithmetic

⁶

•

For any real number

x

, let

x = ±0.a

¹

a

₂

· · · a

^t

a

_t₊₁

a

_t₊₂

· · · × 2

^m

, a

₁

6= 0,

denote the normalized scientific binary representation of

x

^.

–

a

₁

6= 0

^{, hence}

a

₁

= 1

^.

– If

x

is within the numerical range of the machine, the floating-point form of

x

, denoted

f l(x)

, is obtained by terminating the mantissa of

x

at

t

digits for some integer

t

. – There are two ways of performing this termination.

1. chopping: simply discard the excess bits

a

_t+1

, a

_t+2

, . . .

^{to obtain}

f l (x) = ±0.a

¹

a

₂

· · · a

^t

× 2

^m

.

2. rounding up: add

2

^−(t+1)

× 2

^m ^to

x

and then chop the excess bits to obtain a number of the form

f l (x) = ±0.δ

¹

δ

₂

· · · δ

^t

× 2

^m

.

In this method, if

a

_t₊₁

= 1

^{, we add}

1

^to

a

_t to obtain

f l(x)

^{, and if}

a

_t₊₁

= 0

^{, we}

merely chop off all but the first

t

digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003

(20)

Computer Arithmetic

⁶

•

For any real number

x

, let

x = ±0.a

¹

a

₂

· · · a

^t

a

_t₊₁

a

_t₊₂

· · · × 2

^m

, a

₁

6= 0,

x

^.

–

a

₁

6= 0

^{, hence}

a

₁

= 1

^.

– If

x

, denoted

f l(x)

x

at

t

a

_t+1

, a

_t+2

, . . .

^{to obtain}

f l (x) = ±0.a

¹

a

₂

· · · a

^t

× 2

^m

.

2

^−(t+1)

× 2

^m ^to

x

f l (x) = ±0.δ

¹

δ

₂

· · · δ

^t

× 2

^m

.

In this method, if

a

_t₊₁

= 1

^{, we add}

1

^to

a

_t to obtain

f l(x)

^{, and if}

a

_t₊₁

= 0

^{, we}

t

(21)

Computer Arithmetic

⁶

•

For any real number

x

, let

x = ±0.a

¹

a

₂

· · · a

^t

a

_t₊₁

a

_t₊₂

· · · × 2

^m

, a

₁

6= 0,

x

^.

–

a

₁

6= 0

^{, hence}

a

₁

= 1

^.

– If

x

, denoted

f l(x)

x

at

t

.

– There are two ways of performing this termination.

a

_t+1

, a

_t+2

, . . .

^{to obtain}

f l (x) = ±0.a

¹

a

₂

· · · a

^t

× 2

^m

.

2

^−(t+1)

× 2

^m ^to

x

f l (x) = ±0.δ

¹

δ

₂

· · · δ

^t

× 2

^m

.

In this method, if

a

_t₊₁

= 1

^{, we add}

1

^to

a

_t to obtain

f l(x)

^{, and if}

a

_t₊₁

= 0

^{, we}

t

(22)

Computer Arithmetic

⁶

•

For any real number

x

, let

x = ±0.a

¹

a

₂

· · · a

^t

a

_t₊₁

a

_t₊₂

· · · × 2

^m

, a

₁

6= 0,

x

^.

–

a

₁

6= 0

^{, hence}

a

₁

= 1

^.

– If

x

, denoted

f l(x)

x

at

t

a

_t+1

, a

_t+2

, . . .

^{to obtain}

f l (x) = ±0.a

¹

a

₂

· · · a

^t

× 2

^m

.

2

^−(t+1)

× 2

^m ^to

x

f l (x) = ±0.δ

¹

δ

₂

· · · δ

^t

× 2

^m

.

In this method, if

a

_t₊₁

= 1

^{, we add}

1

^to

a

_t to obtain

f l(x)

^{, and if}

a

_t₊₁

= 0

^{, we}

t

(23)

Computer Arithmetic

⁶

•

For any real number

x

, let

x = ±0.a

¹

a

₂

· · · a

^t

a

_t₊₁

a

_t₊₂

· · · × 2

^m

, a

₁

6= 0,

x

^.

–

a

₁

6= 0

^{, hence}

a

₁

= 1

^.

– If

x

, denoted

f l(x)

x

at

t

a

_t+1

, a

_t+2

, . . .

^{to obtain}

f l (x) = ±0.a

¹

a

₂

· · · a

^t

× 2

^m

.

2

^−(t+1)

× 2

^m ^to

x

f l (x) = ±0.δ

¹

δ

₂

· · · δ

^t

× 2

^m

.

In this method, if

a

_t₊₁

= 1

^{, we add}

1

^to

a

_t to obtain

f l(x)

^{, and if}

a

_t₊₁

= 0

^{, we}

t

(24)

Computer Arithmetic

⁶

•

For any real number

x

, let

x = ±0.a

¹

a

₂

· · · a

^t

a

_t₊₁

a

_t₊₂

· · · × 2

^m

, a

₁

6= 0,

x

^.

–

a

₁

6= 0

^{, hence}

a

₁

= 1

^.

– If

x

, denoted

f l(x)

x

at

t

a

_t+1

, a

_t+2

, . . .

^{to obtain}

f l (x) = ±0.a

¹

a

₂

· · · a

^t

× 2

^m

.

2

^−(t+1)

× 2

^m ^to

x

f l (x) = ±0.δ

¹

δ

₂

· · · δ

^t

× 2

^m

.

In this method, if

a

_t₊₁

= 1

^{, we add}

1

^to

a

_t to obtain

f l(x)

^{, and if}

a

_t₊₁

= 0

^{, we}

t

digits.

(25)

Computer Arithmetic

⁷

Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.

Definition 1.2 (Absolute Error and Relative Error) If

x

is an approximation to the exact value

x

^?^{, the} absolute error is

|x

^?

− x|

^{and the} relative error is ^|x_|x^?^−x|_?_| , provided that

x

^?

6= 0

^.

Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.

(26)

Computer Arithmetic

⁷

x

|x

^?

− x|

x

^?

6= 0

^.

(27)

Computer Arithmetic

⁷

x

|x

^?

− x|

x

^?

6= 0

^.

(28)

Computer Arithmetic

⁷

x

|x

^?

− x|

x

^?

6= 0

^.

(29)

Computer Arithmetic

⁸

•

If the floating-point representation

f l(x)

for the number

x

is obtained by using

t

digits and chopping procedure, then the relative error is

|x − fl(x)|

|x| = |0.00 · · · 0a

^t⁺¹

a

_t₊₂

· · · × 2

^m

|

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · × 2

^m

|

= |0.a

^t⁺¹

a

_t₊₂

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t₊₁

a

_t₊₂

· · · | × 2

^−t

.

Since

a

₁

6= 0

, the minimal value of the denominator is ¹₂. The numerator is bounded above by 1. As a consequence

x − fl(x) x

≤ 2

^−t+1

.

(30)

Computer Arithmetic

⁸

• f l(x)

for the number

x

t

|x − fl(x)|

|x| = |0.00 · · · 0a

^t⁺¹

a

_t₊₂

· · · × 2

^m

|

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · × 2

^m

|

= |0.a

^t⁺¹

a

_t₊₂

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t₊₁

a

_t₊₂

· · · | × 2

^−t

.

Since

a

₁

6= 0

x − fl(x) x

≤ 2

^−t+1

.

(31)

Computer Arithmetic

⁸

• f l(x)

for the number

x

t

|x − fl(x)|

|x| = |0.00 · · · 0a

^t⁺¹

a

_t₊₂

· · · × 2

^m

|

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · × 2

^m

|

= |0.a

^t⁺¹

a

_t₊₂

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t₊₁

a

_t₊₂

· · · | × 2

^−t

.

Since

a

₁

6= 0

x − fl(x) x

≤ 2

^−t+1

.

(32)

Computer Arithmetic

⁸

• f l(x)

for the number

x

t

|x − fl(x)|

|x| = |0.00 · · · 0a

^t⁺¹

a

_t₊₂

· · · × 2

^m

|

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · × 2

^m

|

= |0.a

^t⁺¹

a

_t₊₂

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t₊₁

a

_t₊₂

· · · | × 2

^−t

.

Since

a

₁

6= 0

, the minimal value of the denominator is ¹₂. The numerator is bounded above by 1.

As a consequence

x − fl(x) x

≤ 2

^−t+1

.

(33)

Computer Arithmetic

⁸

• f l(x)

for the number

x

t

|x − fl(x)|

|x| = |0.00 · · · 0a

^t⁺¹

a

_t₊₂

· · · × 2

^m

|

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · × 2

^m

|

= |0.a

^t⁺¹

a

_t₊₂

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t₊₁

a

_t₊₂

· · · | × 2

^−t

.

Since

a

₁

6= 0

x − fl(x) x

≤ 2

^−t+1

.

(34)

Computer Arithmetic

⁹

•

^If

t

-digit rounding arithmetic is used and

–

a

_t₊₁

= 0

^{, then}

f l (x) = ±0.a

¹

a

₂

· · · a

^t

× 2

^m^.

A bound for the relative error is

|x − fl(x)|

|x| = |0.a

^t+1

a

_t+2

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · | × 2

^−t

≤ 2

^−t

,

since the numerator is bounded above by ¹₂.

–

a

_t+1

= 1

^{, then}

f l (x) = ±(0.a

¹

a

₂

· · · a

^t

+ 2

^−t

) × 2

^m. The upper bound for relative error becomes

|x − fl(x)|

|x| = |1 − 0.a

^t⁺¹

a

_t₊₂

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · | × 2

^−t

≤ 2

^−t

,

since the numerator is bounded by ¹₂ due to

a

_t+1

= 1

^.

Therefore the relative error for rounding arithmetic is

x − fl(x) x

≤ 2

^−t

= 1

2 × 2

^−t+1

.

(35)

Computer Arithmetic

⁹

•

^If

t

–

a

_t₊₁

= 0

^{, then}

f l (x) = ±0.a

¹

a

₂

· · · a

^t

× 2

^m. A bound for the relative error is

|x − fl(x)|

|x| = |0.a

^t+1

a

_t+2

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · | × 2

^−t

≤ 2

^−t

,

–

a

_t+1

= 1

^{, then}

f l (x) = ±(0.a

¹

a

₂

· · · a

^t

+ 2

^−t

) × 2

|x − fl(x)|

|x| = |1 − 0.a

^t⁺¹

a

_t₊₂

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · | × 2

^−t

≤ 2

^−t

,

a

_t+1

= 1

^.

x − fl(x) x

≤ 2

^−t

= 1

2 × 2

^−t+1

.

(36)

Computer Arithmetic

⁹

•

^If

t

–

a

_t₊₁

= 0

^{, then}

f l (x) = ±0.a

¹

a

₂

· · · a

^t

× 2

|x − fl(x)|

|x| = |0.a

^t+1

a

_t+2

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · | × 2

^−t

≤ 2

^−t

,

–

a

_t+1

= 1

^{, then}

f l (x) = ±(0.a

¹

a

₂

· · · a

^t

+ 2

^−t

) × 2

|x − fl(x)|

|x| = |1 − 0.a

^t⁺¹

a

_t₊₂

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · | × 2

^−t

≤ 2

^−t

,

a

_t+1

= 1

^.

x − fl(x) x

≤ 2

^−t

= 1

2 × 2

^−t+1

.

(37)

Computer Arithmetic

⁹

•

^If

t

–

a

_t₊₁

= 0

^{, then}

f l (x) = ±0.a

¹

a

₂

· · · a

^t

× 2

|x − fl(x)|

|x| = |0.a

^t+1

a

_t+2

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · | × 2

^−t

≤ 2

^−t

,

–

a

_t+1

= 1

^{, then}

f l (x) = ±(0.a

¹

a

₂

· · · a

^t

+ 2

^−t

) × 2

|x − fl(x)|

|x| = |1 − 0.a

^t⁺¹

a

_t₊₂

· · · |

|0.a

¹

a

₂

· · · a

^t

a

_t+1

a

_t+2

· · · | × 2

^−t

≤ 2

^−t

,

a

_t+1

= 1

^.

x − fl(x) x

≤ 2

^−t

= 1

2 × 2

^−t+1

.

(38)

Computer Arithmetic

¹⁰

•

^{The number}

ε

_M

≡ 2

^−t+1 is referred to as the unit roundoff error or machine epsilon. The floating-point representation,

f l(x)

^{, of}

x

can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε

^M

.

(1)

•

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

(39)

Computer Arithmetic

¹⁰

•

^{The number}

ε

_M

≡ 2

^−t+1 is referred to as the unit roundoff error or machine epsilon.

The floating-point representation,

f l(x)

^{, of}

x

can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε

^M

.

(1)

•

(40)

Computer Arithmetic

¹⁰

•

^{The number}

ε

_M

≡ 2

f l(x)

^{, of}

x

can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε

^M

.

(1)

•

(41)

Computer Arithmetic

¹⁰

•

^{The number}

ε

_M

≡ 2

f l(x)

^{, of}

x

can be expressed as

f l(x) = x(1 + δ), |δ| ≤ ε

^M

.

(1)

•

(42)

Computer Arithmetic

¹¹

☞

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number

±q × 2

^m as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

•

The first bit is a sign indicator, denoted

s

. This is followed by an 8-bit exponent

c

and a 23-bit mantissa

f

^.

•

The base for the exponent and mantissa is 2, and the actual exponent is

c − 127

^{. The}

value of

c

is restricted by the inequality

0 ≤ c ≤ 255

^.

•

The actual exponent of the number is restricted by the inequality

−126 ≤ c − 127 ≤ 128

^.

•

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.

(43)

Computer Arithmetic

¹¹

☞

±q × 2

8 bits

0 1 8 9 31

• s

c

f

^.

• c − 127

^{. The}

value of

c

0 ≤ c ≤ 255

^.

• −126 ≤ c − 127 ≤ 128

^.

•

(44)

Computer Arithmetic

¹¹

☞

±q × 2

8 bits

0 1 8 9 31

• s

c

f

^.

• c − 127

^{. The}

value of

c

0 ≤ c ≤ 255

^.

• −126 ≤ c − 127 ≤ 128

^.

•

(45)

Computer Arithmetic

¹¹

☞

±q × 2

8 bits

0 1 8 9 31

• s

c

f

^.

• c − 127

^{. The}

value of

c

0 ≤ c ≤ 255

^.

• −126 ≤ c − 127 ≤ 128

^.

•

(46)

Computer Arithmetic

¹¹

☞

±q × 2

8 bits

0 1 8 9 31

• s

c

f

^.

• c − 127

^{. The}

value of

c

0 ≤ c ≤ 255

^.

• −126 ≤ c − 127 ≤ 128

^.

•

(47)

Computer Arithmetic

¹¹

☞

±q × 2

8 bits

0 1 8 9 31

• s

c

f

^.

• c − 127

^{. The}

value of

c

0 ≤ c ≤ 255

^.

• −126 ≤ c − 127 ≤ 128

^.

•

(48)

Computer Arithmetic

¹²

•

The mantissa

f

actually corresponds to 24 binary digits (i.e., precision

t = 24

^{), the}

machine epsilon is

ε

_M

= 2

⁻²⁴⁺¹

= 2

⁻²³

≈ 1.192 × 10

⁻⁷

.

⁽²⁾

•

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is

1 + 2

⁻²³^.

•

The largest number that can be represented by the single precision format is approximately

2

¹²⁸

≈ 3.403 × 10

³⁸, and the smallest positive number is

2

⁻¹²⁶

≈ 1.175 × 10

⁻³⁸^.

(49)

Computer Arithmetic

¹²

•

The mantissa

f

t = 24

^),

the machine epsilon is

ε

_M

= 2

⁻²⁴⁺¹

= 2

⁻²³

≈ 1.192 × 10

⁻⁷

.

⁽²⁾

• 1 + 2

⁻²³^.

•

2

¹²⁸

≈ 3.403 × 10

2

⁻¹²⁶

≈ 1.175 × 10

⁻³⁸^.

(50)

Computer Arithmetic

¹²

•

The mantissa

f

t = 24

^{), the}

machine epsilon is

ε

_M

= 2

⁻²⁴⁺¹

= 2

⁻²³

≈ 1.192 × 10

⁻⁷

.

⁽²⁾

• 1 + 2

⁻²³^.

•

2

¹²⁸

≈ 3.403 × 10

2

⁻¹²⁶

≈ 1.175 × 10

⁻³⁸^.

(51)

Computer Arithmetic

¹²

•

The mantissa

f

t = 24

^{), the}

machine epsilon is

ε

_M

= 2

⁻²⁴⁺¹

= 2

⁻²³

≈ 1.192 × 10

⁻⁷

.

⁽²⁾

• 1 + 2

⁻²³^.

•

2

¹²⁸

≈ 3.403 × 10

2

⁻¹²⁶

≈ 1.175 × 10

⁻³⁸^.

(52)

Computer Arithmetic

¹³

☞

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.

1

sign of mantissa

52-bit

mantissa

0 1

11-bit

11 12

63

Figure 2: 64-bit double precision.

• s

c

f

.

•

The actual exponent is

c − 1023

^.

•

The machine epsilon

ε

_M

= 2

⁻⁵²

≈ 2.220 × 10

⁻¹⁶

,

which provides between 15 and 16 decimal digits of accuracy.

•

Range of approximately

2

⁻¹⁰²²

≈ 2.225 × 10

⁻³⁰⁸ ^to

2

¹⁰²⁴

≈ 1.798 × 10

³⁰⁸^.

(53)

Computer Arithmetic

¹³

☞

1

sign of mantissa

52-bit

mantissa

0 1

11-bit

11 12

63

• s

c

f

.

• c − 1023

^.

•

The machine epsilon

ε

_M

= 2

⁻⁵²

≈ 2.220 × 10

⁻¹⁶

,

•

2

⁻¹⁰²²

≈ 2.225 × 10

⁻³⁰⁸ ^to

2

¹⁰²⁴

≈ 1.798 × 10

³⁰⁸^.

(54)

Computer Arithmetic

¹³

☞

1

sign of mantissa

52-bit

mantissa

0 1

11-bit

11 12

63

• s

c

f

.

• c − 1023

^.

•

The machine epsilon

ε

_M

= 2

⁻⁵²

≈ 2.220 × 10

⁻¹⁶

,

•

2

⁻¹⁰²²

≈ 2.225 × 10

⁻³⁰⁸ ^to

2

¹⁰²⁴

≈ 1.798 × 10

³⁰⁸^.