**Computer Arithmetic**

^{1}

**Numerical Analysis**

**NTNU**

**Tsung-Min Hwang**
**September 14, 2003**

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{2}

**1** **Floating-Point Number and Roundoff Error**

### •

Normalized scientific notation for the decimal number system of### x

:### x = ±r × 10

^{n}

### ,

where

### 1

### 10 ≤ r < 1,

and

### n

is an integer (positive, negative, or zero).**–**

### r

is called the mantissa and### n

is the exponent.**– The leading digit in the fraction is not zero.**

**– For example,**

### 42.965 = 0.42965 × 10

^{2}

### ,

### −0.00234 = −0.234 × 10

^{−2}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{2}

**1** **Floating-Point Number and Roundoff Error**

### •

Normalized scientific notation for the decimal number system of### x

:### x = ±r × 10

^{n}

### ,

where

### 1

### 10 ≤ r < 1,

and

### n

is an integer (positive, negative, or zero).**–**

### r

is called the mantissa and### n

is the exponent.**– The leading digit in the fraction is not zero.**

**– For example,**

### 42.965 = 0.42965 × 10

^{2}

### ,

### −0.00234 = −0.234 × 10

^{−2}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{2}

**1** **Floating-Point Number and Roundoff Error**

### •

Normalized scientific notation for the decimal number system of### x

:### x = ±r × 10

^{n}

### ,

where

### 1

### 10 ≤ r < 1,

and

### n

is an integer (positive, negative, or zero).**–**

### r

is called the mantissa and### n

is the exponent.**– The leading digit in the fraction is not zero.**

**– For example,**

### 42.965 = 0.42965 × 10

^{2}

### ,

### −0.00234 = −0.234 × 10

^{−2}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{2}

**1** **Floating-Point Number and Roundoff Error**

### •

Normalized scientific notation for the decimal number system of### x

:### x = ±r × 10

^{n}

### ,

where

### 1

### 10 ≤ r < 1,

and

### n

is an integer (positive, negative, or zero).**–**

### r

is called the mantissa and### n

is the exponent.**– The leading digit in the fraction is not zero.**

**– For example,**

### 42.965 = 0.42965 × 10

^{2}

### ,

### −0.00234 = −0.234 × 10

^{−2}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{3}

### •

Scientific notation for the binary number system of### x

:### x = ±q × 2

^{m}

with

### 1

### 2 ≤ q < 1,

and some integer

### m

. For example,### (1001.1101)

_{2}

### = 1 × 2

^{3}

### + 1 × 2

^{0}

### + 1 × 2

^{−1}

### + 1 × 2

^{−2}

### + 1 × 2

^{−4}

### = 0.10011101 × 2

^{4}

### = (9.8125)

_{10}

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{3}

### •

Scientific notation for the binary number system of### x

:### x = ±q × 2

^{m}

with

### 1

### 2 ≤ q < 1,

and some integer

### m

.For example,

### (1001.1101)

_{2}

### = 1 × 2

^{3}

### + 1 × 2

^{0}

### + 1 × 2

^{−1}

### + 1 × 2

^{−2}

### + 1 × 2

^{−4}

### = 0.10011101 × 2

^{4}

### = (9.8125)

_{10}

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{3}

### •

Scientific notation for the binary number system of### x

:### x = ±q × 2

^{m}

with

### 1

### 2 ≤ q < 1,

and some integer

### m

. For example,### (1001.1101)

_{2}

### = 1 × 2

^{3}

### + 1 × 2

^{0}

### + 1 × 2

^{−1}

### + 1 × 2

^{−2}

### + 1 × 2

^{−4}

### = 0.10011101 × 2

^{4}

### = (9.8125)

_{10}

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{4}

**Example 1.1 What is the binary representation of**^{2}_{3}*?*

*Solution: To determine the binary representation for* ^{2}_{3}, we write

### 2

### 3 = (0.a

_{1}

### a

_{2}

### a

_{3}

### . . .)

_{2}

### .

Multiply by 2 to obtain

### 4

### 3 = (a

_{1}

### .a

_{2}

### a

_{3}

### . . .)

_{2}

### .

Therefore, we get

### a

_{1}

### = 1

by taking the integer part of both sides. Subtracting 1, we have### 1

### 3 = (0.a

_{2}

### a

_{3}

### a

_{4}

### . . .)

_{2}

### .

Repeating the previous step, we arrive at

### 2

### 3 = (0.101010 . . .)

_{2}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{4}

**Example 1.1 What is the binary representation of**^{2}_{3}*?*

*Solution: To determine the binary representation for* ^{2}_{3}, we write

### 2

### 3 = (0.a

_{1}

### a

_{2}

### a

_{3}

### . . .)

_{2}

### .

Multiply by 2 to obtain

### 4

### 3 = (a

_{1}

### .a

_{2}

### a

_{3}

### . . .)

_{2}

### .

Therefore, we get

### a

_{1}

### = 1

by taking the integer part of both sides. Subtracting 1, we have### 1

### 3 = (0.a

_{2}

### a

_{3}

### a

_{4}

### . . .)

_{2}

### .

Repeating the previous step, we arrive at

### 2

### 3 = (0.101010 . . .)

_{2}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{4}

**Example 1.1 What is the binary representation of**^{2}_{3}*?*

*Solution: To determine the binary representation for* ^{2}_{3}, we write

### 2

### 3 = (0.a

_{1}

### a

_{2}

### a

_{3}

### . . .)

_{2}

### .

Multiply by 2 to obtain

### 4

### 3 = (a

_{1}

### .a

_{2}

### a

_{3}

### . . .)

_{2}

### .

Therefore, we get

### a

_{1}

### = 1

by taking the integer part of both sides. Subtracting 1, we have### 1

### 3 = (0.a

_{2}

### a

_{3}

### a

_{4}

### . . .)

_{2}

### .

Repeating the previous step, we arrive at

### 2

### 3 = (0.101010 . . .)

_{2}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{4}

**Example 1.1 What is the binary representation of**^{2}_{3}*?*

*Solution: To determine the binary representation for* ^{2}_{3}, we write

### 2

### 3 = (0.a

_{1}

### a

_{2}

### a

_{3}

### . . .)

_{2}

### .

Multiply by 2 to obtain

### 4

### 3 = (a

_{1}

### .a

_{2}

### a

_{3}

### . . .)

_{2}

### .

Therefore, we get

### a

_{1}

### = 1

by taking the integer part of both sides.Subtracting 1, we have

### 1

### 3 = (0.a

_{2}

### a

_{3}

### a

_{4}

### . . .)

_{2}

### .

Repeating the previous step, we arrive at

### 2

### 3 = (0.101010 . . .)

_{2}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{4}

**Example 1.1 What is the binary representation of**^{2}_{3}*?*

*Solution: To determine the binary representation for* ^{2}_{3}, we write

### 2

### 3 = (0.a

_{1}

### a

_{2}

### a

_{3}

### . . .)

_{2}

### .

Multiply by 2 to obtain

### 4

### 3 = (a

_{1}

### .a

_{2}

### a

_{3}

### . . .)

_{2}

### .

Therefore, we get

### a

_{1}

### = 1

by taking the integer part of both sides. Subtracting 1, we have### 1

### 3 = (0.a

_{2}

### a

_{3}

### a

_{4}

### . . .)

_{2}

### .

Repeating the previous step, we arrive at

### 2

### 3 = (0.101010 . . .)

_{2}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{4}

**Example 1.1 What is the binary representation of**^{2}_{3}*?*

*Solution: To determine the binary representation for* ^{2}_{3}, we write

### 2

### 3 = (0.a

_{1}

### a

_{2}

### a

_{3}

### . . .)

_{2}

### .

Multiply by 2 to obtain

### 4

### 3 = (a

_{1}

### .a

_{2}

### a

_{3}

### . . .)

_{2}

### .

Therefore, we get

### a

_{1}

### = 1

by taking the integer part of both sides. Subtracting 1, we have### 1

### 3 = (0.a

_{2}

### a

_{3}

### a

_{4}

### . . .)

_{2}

### .

Repeating the previous step, we arrive at

### 2

### 3 = (0.101010 . . .)

_{2}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{5}

### •

Only a relatively small subset of the real number system is used for the representation of all the real numbers.### •

This subset, which are called the*floating-point numbers, contains only rational*numbers, both positive and negative.

### •

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation.**Department of Mathematics – NTNU**

**Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{5}

### •

Only a relatively small subset of the real number system is used for the representation of all the real numbers.### •

This subset, which are called the*floating-point numbers, contains only rational*numbers, both positive and negative.

### •

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation.**Department of Mathematics – NTNU**

**Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{5}

### •

Only a relatively small subset of the real number system is used for the representation of all the real numbers.### •

This subset, which are called the*floating-point numbers, contains only rational*numbers, both positive and negative.

### •

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation.**Department of Mathematics – NTNU**

**Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{5}

### •

Only a relatively small subset of the real number system is used for the representation of all the real numbers.### •

This subset, which are called the*floating-point numbers, contains only rational*numbers, both positive and negative.

### •

When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation.**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{6}

### •

For any real number### x

, let### x = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t}

_{+1}

### a

_{t}

_{+2}

### · · · × 2

^{m}

### , a

_{1}

### 6= 0,

denote the normalized scientific binary representation of

### x

^{.}

**–**

### a

_{1}

### 6= 0

^{, hence}

### a

_{1}

### = 1

^{.}

**– If**

### x

is within the numerical range of the machine, the floating-point form of### x

, denoted### f l(x)

, is obtained by terminating the mantissa of### x

at### t

digits for some integer### t

.**– There are two ways of performing this termination.**

1. **chopping: simply discard the excess bits**

### a

_{t+1}

### , a

_{t+2}

### , . . .

^{to obtain}

### f l (x) = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### × 2

^{m}

### .

2. **rounding up: add**

### 2

^{−(t+1)}

### × 2

^{m}

^{to}

### x

and then chop the excess bits to obtain a number of the form### f l (x) = ±0.δ

^{1}

### δ

_{2}

### · · · δ

^{t}

### × 2

^{m}

### .

In this method, if

### a

_{t}

_{+1}

### = 1

^{, we add}

### 1

^{to}

### a

_{t}to obtain

### f l(x)

^{, and if}

### a

_{t}

_{+1}

### = 0

^{, we}

merely chop off all but the first

### t

digits.**Department of Mathematics – NTNU**

**Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{6}

### •

For any real number### x

, let### x = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t}

_{+1}

### a

_{t}

_{+2}

### · · · × 2

^{m}

### , a

_{1}

### 6= 0,

denote the normalized scientific binary representation of

### x

^{.}

**–**

### a

_{1}

### 6= 0

^{, hence}

### a

_{1}

### = 1

^{.}

**– If**

### x

is within the numerical range of the machine, the floating-point form of### x

, denoted### f l(x)

, is obtained by terminating the mantissa of### x

at### t

digits for some integer### t

.**– There are two ways of performing this termination.**

1. **chopping: simply discard the excess bits**

### a

_{t+1}

### , a

_{t+2}

### , . . .

^{to obtain}

### f l (x) = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### × 2

^{m}

### .

2. **rounding up: add**

### 2

^{−(t+1)}

### × 2

^{m}

^{to}

### x

and then chop the excess bits to obtain a number of the form### f l (x) = ±0.δ

^{1}

### δ

_{2}

### · · · δ

^{t}

### × 2

^{m}

### .

In this method, if

### a

_{t}

_{+1}

### = 1

^{, we add}

### 1

^{to}

### a

_{t}to obtain

### f l(x)

^{, and if}

### a

_{t}

_{+1}

### = 0

^{, we}

merely chop off all but the first

### t

digits.**Department of Mathematics – NTNU**

**Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{6}

### •

For any real number### x

, let### x = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t}

_{+1}

### a

_{t}

_{+2}

### · · · × 2

^{m}

### , a

_{1}

### 6= 0,

denote the normalized scientific binary representation of

### x

^{.}

**–**

### a

_{1}

### 6= 0

^{, hence}

### a

_{1}

### = 1

^{.}

**– If**

### x

is within the numerical range of the machine, the floating-point form of### x

, denoted### f l(x)

, is obtained by terminating the mantissa of### x

at### t

digits for some integer### t

.**– There are two ways of performing this termination.**

1. **chopping: simply discard the excess bits**

### a

_{t+1}

### , a

_{t+2}

### , . . .

^{to obtain}

### f l (x) = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### × 2

^{m}

### .

2. **rounding up: add**

### 2

^{−(t+1)}

### × 2

^{m}

^{to}

### x

and then chop the excess bits to obtain a number of the form### f l (x) = ±0.δ

^{1}

### δ

_{2}

### · · · δ

^{t}

### × 2

^{m}

### .

In this method, if

### a

_{t}

_{+1}

### = 1

^{, we add}

### 1

^{to}

### a

_{t}to obtain

### f l(x)

^{, and if}

### a

_{t}

_{+1}

### = 0

^{, we}

merely chop off all but the first

### t

digits.**Department of Mathematics – NTNU**

**Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{6}

### •

For any real number### x

, let### x = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t}

_{+1}

### a

_{t}

_{+2}

### · · · × 2

^{m}

### , a

_{1}

### 6= 0,

denote the normalized scientific binary representation of

### x

^{.}

**–**

### a

_{1}

### 6= 0

^{, hence}

### a

_{1}

### = 1

^{.}

**– If**

### x

is within the numerical range of the machine, the floating-point form of### x

, denoted### f l(x)

, is obtained by terminating the mantissa of### x

at### t

digits for some integer### t

.**– There are two ways of performing this termination.**

1. **chopping: simply discard the excess bits**

### a

_{t+1}

### , a

_{t+2}

### , . . .

^{to obtain}

### f l (x) = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### × 2

^{m}

### .

2. **rounding up: add**

### 2

^{−(t+1)}

### × 2

^{m}

^{to}

### x

and then chop the excess bits to obtain a number of the form### f l (x) = ±0.δ

^{1}

### δ

_{2}

### · · · δ

^{t}

### × 2

^{m}

### .

In this method, if

### a

_{t}

_{+1}

### = 1

^{, we add}

### 1

^{to}

### a

_{t}to obtain

### f l(x)

^{, and if}

### a

_{t}

_{+1}

### = 0

^{, we}

merely chop off all but the first

### t

digits.**Department of Mathematics – NTNU**

**Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{6}

### •

For any real number### x

, let### x = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t}

_{+1}

### a

_{t}

_{+2}

### · · · × 2

^{m}

### , a

_{1}

### 6= 0,

denote the normalized scientific binary representation of

### x

^{.}

**–**

### a

_{1}

### 6= 0

^{, hence}

### a

_{1}

### = 1

^{.}

**– If**

### x

is within the numerical range of the machine, the floating-point form of### x

, denoted### f l(x)

, is obtained by terminating the mantissa of### x

at### t

digits for some integer### t

.**– There are two ways of performing this termination.**

1. **chopping: simply discard the excess bits**

### a

_{t+1}

### , a

_{t+2}

### , . . .

^{to obtain}

### f l (x) = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### × 2

^{m}

### .

2. **rounding up: add**

### 2

^{−(t+1)}

### × 2

^{m}

^{to}

### x

and then chop the excess bits to obtain a number of the form### f l (x) = ±0.δ

^{1}

### δ

_{2}

### · · · δ

^{t}

### × 2

^{m}

### .

In this method, if

### a

_{t}

_{+1}

### = 1

^{, we add}

### 1

^{to}

### a

_{t}to obtain

### f l(x)

^{, and if}

### a

_{t}

_{+1}

### = 0

^{, we}

merely chop off all but the first

### t

digits.**Department of Mathematics – NTNU**

**Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{6}

### •

For any real number### x

, let### x = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t}

_{+1}

### a

_{t}

_{+2}

### · · · × 2

^{m}

### , a

_{1}

### 6= 0,

denote the normalized scientific binary representation of

### x

^{.}

**–**

### a

_{1}

### 6= 0

^{, hence}

### a

_{1}

### = 1

^{.}

**– If**

### x

is within the numerical range of the machine, the floating-point form of### x

, denoted### f l(x)

, is obtained by terminating the mantissa of### x

at### t

digits for some integer### t

.**– There are two ways of performing this termination.**

1. **chopping: simply discard the excess bits**

### a

_{t+1}

### , a

_{t+2}

### , . . .

^{to obtain}

### f l (x) = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### × 2

^{m}

### .

2. **rounding up: add**

### 2

^{−(t+1)}

### × 2

^{m}

^{to}

### x

and then chop the excess bits to obtain a number of the form### f l (x) = ±0.δ

^{1}

### δ

_{2}

### · · · δ

^{t}

### × 2

^{m}

### .

In this method, if

### a

_{t}

_{+1}

### = 1

^{, we add}

### 1

^{to}

### a

_{t}to obtain

### f l(x)

^{, and if}

### a

_{t}

_{+1}

### = 0

^{, we}

merely chop off all but the first

### t

digits.**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{7}

**Definition 1.1 (Roundoff error) The error results from replacing a number with its***floating-point form is called* roundoff error *or* rounding error.

**Definition 1.2 (Absolute Error and Relative Error) If**

### x

*is an approximation to the exact*

*value*

### x

^{?}

^{, the}*absolute error*

*is*

### |x

^{?}

### − x|

^{and the}*relative error*

*is*

^{|x}

_{|x}

^{?}

^{−x|}

_{?}

_{|}

*, provided that*

### x

^{?}

### 6= 0

^{.}**Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the***relative error more meaningful.*

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{7}

**Definition 1.1 (Roundoff error) The error results from replacing a number with its***floating-point form is called* roundoff error *or* rounding error.

**Definition 1.2 (Absolute Error and Relative Error) If**

### x

*is an approximation to the exact*

*value*

### x

^{?}

^{, the}*absolute error*

*is*

### |x

^{?}

### − x|

^{and the}*relative error*

*is*

^{|x}

_{|x}

^{?}

^{−x|}

_{?}

_{|}

*, provided that*

### x

^{?}

### 6= 0

^{.}**Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the***relative error more meaningful.*

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{7}

**Definition 1.1 (Roundoff error) The error results from replacing a number with its***floating-point form is called* roundoff error *or* rounding error.

**Definition 1.2 (Absolute Error and Relative Error) If**

### x

*is an approximation to the exact*

*value*

### x

^{?}

^{, the}*absolute error*

*is*

### |x

^{?}

### − x|

^{and the}*relative error*

*is*

^{|x}

_{|x}

^{?}

^{−x|}

_{?}

_{|}

*, provided that*

### x

^{?}

### 6= 0

^{.}**Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the***relative error more meaningful.*

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{7}

**Definition 1.1 (Roundoff error) The error results from replacing a number with its***floating-point form is called* roundoff error *or* rounding error.

**Definition 1.2 (Absolute Error and Relative Error) If**

### x

*is an approximation to the exact*

*value*

### x

^{?}

^{, the}*absolute error*

*is*

### |x

^{?}

### − x|

^{and the}*relative error*

*is*

^{|x}

_{|x}

^{?}

^{−x|}

_{?}

_{|}

*, provided that*

### x

^{?}

### 6= 0

^{.}**Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the***relative error more meaningful.*

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{8}

### •

If the floating-point representation### f l(x)

for the number### x

is obtained by using### t

digits and chopping procedure, then the relative error is### |x − fl(x)|

### |x| = |0.00 · · · 0a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · × 2

^{m}

### |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · × 2

^{m}

### |

### = |0.a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t}

_{+1}

### a

_{t}

_{+2}

### · · · | × 2

^{−t}

### .

Since

### a

_{1}

### 6= 0

, the minimal value of the denominator is^{1}

_{2}. The numerator is bounded above by 1. As a consequence

### x − fl(x) x

### ≤ 2

^{−t+1}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{8}

### •

If the floating-point representation### f l(x)

for the number### x

is obtained by using### t

digits and chopping procedure, then the relative error is### |x − fl(x)|

### |x| = |0.00 · · · 0a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · × 2

^{m}

### |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · × 2

^{m}

### |

### = |0.a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t}

_{+1}

### a

_{t}

_{+2}

### · · · | × 2

^{−t}

### .

Since

### a

_{1}

### 6= 0

, the minimal value of the denominator is^{1}

_{2}. The numerator is bounded above by 1. As a consequence

### x − fl(x) x

### ≤ 2

^{−t+1}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{8}

### •

If the floating-point representation### f l(x)

for the number### x

is obtained by using### t

digits and chopping procedure, then the relative error is### |x − fl(x)|

### |x| = |0.00 · · · 0a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · × 2

^{m}

### |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · × 2

^{m}

### |

### = |0.a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t}

_{+1}

### a

_{t}

_{+2}

### · · · | × 2

^{−t}

### .

Since

### a

_{1}

### 6= 0

, the minimal value of the denominator is^{1}

_{2}. The numerator is bounded above by 1. As a consequence

### x − fl(x) x

### ≤ 2

^{−t+1}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{8}

### •

If the floating-point representation### f l(x)

for the number### x

is obtained by using### t

digits and chopping procedure, then the relative error is### |x − fl(x)|

### |x| = |0.00 · · · 0a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · × 2

^{m}

### |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · × 2

^{m}

### |

### = |0.a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t}

_{+1}

### a

_{t}

_{+2}

### · · · | × 2

^{−t}

### .

Since

### a

_{1}

### 6= 0

, the minimal value of the denominator is^{1}

_{2}. The numerator is bounded above by 1.

As a consequence

### x − fl(x) x

### ≤ 2

^{−t+1}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{8}

### •

If the floating-point representation### f l(x)

for the number### x

is obtained by using### t

digits and chopping procedure, then the relative error is### |x − fl(x)|

### |x| = |0.00 · · · 0a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · × 2

^{m}

### |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · × 2

^{m}

### |

### = |0.a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t}

_{+1}

### a

_{t}

_{+2}

### · · · | × 2

^{−t}

### .

Since

### a

_{1}

### 6= 0

, the minimal value of the denominator is^{1}

_{2}. The numerator is bounded above by 1. As a consequence

### x − fl(x) x

### ≤ 2

^{−t+1}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{9}

### •

^{If}

### t

-digit rounding arithmetic is used and**–**

### a

_{t}

_{+1}

### = 0

^{, then}

### f l (x) = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### × 2

^{m}

^{.}

A bound for the relative error is

### |x − fl(x)|

### |x| = |0.a

^{t+1}

### a

_{t+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · | × 2

^{−t}

### ≤ 2

^{−t}

### ,

since the numerator is bounded above by ^{1}_{2}.

**–**

### a

_{t+1}

### = 1

^{, then}

### f l (x) = ±(0.a

^{1}

### a

_{2}

### · · · a

^{t}

### + 2

^{−t}

### ) × 2

^{m}. The upper bound for relative error becomes

### |x − fl(x)|

### |x| = |1 − 0.a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · | × 2

^{−t}

### ≤ 2

^{−t}

### ,

since the numerator is bounded by ^{1}_{2} due to

### a

_{t+1}

### = 1

^{.}

Therefore the relative error for rounding arithmetic is

### x − fl(x) x

### ≤ 2

^{−t}

### = 1

### 2 × 2

^{−t+1}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{9}

### •

^{If}

### t

-digit rounding arithmetic is used and**–**

### a

_{t}

_{+1}

### = 0

^{, then}

### f l (x) = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### × 2

^{m}. A bound for the relative error is

### |x − fl(x)|

### |x| = |0.a

^{t+1}

### a

_{t+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · | × 2

^{−t}

### ≤ 2

^{−t}

### ,

since the numerator is bounded above by ^{1}_{2}.

**–**

### a

_{t+1}

### = 1

^{, then}

### f l (x) = ±(0.a

^{1}

### a

_{2}

### · · · a

^{t}

### + 2

^{−t}

### ) × 2

^{m}. The upper bound for relative error becomes

### |x − fl(x)|

### |x| = |1 − 0.a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · | × 2

^{−t}

### ≤ 2

^{−t}

### ,

since the numerator is bounded by ^{1}_{2} due to

### a

_{t+1}

### = 1

^{.}

Therefore the relative error for rounding arithmetic is

### x − fl(x) x

### ≤ 2

^{−t}

### = 1

### 2 × 2

^{−t+1}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{9}

### •

^{If}

### t

-digit rounding arithmetic is used and**–**

### a

_{t}

_{+1}

### = 0

^{, then}

### f l (x) = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### × 2

^{m}. A bound for the relative error is

### |x − fl(x)|

### |x| = |0.a

^{t+1}

### a

_{t+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · | × 2

^{−t}

### ≤ 2

^{−t}

### ,

since the numerator is bounded above by ^{1}_{2}.

**–**

### a

_{t+1}

### = 1

^{, then}

### f l (x) = ±(0.a

^{1}

### a

_{2}

### · · · a

^{t}

### + 2

^{−t}

### ) × 2

^{m}. The upper bound for relative error becomes

### |x − fl(x)|

### |x| = |1 − 0.a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · | × 2

^{−t}

### ≤ 2

^{−t}

### ,

since the numerator is bounded by ^{1}_{2} due to

### a

_{t+1}

### = 1

^{.}

Therefore the relative error for rounding arithmetic is

### x − fl(x) x

### ≤ 2

^{−t}

### = 1

### 2 × 2

^{−t+1}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{9}

### •

^{If}

### t

-digit rounding arithmetic is used and**–**

### a

_{t}

_{+1}

### = 0

^{, then}

### f l (x) = ±0.a

^{1}

### a

_{2}

### · · · a

^{t}

### × 2

^{m}. A bound for the relative error is

### |x − fl(x)|

### |x| = |0.a

^{t+1}

### a

_{t+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · | × 2

^{−t}

### ≤ 2

^{−t}

### ,

since the numerator is bounded above by ^{1}_{2}.

**–**

### a

_{t+1}

### = 1

^{, then}

### f l (x) = ±(0.a

^{1}

### a

_{2}

### · · · a

^{t}

### + 2

^{−t}

### ) × 2

^{m}. The upper bound for relative error becomes

### |x − fl(x)|

### |x| = |1 − 0.a

^{t}

^{+1}

### a

_{t}

_{+2}

### · · · |

### |0.a

^{1}

### a

_{2}

### · · · a

^{t}

### a

_{t+1}

### a

_{t+2}

### · · · | × 2

^{−t}

### ≤ 2

^{−t}

### ,

since the numerator is bounded by ^{1}_{2} due to

### a

_{t+1}

### = 1

^{.}

Therefore the relative error for rounding arithmetic is

### x − fl(x) x

### ≤ 2

^{−t}

### = 1

### 2 × 2

^{−t+1}

### .

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{10}

### •

^{The number}

### ε

_{M}

### ≡ 2

^{−t+1}is referred to as the

*unit roundoff error*or

*machine epsilon.*The floating-point representation,

### f l(x)

^{, of}

### x

can be expressed as### f l(x) = x(1 + δ), |δ| ≤ ε

^{M}

### .

(1)### •

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report*called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were*specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{10}

### •

^{The number}

### ε

_{M}

### ≡ 2

^{−t+1}is referred to as the

*unit roundoff error*or

*machine epsilon.*

The floating-point representation,

### f l(x)

^{, of}

### x

can be expressed as### f l(x) = x(1 + δ), |δ| ≤ ε

^{M}

### .

(1)### •

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report*called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were*specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{10}

### •

^{The number}

### ε

_{M}

### ≡ 2

^{−t+1}is referred to as the

*unit roundoff error*or

*machine epsilon.*

The floating-point representation,

### f l(x)

^{, of}

### x

can be expressed as### f l(x) = x(1 + δ), |δ| ≤ ε

^{M}

### .

(1)### •

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report*called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were*specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{10}

### •

^{The number}

### ε

_{M}

### ≡ 2

^{−t+1}is referred to as the

*unit roundoff error*or

*machine epsilon.*

The floating-point representation,

### f l(x)

^{, of}

### x

can be expressed as### f l(x) = x(1 + δ), |δ| ≤ ε

^{M}

### .

(1)### •

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report*called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were*specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{11}

### ☞

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number### ±q × 2

^{m}as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

### •

The first bit is a sign indicator, denoted### s

. This is followed by an 8-bit exponent### c

and a 23-bit mantissa### f

^{.}

### •

The base for the exponent and mantissa is 2, and the actual exponent is### c − 127

^{. The}

value of

### c

is restricted by the inequality### 0 ≤ c ≤ 255

^{.}

### •

The actual exponent of the number is restricted by the inequality### −126 ≤ c − 127 ≤ 128

^{.}

### •

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{11}

### ☞

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number### ±q × 2

^{m}as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

### •

The first bit is a sign indicator, denoted### s

. This is followed by an 8-bit exponent### c

and a 23-bit mantissa### f

^{.}

### •

The base for the exponent and mantissa is 2, and the actual exponent is### c − 127

^{. The}

value of

### c

is restricted by the inequality### 0 ≤ c ≤ 255

^{.}

### •

The actual exponent of the number is restricted by the inequality### −126 ≤ c − 127 ≤ 128

^{.}

### •

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{11}

### ☞

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number### ±q × 2

^{m}as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

### •

The first bit is a sign indicator, denoted### s

. This is followed by an 8-bit exponent### c

and a 23-bit mantissa### f

^{.}

### •

The base for the exponent and mantissa is 2, and the actual exponent is### c − 127

^{. The}

value of

### c

is restricted by the inequality### 0 ≤ c ≤ 255

^{.}

### •

The actual exponent of the number is restricted by the inequality### −126 ≤ c − 127 ≤ 128

^{.}

### •

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{11}

### ☞

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number### ±q × 2

^{m}as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

### •

The first bit is a sign indicator, denoted### s

. This is followed by an 8-bit exponent### c

and a 23-bit mantissa### f

^{.}

### •

The base for the exponent and mantissa is 2, and the actual exponent is### c − 127

^{. The}

value of

### c

is restricted by the inequality### 0 ≤ c ≤ 255

^{.}

### •

The actual exponent of the number is restricted by the inequality### −126 ≤ c − 127 ≤ 128

^{.}

### •

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{11}

### ☞

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number### ±q × 2

^{m}as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

### •

The first bit is a sign indicator, denoted### s

. This is followed by an 8-bit exponent### c

and a 23-bit mantissa### f

^{.}

### •

The base for the exponent and mantissa is 2, and the actual exponent is### c − 127

^{. The}

value of

### c

is restricted by the inequality### 0 ≤ c ≤ 255

^{.}

### •

The actual exponent of the number is restricted by the inequality### −126 ≤ c − 127 ≤ 128

^{.}

### •

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{11}

### ☞

The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number### ±q × 2

^{m}as shown in Figure 1.

23 bits sign of mantissa

normalized mantissa exponent

8 bits

0 1 8 9 31

Figure 1: 32-bit single precision.

### •

The first bit is a sign indicator, denoted### s

. This is followed by an 8-bit exponent### c

and a 23-bit mantissa### f

^{.}

### •

The base for the exponent and mantissa is 2, and the actual exponent is### c − 127

^{. The}

value of

### c

is restricted by the inequality### 0 ≤ c ≤ 255

^{.}

### •

The actual exponent of the number is restricted by the inequality### −126 ≤ c − 127 ≤ 128

^{.}

### •

A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{12}

### •

The mantissa### f

actually corresponds to 24 binary digits (i.e., precision### t = 24

^{), the}

machine epsilon is

### ε

_{M}

### = 2

^{−24+1}

### = 2

^{−23}

### ≈ 1.192 × 10

^{−7}

### .

^{(2)}

### •

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is### 1 + 2

^{−23}

^{.}

### •

The largest number that can be represented by the single precision format is approximately### 2

^{128}

### ≈ 3.403 × 10

^{38}, and the smallest positive number is

### 2

^{−126}

### ≈ 1.175 × 10

^{−38}

^{.}

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{12}

### •

The mantissa### f

actually corresponds to 24 binary digits (i.e., precision### t = 24

^{),}

the machine epsilon is

### ε

_{M}

### = 2

^{−24+1}

### = 2

^{−23}

### ≈ 1.192 × 10

^{−7}

### .

^{(2)}

### •

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is### 1 + 2

^{−23}

^{.}

### •

The largest number that can be represented by the single precision format is approximately### 2

^{128}

### ≈ 3.403 × 10

^{38}, and the smallest positive number is

### 2

^{−126}

### ≈ 1.175 × 10

^{−38}

^{.}

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{12}

### •

The mantissa### f

actually corresponds to 24 binary digits (i.e., precision### t = 24

^{), the}

machine epsilon is

### ε

_{M}

### = 2

^{−24+1}

### = 2

^{−23}

### ≈ 1.192 × 10

^{−7}

### .

^{(2)}

### •

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is### 1 + 2

^{−23}

^{.}

### •

The largest number that can be represented by the single precision format is approximately### 2

^{128}

### ≈ 3.403 × 10

^{38}, and the smallest positive number is

### 2

^{−126}

### ≈ 1.175 × 10

^{−38}

^{.}

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{12}

### •

The mantissa### f

actually corresponds to 24 binary digits (i.e., precision### t = 24

^{), the}

machine epsilon is

### ε

_{M}

### = 2

^{−24+1}

### = 2

^{−23}

### ≈ 1.192 × 10

^{−7}

### .

^{(2)}

### •

This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is### 1 + 2

^{−23}

^{.}

### •

The largest number that can be represented by the single precision format is approximately### 2

^{128}

### ≈ 3.403 × 10

^{38}, and the smallest positive number is

### 2

^{−126}

### ≈ 1.175 × 10

^{−38}

^{.}

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{13}

### ☞

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.1

sign of mantissa

normalized mantissa exponent

52-bit

mantissa

0 1

11-bit

11 12

63

Figure 2: 64-bit double precision.

### •

The first bit is a sign indicator, denoted### s

. This is followed by an 11-bit exponent### c

and a 52-bit mantissa### f

.### •

The actual exponent is### c − 1023

^{.}

### •

The machine epsilon### ε

_{M}

### = 2

^{−52}

### ≈ 2.220 × 10

^{−16}

### ,

which provides between 15 and 16 decimal digits of accuracy.

### •

Range of approximately### 2

^{−1022}

### ≈ 2.225 × 10

^{−308}

^{to}

### 2

^{1024}

### ≈ 1.798 × 10

^{308}

^{.}

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{13}

### ☞

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.1

sign of mantissa

normalized mantissa exponent

52-bit

mantissa

0 1

11-bit

11 12

63

Figure 2: 64-bit double precision.

### •

The first bit is a sign indicator, denoted### s

. This is followed by an 11-bit exponent### c

and a 52-bit mantissa### f

.### •

The actual exponent is### c − 1023

^{.}

### •

The machine epsilon### ε

_{M}

### = 2

^{−52}

### ≈ 2.220 × 10

^{−16}

### ,

which provides between 15 and 16 decimal digits of accuracy.

### •

Range of approximately### 2

^{−1022}

### ≈ 2.225 × 10

^{−308}

^{to}

### 2

^{1024}

### ≈ 1.798 × 10

^{308}

^{.}

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**

**Computer Arithmetic**

^{13}

### ☞

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.1

sign of mantissa

normalized mantissa exponent

52-bit

mantissa

0 1

11-bit

11 12

63

Figure 2: 64-bit double precision.

### •

The first bit is a sign indicator, denoted### s

. This is followed by an 11-bit exponent### c

and a 52-bit mantissa### f

.### •

The actual exponent is### c − 1023

^{.}

### •

The machine epsilon### ε

_{M}

### = 2

^{−52}

### ≈ 2.220 × 10

^{−16}

### ,

which provides between 15 and 16 decimal digits of accuracy.

### •

Range of approximately### 2

^{−1022}

### ≈ 2.225 × 10

^{−308}

^{to}

### 2

^{1024}

### ≈ 1.798 × 10

^{308}

^{.}

**Department of Mathematics – NTNU** **Tsung-Min Hwang September 14, 2003**