Computer Arithmetic
1Numerical Analysis
NTNU
Tsung-Min Hwang September 14, 2003
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
21 Floating-Point Number and Roundoff Error
•
Normalized scientific notation for the decimal number system ofx
:x = ±r × 10
n,
where
1
10 ≤ r < 1,
and
n
is an integer (positive, negative, or zero). –r
is called the mantissa andn
is the exponent. – The leading digit in the fraction is not zero. – For example,42.965 = 0.42965 × 10
2,
−0.00234 = −0.234 × 10
−2.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
21 Floating-Point Number and Roundoff Error
•
Normalized scientific notation for the decimal number system ofx
:x = ±r × 10
n,
where
1
10 ≤ r < 1,
and
n
is an integer (positive, negative, or zero).–
r
is called the mantissa andn
is the exponent. – The leading digit in the fraction is not zero. – For example,42.965 = 0.42965 × 10
2,
−0.00234 = −0.234 × 10
−2.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
21 Floating-Point Number and Roundoff Error
•
Normalized scientific notation for the decimal number system ofx
:x = ±r × 10
n,
where
1
10 ≤ r < 1,
and
n
is an integer (positive, negative, or zero).–
r
is called the mantissa andn
is the exponent.– The leading digit in the fraction is not zero.
– For example,
42.965 = 0.42965 × 10
2,
−0.00234 = −0.234 × 10
−2.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
21 Floating-Point Number and Roundoff Error
•
Normalized scientific notation for the decimal number system ofx
:x = ±r × 10
n,
where
1
10 ≤ r < 1,
and
n
is an integer (positive, negative, or zero).–
r
is called the mantissa andn
is the exponent.– The leading digit in the fraction is not zero.
– For example,
42.965 = 0.42965 × 10
2,
−0.00234 = −0.234 × 10
−2.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
3•
Scientific notation for the binary number system ofx
:x = ±q × 2
mwith
1
2 ≤ q < 1,
and some integer
m
. For example,(1001.1101)
2= 1 × 2
3+ 1 × 2
0+ 1 × 2
−1+ 1 × 2
−2+ 1 × 2
−4= 0.10011101 × 2
4= (9.8125)
10Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
3•
Scientific notation for the binary number system ofx
:x = ±q × 2
mwith
1
2 ≤ q < 1,
and some integer
m
.For example,
(1001.1101)
2= 1 × 2
3+ 1 × 2
0+ 1 × 2
−1+ 1 × 2
−2+ 1 × 2
−4= 0.10011101 × 2
4= (9.8125)
10Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
3•
Scientific notation for the binary number system ofx
:x = ±q × 2
mwith
1
2 ≤ q < 1,
and some integer
m
. For example,(1001.1101)
2= 1 × 2
3+ 1 × 2
0+ 1 × 2
−1+ 1 × 2
−2+ 1 × 2
−4= 0.10011101 × 2
4= (9.8125)
10Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
4Example 1.1 What is the binary representation of 23?
Solution: To determine the binary representation for 23, we write
2
3 = (0.a
1a
2a
3. . .)
2.
Multiply by 2 to obtain
4
3 = (a
1.a
2a
3. . .)
2.
Therefore, we get
a
1= 1
by taking the integer part of both sides. Subtracting 1, we have1
3 = (0.a
2a
3a
4. . .)
2.
Repeating the previous step, we arrive at
2
3 = (0.101010 . . .)
2.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
4Example 1.1 What is the binary representation of 23?
Solution: To determine the binary representation for 23, we write
2
3 = (0.a
1a
2a
3. . .)
2.
Multiply by 2 to obtain
4
3 = (a
1.a
2a
3. . .)
2.
Therefore, we get
a
1= 1
by taking the integer part of both sides. Subtracting 1, we have1
3 = (0.a
2a
3a
4. . .)
2.
Repeating the previous step, we arrive at
2
3 = (0.101010 . . .)
2.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
4Example 1.1 What is the binary representation of 23?
Solution: To determine the binary representation for 23, we write
2
3 = (0.a
1a
2a
3. . .)
2.
Multiply by 2 to obtain
4
3 = (a
1.a
2a
3. . .)
2.
Therefore, we get
a
1= 1
by taking the integer part of both sides. Subtracting 1, we have1
3 = (0.a
2a
3a
4. . .)
2.
Repeating the previous step, we arrive at
2
3 = (0.101010 . . .)
2.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
4Example 1.1 What is the binary representation of 23?
Solution: To determine the binary representation for 23, we write
2
3 = (0.a
1a
2a
3. . .)
2.
Multiply by 2 to obtain
4
3 = (a
1.a
2a
3. . .)
2.
Therefore, we get
a
1= 1
by taking the integer part of both sides.Subtracting 1, we have
1
3 = (0.a
2a
3a
4. . .)
2.
Repeating the previous step, we arrive at
2
3 = (0.101010 . . .)
2.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
4Example 1.1 What is the binary representation of 23?
Solution: To determine the binary representation for 23, we write
2
3 = (0.a
1a
2a
3. . .)
2.
Multiply by 2 to obtain
4
3 = (a
1.a
2a
3. . .)
2.
Therefore, we get
a
1= 1
by taking the integer part of both sides. Subtracting 1, we have1
3 = (0.a
2a
3a
4. . .)
2.
Repeating the previous step, we arrive at
2
3 = (0.101010 . . .)
2.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
4Example 1.1 What is the binary representation of 23?
Solution: To determine the binary representation for 23, we write
2
3 = (0.a
1a
2a
3. . .)
2.
Multiply by 2 to obtain
4
3 = (a
1.a
2a
3. . .)
2.
Therefore, we get
a
1= 1
by taking the integer part of both sides. Subtracting 1, we have1
3 = (0.a
2a
3a
4. . .)
2.
Repeating the previous step, we arrive at
2
3 = (0.101010 . . .)
2.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
5•
Only a relatively small subset of the real number system is used for the representation of all the real numbers.•
This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.•
When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003Computer Arithmetic
5•
Only a relatively small subset of the real number system is used for the representation of all the real numbers.•
This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.•
When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003Computer Arithmetic
5•
Only a relatively small subset of the real number system is used for the representation of all the real numbers.•
This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.•
When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003Computer Arithmetic
5•
Only a relatively small subset of the real number system is used for the representation of all the real numbers.•
This subset, which are called the floating-point numbers, contains only rational numbers, both positive and negative.•
When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
6•
For any real numberx
, letx = ±0.a
1a
2· · · a
ta
t+1a
t+2· · · × 2
m, a
16= 0,
denote the normalized scientific binary representation of
x
.–
a
16= 0
, hencea
1= 1
.– If
x
is within the numerical range of the machine, the floating-point form ofx
, denotedf l(x)
, is obtained by terminating the mantissa ofx
att
digits for some integert
. – There are two ways of performing this termination.1. chopping: simply discard the excess bits
a
t+1, a
t+2, . . .
to obtainf l (x) = ±0.a
1a
2· · · a
t× 2
m.
2. rounding up: add
2
−(t+1)× 2
m tox
and then chop the excess bits to obtain a number of the formf l (x) = ±0.δ
1δ
2· · · δ
t× 2
m.
In this method, if
a
t+1= 1
, we add1
toa
t to obtainf l(x)
, and ifa
t+1= 0
, wemerely chop off all but the first
t
digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003Computer Arithmetic
6•
For any real numberx
, letx = ±0.a
1a
2· · · a
ta
t+1a
t+2· · · × 2
m, a
16= 0,
denote the normalized scientific binary representation of
x
.–
a
16= 0
, hencea
1= 1
.– If
x
is within the numerical range of the machine, the floating-point form ofx
, denotedf l(x)
, is obtained by terminating the mantissa ofx
att
digits for some integert
. – There are two ways of performing this termination.1. chopping: simply discard the excess bits
a
t+1, a
t+2, . . .
to obtainf l (x) = ±0.a
1a
2· · · a
t× 2
m.
2. rounding up: add
2
−(t+1)× 2
m tox
and then chop the excess bits to obtain a number of the formf l (x) = ±0.δ
1δ
2· · · δ
t× 2
m.
In this method, if
a
t+1= 1
, we add1
toa
t to obtainf l(x)
, and ifa
t+1= 0
, wemerely chop off all but the first
t
digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003Computer Arithmetic
6•
For any real numberx
, letx = ±0.a
1a
2· · · a
ta
t+1a
t+2· · · × 2
m, a
16= 0,
denote the normalized scientific binary representation of
x
.–
a
16= 0
, hencea
1= 1
.– If
x
is within the numerical range of the machine, the floating-point form ofx
, denotedf l(x)
, is obtained by terminating the mantissa ofx
att
digits for some integert
.– There are two ways of performing this termination.
1. chopping: simply discard the excess bits
a
t+1, a
t+2, . . .
to obtainf l (x) = ±0.a
1a
2· · · a
t× 2
m.
2. rounding up: add
2
−(t+1)× 2
m tox
and then chop the excess bits to obtain a number of the formf l (x) = ±0.δ
1δ
2· · · δ
t× 2
m.
In this method, if
a
t+1= 1
, we add1
toa
t to obtainf l(x)
, and ifa
t+1= 0
, wemerely chop off all but the first
t
digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003Computer Arithmetic
6•
For any real numberx
, letx = ±0.a
1a
2· · · a
ta
t+1a
t+2· · · × 2
m, a
16= 0,
denote the normalized scientific binary representation of
x
.–
a
16= 0
, hencea
1= 1
.– If
x
is within the numerical range of the machine, the floating-point form ofx
, denotedf l(x)
, is obtained by terminating the mantissa ofx
att
digits for some integert
. – There are two ways of performing this termination.1. chopping: simply discard the excess bits
a
t+1, a
t+2, . . .
to obtainf l (x) = ±0.a
1a
2· · · a
t× 2
m.
2. rounding up: add
2
−(t+1)× 2
m tox
and then chop the excess bits to obtain a number of the formf l (x) = ±0.δ
1δ
2· · · δ
t× 2
m.
In this method, if
a
t+1= 1
, we add1
toa
t to obtainf l(x)
, and ifa
t+1= 0
, wemerely chop off all but the first
t
digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003Computer Arithmetic
6•
For any real numberx
, letx = ±0.a
1a
2· · · a
ta
t+1a
t+2· · · × 2
m, a
16= 0,
denote the normalized scientific binary representation of
x
.–
a
16= 0
, hencea
1= 1
.– If
x
is within the numerical range of the machine, the floating-point form ofx
, denotedf l(x)
, is obtained by terminating the mantissa ofx
att
digits for some integert
. – There are two ways of performing this termination.1. chopping: simply discard the excess bits
a
t+1, a
t+2, . . .
to obtainf l (x) = ±0.a
1a
2· · · a
t× 2
m.
2. rounding up: add
2
−(t+1)× 2
m tox
and then chop the excess bits to obtain a number of the formf l (x) = ±0.δ
1δ
2· · · δ
t× 2
m.
In this method, if
a
t+1= 1
, we add1
toa
t to obtainf l(x)
, and ifa
t+1= 0
, wemerely chop off all but the first
t
digits. Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003Computer Arithmetic
6•
For any real numberx
, letx = ±0.a
1a
2· · · a
ta
t+1a
t+2· · · × 2
m, a
16= 0,
denote the normalized scientific binary representation of
x
.–
a
16= 0
, hencea
1= 1
.– If
x
is within the numerical range of the machine, the floating-point form ofx
, denotedf l(x)
, is obtained by terminating the mantissa ofx
att
digits for some integert
. – There are two ways of performing this termination.1. chopping: simply discard the excess bits
a
t+1, a
t+2, . . .
to obtainf l (x) = ±0.a
1a
2· · · a
t× 2
m.
2. rounding up: add
2
−(t+1)× 2
m tox
and then chop the excess bits to obtain a number of the formf l (x) = ±0.δ
1δ
2· · · δ
t× 2
m.
In this method, if
a
t+1= 1
, we add1
toa
t to obtainf l(x)
, and ifa
t+1= 0
, wemerely chop off all but the first
t
digits.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
7Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.
Definition 1.2 (Absolute Error and Relative Error) If
x
is an approximation to the exact valuex
?, the absolute error is|x
?− x|
and the relative error is |x|x?−x|?| , provided thatx
?6= 0
.Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
7Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.
Definition 1.2 (Absolute Error and Relative Error) If
x
is an approximation to the exact valuex
?, the absolute error is|x
?− x|
and the relative error is |x|x?−x|?| , provided thatx
?6= 0
.Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
7Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.
Definition 1.2 (Absolute Error and Relative Error) If
x
is an approximation to the exact valuex
?, the absolute error is|x
?− x|
and the relative error is |x|x?−x|?| , provided thatx
?6= 0
.Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
7Definition 1.1 (Roundoff error) The error results from replacing a number with its floating-point form is called roundoff error or rounding error.
Definition 1.2 (Absolute Error and Relative Error) If
x
is an approximation to the exact valuex
?, the absolute error is|x
?− x|
and the relative error is |x|x?−x|?| , provided thatx
?6= 0
.Remark 1.1 As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
8•
If the floating-point representationf l(x)
for the numberx
is obtained by usingt
digits and chopping procedure, then the relative error is|x − fl(x)|
|x| = |0.00 · · · 0a
t+1a
t+2· · · × 2
m|
|0.a
1a
2· · · a
ta
t+1a
t+2· · · × 2
m|
= |0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t.
Since
a
16= 0
, the minimal value of the denominator is 12. The numerator is bounded above by 1. As a consequencex − fl(x) x
≤ 2
−t+1.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
8•
If the floating-point representationf l(x)
for the numberx
is obtained by usingt
digits and chopping procedure, then the relative error is|x − fl(x)|
|x| = |0.00 · · · 0a
t+1a
t+2· · · × 2
m|
|0.a
1a
2· · · a
ta
t+1a
t+2· · · × 2
m|
= |0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t.
Since
a
16= 0
, the minimal value of the denominator is 12. The numerator is bounded above by 1. As a consequencex − fl(x) x
≤ 2
−t+1.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
8•
If the floating-point representationf l(x)
for the numberx
is obtained by usingt
digits and chopping procedure, then the relative error is|x − fl(x)|
|x| = |0.00 · · · 0a
t+1a
t+2· · · × 2
m|
|0.a
1a
2· · · a
ta
t+1a
t+2· · · × 2
m|
= |0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t.
Since
a
16= 0
, the minimal value of the denominator is 12. The numerator is bounded above by 1. As a consequencex − fl(x) x
≤ 2
−t+1.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
8•
If the floating-point representationf l(x)
for the numberx
is obtained by usingt
digits and chopping procedure, then the relative error is|x − fl(x)|
|x| = |0.00 · · · 0a
t+1a
t+2· · · × 2
m|
|0.a
1a
2· · · a
ta
t+1a
t+2· · · × 2
m|
= |0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t.
Since
a
16= 0
, the minimal value of the denominator is 12. The numerator is bounded above by 1.As a consequence
x − fl(x) x
≤ 2
−t+1.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
8•
If the floating-point representationf l(x)
for the numberx
is obtained by usingt
digits and chopping procedure, then the relative error is|x − fl(x)|
|x| = |0.00 · · · 0a
t+1a
t+2· · · × 2
m|
|0.a
1a
2· · · a
ta
t+1a
t+2· · · × 2
m|
= |0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t.
Since
a
16= 0
, the minimal value of the denominator is 12. The numerator is bounded above by 1. As a consequencex − fl(x) x
≤ 2
−t+1.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
9•
Ift
-digit rounding arithmetic is used and–
a
t+1= 0
, thenf l (x) = ±0.a
1a
2· · · a
t× 2
m.A bound for the relative error is
|x − fl(x)|
|x| = |0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t≤ 2
−t,
since the numerator is bounded above by 12.
–
a
t+1= 1
, thenf l (x) = ±(0.a
1a
2· · · a
t+ 2
−t) × 2
m. The upper bound for relative error becomes|x − fl(x)|
|x| = |1 − 0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t≤ 2
−t,
since the numerator is bounded by 12 due to
a
t+1= 1
.Therefore the relative error for rounding arithmetic is
x − fl(x) x
≤ 2
−t= 1
2 × 2
−t+1.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
9•
Ift
-digit rounding arithmetic is used and–
a
t+1= 0
, thenf l (x) = ±0.a
1a
2· · · a
t× 2
m. A bound for the relative error is|x − fl(x)|
|x| = |0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t≤ 2
−t,
since the numerator is bounded above by 12.
–
a
t+1= 1
, thenf l (x) = ±(0.a
1a
2· · · a
t+ 2
−t) × 2
m. The upper bound for relative error becomes|x − fl(x)|
|x| = |1 − 0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t≤ 2
−t,
since the numerator is bounded by 12 due to
a
t+1= 1
.Therefore the relative error for rounding arithmetic is
x − fl(x) x
≤ 2
−t= 1
2 × 2
−t+1.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
9•
Ift
-digit rounding arithmetic is used and–
a
t+1= 0
, thenf l (x) = ±0.a
1a
2· · · a
t× 2
m. A bound for the relative error is|x − fl(x)|
|x| = |0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t≤ 2
−t,
since the numerator is bounded above by 12.
–
a
t+1= 1
, thenf l (x) = ±(0.a
1a
2· · · a
t+ 2
−t) × 2
m. The upper bound for relative error becomes|x − fl(x)|
|x| = |1 − 0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t≤ 2
−t,
since the numerator is bounded by 12 due to
a
t+1= 1
.Therefore the relative error for rounding arithmetic is
x − fl(x) x
≤ 2
−t= 1
2 × 2
−t+1.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
9•
Ift
-digit rounding arithmetic is used and–
a
t+1= 0
, thenf l (x) = ±0.a
1a
2· · · a
t× 2
m. A bound for the relative error is|x − fl(x)|
|x| = |0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t≤ 2
−t,
since the numerator is bounded above by 12.
–
a
t+1= 1
, thenf l (x) = ±(0.a
1a
2· · · a
t+ 2
−t) × 2
m. The upper bound for relative error becomes|x − fl(x)|
|x| = |1 − 0.a
t+1a
t+2· · · |
|0.a
1a
2· · · a
ta
t+1a
t+2· · · | × 2
−t≤ 2
−t,
since the numerator is bounded by 12 due to
a
t+1= 1
.Therefore the relative error for rounding arithmetic is
x − fl(x) x
≤ 2
−t= 1
2 × 2
−t+1.
Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
10•
The numberε
M≡ 2
−t+1 is referred to as the unit roundoff error or machine epsilon. The floating-point representation,f l(x)
, ofx
can be expressed asf l(x) = x(1 + δ), |δ| ≤ ε
M.
(1)•
In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
10•
The numberε
M≡ 2
−t+1 is referred to as the unit roundoff error or machine epsilon.The floating-point representation,
f l(x)
, ofx
can be expressed asf l(x) = x(1 + δ), |δ| ≤ ε
M.
(1)•
In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
10•
The numberε
M≡ 2
−t+1 is referred to as the unit roundoff error or machine epsilon.The floating-point representation,
f l(x)
, ofx
can be expressed asf l(x) = x(1 + δ), |δ| ≤ ε
M.
(1)•
In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
10•
The numberε
M≡ 2
−t+1 is referred to as the unit roundoff error or machine epsilon.The floating-point representation,
f l(x)
, ofx
can be expressed asf l(x) = x(1 + δ), |δ| ≤ ε
M.
(1)•
In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using floating-point hardware.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
11☞
The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number±q × 2
m as shown in Figure 1.23 bits sign of mantissa
normalized mantissa exponent
8 bits
0 1 8 9 31
Figure 1: 32-bit single precision.
•
The first bit is a sign indicator, denoteds
. This is followed by an 8-bit exponentc
and a 23-bit mantissaf
.•
The base for the exponent and mantissa is 2, and the actual exponent isc − 127
. Thevalue of
c
is restricted by the inequality0 ≤ c ≤ 255
.•
The actual exponent of the number is restricted by the inequality−126 ≤ c − 127 ≤ 128
.•
A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
11☞
The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number±q × 2
m as shown in Figure 1.23 bits sign of mantissa
normalized mantissa exponent
8 bits
0 1 8 9 31
Figure 1: 32-bit single precision.
•
The first bit is a sign indicator, denoteds
. This is followed by an 8-bit exponentc
and a 23-bit mantissaf
.•
The base for the exponent and mantissa is 2, and the actual exponent isc − 127
. Thevalue of
c
is restricted by the inequality0 ≤ c ≤ 255
.•
The actual exponent of the number is restricted by the inequality−126 ≤ c − 127 ≤ 128
.•
A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
11☞
The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number±q × 2
m as shown in Figure 1.23 bits sign of mantissa
normalized mantissa exponent
8 bits
0 1 8 9 31
Figure 1: 32-bit single precision.
•
The first bit is a sign indicator, denoteds
. This is followed by an 8-bit exponentc
and a 23-bit mantissaf
.•
The base for the exponent and mantissa is 2, and the actual exponent isc − 127
. Thevalue of
c
is restricted by the inequality0 ≤ c ≤ 255
.•
The actual exponent of the number is restricted by the inequality−126 ≤ c − 127 ≤ 128
.•
A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
11☞
The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number±q × 2
m as shown in Figure 1.23 bits sign of mantissa
normalized mantissa exponent
8 bits
0 1 8 9 31
Figure 1: 32-bit single precision.
•
The first bit is a sign indicator, denoteds
. This is followed by an 8-bit exponentc
and a 23-bit mantissaf
.•
The base for the exponent and mantissa is 2, and the actual exponent isc − 127
. Thevalue of
c
is restricted by the inequality0 ≤ c ≤ 255
.•
The actual exponent of the number is restricted by the inequality−126 ≤ c − 127 ≤ 128
.•
A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
11☞
The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number±q × 2
m as shown in Figure 1.23 bits sign of mantissa
normalized mantissa exponent
8 bits
0 1 8 9 31
Figure 1: 32-bit single precision.
•
The first bit is a sign indicator, denoteds
. This is followed by an 8-bit exponentc
and a 23-bit mantissaf
.•
The base for the exponent and mantissa is 2, and the actual exponent isc − 127
. Thevalue of
c
is restricted by the inequality0 ≤ c ≤ 255
.•
The actual exponent of the number is restricted by the inequality−126 ≤ c − 127 ≤ 128
.•
A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
11☞
The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number±q × 2
m as shown in Figure 1.23 bits sign of mantissa
normalized mantissa exponent
8 bits
0 1 8 9 31
Figure 1: 32-bit single precision.
•
The first bit is a sign indicator, denoteds
. This is followed by an 8-bit exponentc
and a 23-bit mantissaf
.•
The base for the exponent and mantissa is 2, and the actual exponent isc − 127
. Thevalue of
c
is restricted by the inequality0 ≤ c ≤ 255
.•
The actual exponent of the number is restricted by the inequality−126 ≤ c − 127 ≤ 128
.•
A normalization is imposed that requires that the leading digit in fraction be 1, and this digit is not stored as part of the 23-bit mantissa.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
12•
The mantissaf
actually corresponds to 24 binary digits (i.e., precisiont = 24
), themachine epsilon is
ε
M= 2
−24+1= 2
−23≈ 1.192 × 10
−7.
(2)•
This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is1 + 2
−23.•
The largest number that can be represented by the single precision format is approximately2
128≈ 3.403 × 10
38, and the smallest positive number is2
−126≈ 1.175 × 10
−38.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
12•
The mantissaf
actually corresponds to 24 binary digits (i.e., precisiont = 24
),the machine epsilon is
ε
M= 2
−24+1= 2
−23≈ 1.192 × 10
−7.
(2)•
This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is1 + 2
−23.•
The largest number that can be represented by the single precision format is approximately2
128≈ 3.403 × 10
38, and the smallest positive number is2
−126≈ 1.175 × 10
−38.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
12•
The mantissaf
actually corresponds to 24 binary digits (i.e., precisiont = 24
), themachine epsilon is
ε
M= 2
−24+1= 2
−23≈ 1.192 × 10
−7.
(2)•
This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is1 + 2
−23.•
The largest number that can be represented by the single precision format is approximately2
128≈ 3.403 × 10
38, and the smallest positive number is2
−126≈ 1.175 × 10
−38.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
12•
The mantissaf
actually corresponds to 24 binary digits (i.e., precisiont = 24
), themachine epsilon is
ε
M= 2
−24+1= 2
−23≈ 1.192 × 10
−7.
(2)•
This approximately corresponds to 6 accurate decimal digits. And the first single precision floating-point number greater than 1 is1 + 2
−23.•
The largest number that can be represented by the single precision format is approximately2
128≈ 3.403 × 10
38, and the smallest positive number is2
−126≈ 1.175 × 10
−38.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
13☞
A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.1
sign of mantissa
normalized mantissa exponent
52-bit
mantissa
0 1
11-bit
11 12
63
Figure 2: 64-bit double precision.
•
The first bit is a sign indicator, denoteds
. This is followed by an 11-bit exponentc
and a 52-bit mantissaf
.•
The actual exponent isc − 1023
.•
The machine epsilonε
M= 2
−52≈ 2.220 × 10
−16,
which provides between 15 and 16 decimal digits of accuracy.
•
Range of approximately2
−1022≈ 2.225 × 10
−308 to2
1024≈ 1.798 × 10
308.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
13☞
A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.1
sign of mantissa
normalized mantissa exponent
52-bit
mantissa
0 1
11-bit
11 12
63
Figure 2: 64-bit double precision.
•
The first bit is a sign indicator, denoteds
. This is followed by an 11-bit exponentc
and a 52-bit mantissaf
.•
The actual exponent isc − 1023
.•
The machine epsilonε
M= 2
−52≈ 2.220 × 10
−16,
which provides between 15 and 16 decimal digits of accuracy.
•
Range of approximately2
−1022≈ 2.225 × 10
−308 to2
1024≈ 1.798 × 10
308.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003
Computer Arithmetic
13☞
A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in Figure 2.1
sign of mantissa
normalized mantissa exponent
52-bit
mantissa
0 1
11-bit
11 12
63
Figure 2: 64-bit double precision.
•
The first bit is a sign indicator, denoteds
. This is followed by an 11-bit exponentc
and a 52-bit mantissaf
.•
The actual exponent isc − 1023
.•
The machine epsilonε
M= 2
−52≈ 2.220 × 10
−16,
which provides between 15 and 16 decimal digits of accuracy.
•
Range of approximately2
−1022≈ 2.225 × 10
−308 to2
1024≈ 1.798 × 10
308.Department of Mathematics – NTNU Tsung-Min Hwang September 14, 2003