Chapter 1
Mathematical Preliminaries and Error Analysis
Hung-Yuan Fan (范洪源)
Department of Mathematics, National Taiwan Normal University, Taiwan
Spring 2016
Section 1.1
Review of Calculus
Limits
Def 1.1
A fuction f : X→ R has the limit L at x0, denoted by
xlim→x0
f(x) = L,
if∀ ε > 0, ∃ δ > 0s.t. x ∈ X, 0 < |x − x0| < δ ⇒ |f(x) − L| < ε.
Continuity (連續性)
Def 1.2
1 A fuction f : X→ R is continuous (簡寫: conti.) at x0∈ X if
xlim→x0
f(x) = f(x0).
2 f is conti. on X if it is conti. at each point of X.
3 C(X) ={f | f is conti. on X} denotes the set of all conti.
functions defined on X.
Note: if X = [a, b], (a, b), [a, b) or (a, b] with a < b, write C[a, b],
C(a, b), C[a, b) or C(a, b], respectively.Limits of Sequences
Def 1.3
A sequence (簡寫: seq.) of real numbers{xn}∞n=1 converges (簡寫:
conv.) to the limit x, written
nlim→∞xn= x, or xn→ x as n → ∞, if∀ ε > 0, ∃ N(ε) ∈ N s.t. n > N(ε) ⇒ |xn− x| < ε.
Thm 1.4 (序列與連續性的關係)
Let f be a real-valued function defined on∅ ̸= X ⊆ R and x0∈ X.
The followings are equivalent:
a. f is conti. at x0.
b. ∀ seq. {xn}∞n=1⊆ X with lim
→∞xn= x0, lim
→∞f(xn) = f(x0).
Differentiability (可微分性)
Def 1.5
1 A fuction f : X→ R is differentiable (簡寫: diffi.) at x0 ∈ X if f′(x0) = lim
x→x0
f(x)− f(x0) x− x0
= lim
h→0
f(x0+ h)− f(x0)
h .
2 f is cdiff. on X if it is cdiffy6t. at each point of X.
3 Cn(X) denotes the set of all functions having n conti.
derivatives on X.
4 C∞(X) denotes the set of functions having derivatives of all orders on X.
Continuity v.s. Differentiability
Thm 1.6
Let f be a real-valued function defined on X and x0 ∈ X. Then f is diff. at x0 =⇒ f is conti. at x0.
Rolle’s Theorem
Thm 1.7 (Rolle’s Thm)
f∈ C[a, b] and f is diff. on (a, b). If f(a) = f(b), then ∃ c ∈ (a, b) s.t. f′(c) = 0.
Generalized Rolle’s Theorem
Thm 1.10 (Generalized Rolle’s Thm)
f∈ C[a, b] is n times diff. on (a, b). If f(xi) = 0 for some n + 1 distinct numbers a≤ x0< x1 <· · · < xn≤ b, then
∃ c ∈ (x0, xn)⊆ [a, b] s.t. f(n)(c) = 0.
Mean Value Theorem (簡寫: MVT, 均值定理)
Thm 1.8 (MVT)
f∈ C[a, b] and f is diff. on (a, b). Then ∃ c ∈ (a, b) s.t.
f′(c) = f(b)− f(a)
b− a or f(b)− f(a) = f′(c)(b− a).
Extreme Value Theorem (簡寫: EVT, 極值定理)
Thm 1.9 (EVT)
If f∈ C[a, b], then ∃ c1, c2∈ [a, b] s.t.
f(c1)≤ f(x) ≤ f(c2) ∀ x ∈ [a, b].
Intermediate Value Theorem (簡寫: IVT, 中間值定理)
Thm 1.11 (IVT)
f∈ C[a, b], K is any number between f(a) and f(b)
=⇒ ∃ c ∈ (a, b) s.t. f(c) = K.
Taylor Polynomials and Series
Thm 1.14 (Taylor’s Thm, 泰勒定理)
f∈ Cn[a, b], f(n+1) ∃ on [a, b] and x0∈ [a, b].
⇒ ∀ x ∈ [a, b], ∃ ξ(x) between x0 and x s.t. f(x) = Pn(x) + Rn(x), where
Pn(x) =
∑n k=0
f(k)(x0)
k! (x− x0)k, (the nth Taylor poly. for f) Rn(x) = f(n+1)(ξ(x))
(n + 1)! (x− x0)n+1.
(remainder or truncation error associated with Pn(x))
Taylor Series (泰勒級數)
Remarks
1 If lim
n→∞Rn(x) = 0 ∀ x ∈ I (I: interval with x0∈ I), then f(x) = lim
n→∞Pn(x) =
∑∞ k=0
f(k)(x0)
k! (x− x0)k ∀ x ∈ I.
We say that the Taylor series for f about x0 conv. to f on I.
2 If x0= 0, the Taylor series is often called the Maclaurin
series.
Example 3, p. 11
The second (or third) Taylor poly. for f(x) = cos x about x0= 0 is P2(x) = P3(x) = 1−12x2, but their truncation errors satisfy
|R2(x)| ≤| sin ξ(x)| |x|3 6 ≤ |x|4
6 = 0.1¯6· |x|4 (∵ | sin ξ(x)| ≤ |ξ(x)| ≤ |x| ∀ x ∈ R)
|R3| ≤| cos ˜ξ(x)| |x|4 24 ≤ |x|4
24 = 0.041¯6· |x|4. (Sharper Bound for|x| ≈ 0!)
What are the goals of numerical analysis?
Remark
Two objectives of numerical analysis:
1 Find an approximation to the solution of a given problem.
2 Determine a bound for the accuracy of the approximation.
Is this error bound tight and sharp?
Integration (1/2)
Def 1.12 (定積分的定義)
1 The (Riemann) definite integral of f on [a, b] is defined by
∫ b
a
f(x) dx = lim
max
1≤i≤n∆xi→ 0
∑n i=1
f(zi)∆xi,
where P ={a = x0 < x1<· · · < xn= b} is any partition of [a, b], zi∈ [xi−1, xi] and ∆xi= xi− xi−1 for i = 1, 2, . . . , n.
2 f is called (Riemann) integrable over [a, b] if the limit exists.
Note: f is conti. on [a, b]
⇒ f is integrable over [a, b].Integration (2/2)
Remark
f is integrable over [a, b] =⇒
∫ b
a
f(x) dx = lim
n→∞
∑n i=1
f(zi)�x, (�x =
b
− an
)≈
∑n i=0
wi· f(xi)�x, (wi
: weighting coeff.)
withzi= xi or xi−1 for i = 1, 2, . . . , n.
Riemann Sums (黎曼和) with z
i= x
i∀ i
Weighted MVT for Definite Integrals
Thm 1.13 (定積分的權重均值定理)
f∈ C[a, b] and g is an integrable function that does not change sign on [a, b]. Then∃ c ∈ (a, b) s.t.
∫ b a
f(x)g(x) dx = f(c)
∫ b a
g(x) dx.
Note: When
g(x)≡ 1, we havef(c) = 1 b− a
∫ b
a
f(x) dx≡ favg, where favg is the average value of f on [a, b].
The Average Value of a Function
Section 1.2
Round-off Errors and Computer Arithmetic
(捨入誤差與電腦算術)
Binary Machine Numbers (二進位機器數字)
IEEE 754-1985 Standard (updated version: IEEE 754-2008)
1 Single Precision Format (32 bits; 單精度)
2 Double Precision Format (64 bits; 雙精度)
3 Extended Precision Format (80 bits; 擴充精度) sign: 1 bit, exponent: 16 bits, fraction: 63 bits
64-bit Floating-Point Representation
64-bit representation is used for a real number.
Each binary floating-point number (浮點數) has at least 16
decimal digits of precision.
1-bit sign (符號) s is followed by 11-bit exponent (指數) c (characteristic, 0≤ c ≤ 211− 1 = 2047) and 52-bit binary
fraction f (mantissa: 尾數).
The Normalized Forms (正規化形式或標準化形式)
Normalized binary floating-pint form of x∈ R is
fl(x) = (−1)s2c−1023(1 + f)2 = (−1)s( 1 +
∑k i=1
bi2−i )
102c−1023, where f = (0.b1b2· · · bk)2.
F ={fl(y) | y ∈ R} is a finite (and proper) subset of R.
The difference between two adjacent (相鄰的) 64-bit floating-point numbers is εM = 2−52≈ 2.22 × 10−16.
Note: the machine precision (or epsilon) is
εM = 2−23≈ 1.19 × 10−7 for the single precision format.
Some Examples
1 Since
27.5664062510= 11011.100100012
= 1.1011100100012× 24, (Normalized Form) we have s = 0, c = 4 + 1023 = 102710=100000000112 and mantissa f = 0.1011100100012. Using IEEE 754 format⇒
0
10000000011 10111001000100· · · 0 (補 40 個零!)2 Note that
0.110= 0.0 00112= 1.1 00112× 2−4. How to store 0.110 by using IEEE 754 format?
Remarks on IEEE 754 Format
1 The smallest positive floating-point number (with s = 0, c = 1 and f = 0) is
flmin= 2−1022(1 + 0)≈ 2.2 × 10−308.
2 The largest one (with s = 0, c = 2046 and f = 1− 2−52) is flmax= 21023(2− 2−52)≈ 1.8 × 10308.
3 |fl(x)| > flmax ⇒ overflow (上溢位) and |fl(x)| < flmin⇒
underflow (下溢位) and reset x = 0.
4 Two zeros +0 (with s = 0, c = 0, f = 0) and−0 (with s = 1, c = 0, f = 0) exist!
Decimal Machine Numbers (十進位機器數字)
Normalized decimal floating-point form of y∈ R is fl(y) =±0.d1d2· · · dk× 10n,
where 1≤ d1 ≤ 9, 0 ≤ di≤ 9 (i = 2, . . . , k) and n ∈ Z. In this case, fl(y): k-digit decimal machine number.
The k-digit fl(y) of a normalized real number y =±0.d1d2· · · dkdk+1· · · × 10n
can be obtained by terminating the mantissa of y at k decimal digits.
Two Methods of Termination
1
Chopping: (直接捨去法)
fl(y) =±0.d1
d
2· · · dk× 10n, i.e. simply chop off the digits dk+1dk+2· · · .2
Rounding: (四捨五入法)
fl(y) ={ ±(0.d1d2· · · dk+ 10−k)× 10n, dk+1 ≥ 5 (Round Up)
±0.d1
d
2· · · dk× 10n, dk+1 < 5 (Round Down)≡ ±0.δ1δ2· · · δk× 10nafter chopping.
Example 1, p. 20
Determine the 5-digit (a) chopping and (b) rounding values of π = 0.31415926· · · × 101.
Sol:
(a) fl(π) = 0.31415× 101 by chopping.
(b) fl(π) = (0.31415 + 10−5)× 101 = 0.31416× 101 by rounding.
Absolute and Relative Errors (絕對誤差與相對誤差)
Def 1.15
If p∗ is an approximation to p, then
1 the absolute error is AE(p∗) =|p∗− p|.
2 the relative error is
RE(p∗) = |p∗− p|
|p| , providred that p̸= 0.
Note: the relative error is independent of the magnitude of p, but
the absolute error might vary widely!Examples of Abs. and Rel. Errors
Example 2, p. 21
Find the abs. and rel. errors when approximating p by p∗. (a) p = 0.3000× 101 and p∗ = 0.3100× 101.
(b) p = 0.3000× 10−3 and p∗ = 0.3100× 10−3. (c) p = 0.3000× 104 and p∗ = 0.3100× 104.
Sol:
(a) AE(p∗) = 0.1 and RE(p∗) = 0.333 �3× 10−1.
(b) AE(p∗) = 0.1× 10−4 and RE(p∗) = 0.333 �3× 10−1. (c) AE(p∗) = 0.1× 103 and RE(p∗) = 0.333 �3× 10−1.
( 相對誤差都一樣, 但是絕對誤差變化很大!)
Significant Digits (有效位數)
Def 1.16
p∗ approximate p̸= 0 to t significant digits (or figures) if
∃ largest t ∈ N ∪ {0} satisfying RE(p∗) = |p∗− p|
|p| ≤ 5 × 10−t.
Note: for any normalized y = 0.d
1d2· · · × 10n∈ R, its k-digit decimal representation satisfiesRE(fl(y))≤ 10−k+1= 10−(k−1) by using chopping (see the textbook), and
RE(fl(y))≤ 0.5 × 10−k+1 = 5× 10−k
Finite-Digit Arithmetic (有限位數的算術)
Elementary Floating-Pont Arithmetic
For floating-point representations fl(x) and fl(y) of real numbers x and y, assume that
x⊕ y = fl(fl(x) + fl(y)), x ⊗ y = fl(fl(x) × fl(y)), x⊖ y = fl(fl(x) − fl(y)), x ⊘ y = fl(fl(x) ÷ fl(y)).
Note: in practical computation, we usually have
fl(x op y) = (x op y)(1 + δ) with|δ| ≤ εM, where op = +,−, ×, ÷, and εM is the machine precision.Subtraction of Nearly Equal Numbers (相近數的減法)
Cancellation of Significant Digits
If x, y∈ R (x > y) have the k-digit decimal representations fl(x) = 0.d1
d
2· · · dpαp+1αp+2· · · αk× 10n, fl(y) = 0.d1d
2· · · dpβp+1βp+2· · · βk× 10n, thenfl(x)− fl(y) = (0.αp+1αp+2· · · αk− 0.βp+1βp+2· · · βk)× 10n−p
≡ 0.σp+1σp+2· · · σk× 10n−p,
i.e. x⊖ y = fl(fl(x) − fl(y)) has at most k− psignificant digits, with the last p digits being either 0 or randomly assigned.
Magnification of Absolute Errors (絕對誤差的擴大)
Remark
Suppose that fl(z) = z + δ with|δ| being the absolute error. If ε = 10−n with n∈ N is a number of small magnitude, then
fl(z)
fl(ε) ≈ (z + δ) × 10n= z
ε+ 10nδ.
So, the absolute error in computing z/ε is fl(z)
fl(ε) −z ε
≈ 10n· |δ| = |δ|/ε.
Example 4, pp. 23–24 (1/2) Given four real numbers
x = 5
7 = 0.714285, u = 0.714251 v = 98765.9, w = 0.111111× 10−4.
Find 5-digit chopping values of x⊖ u, (x ⊖ u) ⊘ w, (x ⊖ u) ⊗ v and u⊕ v.
Sol: The absolute error for x
⊖ u is|(x − u) − (x ⊖ u)| = |(x − u) − fl(fl(x) − fl(u))|
=|(5
7 − 0.714251) − fl(0.71428 × 100− 0.71425 × 100)|
=|0.347143 × 10−4− 0.30000 × 10−4|
× 10−5
Example 4, pp. 23–24 (2/2)
The relative error for x⊖ u is given by RE(x⊖ u) = 0.47143× 10−5
0.347143× 10−4 = 0.1358 ≤ 0.136.
How to avoid the loss of accuracy?
Some Tricks
1 Reformulation of the calculations to avoid the subtraction of two nearly equal numbers.
(改變計算公式以避免相近數字相減)
2 Rearrangement of the calculations by the nested arithmetic.
(利用巢狀算術技巧以減少四則運算數量)
The lesson:
Think before you compute!Illustration of Trick 1
Distinct real roots of ax2+ bx + c = 0 with a̸= 0 and b2− 4ac > 0 are
x1 = −b +√
b2− 4ac
2a , x2= −b −√
b2− 4ac
2a .
Ifb > 0 and4ac≪ b2, then
−b +√
b2− 4ac ≈ 0 ⇒Loss of accuracy for computing x1! Rewrite the formula for x1by rationalization (有理化)
x1= −2c b +√
b2− 4ac. (分母不是相近數相減!) Use x1x2= ca ⇒ x2=axc
1 = −b−√2ab2−4ac.
An Example for Trick 1 (1/2)
Example, pp. 25–26
Use 4-digit rounding arithmetic to determine the first root x1 of f(x) = x2+ 62.10x + 1 = 0.
Sol: Two real roots of f(x) = 0 are approximately
x1 =−0.01610723, x2 =−62.08390.Use 4-digit rounding⇒ fl(√
b2− 4ac) = fl(√
(62, 10)2− (4.000)(1.000)(1.000)) = 62.06, f(x1) = −62.10 + 62.06
2.000 =−0.02000, with the relative error being
| − 0.01611 + 0.02000| −1
An Example for Trick 1 (2/2)
In addition, if we use the reformulation for x1, then fl(x1) = fl
( fl(−2c) fl(b +√
b2− 4ac) )
= fl
( −2.000 62.10 + 62.06
)
=−0.01610, which has the small relative error 6.2× 10−4.
Note:
近似零根 x1 的精度提升至 3 個有效位數!An Example of Polynomial Evaluation (1/2)
Example 6, pp. 26–27
Evaluate the 3-digit chopping and rounding values of a poly.
f(x) = x3− 6.1x2+ 3.2x + 1.5 at x = 4.71.
Sol: The actual value is y = f(4.71) =
−14.263899. Using 3-digit chopping/rounding arithmetic, we have The 3-digit approx. values of y arefl(y) = fl(
((104.− 134.) + 15.0) + 1.5)
=−13.5, (Chopping) fl(y) = fl(
((105.− 135.) + 15.1) + 1.5)
=−13.4. (Rounding)
An Example of Polynomial Evaluation (2/2)
Hence, the relative errors in computing fl(y) are RE(fl(y)) =−14.263899 + 13.5
−14.263899 ≈ 5.36 × 10−2, (Chopping) RE(fl(y)) =−14.263899 + 13.4
−14.263899 ≈ 6.06 × 10−2. (Rounding)
=⇒ Onlyone significant digitfor both chopping and rounding values of y = f(4.71)!
Nested Arithmetic (巢狀算術)
Rearrangement of Poly. Evaluation
Direct Computation: (4 multiplications and 3 additions) f(x) = x· (x · x) − 6.1 · (x · x) + 3.2 · x + 1.5 Nested Computation: (2 multiplications and 3 additions)
f(x) =(
(x− 6.1) · x + 3.2)
· x + 1.5
Again, using 3-digit arithmetic with the nested form =⇒ RE(fl(y)) =−14.263899 + 14.2
−14.263899 ≈ 4.5 × 10−3, (Chopping) RE(fl(y)) =−14.263899 + 14.3 ≈ 2.5 × 10−3. (Rounding)
Useful Suggestion
The accuracy of an approximation can be improved ifwe reduce the number of arithmetic operations.
(減少四則運算的數量可以改進計算解的精度!)
HW of Sec 1.2:
√
24,
Section 1.3
Algorithms and Convergence
(演算法與收斂性)
Algorithms and Pseudocodes (虛擬碼)
An algorithm is a procedure that describes a finite sequence
of steps to be performed in a specified order.
The objective of an algorithm is to implement a procedure for
solving a problem or approximating a solution to the problem.
(演算法目標是求解問題或是得到該問題的數值近似解) Pseudocode is an informal environment-independent description of the key principles of an algorithm.
It uses structural conventions of a programming language, but is intended for human reading rather than machine reading.
An Example of Pseudocode To solve the root-finding problem
f(x) = ax2+ bx + c = 0 with a̸= 0.
INPUT coefficients a, b, c.
OUTPUT approximate root x.
Step 1 Compute the discriminant D = b2− 4ac.
Step 2 Compute approximate root x to f(x) = 0 using D.
Step 3 OUTPUT(x); STOP.
An Illustration of Algorithm
Example 1, p. 33
The Nth Taylor poly. of f(x) = ln x about x0 = 1 is
PN(x) =
∑N i=1
(−1)i+1
i (x− 1)i.
Construct an algorithm to determine the minimal value of N s.t.
| ln(1.5) − PN(1.5)| < 10−5.
Note: From the Alternating Series Thm =
⇒| ln x − PN(x)| ≤(−1)N+1
N + 1 (x− 1)N+1. So, the stopping criterion (停止準則) should be
|aN+1| =(−1)N+1
(x− 1)N+1 < TOL,
Algorithm for Example 1
Stability of Algorithms (演算法的穩定性)
Definition
An algorithm is called stable if it satisfies the property that
small changes in the initial data produce correspondingly small changes in the final results.
(初始資料的微小變動 =
⇒ 計算結果也是微小變化)Otherwise, the algorithm is called unstable, i.e. small changes in the initial data produce large changes in the final results.
(初始資料的微小變動 =
⇒ 計算結果產生大幅變化)Growth of Errors
Def 1.17 (誤差的線性與指數成長)
E0 > 0: the magnitude of error at some stage in the calculations, En: the magnitude of error after n subsequent operations.
1 The growth of error is called linear if En≈ CnE0, where the constant C > 0 is independent of n.
2 The growth of error is called exponential if En≈ CnE0 for some C > 1.
Example of an Unstable Algorithm (1/2)
An Unstable Procedure
The sequence{pn}∞n=0 defined by pn= c1(1
3)n+ c23n
is the general solution to the recursive equation (遞迴方程式) pn= 103pn−1− pn−2, n = 2, 3, . . . .
p
0= 1, p1= 13 ⇒ c1 = 1, c2 = 0. The solution is pn= (13)n.
Use 5-digit rounding ⇒ ^p0 = 1.0000, ^
p
1= 0.33333 and hence ^c
1= 1.0000, ^c
2 =−0.12500 × 10−5. The solution isExample of an Unstable Algorithm (2/2)
The absolute error in computing ˆpn is
AE(ˆpn) = pn− ˆpn= 0.12500× 10−5(3n).
=⇒ An unstable procedure with exponential growth of errors!
Rates of Convergence (收斂比率)
Def 1.18
Suppose that{αn}∞n=1 and {βn}∞n=1 are two sequences with
nlim→∞αn= α and lim
n→∞βn= 0. If∃ K > 0 and n0 ∈ N s.t.
|αn− α| ≤ K|βn| ∀ n ≥ n0,
then we say that{αn}∞n=1 conv. to α with rate (or order) of
convergence O(β
n), and writeαn= α + O(βn). (as n→ ∞)
Note: seq.
{αn}∞n=1 is often generated by some iterative method (迭代法), and it is often compared with βn= 1np for p > 0.Example 2, p. 37
For n≥ 1, consider two sequences of real numbers αn= n + 1
n2 and αˆn= n + 3 n3 . Determine their rates of convergence.
Sol: Since
|αn− 0| = n + 1
n2 ≤ n + n
n2 = 2·1
n ≡ 2βn,
|ˆαn− 0| = n + 3
n3 ≤ n + 3n
n3 = 4· 1
n2 ≡ 4 ˆβn
for all n≥ 1, it follows that αn= 0 + O(
1
n
), αˆn= 0 + O(1
n
2).Big-Oh Notation (大 O 符號)
Def 1.19
Suppose that lim
h→0F(h) = L and lim
h→0G(h) = 0. If ∃ K > 0 and δ > 0 s.t.
|F(h) − L| ≤ K|G(h)| for 0 < |h| < δ, then we write
F(h) = L + O(G(h)). (as h→ 0)
Note: In practice, we often choose G(h) = h
p for p > 0, and the largest value of p is expected.Example 3, p. 38
Show that cos h +12h2 = 1 + O(h4).
pf: From Taylor’s Thm,
∃ ξ(h) between 0 and h s.t.cos h = 1−1
2h2+cos ξ(h)
24 h4 for h̸= 0.
Hence, we see that (cos h +1
2h2)− 1 = | cos ξ(h)|
24 |h4| ≤