An Introduction to Computational Mathematics

(1)

AN INTRODUCTION

I-Liang Chern

Department of Mathematics National Taiwan University

2015

(2)

(3)

Solving Equations of One Variable

1.1 Motivation

This part is taken from QSG’s book.

Motivation 1: Investment fund At the beginning of every year a bank customer deposits v euros in an investment fund and withdraws, at the end of the nth year, a capital of M euros. We want to compute the average yearly rate of interest r of this investment. Since M is related to r by the relation M = v n X k=1 (1 + r)k= v1 + r r [(1 + r) n_{− 1],}

we deduce that r is the root of

f (r) = 0, f (r) = v1 + r

r [(1 + r)

n_{− 1] − M.}

Motivation 2: State equation of a gas The van der Waals equation of state (i.e. the equation that relates pressure p, volume V and temperature T ) is

p + aN V

(V − N b) = kN T,

where N is the number of molecules and k is the Boltzmann constant. The parameters a measures the attraction between molecules and b measures the volume occupied by the molecules (Van der Waals equation, Wiki).

We want to determine the volume V occupied by a gas at temperature T and pressure p.

Motivation 3: Rods system Let us consider the mechanical system represented by the four rigid rods ai of Figure 2.1. For any admissible value of the angle β, let us determine the value of the corresponding angle α between the rods a1and a2. Starting from the vector identity

a1− a2− a3− a4= 0,

(7)

and noting that the rod a1 is always aligned with the x-axis, we can deduce the following

relation-ship between β and α:

a2₁+ a2₂− a2₃+ a2₄+ 2a1a4cos β − 2a1a2cos α − 2a2a4cos(β − α) = 0.

We would also like to mention that a solution does not exist for all values of β, and may not even be unique. To solve the equation for any given β lying between 0 and π we should invoke numerical methods.

Motivation 4: Population dynamics The population of species or bacteria is modeled by x+= xR(x)

where x and x+ are respectively the populations of the present and next generations. R(x) is the growth rate.

• Beverton-Holt or discrete Verhulst model R(x) = r 1+Kx

• Redator/prey model R(x) = _1+(x/K)rx 2.

We look for saturated population where x+= x.

1.2 Newton’s method

Goal: solve f (x) = 0

Strategy

• it is an iterative method

• it approximate the equation by a linear equation at each step, use the solution of the linear equation to approximate the root.

f (x) ∼ f (xn) + f0(xn)(x − xn) = 0.

Algorithm

xn+1= xn− f0(xn)−1f (xn).

Error analysis

(8)

1.2. NEWTON’S METHOD 3 Proof. Let g(x) = x−f0(x)−1f (x). We have g(x∗) = x∗. Subtract xn+1 = g(xn) and x∗ = g(x∗),

we get en+1= g(xn) − g(x∗) = g(x∗+ en) − g(x∗) = 1 2g 00_(ξ)e2 n.

Here, we have used g0(x∗) = 0 and the lemma below. In computing g0(x∗), we use

g0(x∗) = 1 −f

0_(x∗₎2_{− f (x}∗_)f00_(x∗₎

f0(x∗)2 = 0

where f0(x∗) 6= 0 is used.

Lemma 1.1. If g ∈ C2andg(0) = g0(0) = 0, then there exists ξ between 0 and x such that g(x) = 1

2g

00_(ξ)x2_.

Proof. We use integral form of mean value theorem.

g(x) = Z x

0

(x − t)g00(t) dt. Then use mean value theorem:

Z x 0 (x − t)g00(t) dt = g00(ξ) Z x 0 (x − t) dt.

Remark. The proof above requires too many derivatives of f (it uses g00, which is equivalent to f000). But we only need f00for quadratic convergence. Here is a shorter proof.

en+1= en− f (xn) f0_(x n) = enf 0_(x n) − f (xn) f0_(x n) From 0 = f (x∗) = f (xn− en) = f (xn) − enf0(xn) + 1 2f 00_(ξ)e2 n, we get en+1= 1 2 f00(ξ) f0_(x n) e2_n, or en+1≈ 1 2 f00(x∗) f0(x∗)e 2 n.

(9)

Important example: Solve x2 = a, a > 0. The corresponding Newton’s method is xn+1 = xn−

x2_n− a 2xn

:= g(xn)

This was known as the Babylonian method. Heron of Alexandria (AD 10 - AD 70) was a Greek mathematician who described an iterative method of computing the square root. Heron’s method can also be derived as a special case of the (much) later Newton’s method (16th century). In fact, an implementation of this algorithm is found on a Babylonian clay tablet (YBC7289, 1800-1600 BC), hence the Heron’s method is also known as the Babylonian method. After thousands of years, today it has been one of the most commonly taught examples in numerical computation and analysis, the basis of many numerical algorithms of nonlinear equations and optimization problems, and in fact the most common algorithm for computing square roots. For this method, you can show that

• g : [√a, ∞) → [√a, ∞) is a strictly decreasing function. • g : (0,√a] → [√a, ∞).

These two properties assure the convergence of {xn} if x0 > 0.

A third order method If we use a parabola to approximate f , we can get a third order method. f (x) ∼ f (xn) + f0(xn)(x − xn) +

1 2f

00_(x

n)(x − xn)2= 0.

This leads to a root

xn+1= xn+

−f0(xn) +pf0(xn)2− 2f (xn)f00(xn)

f00_(x n)

You can prove it is third order convergence.

Multiple roots You can check that the method falls to a first order method if the root x∗ is a double root. If the multiplicity is p, then the Newton method convergence rate becomes (1 − 1/p)n. This can serve to detect the multiplicity of x∗. There are two ways to gain high order convergence. The first one is to modify the scheme to

xn+1= xn− p

f (xn)

f0_(x n)

:= g(xn),

provided we have known that the multiplicity of x∗is p. You can check that g0(x∗) = 0. The second method is to consider the new function

µ(x) := f (x) f0_(x),

which has a simple root at x∗. You can apply Newton’s method to this one. xn= g(xn−1), where g(x) = x − µ(x) µ0_(x) = x − f (x)f0(x) f0_(x)2_{− f (x)f}00_(x)

(10)

1.3. SECANT METHOD 5

1.3 Secant method

Goal: solve f (x) = 0

Strategy

• it is an iterative method

• it approximate the equation by a linear equation at each step, use the solution of the linear equation to approximate the root.

f (x) ∼ f (xn) + f (xn) − f (xn−1) xn− xn−1 (x − xn) = 0. Algorithm xn+1= xn− xn− xn−1 f (xn) − f (xn−1) f (xn) Start from x0, x1.

Avoiding loss of significance The divided difference

an:=

f (xn) − f (xn−1)

xn− xn−1

loses significant digits. To fix it, we replace it by an←

f (xn+ h) − f (xn)

h when |xn− xn−1| < h. Here, h is chosen to be a fixed number. For example, we can choose

h =√δxn Then f (xn+ xn √ δ) − f (xn) ∼ f0(xn)xn √ δ + O(1)δ. The function difference loses only one digit provided O(1) = 1.

Efficiency In Newton’s method, we need to evaluate both f (xn) and f0(xn). In secant method,

we only evaluate f (xn). So, if it is time consuming to evaluate f0(xn), the secant method is more

(11)

Error Analysis and Convergence Rate We use Newton’s divided difference to approximate a function by polynomials. Define the Newton divided difference by

f [x0, x1] = f (x1) − f (x0) x1− x0 , f [x0, x1, x2] = f [x1, x2] − f [x0, x1] x2− x0 . From this definition, we have the interpolation formulae

f (x) = f (x0) + f [x0, x](x − x0)

f [x0, x] = f [x0, x1] + f [x0, x1, x](x − x1)

The latter leads to

f (x) − f (x0) = f [x0, x1](x − x0) + f [x0, x1, x](x − x0)(x − x1).

In general,

f (x) = f (x0) + f [x0, x1](x − x0) + f [x0, x1, x2](x − x0)(x − x1) + · · ·

+ f [x0, x1, ..., xn, x](x − x0) · · · (x − xn).

Now, consider secant method. At xn, it approximates f (x) by the linear function

`(x) = f (xn) + f [xn, xn−1](x − xn) Therefore, 0 = `(xn+1) = f (xn) + f [xn, xn−1](xn+1− xn) = 0. (1.1) By definition, f (x) = f (xn) + f [xn, xn−1](x − xn) + f [xn, xn−1, x](x − xn)(x − xn−1). At x = x∗, 0 = f (x∗) = f (xn) + f [xn, xn−1](x∗− xn) + f [xn, xn−1, x∗](x∗− xn)(x∗− xn−1). (1.2)

Use en= xn− x∗, subtract the above two equations (1.1) (1.2), we get

f [xn, xn−1]en+1− f [xn, xn−1, x∗]enen−1 Hence en+1 = f [xn, xn−1, x∗] f [xn, xn−1] enen−1

By mean value theorem, we have

(12)

1.4. A DYNAMICAL SYSTEM POINT OF VIEW OF ITERATIVE MAP 7

f [xn, xn−1, x∗] =

1 2f

00_{(ζ) for some ζ in the interval containing x}

n, xn−1, x∗. Let M = 1 2 maxIf00(x) minIf0(x) Then |en+1| ≤ M |enen−1|.

Now, define ρn= M |en|, then we have

ρn+1 ≤ ρnρn−1.

We choose x0and x1 so that |ρ0| < 1 and |ρ1| < 1. Now, we choose

ρ = min{ρ0, ρ1} < 1.

It is easy to see by induction that ρn< 1 for all n. Furthermore

ρ2≤ ρ1ρ0≤ ρ2, ρ3≤ ρ2ρ1≤ ρ2ρ = ρ3 ρ4≤ ρ3ρ2≤ ρ3ρ2 = ρ5 .. . ρn+1≤ ρqn+1= ρqn+qn−1

where qn+1= qn+ qn−1, q0 = q1 = 1. The solution

qn= 1 √ 5 λ n+1 1 − λn+12 , λ1 = 1 +√5 2 , λ2 = 1 −√5 2

are the two roots of λ2− λ − 1 = 0. The solution qn∼ √1₅λn+11 . Thus,

ρn≤ ρ1/ √ 5λ n+1 1 , or |en| ≤ M−1 ρ1/ √ 5λ n+1 1 .

1.4 A dynamical system point of view of iterative map

The iterative xn+1 = g(xn) can be viewed as a discrete dynamical system, which serves as an

important model for a class of physical world. For instance, the discrete logistic map xn+1 = axn(1 − xn).

(13)

Discrete logistic map For iterative map g(x) starting from x0, those x0 which leads xn to

con-verge to a fixed point x∗is called the basin of convergence of x∗. So, each fixed point has its own basin of attraction. They are disjoint. However, not all points belong to the basins of the fixed points. For instance, for certain range of a, the intersection of the discrete logic map g(x) = ax(1 − x) has period 2 stable solution. Then period 4 stable solutions, etc. More precisely, when 0 < a < 1, then 0 is the only fixed point, and it is stable. For 1 < a < 3, the fixed points are 0 and (a − 1)/a. The latter is a stable one and the whole region (0, 1) is its basin of attraction. When 3 < a < 1 +√6, there are fixed points of g ◦ g. These two solutions are period 2 solutions. They are stable. This means that all points in (0, 1) are the basin of these period 2 solution. The bifurcation phenomenon from fixed point to a period 2 solution as a across 3 is called period doubling. As a keeps increasing, it exhibits period doubling from period 2 to period 4, to period 8 and so forth. As a ∼ 3.56995, It exhibits so called chaos, where solutions of all periods appear. (see logistic map, wiki)

Newton’s method on complex plane Given a polynomial p(z) on the complex plane, the Newton method gives the iteration

zn+1= zn−

p(zn)

p0(zn)

= g(zn).

For every root ξ of p(z), those z0which generates a convergent sequence to ξ by Newton’s method

is called the basin of convergence of ξ. The basins of attraction of roots of p(z) = 0 are disjoint. However, not all points on C belongs to one of the basin of the roots. There are some points which do not belong to any of these basin sets. The set of these exceptional points are called Julia set of p. For example, you can study the Julia set of the polynomial p(z) = z3 − 1. You can partition the domain [−2, 2] × [−2, 2] into 1000 × 1000 small cells uniformly. For each cell, run Newton’s iteration for 20 step starting from the cell center to classify the class of the cell center. Do the experiment and analyze what you obtain.

1.5 Fixed Point Method

Goal: Solve f (x) = 0.

Strategy : We change this to a fixed point problem:

x = x − λf (x) := g(x) λ 6= 0. λ is chosen so that

|λf0(x)| < 1.

Algorithm

(14)

1.5. FIXED POINT METHOD 9

Remarks

1. λ can vary in each step, i.e.

xn+1= xn− λnf (xn)

In numerical ODE, this λncan be viewed as ∆t. In this sense, such fixed point method can

be thought as a forward Euler method for the ODE: ˙x = −f (x). Practically, λnis chosen so

that |λnf0(xn)| < 1.

2. We can also choose λ to be a function of x. That is,

xn+1= xn− λ(xn)f (xn).

In particular, λ(x) = 1/f0(x) gives the Newton method.

Error Analysis

Definition 1.1. A function g is called a contraction map if there exists a constant 0 ≤ ρ < 1 such that

|g(x) − g(y)| ≤ ρ|x − y| for anyx, y under consideration.

A contraction map is certainly Lipschitz continuous.

Theorem 1.2. If g is a contraction map, then g has a unique fixed point x∗. Moreover, the iterative map

xn+1= g(xn)

converges tox∗linearly in the sense

|x_n− x∗| ≤ ρn|x₀− x∗| Proof. 1. {xn} is Cauchy. We can write

xn= x0+ (x1− x0) + (x2− x1) + · · · + (xn− xn−1) = x0+ n X i=1 (xi− xi−1) The seriesPn

i=1(xi− xi−1) converges absolutely: n X i=1 |x_i− x_i−1| ≤ n X i=1 ρi−1|x₁− x₀| = ρ n_{− 1} ρ − 1|x1− x0| < ∞. Hence xnis Cauchy and thus converges.

2. Subtracting xn+1= g(xn) and x∗ = g(x∗). This leads to

|x_n+1− x∗| = |g(x_n) − g(x∗)| ≤ ρ|xn− x∗| ≤ ρ2|xn−1− x∗|

(15)

3. If both x∗and y∗are fixed points, then

|x∗− y∗| = |g(x∗) − g(y∗)| ≤ ρ|x∗− y∗|, we get (1 − ρ)|x∗− y∗| ≤ 0. Since ρ < 1, we obtain x∗ = y∗.

Example We consider linear equation ax − b = 0. The corresponding fixed point method is xn+1= xn− λ(axn− b) = (1 − λa)xn+ λb

We see that

xn+1− xn= (1 − λa)(xn− xn−1).

Thus, the map xn→ xn+1is a contraction if and only if

|1 − λa| < 1.

Applications

Theorem 1.3 (Implicit Function Theorem). If F : Rm × Rn _{→ R}n_. _{F (x}

0, y0) = 0 and F is

continuously differentiable in a neighborhood of(x0, y0). Further, suppose the Jacobian Fy(x0, y0)

is invertible. Then there exist neighborhoods of x0 and y0, called U and V respectively, and a

continuously differentiable functionf : U → V such that

F (x, f (x)) = 0 for allx ∈ U.

We demonstrate the proof for n = m = 1. It is easy extended to general case. To solve an equation

F (x, y) = 0 for y with given x, we linearize it about (x0, y0):

F (x, y) = F (x0, y0) + aξ + bη + h(ξ, η).

Here, ξ := x − x0, η := y − y0, a = Fx(x0, y0), b = Fy(x0, y0),

h(ξ, η) := F (x0+ ξ, y0+ η) − F (x0, y0) − aξ − bη = o(ξ, η).

Since F (x0, y0) = 0, we solve the perturbation equation:

aξ + bη + h(ξ, η) = 0. for η with small ξ. This can be rewritten as

η = −a bξ −

1

bh(ξ, ηn). We use fixed point method to solve this equation:

ηn+1= −

a bξ −

1

bh(ξ, ηn) := g(ξ, ηn).

We find that as long as ξ is small, then this iterative map g(ξ, ·) is a contraction map. Thus, it has a fixed point η. I shall not go into detail of the proof.

(16)

1.5. FIXED POINT METHOD 11

Acceleration technique In the fixed point method, near the fixed point, we want to find find a better approximation ˆxnto the limit x∗. Since the sequence converges linearly, we expect the three

points(xn−2, xn−1), (xn−1, xn) and (ˆxn, ˆxn) are co-linear.This leads to

xn−1− ˆxn

xn−2− ˆxn

= xn− ˆxn xn−1− ˆxn

We then get a good candidate ˆ

xn= xn−2−

(xn−1− xn−2)2

xn− 2xn−1+ xn−2

This is called Aitken’s extrapolation formula, or the Steffensen algorithm. xn+1 = G(xn)

where

G(x) = x − (g(x) − x)

2

g(g(x)) − 2g(x) + x. Theorem 1.4. The Steffensen algorithm {xk} converges quadratically.

Proof. Without loss of generality, we may assume x∗ = 0. WE can write g(x) = `x + O(x2). We claim G(x) = O(x2). By direct computation, we have

G(x) = x − (g(x) − x)

2

g(g(x)) − 2g(x) + x

= x − ((` − 1)x + O(x

2₎₎2

`(`x + O(x2_{)) + O((`x + O(x}2₎₎2_{− 2(`x + O(x}2_{)) + x}

= x − (` − 1) 2_x2_{+ O(x}3₎ (` − 1)2_{x + O(x}2₎ = O(x 3₎ (` − 1)2_{x + O(x}2₎ = O(x 2_).

This shows G(x) = O(x2) provided g0(x∗) 6= 1/2. For program in matlab, see QSG, pp. 65, program 2.4. Homeworks 1.1. 1. QSG: pp. 74, Ex 2.9

2. Write a program to solve the Wilkinson problem:

f (x) :=

20

Y

i=1

(x − i) + x20

(17)

(18)

Chapter 2

Basic Numerical Linear Algebra

2.1 Motivations

Hydraulic Network The hydraulic network can be modeled as a graph (V, E), where V =

{1, 2, ..., n} is the nodes, E = {(i, j)} ⊂ V × V are the edges. Each edge e = (i, j) is a pipe connecting nodes i and j, on which, we associate a flow velocity ue, area of cross section Aeand

length Le. We assume they are uniform along pipe e. The flow direction gives a natural direction

(orientation) of the pipe. So, the graph is a directed graph: each e = (i, j) ∈ E has a direction from i to j. For (i, j) ∈ E, we define sign(i, j) = 1 and sign(j, i) = −1. On each edge e, we compute the flow rate

Qe:= ρueAe,

which is the amount of water passing through the pipe per unit time. At each node i, we associate with a pressure pi. On each pipe (edge), there is a momentum equation which balances the flux in

the pipe and the pressure difference at the two ends of the pipe. Physically, this is Darcy’s law. It can be expressed as

px= −αeρue

where αeis the friction coefficient, and positive x is the same direction of e. We integrate the Darcy

law over the pipe e and get

pj− pi=

αeLe

Ae

Qesign(j, i).

Here, we have assume ρ ≡ 1. This is the momentum balance equation on each pipe e. At each node i, we conservation of water. To describe this equation, let Ei = {e ∈ E, i is one end of e},

N_i = {j|(i, j) ∈ E} be the neighboring nodes of i. At each interior node i, X e∈Ei sign(j, i)Qe= X j∈Ni Ae αeLe (pj − pi) = 0.

At each boundary node i, a pressure pb_i is prescribed. There are two kinds of boundary nodes, one is node of the end user. The pressure pb_i = 0 there. The other is the water pump node, where

(19)

pb_i > 0 is also prescribed. This equation is a discrete Laplacian for (pi)i∈N with Dirichlet boundary

condition.

A mass-spring system Consider a spring-mass system which consists of n masses placed verti-cally between two walls. The n masses and the two end walls are connected by n + 1 springs. If all masses are zeros, the springs are “at rest” states. When the masses are greater than zeros, the springs are elongated due to the gravitation force. The mass mimoves down uidistance, called the

displacement. The goal is to find the displacements uiof the masses mi, i = 1, ..., n.

In this model, the nodes are the masses mi. We may treat the end walls are the fixed masses,

and call them m0and mn+1, respectively. The edges (or the bonds) are the springs. Let us call the

spring connecting mi and mi+1by edge (or spring) i, i = 1, ..., n + 1. Suppose the spring i has

spring constant ci. Let us call the downward direction the positive direction.

Let me start from the simplest case: n = 1 and no bottom wall. The mass m1 elongates the

spring 1 by a displacement u1. The elongated spring has a restoration force −c1u1acting on m1.1

This force must be balanced with the gravitational force on m1.2Thus, we have

−c₁u1+ f1 = 0,

where f1= m1g, the gravitation force on m1, and g is the gravitation constant. From this, we get

u1 =

f1

c1

.

Next, let us consider the case where there is a bottom wall. In this case, both springs 1 and 2 exert forces upward to m1. The balance law becomes

−c1u1− c2u1+ f1= 0.

This results u1 = f1/(c1+ c2).

Let us jump to a slightly more complicated case, say n = 3. The displacements u0= 0, u4 = 0,

due to the walls are fixed. The displacements u1, u2, u3cause elongations of the springs:

ei= ui− ui−1, i = 1, 2, 3, 4.

The restoration force of spring i is

wi = ciei.

The force exerted to miby spring i is −wi = −ciei. In fact, when ei < 0, the spring is shortened

and it pushes downward to mass mi (the sign is positive), hence the force is −ciei > 0. On the

other hand, when ei > 0, the spring is elongated and it pull mi upward. We still get the force

1_{The minus sign is due to the direction of force is upward.} 2

(20)

2.1. MOTIVATIONS 15

u

₁

m

₁

m

1

-c

₁

u

₁

m

₁

g

-c

₁

u

₁

m

₁

g

-c

₂

u

₁

Figure 2.1: The left one is a spring without any mass. The middle one is a spring hanging a mass m1freely. The right one is a mass m1 with two springs fixed on the ceiling and floor.

−w_i = −ciei < 0. Similarly, the force exerted to mi by spring i + 1 is wi+1 = ci+1ei+1. When

ei+1> 0, the spring i + 1 is elongated and it pulls midownward, the force is wi+1= ci+1ei+1> 0.

When ei+1< 0, it pushes miupward, and the force wi+1= ci+1ei+1< 0. In both cases, the force

exterted to mi by spring i + 1 is wi+1.

Thus, the force balance law on miis

wi+1− wi+ fi = 0, i = 1, 2, 3.

(21)

Let us express the above equations in matrix form. First, the elongation: e = Au, or     e1 e2 e3 e4     =     1 −1 1 −1 1 −1       u1 u2 u3  

the restoration force:

w = Ce, or     w1 w2 w3 w4     =     c1 c2 c3 c4         e1 e2 e3 e4    

the force balance laws:

Atw = f, or   1 −1 1 −1 1 −1       w1 w2 w3 w4     =   f1 f2 f3  

where Atis the transpose of A.

We can write the above equations in block matrix form as C−1 A At 0 −w u = 0 −f .

This kind of block matrix appears commonly in many other physical systems, for instance, network flows, fluid flows. In fact, any optimization system with constraint can be written in this form. Here, the constraint part is the second equation. We shall come back to this point in the next section.

One way to solve the above block matrix system is to eliminate the variable w and get Ku := AtCAu = f.

The matrix K := AtCA is a symmetric positive definite matrix. It is called the stiffness matrix. For n = 4, we get K := AtCA =   c1+ c2 −c2 0 −c₂ c2+ c3 −c3 0 −c3 c3+ c4  

Mimimum principle Consider the functional P (u) := 1

2(Ku, u) − (f, u),

where K is a symmetric positive definite matrix in Rn. The directional derivative of P at u in the direction v is defined as P0(u)v = d dt t=0 P (u + tv)

(22)

2.1. MOTIVATIONS 17 P0(u) is called the gradient (or the first variation) of P at u. We can compute this gradient:3

P0(u)v = d dt _t=0 1 2(K(u + tv), u + tv) − (f, u + tv) = 1 2 (Kv, u) + (Ku, v) − (f, v) = (Ku − f, v).

Here, we have used K being symmetric. Thus,

P0(u) = Ku − f. The second derivative is the Hessian. It is

P00(u) = K.

If u∗is a minimum of P (v), then P0(u∗) = 0. This is called the Euler-Lagrange equation of P . Conversely, If u∗ satisfies the Euler-Lagrange equation Ku∗ = f , then u∗ is the minimum of P (v). In fact, for any v, we compute P (v) − P (u∗). We claim

P (v) − P (u∗) = 1

2(K(v − u

∗

), v − u∗).

To see this, since P (v) is a quadratic function of v, we can complete the squares: P (v) − P (u∗) = 1 2(Kv, v) − (f, v) − 1 2(Ku ∗_{, u}∗_{) + (f, u}∗₎ = 1 2(Kv, v) − 1 2(Ku ∗ , u∗) − (f, v − u∗) = 1 2(Kv, v) − 1 2(Ku ∗ , u∗) − (Ku∗, v − u∗) = 1 2(Kv, v) + 1 2(Ku ∗_{, u}∗_{) − (Ku}∗_{, v)} = 1 2(K(v − u ∗_{), v − u}∗_{) ≥ 0.}

Hence we get that u∗is a minimum. In fact, u∗is the only minimum because P (v) = P (u∗) if and only if (K(v − u∗), v − u∗) = 0. Since K is positive definite, we get v − u∗ = 0.

We conclude the above discussion as the follows.

Theorem 2.1. Let P (u) := 1₂(Ku, u) − (f, u) and K is symmetric positive definite. The vector u∗ which minimizesP (v) must satisfy the Euler-Lagrange equation P0(u∗) = Ku∗− f = 0. The converse is also true.

3_{Here, I use the following properties: (f, g)}0

= (f0, g) + (f, g0). This is because (f, g) =P ifigiand (f, g)0 = P i fi0gi+ fi, gi0 = (f0, g) + (f, g0).

(23)

The physical meaning of P is the total potential energy of the spring-mass system. Indeed, 1 2(CAu, Au) = n X i=1 1 2ci(ui− ui−1) 2

is the sum of the potential energy stored in the spring, whereas the term (f, u) =

n

X

i=1

fiui

is the sum of the works done by the mass miwith displacement uifor i = 1, ..., n. The term −(f, u)

is the gravitational potential due to the masses miwith displacements ui.

Principal Component Analysis In statistics, the data set is usually represented as a matrix A. The data are collected by n experiments. Each experiment has p items, represented by a p row vector. For instance, a row vector is a biological records of a person, containing blood pressure, glucose, etc. Suppose there are p items. The data set has n peoples’ data. Another example is the pattern recognition of a word. The image of a word is represented by a 20 pixel image. We transform it into a row vector. Suppose there are n experiments (say 100 images with the same words written repeatedly.)

To analysis the data set, we first normalize it: ¯ aj := ( n X i=1 aij, A0 := ¯a1T. A1 := (aij − ¯aj)n×p

The matrix A1has zero mean in each column (item). The matrix

C := A∗₁A1 = (haki− ¯ai, akj− ¯aji)p×p.

is called the covariance matrix. It measures the covariance between item i and item j. The principal component analysis is to decompose A1into

A1= p

X

i=1

σiuivTi .

Here ui is an n × 1 vector, via p × 1 vector. The vectors viare indeed the eigenvectors of A∗1A1.

They are orthogonal. This means that viand vjare uncorrelated. Thus, we can decompose the data

set A1 in terms of p uncorrelated items (vi)pi=1. These viare recombination of the items. We may

call them the features. In principal analysis, we want to approximate the data set A1 by only few

such features.

In matrix completion problem, a typical application is to complete an incomplete table of data set. The video company Netflix proposed her incomplete table: each row is the rating record of a

(24)

2.2. INTRODUCTION AND OVERVIEW 19 person on a list of videos. This is an incomplete matrix A. The row m are member list, which is about 500,000. The column n is the video list, which is about 20, 000. The rating is from 1 star to 5 stars. The matrix is certainly incomplete. We want to fill in the vacancy based on an assumption that the completed matrix is low rank. The singular value decomposition of A is

A =

p

X

i=1

σiuivTi .

Each class uivTi represents certain types of videos. For instance, the drama, the action, etc. In a

class uiviT, the row vTi lists those videos with this type (say drama), and uilists those members who

rate these videos higher. The completed matrix can be used for recommendation to the members.

2.2 Introduction and overview

There are three kinds of linear problems we encounter in applications: • solving large linear system: Ax = b

• solving least squares problem: minxkAx − bk2.

• solving eigenvalue problems: Ax = λx • solving singular value decomposition problem. In solving linear systems, there are two classes of methods:

• Direct methods: which solves the equation directly. This is usually for small system. Basi-cally, the solving process is a factorization of the matrix A such as LU-factorization.

• Iterative methods: the basic idea is to decompose A = M − N , where M is a major part and easy to invert, while N is a minor part. Then perform an iteration M xn+1− N xn= b to get

an approximate solution. Usually, a preconditioning is needed, which means that we replace Ax = b by P Ax = P b so that it is easy to have above major-minor decomposition.

In solving eigenvalue problems, I shall discuss the power method and QR algorithm. For least square problem, I shall discussed weighted iterative method.

2.3 *Matrix Algebra

Spectral Decomposition We assume A is an n × n matrix in Cn.

Theorem 2.2 (Caley-Hamilton). Let pA(λ) := det(λI − A) be the characteristic polynomial of A.

ThenpA(A) = 0.

(25)

Theorem 2.4 (Fundamental Theorem of Algebra). Any polynomial p(λ) over C of degree m can be factorized as p(λ) = a m Y i=1 (λ − λi)

for some constanta 6= 0 and λ1, ..., λm∈ C. This factorization is unique.

Definition 2.2. Let A : Cn → Cn_{. A subspace} _{V ⊂ C}n _{is called an invariant subspace of the}

linear mapA if AV ⊂ V.

Definition 2.3. Let A : Cn→ Cn_{. A vector}_{v is called an eigenvector of A if there exists a λ such}

that

Av = λv.

Definition 2.4. For a matrix A, the set of all its eigenvalues σ(A) := {λ1, ..., λn} is called the

spectra ofA.

Definition 2.5. A vector space V is said to be the direct sum of its two subspaces V1 andV2 if for

anyv ∈ V there exist two unique vectors vi ∈ Vi,i = 1, 2 such that v = v1+ v2. We denote it by

V = V1⊕ V2.

Remark 2.1. We also use the notation V = V1+ V2for the property: anyv ∈ V can be written as

v = v1+ v2for somevi ∈ Vi,i = 1, 2. Notice that V = V1⊕ V2if and only ifV = V1+ V2and

V₁∩ V₂ = {0}.

Lemma 2.1. Suppose p and q are two polynomials over C and are relatively prime (i.e. no common roots). Then there exist two other polynomialsa and b such that

ap + bq = 1.

Lemma 2.2. Suppose p and q are two polynomials over C and are relatively prime (i.e. no common roots). LetNp:= Ker(p(A)), Nq := Ker(q(A)) and Npq:= Ker(p(A)q(A)). Then

N_pq = Np⊕ Nq.

Proof. From ap + bq = 1 we get

a(A)p(A) + b(A)q(A) = I. For any v ∈ Npq, acting the above operator formula to v, we get

v = a(A)p(A)v + b(A)q(A)v := v2+ v1.

We claim that v1 ∈ Np, whereas v2∈ Nq. This is because

p(A)v1 = p(A)b(A)q(A)v = b(A)p(A)q(A)v = 0.

Similar argument for proving v2 ∈ Nq. To see this is a direct sum, suppose v ∈ Np∩ Nq. Then

v = a(A)p(A)v + b(A)q(A)v = 0. Hence Np∩ Nq= {0}.

(26)

2.3. *MATRIX ALGEBRA 21 Corollary 2.1. Suppose a polynomial p is factorized as p = p1· · · ps withp1, ..., ps are relatively

prime (no common roots). LetNpi := Kerpi(A). Then

N_p = Np1 ⊕ · · · ⊕ Nps.

Theorem 2.5 (Spectral Decomposition). Let pmbe the minimal polynomial ofA. Suppose pmcan

be factorized as pm(λ) = s Y i=1 pi(λ) = s Y i=1 (λ − λki) mi

withλki 6= λkjfori 6= j. Let Nki = Ker(A − λkiI)

mi_{. Then}

• Nkiis invariant underA,

• Cn_{= N}

k1⊕ · · · ⊕ Nks.

Jordan matrix A matrix J is called a Jordan normal form of a matrix A if we can find matrix V such that AV = VJ, where J = Jk1 ⊗ · · · ⊗ Jkp:=      Jk1 Jk2 . .. Jkp      , V = [Vk1, Vk2, · · · , Vkp]. Jk(λk) =        λk 1 λk 1 . .. ... λk 1 λk        k×k , Vk= [v1k, · · · , vkk], k = k1, ..., ks, s X i=1 ki= n.

Here, λki are the eigenvalues of A, v

j

k ∈ Cn are called the generalized eigenvectors of A, the

matrices Jk are called Jordan blocks of size k of A. The matrix Vk = [v1k, · · · , vkk] is an n × k

matrix. We can restrict A to Vk, k = k1, ..., ksas

(27)

For each generalized vector, (A − λkI)v1k= 0 (A − λkI)v2k= v1k .. . (A − λkI)vkk= vk−1k , k = k1, ..., ks. This implies (A − λkI) vk1 = 0 (A − λkI)2vk2 = 0 .. . (A − λkI)kvkk = 0, k = k1, ..., ks. The set {vj_k i} form a basis in C

n_{. Therefore, V is invertible, and}

A = VJV−1. We call A is similar to J, and is denoted by A ∼ J.

Notice that the matrix Nk:= Jk− λkI is called a Nilpotent matrix, which has the form

Nk=      0 1 . .. ... 0 1 0      k×k .

It is easy to check that

N2_k =        0 0 1 . .. ... ... 0 0 1 0 0 0        k×k , · · · , Nk_k= 0.

Theorem 2.6. Any matrix A over C is similar to a Jordan normal form. The structure of this Jordan normal form is unique.

Example Suppose A is a 2 × 2 matrix with double eigenvalue λ. Let N1 = Ker(A − λI) and

N₂ = Ker(A − λI)2. We assume dimN1 = 1. Then N1 ⊂ N2 = C2. Let us choose any

v2∈ N2\ N1. We define v1 = (A − λI)v2. Then (A − λI)v1 = (A − λI)2v2= 0. Thus, under

(28)

2.3. *MATRIX ALGEBRA 23

Orthogonality, Self-adjoint operators There are some other decomposition, mainly when the under space Rnor Cnendowed with inner product structure.

Below V and W are vector spaces.

1. Orthogonal Projection: Given W ⊂ V , there is an orthogonal projection P : V → W such that (i) P w = w for all w ∈ W , (ii) (I − P )v ⊥ W for all v ∈ V .

2. For any W ⊂ V , there is a subspace W⊥such that (i) V = W ⊕ W⊥, (ii) W ∩ W⊥ = {0}, (iii) W ⊥ W⊥.

3. Self adjoint operator: we define A∗ = (¯aji). A matrix A is called self-adjoint if A∗ = A.

4. Alternatively, A∗is defined by

hv, A∗wi = hAv, wi, and A is self-adjoint if hAv, wi = hv, Awi.

5. A matrix U is unitary if U∗U = U U∗ = I. This is equivalent to that U = [u1, ..., un] and

{ui}ni=1are orthonormal.

Theorem 2.7. If A is self adjoint, then A is diagonalizable by a unitary matrix U and all eigenvalues are real.

Proof. 1. Suppose µ is an eigenvalue of A. By the spectral decomposition theorem, we can find the maximal invariant subspace W corresponding to µI − A. Let J = µI − A. We claim that J = 0 on W .

2. Since A is self-adjoint, so is J .

3. If the minimal polynomial of J in W is pm(λ) = λm. If m > 1, this means that there exists

v1and v2 which are independent such that

J v1= 0, J v2= v1.

Then we have

hv1, v1i = hJ v2, v1i = hv2, J v1i = 0.

This is a contradiction. Hence, m = 1. This also means J = 0.

4. The eigenvalues are real. Suppose λ, v are a pair of eigenvalue/eigenvector. λhv, vi = hλv, vi = hAv, vi

= hλv, Avi = hλv, λvi = ¯λhv, vi

5. The eigenspace corresponding to two distinct eigenvalues λ 6= µ are orthogonal to each other. Suppose

(29)

Then

λhv, wi = hAv, wi = hv, Awi = µhv, wi Hence, we get hv, wi = 0.

The Rayleigh quotient method is a constructive method to find eigenvalues of self-adjoint operator. λ1= max

v

hAv, vi hv, vi . Suppose V1be the corresponding eigenspace.

λ2= max v⊥V1

hAv, vi hv, vi .

This process can be proceeded inductively and find all eigenvalues and eigenvectors.

Singular Value Decomposition

Theorem 2.8. Let A : Rn → Rm_{(or C}n _{→ C}m_{). Then there exist orthonormal bases} _{V =}

[v1, ..., vn] in RnandU = [u1, ..., um] and non-negative numbers

σ1 ≥ ... ≥ σp > 0, p ≥ min(m, n) such that Avi= σiui, i = 1, ..., p, Avi= 0 forp < i ≤ n Or in matrix form AV = U Σ,

whereV is n × n unitary matrix, U is m × m unitary matrix, Σ is m × n diagonal matrix: Σ =

(diag(σ1, ..., σp), 0) ifm ≤ n

(diag(σ1, ..., σp), 0)T ifm > n.

Proof. 1. The matrix A∗A is self-adjoint. All its eigenvalues are real. They are also non-negative because if λ and v is a pair of eigenvalue/eigenvector, then from Rayleigh quotient

λhv, vi = hA∗Av, vi = hAv, Avi ≥ 0.

2. From the spectral decomposition for the self-adjoint matrix A∗A, we can find unitary matrix [v1, ..., vn] and Λ = diag(λ1, ..., λp, 0, ..., 0) such that AV = V Λ. Here, λ1 ≥ · · · ≥ λp > 0,

the rest eigenvalues are 0. The corresponding eigenspace spanned by < vp+1, ..., vn> is the

(30)

2.3. *MATRIX ALGEBRA 25 3. We note that A∗Av = 0 if and only if Av = 0. Therefor, N (A) is also spanned by

[vp+1, ..., vn]. 4. Let σi = √ λi. For σi > 0, define uiby Avi = σiui. Then, hui, uji = 1 σiσj hAvi, Avji = 1 σiσj hA∗Avi, vji = σi σj hvi, vji = 0. hu_i, uii = 1 σ_i2hAvi, Avii = 1 σ_i2hA ∗_Av i, vii = hvi, vji = 1.

The space spanned by [u1, ..., up] is R(A). Let us choose [up+1, ..., um] to be an orthonormal

basis in R(A)⊥⊂ Rm_{. Then [u}

1, ..., um] is unitary matrix.

Remarks.

1. A∗has the following representation: A∗ui=

1 σi

A∗(Avi) = σivi.

2. The domain and range of A can be decomposed into

Rn= [v1, ..., vp] ⊕ [vp+1, ..., vn] = [v1, ..., vp] ⊕ N (A),

Rm = [u1, ..., up] ⊕ [up+1, ..., um] = R(A) ⊕ [up+1, ..., um].

3. An m × n matrix is of rank 1 if and only if it has the form u ⊗ v, uvT

where u is m × 1 and vT is 1 × n matrices. The SVD is to decompose A into a sum of rank 1 matrices A = [u1, ..., um]           σ1 . .. σp 0 . .. 0              v₁T .. . v_nT   = p X i=1 σiuiviT,

(31)

4. The least-squares solution for Ax = b is the minimizer of 1

2kAx − bk

2 2

With the singular value decomposition, we can represent b = p X i=1 hb, u_iiu_i+ m X i=p+1 hb, u_iiu_i = p X i=1 hb, u_iiu_i+ b⊥

The least squares solution is

x∗= p X i=1 1 σi hb, uiivi+ N (A),

which minimize kAx − bk2with minimal value

kAx∗− bk2= kb⊥k2. The solution x†:= Pp

i=1σ1ihb, uiivi, denoted by A

†_{b, is called the Moore-Penrose solution,}

where

A†:= V Σ†U∗

Σ†has the same structure as Σ and replacing σiby σ_i−1. The matrix A†is called the pseudo

inverse of A.

2.4 Matrix Analysis

2.4.1 Matrix Norm

Norm in vector space In analysis, we need to measure how close of two vectors, the concept of convergence. A natural way is to define the concept of norm for vectors.

Definition 2.6. Let V be a vector space. A mapping k · k : V → R is called a norm if (i) kxk ≥ 0 and kxk = 0 if and only if x = 0;

(ii) kλxk = |λ|kxk for any λ ∈ R and any x ∈ Rn_;

(iii) kx + yk ≤ kxk + kyk.

A vector space endowed with a normk · k is called a normed vector space. In Rn, we define the norms

|x|p :=   n X j=1 |xj|p   1/p , 1 ≤ p < ∞, |x|∞= max i |xi|

(32)

2.4. MATRIX ANALYSIS 27

Matrix Norm A m × n matrix A is viewed as a linear map from Rn→ Rm_{(or C}n_{→ C}m_{). The}

set of all m × n matrices is denoted by Mm×n, which is a linear space.

The norms in domain and range may be different. Let us call the norm in the domain by k · ka

and range by k · kb. The linear map A : (Rn, k · ka) → (Rm, k · kb) induces an operator norm on the

matrix A defined by

kAk_a→b:= max

x6=0

kAxk_b

kxk_a ≡ maxkxka=1

kAxk_b.

Most of the time, we drop the subindex (a → b) when it is clear from the context. It is easy to see that k · k is a norm in the vector space Mm×n. Let us check the triangle inequality:

kA + Bk = max

kxk=1k(A + B)xk

≤ max

kxk=1kAxk + maxkxk=1kBxk

= kAk + kBk. The operator has the following two important properties

• kAxk ≤ kAk kxk

• kABk ≤ kAk kBk when both A, B ∈ Mn×n.

Examples

1. A : (Rn, | · |∞) → (Rm, | · |∞) Then kAk∞= maxiPj|aij|

kAk∞= sup |x|∞=1 max i | X j aijxj| = max i _|x|sup_∞₌₁| X j aijxj| ≤ max i _|x|sup_∞₌₁( X j |a_ij|)(max j |xj|) = maxi X j |a_ij|

Conversely, if maxiPj|aij| =Pj|ai0j|, we choose

xj = sign ai0j = ±1.

Then |x|∞= 1 and

kAk∞≥ |Ax|∞= max i | X j aijxj| = X j |a_i₀_j|.

(33)

2. A : (Rn, | · |1) → (Rm, | · |1), then kAk1 = maxjPi|aij|. |Ax|₁=X i |X j aijxj| ≤ X i X j |a_ij| |x_j| =X j X i |a_ij| ! |x_j| ≤X j max k X i |a_ik| ! |x_j| = max k X i |a_ik| ! |x|₁

Thus, we obtain kAk1 ≤ maxjP_i|aij|. Conversely, if maxjP_i|aij| =P_i|aij0|, then we

choose x = (δjj0) n j=1 = (0, · · · , 1, · · · , 0)T. We have |x|1 = 1 and Ax = (a1j0, a2j0, · · · , anj0) T_. Thus, kAk1≥ |Ax|1 = n X i=1 |aij0| = max j X i |aij|.

3. kAk2 =pρ(A∗A), where ρ(B) is the spectral radius of a matrix B, which is defined for a

general square matrix B, by

ρ(B) = max

i {|λ1(B)|, ..., |λn(B)|},

the largest eigenvalues of B in magnitude.

Proof. Since A∗A is hermitian, its eigenvalues are real. Let us order them by λ1 ≥ λ2 ≥ · · · ≥ λn.

Then the spectral radius ρ(A∗A) = λ1. Suppose x is the unit eigenvector corresponding to

λ1, then

kAk2₂≥ |Ax|2₂ = hAx, Axi = hA∗Ax, xi = λ1hx, xi = λ1

We get kAk2≥

√

λ1. On the other hand, for any |x|2 = 1, we have

|Ax|2 = hAx, Axi = hA∗Ax, xi ≤ λ1hx, xi = λ1.

We get kAk2₂≤ λ1.

4. Frobenius norm: it is define to be kAk2

F :=

X

i,j

|a_ij|2_{= tr(A}∗_A).

(34)

2.4. MATRIX ANALYSIS 29 • |Ax|₂ ≤ kAk_F|x|₂, kAk2≤ kAkF;

• kABkF ≤ kAkFkBkF;

• kARk_F = kRAkF = kAkF for any rotation matrix R.

The proofs of the first and second follow from Cauchy-Schwarz inequality. For the proof of the third statement,

kARk2_F = tr(RTATAR) = tr(RRTATA) = tr(ATA) = kAk2_F. Here, we have used the cyclic formula for trace:

tr(ABC) = tr(BCA) = tr(CAB). 5. Nuclear norm: kAk∗ = min(m,n) X i=1 |σ_i(A)|,

where σ1 ≥ σ2 ≥ · · · are the singular values of A, equivalently, σ(A) = pλ(A∗A). The

nuclear norm is also called Ky Fan ’n’-norm. It is used, for instance, in compressive sensing, in principal component analysis in statistics, to find a low rank matrix approximation to a given matrix.

6. *Schatten norm: the above nuclear norm, Frobenius norm, L2operator norm can all be unified as special cases of Schatten norm, which is defined as

kAk_p :=   min(m,n) X i=1 σi(A)p   1/p .

From the functional calculus theory,

kAkp_p = tr(|A|p), |A| :=√A∗A.

Thus, the L2operator norm is the Schatten maximum norm. The nuclear norm is the Schatten 1-norm. The Frobenius norm is the Schatten 2-norm.

Remarks.

• The Nilponent matrix is defined to be N =

0 1

0 0

has ρ(N ) = 0, but kN k1= kN k2 = kN k∞= 1. The matrix

N∗N = 0 0 0 1 .

(35)

2.4.2 Condition number

Consider

Ax = b,

where A is an n × n matrix and assume it is invertible. We want to measure the sensitivity of solving x from b. Suppose ˜b is a perturbation of b and ˜x the corresponding solution of A˜x = ˜b. Then

kx − ˜xk = kA−1(b − ˜b)k ≤ kA−1kkb − ˜bk kx − ˜xk kxk ≤ kA −1_{kkb − ˜bk} 1 kxk kAxk kbk ≤ kA −1_kkAkkb − ˜bk kbk Condition number κ(A) := kAkkA−1k measures the sensitivity of x w.r.t. b.

1. Find the condition number of

A = 1 1 + 1 − 1 Ans: κ(A) ≥ 4/2.

2. The Hilbert matrix is given by

H = 1 i + j + 1 0≤i,j≤n

Its condition number has estimate: κ(H) ≈ O((1 +√2)4n/√n).

3. The discrete Laplacian in one dimension with Dirichlet boundary condition is A = diag(−1, 2, −1).

Since A is symmetric positive definite, we have kAk = max i λi(A) = λ1(A), kA−1k = max i |λi(A −1 )| = max i |λi(A) −1_{| = λ} n(A)−1. Thus, κ(A) = λ1(A) λn(A) . Homework

1. What is the explicit expression of the operator norm of A : (Rn, | · |1) → (Rm, | · |∞)?

2. Show that κ(A) = supkxk=kyk kAxk kAyk.

(36)

2.4. MATRIX ANALYSIS 31

2.4.3 *Functional Calculus

Given an n × n matrix A, we can define p(A) for any polynomial p. Let σ(A) denote the spectra of A, we have the following spectral mapping theorem.

Theorem 2.9 (Spectral Mapping Theorem for polynomial functions). Let A be an n × n matrix over C and σ(A) be its spectra. Then for any polynomial p, we have

σ(p(A)) = p(σ(A)).

Proof. 1. By spectral decomposition theorem, there exists V and J such that A = V J V−1.

From this expression, we get that

Ak= (V J V−1) · (V J V−1) · · · (V J V−1) = V JkV−1. The Jordan matrix has the form

J = J1⊗ · · · ⊗ Jm.

This implies

Jk= J₁k⊗ · · · ⊗ J_mk.

2. For each Jordan block above, say Jp = µpI + N , N is Nilpotent, we can get that Jpkis always

upper triangular with diagonal µk_pI. Thus, the eigenvalue of J_pkis µk_p. This shows σ(J_pk) = (σ(Jp))k. 3. Since σ(J ) = σ(J₁k⊗ · · · ⊗ J_mk) = m [ p=1 σ(J_pk) we then get σ(Ak) = σ(Jk) = m [ p=1 σ(J_pk) = σ(A)k, and thus σ(p(A)) = p(σ(A)) for any polynomial function p(·).

(37)

Remark. In applications, we will need f (A) for more general functions. For example, A−1, exp(A), sin(A), etc. These operator-valued functions can be defined through the helps of resolvent (λI − A)−1and the Cauchy integral formula.

The resolvent (λI − A)−1can be defined in |λ| > ρ(A), where ρ(A) is called the spectral radius of A. The spectral radius ρ(A) is useful in the power series expansion of a matrix. We have the following theorem.

Theorem 2.10. Ak → 0 if and only if ρ(A) < 1. kAk_{k is unbounded as k → ∞ if and only if}

ρ(A) > 1.

Proof. If Ak→ 0, and suppose λ/x be a pair of eigenvalue/eigenvector of A, then Akx = λkx → 0.

This implies λk→. Hence |λ| < 1. This implies ρ(A) < 1.

Conversely, let us suppose ρ(A) < 1, which means that all eigenvalues |λ(A)| < 1. Let us decompose A into direct product of Jordan blocks: AV = V J with V invertible and J = J1⊗ · · · ⊗ J`. The power

Ak= V JkV−1, Jk= J₁k⊗ · · · ⊗ J_`k.

We can see that λ(A) < 1 ⇔ J (λ(A))k→ 0, which is equivalent to |λ(A)| < 1.

Suppose |λ(A)| > 1 for some eigenvalue λ(A), then the corresponding Jordan block

Jk= (λI + N )k= k X m=0 k m λk−mINm → ∞ if and only if |λ| > 1.

Theorem 2.11 (Gelfand formula). For any matrix norm k · k, we have ρ(A) = lim

k→∞kA k_k1/k_.

Proof. For any > 0, we have ρ(A/(ρ(A) + )) < 1. Hence

A ρ(A) +

k

→ 0 as k → ∞. Thus, there exists N1such that for all k > N1, we have

A ρ(A) + k < 1 This means kAkk ≤ (ρ(A) + )k

(38)

2.4. MATRIX ANALYSIS 33 or

kAkk1/k ≤ ρ(A) + . Similarly, ρ(A/(ρ(A) − )) > 1. From

A ρ(A) − k → ∞ as k → ∞, there exists N2such that for all k > N2,

kAk_k1/k _{≥ ρ(A) − .}

This completes the proof. Theorem 2.12. The seriesP∞

n=0Anconverges if and only ifρ(A) < 1. In the convergence case,

the series equals(I − A)−1.

Proof. Suppose ρ(A) < 1, we want to show (I − A) is invertible. The key is to expand

(I − A)−1 =

∞

X

n=0

An.

This is called Neumann series. From ρ(A) < 1, we chooser > 0 such that ρ(A) + = η < 1. From kAnk1/n _{→ ρ(A), there exists N such that for all n > N , we have}

kAnk1/n ≤ ρ(A) + = η, or kAnk ≤ ηn_{. Thus,}P∞

n=0Anconverges absolutely and uniformly in any operator norm.

It is also easy to see that this series commutes with A because the finite part of the Neumann series commutes with A. Thus, we get

(I − A) ( ∞ X n=0 An) = ( ∞ X n=0 An) (I − A) = I.

Corollary 2.2. The operator Rλ(A) := (λI − A)−1 is well-defined and analytic in|λ| > ρ(A).

Proof. This is because the series

∞ X n=0 A λ n

converges absolutely and uniformly in any operator norm k · k and uniformly in λ for λ in the compact region in |λ| > ρ(A). Since the finite sum is analytic in λ, so is their uniform limit.

(39)

Example Suppose J = µIn+ Nnbe a Jordan matrix, find the exact formula of (λIn− J )−1.

Definition 2.7. Let f be a holomorphic function on C and A be an n × n matrix over C. We define f (A) :=

Z

C

f (λ)(λI − A)−1dλ whereC is any closed contour that winds once around σ(A).

Theorem 2.13 (Spectral Mapping Theorem). Let f be a holomorphic function on C and A be an n × n matrix over C. We have

σ(f (A)) = f (σ(A)).

2.5 Direct Methods for Solving Linear Equations

2.5.1 LU Decomposition

Goal : Solv small size linear system

Ax = b. Small size means that n is at most few hundreds.

Strategy

• Decompose A = LU by Gaussian elimination method, where L is lower triangular matrix and U is a upper triangular matrix.

• Solve LU x = b by solving

Ly = b U x = y.

These two equations can be solved by forward and backward substitution, respectively.

Procedure If the matrix is upper triangular, i.e. aij = 0 if j < i, then we can solve this equation

by backward substitution:

Algorithm 1 Backward substitution

1: procedure BKSBSTITUT(n, A = (aij), b) 2: for i = n : 1 do 3: xi ← bi−Pnj=i+1 /aii 4: end for 5: end procedure

If the matrix is lower triangular, i.e. aij = 0 if j > i, then we can solve this equation by forward

(40)

2.5. DIRECT METHODS FOR SOLVING LINEAR EQUATIONS 35 Algorithm 2 Forward substitution

1: procedure FWDSBSTITUT(n, A = (aij), b) 2: for i = 1 : n do 3: xi ← bi− Pi−1 j=1 /aii 4: end for 5: end procedure

For general matrix A, we factorize it into the product of a lower triangular matrix L and an upper triangular matrix U :

A = LU, called LU factorization. By direct calculation, we get

aij = min(i,j)

X

s=1

`isusj.

The procedure to obtain L and U is by the Gaussian elimination method. It is an inductive procedure. At step k,

• we assume that we have computed rows 1, ..., k − 1 of U and columns 1, ..., k − 1 of L • we want to update ukj, j ≥ k and `ik, i ≥ k.

• From akk = k−1 X s=1 `ksusk+ `kkukk,

we can determine `kkor ukkif one of them is chosen. So there are three approaches:

– choose `kk = 1 for all k. Such L is a unit lower triangular matrix, the factorization is

called Doolittle’s factorization;

– choose ukk = 1 for all k. Such U is a unit upper triangular matrix, the factorization is

called Crout’s factorization;

– For symmetric matrix, we can choose `kk = ukkfor all k. Such factorization for

sym-metric matrices is called the Cholesky factorization. Let us choose `k= 1 here. With this, we determine ukk.

• We proceed to compute ukjfor j > k and `ikfor i > k as the follows.

akj = k−1 X s=1 `ksusj+ `kkukj (k + 1 ≤ j ≤ n) aik = k−1 X s=1 `isusk+ `ikukk (k + 1 ≤ i ≤ n)

(41)

The corresponding pseudocode is

Algorithm 3 LU Decomposition

1: procedure GAUSSIANELIMINATION(n, A = (aij), b)

2: for k = 1 : n do 3: `kk = 1 4: ukk= akk−Pk−1s=1`ksusk 5: for j = k + 1 : n do 6: ukj ← akj− Pk−1 s=1usj /`kk 7: end for 8: for i = k + 1 : n do 9: `ik← aik−Pk−1s=1`isusk /ukk 10: end for 11: end for 12: end procedure

With a LU factorization, the system Ax = b can be solved by

Ly = b U x = y.

In practice, we can store L and U in matrix A. At step k, the matrix A(k)has the form

A(k)=            a(1)₁₁ a(1)₁₂ · · · a(1)_1n `21 a(2)22 a (2) 2n .. . . .. . .. ... `k1 · · · `k,k−1 a(k)_kk · · · a(k)_kn .. . ... ... ... `n1 · · · `n,k−1 a (k) nk · · · a (k) nn           

(42)

2.5. DIRECT METHODS FOR SOLVING LINEAR EQUATIONS 37 Algorithm 4 Gaussian Elimination

1: procedure GAUSSIANELIMINATION(n, A = (aij), b)

2: for k = 1 : n − 1 do 3: for i = k + 1 : n do 4: `ik = a(k)_ik /a(k)_kk 5: for j = k + 1 : n do 6: a(k+1)_ij = a(k)_ij − `_ika(k)_kj 7: end for 8: b(k+1)_i = b(k)_i − `ikb(k)_k 9: end for 10: end for 11: end procedure Variants of LU Decomposition

• A = LDU , where L and D are unit lower/upper triangular matrices, D is a diagonal matrix. • If A is symmetric, then we can factor A into A = LLT_{. This is called Cholesky factorization.}

Stability and Pivoting LU Decomposition It is possible that the LU factorization fails at some iteration k. In performing Gaussian elimination for the matrix A(k):

A(k)=    a(k)_kk · · · a(k)_km .. . ... a(k)_nk · · · a(k)nn   .

it is possible that a(k)_kk is zero, or very small. In this case, the Gaussian elimination is either fails or unstable. To avoid this, we can perform row permutation to move the largest |a(k)_ik |, i = k, ..., n to the kth row of the matrix A(k). Let us denote such row permutation by Pk. The factorization now

becomes

P A = LU,

where P = Pn−1· · · P1is the product of these row permutations. Such process involving only row

permutation is called partial pivoting.

In the above pivoting process, it is also possible to find the largest |a(k)_ij | for k ≤ i, j ≤ n then perform a row permutation Pkand a column permutation Qk. Then the factorization becomes

P AQ = LU,

(43)

Matlab Commands

• [LU ] = lu(A) gives the LU decomposition with partial pivoting; • [L, U, P ] = lu(A) gives the LU decomposition with partial pivoting; • [L, U, P, Q] = lu(A) gives the LU decomposition with total pivoting. If A is symmetric, then Cholesky method is adopted:

• R = chol(A) gives A = R0R;

• L = chol(A,0lower0) gives A = LL0.

Computational Complexity :

• It is in general O(n3_{) for full matrices.}

• For banded matrices with band size b, the computational complexity is O(b2_{n), provided}

there is no pivoting.

• Theoretically, if the matrix-matrix multiplication is M (n), then the LU factorization is also M (n). There are some fast algorithms for matrix-matrix multiplication:

– Strassen algorithm: O(n2.807355),

– Coppersmith?Winograd algorithm: O(n2.375477). The latter may be impractical because large constant.

2.5.2 *Other direct methods

• Cyclic Reduction Method: This is for tridiagonal matrix A = diag(aj, 1, cj)

The jth equation is

ajxj−1+ xj+ cjxj+1= bj.

We will reduce to half size by eliminating the odd index terms. Let us write three consecutive equations    a2j−1x2j−2 +x2j−1 +c2j−1x2j = b2j−1 a2jx2j−1 +x2j +c2jx2j+1 = b2j a2j+1x2j +x2j+1 +c2j+1x2j+2 = b2j+1

We can eliminate x2j−1and x2j+1and then obtain an equation only involves x2j−2, x2j and

x2j+2:

(44)

2.6. CLASSICAL ITERATIVE METHODS 39 The problem now can be reduced to half size:

a(1)_j x(1)_j−1+ x(1)_j + c(1)_j = b(1)_j . Here, all original xj, bj, aj, cj will be denoted by x

(0)

j ,etc. as an initialization. The new

variables and coefficients are

a(1)_j = −a (0) 2j−1a (0) 2j 1 − a(0)_2j c(0)_2j−1− a(0)_2j+1c(0)_2j , c (1) j = −c(0)_2j c(0)_2j+1 1 − a(0)_2j c(0)_2j−1− a(0)_2j+1c(0)_2j x(1)_j = x(0)_2j , b(1)_j = b (0) 2j − a (0) 2jb (0) 2j−1− c (0) 2j b (0) 2j+1 1 − a(0)_2j c(0)_2j−1− a(0)_2j+1c(0)_2j .

We perform this process recursively. Suppose the number of unknowns is 2LM at level 0. The solution (x(k)_j ) is called at level k. We perform the above reduction procedure for k = 0 to L. This is a M × M system, a small system, which can the solve it exactly. Once we have the solution at the coarsest level L, we can go backward to obtain solutions at finer grid. Indeed, suppose we have solutions at level k + 1, that is x(k+1)_j . These are also the solutions x(k)_2j . Then the solution at odd grids at level k can be obtained from the odd equation:

a(k)_2j+1x(k)_2j + x(k)_2j+1+ c(k)_2j+1x(k)_2j+2 = b(k)_2j+1

which can be solved for x(k)_2j+1 once x(k)_2j and x(k)_2j+2are obtained. But these two are obtained from the previous iteration steps.

The cyclic reduction method is very similar to multi grid method. If the matrix is diagonally dominant, then off-diagonal coefficients a(k)_j , c(k)_j , which changes during the level reduction, converges to zeros quadratically.

• Block Cyclic Reduction Method: this is particular useful for two dimension problems.

2.6 Classical Iterative Methods

The target problem we can have in mind is the discrete Poisson equation −u_j−1+ 2uj− uj+1= fj, j = 1, ..., N − 1.

The boundary conditions are

u0 = uN = 0.

This can be written in matrix form

Ax = b, where A = diag(−1, 2, −1).

(45)

2.6.1 Splitting iterative methods

Problem Solve Ax = b. A is large size and usually sparse.

Ideas This class of iterative methods split A into A = M − N, where M and N satisfy

• M is the major part, M is easy to invert • N is the minor part.

Then perform iteration:

M xn+1− N xn= b.

This is supposed to be solved easily because we assume M is easy to invert.

Splitting examples For instance, we can express A = D + L + U , where D is diagonal, L lower triangular, and U upper triangular. Then we perform the following splitting

• Jacobi method: choose M = D, N = −L − U ; • Gauss-Seidel: choose M = D + L, N = −U .

Theory The iteration can be rewritten as

xn+1= M−1N xn+ M−1b := Gxn+ M−1b.

The matrix G is called amplification matrix.

Theorem 2.14. The sequence xn+1 = Gxn+ c converges if and only if ρ(G) < 1.

Proof. 1. Subtracting xn+1= Gxn+ c and xn= Gxn−1+ c, we get

xn+1− xn= G(xn− xn−1) = Gn(x1− x0).

The convergence of the sequence {xn} is equivalent to the convergence of the seriesPn(xn+1−

xn), which is also equivalent to that ofP_nGn(x1− x0).

2. If ρ(G) < 1, then, from Gelfand formula, we can choose an ρ(G) < η < 1 and there exists an N , such that for all n ≥ N , we have

kGnk ≤ ηn Thus, the series

X n (xn+1− xn) = X n Gn(x1− x0) converges absolutely.

(46)

2.6. CLASSICAL ITERATIVE METHODS 41 3. Conversely, suppose ρ(G) > 1. Let λ1 be the largest eigenvalue in magnitude and v1 be

the corresponding eigenvector. From ρ(G) > 1, we have |λ1| > 1.We choose x0 such that

x1− x0= v1. This means that (I − G)x0+ c = v1. As long as 1 is not an eigenvalue of G,

this is possible. With this x0, we see that

X n (xn+1− xn) = X n Gn(x1− x0) = X n Gnv1 = X n λn₁ is unbounded.

4. If 1 happens to be an eigenvalue with v being the corresponding eigenvector. We choose x0 = v, we see that xn= v + nc is still unbounded.

Remark A sufficient condition forP

nGn(x1− x0) converges is kGk < 1 for some norm. But

this is not a necessary condition. Definition 2.8. A matrix A is called

• strictly diagonally dominant if

|a_ii| >X

j6=i

|a_ij| for alli = 1, ..., n.

• irreducible diagonally dominant if A is diagonally dominant: |aii| ≥

X

j6=i

|aij| for alli = 1, ..., n,

andA is irreducible, i.e. A cannot be similar via permutation to a block upper triangular matrix.

Theorem 2.15. The Jacobi method or the Gauss-Seidel method converge if one of the following cases holds

• A is symmetric positive definite; • A is strictly diagonally dominant;

• A is irreducible and diagonally dominant. The convergence rate is linear.

Proof. I shall only give the convergence proof for Jacobi method for strictly diagonally dominant matrices. For Jacobi method, A = D + (L + U ) = M − N . The iteration algorithm is M xn+1 =

(47)

N xn+ b. We get xn+1 = Gxn+ M−1b with G = M−1N . When A is strictly row diagonally

dominant, we use operator sup norm kGk∞=, which is

kGk∞= max i X j |gij| = max i X j6=i aij aii = η < 1

Thus, the series

X

n

(xn+1− xn)

converges absolutely in | · |∞. This leads to xn converges. Suppose x∗ is its limit. Then x∗ =

Gx∗+ M−1b. Subtracting this from xn+1 = Gxn+ M−1b, we obtain

en+1 = Gen,

where en:= xn− x∗. Since kGk∞= η < 1, we get

|en|∞≤ kGnk∞|e0| ≤ ηn|e0|∞→ 0.

This shows that the convergence is linear.

Acceleration techniques: Richardson Extrapolation In the late 50’s, people think that one can accelerate the convergent rate by performing Richardson extrapolation techniques. Let us take Ja-cobi method as an example.

1. Damped Jacobi method. Suppose we use Jacobi method to produce xn+1 from xn. We can

extrapolate it to ˆxn+1by

ˆ

xn+1= (1 − ω)xn+ ωxn+1,

where ω will be properly chosen. Usually it will be larger than 1 for extrapolation. Now, suppose xn+1is produced by Jacobi method. Then

ˆ

xn+1= (1 − ω)xn+ ωD−1(−(L + U )xn+ b).

We drop hat in ˆxn+1. The resulting scheme is:

xn+1 = (1 − ω)xn+ ωD−1(−(L + U )xn+ b) = xn+ ωD−1(b − Axn).

This is called damped Jacobi method.

2. Successive over relaxation method (SOR). Suppose xn+1is produced by Gauss-Seidel method.

We extrapolate it from xn, xn+1to a new ˆxn+1by

ˆ

xn+1= (1 − ω)xn+ ωxn+1.

In this case, ˆ

(48)

2.6. CLASSICAL ITERATIVE METHODS 43 3. Symmetric Successive Over Relaxation (SSOR). For symmetric matrix, the above amplifi-cation matrix G is not symmetric. But, we can perform Gauss-Seidel method twice in one iteration, one use lower triangular matrix L, the other uses the upper triangular matrix. In this procedure, we can maintain symmetric of the amplification matrix. So, the scheme reads

xn+1/2= xn+ ω(D + L)−1(b − Axn)

xn+1= xn+1/2+ ω(D + U )−1(b − Axn+1/2).

xn+1= xn+ ω(D + L)−1+ (D + U )−1− ω(D + U )−1A(D + L)−1 (b − Axn).

Note that U = LT and the amplification matrix

G = ω(D + L)−1_{+ (D + U )}−1_{− ω(D + U )}−1_{A(D + L)}−1_A

is symmetric.

The goal is to find proper ω which minimize ρ(G(ω)). It depends on specific A. For discrete Laplacian on box, one can compute the spectrum explicitly, then obtain an optimal ω. For symmetric positive definite matrix, we can also do similar things.

Remarks.

1. SOR in the standard textbook is not expressed in the form above. It is derived and expressed as below. Originally, the Gauss-Seidel can be written as

xn+1= D−1(b − Lxn+1− U xn).

Hence, one can design SOR as

xn+1= (1 − ω)xn+ ωD−1(b − Lxn+1− U xn).

From this, we obtain

Dxn+1= (1 − ω)Dxn+ ω(b − Lxn+1− U xn) (D + ωL)xn+1= ((1 − ω)D − ωU )xn+ ωb. Thus, we split A = M − N , M = D + ωL, N = (1 − ω)D − ωU. Or we can express it as xn+1 = xn+ ω(D + ωL)−1(b − Axn).

(49)

2. SSOR: We perform two SORs:

x_n+1/2= xn+ ω(D + ωL)−1(b − Axn)

xn+1= xn+1/2+ ω(D + ωU )−1(b − Axn+1/2).

This gives

xn+1= xn+ ω(D + ωL)−1+ (D + ωU )−1− ω(D + ωU )−1A(D + ωL)−1 (b − Axn)

= xn+ ωP−1(b − Axn)

where

P−1 = (D + ωL)−1+ (D + ωU )−1− ω(D + ωU )−1A(D + ωL)−1 = (D + ωU )−1[D + ωL − ωA + D + ωU ] (D + ωL)−1 = (2 − ω)(D + ωU )−1D(D + ωL)−1.

2.6.2 Preconditioned iterative methods

To solve

Ax − b = 0, we shall solve the equation

P−1(Ax − b) = 0

instead, where P is called a preconditioner, which is designed to satisfy • P−1is easy to compute,

• P−1_{is an approximation of A}−1_{in the sense that P}−1_{A has smaller condition number of that}

A has.

With a preconditioner P , we can design a fixed point method as xn+1= xn+ ωnP−1(b − Axn).

One can see all classical iterative methods can be expressed as this preconditioned iterative method. • Jacobi method: P = D, ω = 1

• Damped Jacobi method: P = D, ω ∈ (0, 2), • Gauss-Seidel: P = D + L, ω = 1,

• SOR: P = D + ωL

(50)

2.6. CLASSICAL ITERATIVE METHODS 45

Homework

1. Discretize the one-dimension Poisson equation −u00= f by central finite difference method. Solve the resulting linear system

diag(−1, 2, −1)u = f

by above classical iterative methods. Choose proper ω, Compare them.

2.6.3 Conjugate Gradient Method

Goal: Solve Ax = b, A is symmetric positive definite and b 6= 0.

Ideas and derivation: Solve the problem successively in the space spanned by {b, Ab, ..., Ak−b}. The reason is the follows. Suppose p is the minimal polynomial of A, that is p(A) = 0 and deg(A) ≤ n. If A is invertible, then p(0) 6= 0. Otherwise 0 would be an eigenvalue. We may normalize p such that p(0) = −1. Thus

0 = p(A) = −I + A q(A). This shows that A−1 = q(A) with deg(q) ≤ n − 1. Thus,

x = A−1b = q(A)b.

To derive an iterative method, the above observation suggests us to solve this equation iteratively in the spaces

V0 ⊂ V1⊂ · · · ⊂ Vn= Rn.

where

Vk=< b, Ab, ..., Ak−1b >,

the space spanned by {b, Ab, ..., Ak−1b}, called the Krylov spaces. The details of the derivation are the follows.

1. The problem can be written in variation form: min φ(x) := 1

2hAx, xi − hb, xi. The minimum occurs at

∇φ(x) = Ax − b = 0. 2. Let us look for optimal solution in Vk. Let us call it

An Introduction to Computational Mathematics

AN INTRODUCTION

I-Liang Chern

Contents

Chapter 1

Solving Equations of One Variable

1.1

Motivation

1.2

Newton’s method

1.3

Secant method

1.4

A dynamical system point of view of iterative map

1.5

Fixed Point Method

Chapter 2

Basic Numerical Linear Algebra

2.1

Motivations

u

m

m

-c

u

m

g

-c

u

m

g

-c

u

2.2

Introduction and overview

2.3

*Matrix Algebra

2.4

Matrix Analysis

2.5

Direct Methods for Solving Linear Equations

2.6

Classical Iterative Methods