Singular Vectors - 3 Best-Fit Subspaces and Singular Value Decompo- Decompo-sition (SVD)

3 Best-Fit Subspaces and Singular Value Decompo- Decompo-sition (SVD)

3.3 Singular Vectors

We now define the singular vectors of an n × d matrix A. Consider the rows of A as n points in a d-dimensional space. Consider the best fit line through the origin. Let v be a unit vector along this line. The length of the projection of a_i, the i^th row of A, onto

7But there is a difference: here we take the perpendicular distance to the line or subspace, whereas, in the calculus notion, given n pairs, (x₁, y₁), (x₂, y₂), . . . , (x_n, y_n), we find a line l = {(x, y)|y = mx + b}

minimizing the vertical squared distances of the points to it, namely,Pn

i=1(yi− mxi− b)².

v is |a_i· v|. From this we see that the sum of the squared lengths of the projections is

|Av|². The best fit line is the one maximizing |Av|² and hence minimizing the sum of the squared distances of the points to the line.

With this in mind, define the first singular vector v1 of A as v₁ = arg max

|v|=1|Av|.

Technically, there may be a tie for the vector attaining the maximum and so we should not use the article “the”; in fact, −v₁ is always as good as v₁. In this case, we arbitrarily pick one of the vectors achieving the maximum and refer to it as “the first singular vector”

avoiding the more cumbersome “one of the vectors achieving the maximum”. We adopt this terminology for all uses of arg max .

The value σ1(A) = |Av1| is called the first singular value of A. Note that σ²₁ =

i=1

(ai· v1)² is the sum of the squared lengths of the projections of the points onto the line determined by v₁.

If the data points were all either on a line or close to a line, intuitively, v₁ should give us the direction of that line. It is possible that data points are not close to one line, but lie close to a 2-dimensional subspace or more generally a low dimensional space.

Suppose we have an algorithm for finding v₁ (we will describe one such algorithm later).

How do we use this to find the best-fit 2-dimensional plane or more generally the best fit k-dimensional space?

The greedy approach begins by finding v₁ and then finds the best 2-dimensional subspace containing v₁. The sum of squared distances helps. For every 2-dimensional subspace containing v₁, the sum of squared lengths of the projections onto the subspace equals the sum of squared projections onto v₁ plus the sum of squared projections along a vector perpendicular to v₁ in the subspace. Thus, instead of looking for the best 2-dimensional subspace containing v₁, look for a unit vector v₂ perpendicular to v₁ that maximizes |Av|² among all such unit vectors. Using the same greedy strategy to find the best three and higher dimensional subspaces, defines v₃, v₄, . . . in a similar manner. This is captured in the following definitions. There is no apriori guarantee that the greedy algorithm gives the best fit. But, in fact, the greedy algorithm does work and yields the best-fit subspaces of every dimension as we will show.

The second singular vector , v₂, is defined by the best fit line perpendicular to v₁. v₂ = arg max

v⊥v1

|v|=1

|Av|

The value σ2(A) = |Av2| is called the second singular value of A. The third singular

vector v₃ and the third singular value are defined similarly by v₃ = arg max

v⊥v1,v2

|v|=1

|Av|

and

σ₃(A) = |Av₃|,

and so on. The process stops when we have found singular vectors v₁, v₂, . . . , v_r, singular values σ1, σ2, . . . , σr, and

max

v⊥v1,v2,...,vr

|v|=1

|Av| = 0.

The greedy algorithm found the v₁ that maximized |Av| and then the best fit 2-dimensional subspace containing v₁. Is this necessarily the best-fit 2-dimensional sub-space overall? The following theorem establishes that the greedy algorithm finds the best subspaces of every dimension.

Theorem 3.1 (The Greedy Algorithm Works) Let A be an n×d matrix with singu-lar vectors v₁, v₂, . . . , v_r. For 1 ≤ k ≤ r, let V_k be the subspace spanned by v₁, v₂, . . . , v_k. For each k, V_k is the best-fit k-dimensional subspace for A.

Proof: The statement is obviously true for k = 1. For k = 2, let W be a best-fit 2-dimensional subspace for A. For any orthonormal basis (w1, w2) of W , |Aw1|² + |Aw2|² is the sum of squared lengths of the projections of the rows of A onto W . Choose an orthonormal basis (w₁, w₂) of W so that w₂ is perpendicular to v₁. If v₁ is perpendicular to W , any unit vector in W will do as w2. If not, choose w2 to be the unit vector in W perpendicular to the projection of v₁ onto W. This makes w₂ perpendicular to v₁.⁸ Since v₁ maximizes |Av|², it follows that |Aw₁|² ≤ |Av₁|². Since v₂ maximizes |Av|² over all v perpendicular to v1, |Aw2|² ≤ |Av2|². Thus

|Aw₁|²+ |Aw₂|² ≤ |Av₁|²+ |Av₂|².

Hence, V₂ is at least as good as W and so is a best-fit 2-dimensional subspace.

For general k, proceed by induction. By the induction hypothesis, V_k−1 is a best-fit k-1 dimensional subspace. Suppose W is a best-fit k-dimensional subspace. Choose an orthonormal basis w₁, w₂, . . . , w_k of W so that w_k is perpendicular to v₁, v₂, . . . , v_k−1. Then

|Aw1|² + |Aw2|²+ · · · + |Awk−1|² ≤ |Av1|² + |Av2|²+ · · · + |Avk−1|² since V_k−1 is an optimal k − 1 dimensional subspace. Since w_k is perpendicular to v1, v2, . . . , vk−1, by the definition of vk, |Awk|² ≤ |Avk|². Thus

|Aw₁|²+ |Aw₂|²+ · · · + |Aw_k−1|²+ |Aw_k|² ≤ |Av₁|²+ |Av₂|²+ · · · + |Av_k−1|²+ |Av_k|², proving that V_k is at least as good as W and hence is optimal.

8This can be seen by noting that v1is the sum of two vectors that each are individually perpendicular to w2, namely the projection of v1to W and the portion of v1 orthogonal to W .

Note that the n-dimensional vector Av_i is a list of lengths (with signs) of the projec-tions of the rows of A onto vi. Think of |Avi| = σi(A) as the component of the matrix A along v_i. For this interpretation to make sense, it should be true that adding up the squares of the components of A along each of the v_i gives the square of the “whole content of A”. This is indeed the case and is the matrix analogy of decomposing a vector into its components along orthogonal directions. squares of the singular values of A is indeed the square of the “whole content of A”, i.e., the sum of squares of all the entries. There is an important norm associated with this quantity, the Frobenius norm of A, denoted ||A||F defined as

||A||_F = s

j,k

a²_jk.

Lemma 3.2 For any matrix A, the sum of squares of the singular values equals the square of the Frobenius norm. That is, P σ_i²(A) = ||A||²_F.

Proof: By the preceding discussion.

The vectors v₁, v₂, . . . , v_r are called the right-singular vectors. The vectors Av_i form a fundamental set of vectors and we normalize them to length one by

u_i = 1

σ_i(A)Av_i.

Later we will show that u_isimilarly maximizes |u^TA| over all u perpendicular to u₁, . . . , u_i−1. These u_iare called the left-singular vectors. Clearly, the right-singular vectors are orthog-onal by definition. We will show later that the left-singular vectors are also orthogorthog-onal.

在文檔中 Foundations of Data Science (頁 41-44)