3 Best-Fit Subspaces and Singular Value Decompo- Decompo-sition (SVD)
3.3 Singular Vectors
We now define the singular vectors of an n × d matrix A. Consider the rows of A as n points in a d-dimensional space. Consider the best fit line through the origin. Let v be a unit vector along this line. The length of the projection of ai, the ith row of A, onto
7But there is a difference: here we take the perpendicular distance to the line or subspace, whereas, in the calculus notion, given n pairs, (x1, y1), (x2, y2), . . . , (xn, yn), we find a line l = {(x, y)|y = mx + b}
minimizing the vertical squared distances of the points to it, namely,Pn
i=1(yi− mxi− b)2.
v is |ai· v|. From this we see that the sum of the squared lengths of the projections is
|Av|2. The best fit line is the one maximizing |Av|2 and hence minimizing the sum of the squared distances of the points to the line.
With this in mind, define the first singular vector v1 of A as v1 = arg max
|v|=1|Av|.
Technically, there may be a tie for the vector attaining the maximum and so we should not use the article “the”; in fact, −v1 is always as good as v1. In this case, we arbitrarily pick one of the vectors achieving the maximum and refer to it as “the first singular vector”
avoiding the more cumbersome “one of the vectors achieving the maximum”. We adopt this terminology for all uses of arg max .
The value σ1(A) = |Av1| is called the first singular value of A. Note that σ21 =
n
P
i=1
(ai· v1)2 is the sum of the squared lengths of the projections of the points onto the line determined by v1.
If the data points were all either on a line or close to a line, intuitively, v1 should give us the direction of that line. It is possible that data points are not close to one line, but lie close to a 2-dimensional subspace or more generally a low dimensional space.
Suppose we have an algorithm for finding v1 (we will describe one such algorithm later).
How do we use this to find the best-fit 2-dimensional plane or more generally the best fit k-dimensional space?
The greedy approach begins by finding v1 and then finds the best 2-dimensional subspace containing v1. The sum of squared distances helps. For every 2-dimensional subspace containing v1, the sum of squared lengths of the projections onto the subspace equals the sum of squared projections onto v1 plus the sum of squared projections along a vector perpendicular to v1 in the subspace. Thus, instead of looking for the best 2-dimensional subspace containing v1, look for a unit vector v2 perpendicular to v1 that maximizes |Av|2 among all such unit vectors. Using the same greedy strategy to find the best three and higher dimensional subspaces, defines v3, v4, . . . in a similar manner. This is captured in the following definitions. There is no apriori guarantee that the greedy algorithm gives the best fit. But, in fact, the greedy algorithm does work and yields the best-fit subspaces of every dimension as we will show.
The second singular vector , v2, is defined by the best fit line perpendicular to v1. v2 = arg max
v⊥v1
|v|=1
|Av|
The value σ2(A) = |Av2| is called the second singular value of A. The third singular
vector v3 and the third singular value are defined similarly by v3 = arg max
v⊥v1,v2
|v|=1
|Av|
and
σ3(A) = |Av3|,
and so on. The process stops when we have found singular vectors v1, v2, . . . , vr, singular values σ1, σ2, . . . , σr, and
max
v⊥v1,v2,...,vr
|v|=1
|Av| = 0.
The greedy algorithm found the v1 that maximized |Av| and then the best fit 2-dimensional subspace containing v1. Is this necessarily the best-fit 2-dimensional sub-space overall? The following theorem establishes that the greedy algorithm finds the best subspaces of every dimension.
Theorem 3.1 (The Greedy Algorithm Works) Let A be an n×d matrix with singu-lar vectors v1, v2, . . . , vr. For 1 ≤ k ≤ r, let Vk be the subspace spanned by v1, v2, . . . , vk. For each k, Vk is the best-fit k-dimensional subspace for A.
Proof: The statement is obviously true for k = 1. For k = 2, let W be a best-fit 2-dimensional subspace for A. For any orthonormal basis (w1, w2) of W , |Aw1|2 + |Aw2|2 is the sum of squared lengths of the projections of the rows of A onto W . Choose an orthonormal basis (w1, w2) of W so that w2 is perpendicular to v1. If v1 is perpendicular to W , any unit vector in W will do as w2. If not, choose w2 to be the unit vector in W perpendicular to the projection of v1 onto W. This makes w2 perpendicular to v1.8 Since v1 maximizes |Av|2, it follows that |Aw1|2 ≤ |Av1|2. Since v2 maximizes |Av|2 over all v perpendicular to v1, |Aw2|2 ≤ |Av2|2. Thus
|Aw1|2+ |Aw2|2 ≤ |Av1|2+ |Av2|2.
Hence, V2 is at least as good as W and so is a best-fit 2-dimensional subspace.
For general k, proceed by induction. By the induction hypothesis, Vk−1 is a best-fit k-1 dimensional subspace. Suppose W is a best-fit k-dimensional subspace. Choose an orthonormal basis w1, w2, . . . , wk of W so that wk is perpendicular to v1, v2, . . . , vk−1. Then
|Aw1|2 + |Aw2|2+ · · · + |Awk−1|2 ≤ |Av1|2 + |Av2|2+ · · · + |Avk−1|2 since Vk−1 is an optimal k − 1 dimensional subspace. Since wk is perpendicular to v1, v2, . . . , vk−1, by the definition of vk, |Awk|2 ≤ |Avk|2. Thus
|Aw1|2+ |Aw2|2+ · · · + |Awk−1|2+ |Awk|2 ≤ |Av1|2+ |Av2|2+ · · · + |Avk−1|2+ |Avk|2, proving that Vk is at least as good as W and hence is optimal.
8This can be seen by noting that v1is the sum of two vectors that each are individually perpendicular to w2, namely the projection of v1to W and the portion of v1 orthogonal to W .
Note that the n-dimensional vector Avi is a list of lengths (with signs) of the projec-tions of the rows of A onto vi. Think of |Avi| = σi(A) as the component of the matrix A along vi. For this interpretation to make sense, it should be true that adding up the squares of the components of A along each of the vi gives the square of the “whole content of A”. This is indeed the case and is the matrix analogy of decomposing a vector into its components along orthogonal directions. squares of the singular values of A is indeed the square of the “whole content of A”, i.e., the sum of squares of all the entries. There is an important norm associated with this quantity, the Frobenius norm of A, denoted ||A||F defined as
||A||F = s
X
j,k
a2jk.
Lemma 3.2 For any matrix A, the sum of squares of the singular values equals the square of the Frobenius norm. That is, P σi2(A) = ||A||2F.
Proof: By the preceding discussion.
The vectors v1, v2, . . . , vr are called the right-singular vectors. The vectors Avi form a fundamental set of vectors and we normalize them to length one by
ui = 1
σi(A)Avi.
Later we will show that uisimilarly maximizes |uTA| over all u perpendicular to u1, . . . , ui−1. These uiare called the left-singular vectors. Clearly, the right-singular vectors are orthog-onal by definition. We will show later that the left-singular vectors are also orthogorthog-onal.