Properties of Integrated Subspace
Lemma 3.3.2. Let Q be a feasible points of the problem (+). Then Q satisfies the first
4.1 Line Search Type Method
Q[1]|Q[2]| · · · |Q[N ]
-.. Therefore, canonical SVD (for example, the SVD routine in MATLAB or LAPACK [2]) can be applied to solve this eigenvectors problem. However, the computational complexity of canonical SVD is O(N2mℓ2), which is much slower when N increasing. Hence in this chapter, we intro-duce some other methods to compute the integrated subspace with different computational complexity than O(N2mℓ2).
4.1 Line Search Type Method
Since the integrated subspace is defined by an optimization problem, it is reasonable to use the algorithm for solving the optimization problem for computing the integrated subspace.
A classical method for solving the optimization problem is line search methods. Gradient descent method (in our case, gradient ascent method) is widely used among the line search methods. The update scheme for unconstraint gradient ascent method is
xk+1 = xk+ τk∇f(xk)
where f is the objective function and τk is the step size needed to search in the n-th it-eration. In the viewpoint of line search, in each iteration, we define a curve γk(τ ) =
xk+ τk∇f(xk). This curve is a straight line stating at the point xkand point to the direc-tion ∇f(xk), which is the steeps ascent direction of objective function at x. The point for next iteration is given by finding a suitable step size τ and define xk+1 = γk(τ ). These explanations shows the main concept of gradient ascent method. It tries to find the optimal solution by walking along the curve with steepest direction in each step with suitable step size.
The main issue for this algorithm is to find a suitable step size τ. If the step size is too large, it may be difficult to reach the optimal solution because it may walk too far.
However, if the step size is too small, it may need more iteration to reach the optimal solution. One of the popular methods for selecting step size is using the back tracking method with Armijo-Wolfe condition (for maximization problem)
f (xk+ τkpk)≥ f(xk) + ρτkp⊤k∇f(xk) p⊤k∇f(xk+ τkpk)≤ ρ2p⊤k∇f(xk)
where ρ, ρ2 is parameters that can be chosen and pk = ∇f(xk) in gradient ascend. The second condition always holds due to the design of algorithm. Hence we just need to focus on the first condition. We use Armijo rule for simplicity in the following content.
The back tracking method tries to find the step size τk satisfied the Armijo rule by the following steps. First, assign an initial guess of step size, such as 1. Then test whether this step size satisfies the Armijo rule. Suppose not, multiply this step size by a positive constant β < 1 and test it again. Repeat these processes until the step size satisfies the Armijo rule and set it into τk.
For the constraint optimization problem, the gradient ascent method needs to be mod-ified. Here we choose to use the viewpoint of the gradient ascent method on Stiefel man-ifold. The Stiefel manifold is the manifold consisted of all m × ℓ orthogonal matrices
Sm,ℓ =0
Q∈ Rm×ℓ: Q⊤Q = I1
which is exactly our constraint in the optimization problem.
22
doi:10.6342/NTU201702973
Define
F (Q) = 1
2tr(Q⊤P Q) as our objective function. We denote the gradient of F at Q as
GF(Q) = P Q
where the equality can be computed easily. On the manifold, the gradient in the Euclidean space may not represent the direction (tangent) at a point. Hence we need to project the gradient to the tangent space on the manifold for further derivation.
Lemma 4.1.1. The projected gradient of F onto the tangent space TQSm,ℓof Stiefel man-ifold Sm,ℓis
DF(Q) = (I− QQ⊤)P Q.
Proof. First we find a necessary and sufficient condition for X being in TQSm,ℓ. For all X ∈ TQSm,ℓ, find a path Γ(t) in Sm,ℓwith Γ(0) = Q and Γ′(0) = X. From Γ(t)⊤Γ(t) = I, differentiate each side by t and take t = 0, we have
X⊤Q + Q⊤X = 0, (4.1)
which gives a necessary condition for X ∈ TQSm,ℓ. There are ℓ(ℓ + 1)/2 conditions for X in (4.1) and the dimension ofTQSm,ℓ is mℓ − ℓ(ℓ + 1)/2, which means (4.1) is also a sufficient condition for X ∈ TQSm,ℓ. By taking vec to each sides of (4.1), we get the equality
[(Q⊤⊗ Iℓ)Km,ℓ+ (Iℓ⊗ Q⊤)] vec(X) = 0.
Define T = Kℓ,m(Q⊗ Iℓ) + (Iℓ⊗ Q) and get T⊤vec(X) = 0. This shows that the tangent space (after vectorizing each elements) is contained in the null space of T⊤. One can compute the rank of T and shows that the null space of T⊤ is actually the tangent space. Hence the projection matrix onto the tangent space is given by (I − PT), where PT = T (T⊤T )+T⊤and (T⊤T )+denoted the Moore-Penrose pseudo-inverse. With PT,
DF can be given via vec(DF) = (I − PT) vec(GF). With some calculation, we have T = (Iℓ⊗ Q)(Iℓ2 + Kℓ,ℓ) and thus
T⊤T = (Iℓ2 + Kℓ,ℓ)⊤(Iℓ⊗ Q)⊤(Iℓ⊗ Q)(Iℓ2 + Kℓ,ℓ)
= (Iℓ2 + Kℓ,ℓ)(Iℓ2 + Kℓ,ℓ) = 2(Iℓ2 + Kℓ,ℓ).
Then the projection matrix PT can be calculated as:
PT = T (T⊤T )+T⊤
Algorithm 3 is rewritten from [1] by the notation in this thesis and the projected gra-dient as above. Note that the function F (M) denote the function that orthogonalize M first (hence this point is in the Stiefel manifold) and then plug into F .
The basic concept of Algorithm 3 is same as the gradient ascent method described previously. For each step, we find the projected gradient, which is the steepest direction on the manifold. Then we walk along this direction with a suitable step size decide by backtracking and Armijo-Wolfe condition. The difference is that we need to retract back
24
doi:10.6342/NTU201702973
Algorithm 3 Integration of {Q[i]}Ni=1based on Armijo line search.
Require: Q[1], Q[2], . . . , Q[N ](subspace matrices), Qini(initial guess), τ0 > 0 (initial step size), β ∈ (0, 1) (scaling parameter for step size searching), ρ ∈ (0, 1) (parameters for step size searching)
Ensure: Integrated subspace matrix Q based on Armijo line search
1: Initialize the current iterate Qc ← Qini
2: while (not convergent) do
3: Compute the gradient on manifold X = (Im− QcQ⊤c)P Qc
4: Find the smallest integer j ≥ 0 such that the following inequality holds:
F (Qc+ τ0βjX)≥ F (Qc) + τ0βjρ∥X∥2F
5: Orthogonalize (Qc+ τ0βjX) (for example, by QR-decomposition or polar decom-position) as Q+
6: Assign Qc ← Q+
7: end while
8: Output Q = Qc
to the manifold before we calculate the objective function and before the next iteration. In this algorithm, it uses orthogonalization as the retraction.
In [1], the convergent theorem is also provided. Here we translate the theorem and write down the most important part as Theorem 4.1.2. Note that in the theorem, a critical point x∗ (the points that make projected gradient is 0) is stable, if for any neighborhood U around x∗, there exists another neighborhood V around x∗, such that if the initial value start in V, then after any finite steps, the result will be in U. A critical point is asymptotically stable if it is stable, and the condition also holds as the number of steps tends to infinity.
A critical point is unstable if it is not stable.
Theorem 4.1.2 (Convergence theory for Algorithm 3). Let λ1 ≥ λ2 ≥ . . . ≥ λℓ >
λℓ+1 ≥ . . . ≥ λm be the eigenvalues of P and let Q∗ be the corresponding leading ℓ eigenvectors. Algorithm 3 converge to the orthogonal matrix Q with each column are eigenvectors of P . Among all the critical points, Q∗(up to a right orthogonal transform) is the only asymptotically stable point with the linear convergent rate. Other critical points are unstable.
This theorem gives the promise of the convergence. Also, our result for the property of optimization problem in Chapter 3 can support this result. The critical points are actually the points satisfying first order condition 3.2. Among these points, only Q∗ satisfies the
second order condition, which is related to the condition of stable points.
However, same as the traditional gradient ascent method, the convergent rate of the Algorithm 3 is linear. This could be a problem since it might take many steps to converge.
Here we provide another algorithm. Algorithm 4 is rewritten by the notation and objective function in this thesis from the algorithm proposed by Wen and Yin [11].
This algorithm is based on gradient ascent method which is similar to Algorithm 3.
However, some modification is introduced for this algorithm. The method to retract the point back to the manifold is changed in this modified algorithm. The searching path at current point Qc is defined as ΓQc(τ ) = (I − τ2M )−1(I + τ2M )Qc, where M = GQ⊤c − QcG⊤ and G = P Qc is the gradient of the objective function in Euclidean space. By using Woodbury matrix identity, the searching path is same as
ΓQc(τ ) = Qc− τL(I2ℓ+1
2τ R⊤L)−1R⊤Qc (4.3) where L = [−G Qc] and R = [Qc G]. This modification decreases the matrix size to compute inverse. As mentioned in [11], this path also satisfies some properties, hence Theorem 4.1.2 also holds for this algorithm.
Algorithm 4 also uses the Barzilai-Borwein step size [3] (BB step size) to accelerate this gradient method, and the nonmonotone strategy in [12] to prevent stuck in the local optimal points. The Armijo rule in the Algorithm 4 is
F (ΓQc(τ ))≥ F (ΓQc(0)) + τ σ dF (ΓQc(t))
The nonmonotone strategy in [12] modified the Armijo rule by
F (ΓQc(τ ))≥ c + τσ dF (ΓQc(τ )) soften the Armijo rule and increase the chance to jump out the local minimum.
26
doi:10.6342/NTU201702973
The BB step size is the simulation for Newton’s method, in a more efficient way.
They consider the update scheme xk+1 = xk + Skgk. In Newton’s method, the matrix Sk is the Hessian matrix of objective function. They set the matrix Sk = τkI, where τk is the step size needed to computed. Also, they wish Sk can approximately satisfies the secant condition ∆x = Sk∆g or Sk∆x = ∆g in quasi-Newton’s method, where
∆x = xk− xk−1, ∆g = gk− gk−1. Hence the step size τkis determined by
τk = argmin
τ ∥∆x − τ∆g∥ or τk= argmin
τ ∥τ∆x − ∆g∥ . The solutions of these minimization problems are
αk= ⟨∆x, ∆g⟩
⟨∆g, ∆g⟩ or αk = ⟨∆x, ∆x⟩
⟨∆x, ∆g⟩
These tow step size are called BB step size. In our case, the BB step size are
τguess = tr(D1⊤D1)
| tr(D⊤1D2)| or |tr(D⊤1D2)| tr(D2⊤D2)
where D1 = Q+− Qc and D2 = X+− Xc. However, BB step size does not imply the convergence. Therefore, Algorithm 4 still need to use back tracking method with Armijo-Wolfe conditions to ensure the convergence of the algorithm.
In each iteration, the most intensive calculation in Algorithm 4 is the multiplication G = P Qc. This multiplication can be computed by )N
i=1Q⊤[i]Qc first and then G = )N
i=1Q[i]Q⊤[i]Qc. The first step contains N matrix multiplication with size ℓ × m and m× ℓ. The second step contains N matrix multiplication with size m × ℓ and ℓ × ℓ. Hence the complexity for computing G is O(Nmℓ2).
The remaining steps in each iteration can be computed by using G. First, we consider the part for line search. The value of objective function can be computed by F (Q) = tr(Q⊤G), which is the summation of dot product of two m× ℓ matrices. The complexity is O(mℓ). The point on the curve ΓQc(τ ) can be solved by a linear system with dimension ℓ matrix multiplication of size m×2ℓ and 2ℓ×m, and m×2ℓ and 2ℓ×2ℓ. The complexity
Algorithm 4 Integration of {Q[i]}Ni=1 based on nonmonotone line search with BB step size.
Require: Q[1], Q[2], . . . , Q[N ](subspace matrices), Qini(initial guess), τ0 > 0 (initial step size), β ∈ (0, 1) (scaling parameter for step size searching), ρ ∈ (0, 1) (parameter for step size searching), η ∈ (0, 1) (parameter for next step searching), τM, τm(maximum and minimum for predicting step size)
Ensure: Integrated subspace matrix Q based on Armijo line search with BB step size
1: Initialize Qc ← Qini, ¯τ ← τ0, ζ = 1, c = F (Qc)
2: while (not convergent) do
3: Compute the gradient in Euclidean space G = P Qc
4: Set L = [−G Qc] and R = [Qc G].
5: Find the smallest integer j ≥ 0 such that the following inequality holds:
F (ΓQc(¯τ βj))≥ c + ¯τβjρ#
is O(mℓ2+ ℓ3). Hence the total complexity for line search is O(Iinner(mℓ2+ ℓ3)), where Iinner denotes the number of iteration of line search (inner loop). Second, we consider the part for updating c. The main calculation of this part is the computation of the objective function. Hence the complexity of this part is O(mℓ). Finally, we consider the part for computing BB step size. The main computation in this part is X+, which need to calculate G+ = P Q+. However, this computation can be directly used in next iteration. Hence we compute the complexity of this part in the next iteration. For other components, they need matrix multiplication with complexity O(mℓ2). Also, it needs O(mℓ) to compute each trace for computing τguess.
To sum up, suppose we use IW Y iteration to converge, then the computational com-plexity of Algorithm 4 is IW Y(O(N mℓ2)+O(Iinner(mℓ2+ℓ3))+O(mℓ)+O(mℓ2+mℓ)), which is dominate by the term O(IW YN mℓ2), suppose Iinneris controlled. (This assump-tion is reasonable since Iinner is often restricted within some number.)