• 沒有找到結果。

A penalized method of alternating projections for weighted low-rank hankel matrix optimization

N/A
N/A
Protected

Academic year: 2022

Share "A penalized method of alternating projections for weighted low-rank hankel matrix optimization"

Copied!
34
0
0

加載中.... (立即查看全文)

全文

(1)

https://doi.org/10.1007/s12532-022-00217-1 FULL LENGTH PAPER

A penalized method of alternating projections for weighted low-rank hankel matrix optimization

Jian Shen1· Jein-Shan Chen2· Hou-Duo Qi1· Naihua Xiu3

Received: 1 August 2020 / Accepted: 18 December 2021 / Published online: 3 February 2022

© The Author(s) 2022

Abstract

Weighted low-rank Hankel matrix optimization has long been used to reconstruct contaminated signal or forecast missing values for time series of a wide class. The Method of Alternating Projections (MAP) (i.e., alternatively projecting to a low-rank matrix manifold and the Hankel matrix subspace) is a leading method. Despite its wide use, MAP has long been criticized of lacking convergence and of ignoring the weights used to reflect importance of the observed data. The most of known results are in a local sense. In particular, the latest research shows that MAP may converge at a linear rate provided that the initial point is close enough to a true solution and a transversality condition is satisfied. In this paper, we propose a globalized variant of MAP through a penalty approach. The proposed method inherits the favourable local properties of MAP and has the same computational complexity. Moreover, it is capable of handling a general weight matrix, is globally convergent, and enjoys local linear convergence rate provided that the cutting off singular values are significantly smaller than the kept ones. Furthermore, the new method also applies to complex data.

Extensive numerical experiments demonstrate the efficiency of the proposed method against several popular variants of MAP.

B

Jian Shen

j.shen@soton.ac.uk Jein-Shan Chen jschen@math.ntnu.edu.tw Hou-Duo Qi

hdqi@soton.ac.uk Naihua Xiu nhxiu@bjtu.edu.cn

1 School of Mathematical Sciences, University of Southampton, Highfield, Southampton SO17 1BJ, UK

2 Department of Mathematics, National Taiwan Normal University, Taipei 11677, Taiwan 3 Department of Applied Mathematics, Beijing Jiaotong University, Beijing, China

(2)

Keywords Hankel matrix· Alternating projections · Global convergence · Linear convergence· Time series

Mathematics Subject Classification 47B35· 62M10 · 65F55 · 90C26 · 90C30

1 Introduction

In this paper, we are mainly interested in the numerical methods for the weighted low-rank Hankel matrix optimization:

min f(X) :=1

2W ◦ (X − A), s.t. X ∈ Mr∩ Hk×, (1)

whereHk× is the space of all k×  Hankel matrices in the real/complex field with the standard trace inner product, ·  is the Frobenius norm, Mris the set of matrices whose ranks are not greater than a given rank r , A is given, and W is a given weight matrix Wi j ≥ 0. Here (A ◦ B) represents elementwise multiplication (e.g., Hadamard product) between A and B. We note that the size of all matrices involved are of k× .

The difficulties in solving (1) are with the low-rank constraint and how to effectively handle a general weight matrix W , the latter of which is often overlooked in existing literature. Our main purpose is to develop a novel, globally convergent algorithm for (1) and its efficiency will be benchmarked against several state-of-the-art algorithms.

In what follows, we first explain an important application of (1) to time series data, which will be tested in our numerical experiment part. We then review the latest advances on algorithms relating to the alternating projection method. We finish this section by explaining our approach and main contributions.

1.1 Applications in time series

Problem (1) arises from a large number of applications including signal processing, system identification and finding the greatest common divisor between polynomials [23]. To motivate our investigation on (1), let us consider a complex-valued time series a= (a1, a2, . . . , an) of finite rank [17, Chp. 5]:

at =

m s=1

Ps(t)λts, t = 1, 2, . . . , n (2)

where Ps(t) are a complex polynomial of degree (νs − 1) (νs are positive integers) andλs ∈ C \ {0} are distinct. Define r := ν1+ . . . + νm(”:=” means ”define”). Then it is known [29, Prop. 2.1] that the rank of the Hankel matrix A generated by a:

(3)

A= H(a) :=

⎢⎢

⎢⎣

a1 a2 · · · a a2 a3 · · · a+1

... ... ... ...

ak ak+1· · · an

⎥⎥

⎥⎦

must be r , where the choice of(k, ) satisfies n = k +  − 1 and r ≤ k ≤ n − r + 1.

Suppose now that the time series a is contaminated and/or has missing values. To reconstruct a, a natural approach is to computing its nearest time series x by the least squares:

min

n i=1

wi|ai − xi|2, s.t. rank(X) ≤ r, X = H(x), (3)

where w = (w1, . . . , wn) ≥ 0 are the corresponding weight vector emphasizing the importance of each elements of a. The equivalent reformulation of (3) as (1) is obtained by setting

W := H(v◦√

w) and vi =

⎧⎨

1/i for i = 1, . . . , k − 1 1/k for i = k, . . . , n − k + 1 1/(n − i + 1) for i = n − k + 2, . . . , n, where v is known as the averaging vector of Hankel matrix of size k× (k ≤ ) and

w is the elementwise square root of w. We note that the widely studied (1) with Wi j ≡ 1 corresponds towi = 1/vi, which is known as the trapezoid weighting. Another popular choice for financial time series is the exponential weightswi = exp(αi) for some α > 0. We refer to [15, Sect. 2.4] for more comments on the choice of weights.

A special type of the time series of (2) arises from the spectral compressed sensing, which has attracted considerable attention lately [4]. In its one dimensional case, atis often a superposition of a few complex sinusoids:

at =

r s=1

dsexp{(2π jωs− τs)t} , (4)

where j = √

−1, r is the model order, ωs is the frequency of each sinusoid, and ds = 0 is the weight of each sinusoid, and τs ≥ 0 is a damping factor. We note that (4) is a special case of (2) with Ps(t) = ds (henceνs = 1) and λs = exp(2π jωs− τs).

If at is sampled at all integer values from 1 to n, we get a sample vector a ∈ Cn. Consequently, the rank of H(a) must be r. However, in practice, only a subset Ω of the sampling points{1, . . . , n} may be observed (possibly contaminated), leading to the question of how to best reconstruct a(t) based on its partial observation ai on Ω. This has led to the Hankel matrix completion/approximation problem of (1), see [4, Sect. II.A] and [2, Sect. 2.1]. A popular choice of W in the spectral compressed sensing is Wi j = 1 for all (i, j), resulting in the distance between X and A in (1) being measured by the standard Frobenius norm. In this paper, we assume

Assumption 1 W is Hankel and non-negative (i.e., Wi j ≥ 0 for all (i, j)).

(4)

1.2 On alternating projection methods

Low-rank matrix optimization is an active research area. Our short review is only able to focus on a small group of those papers that motivated our research. We note that there are four basic features about the problem (1): (i) X has to be low rank; (ii) X has Hankel structure; (iii) the objective is weighted; and (iv) X may be complex valued.

The first feature is the most difficult one to handle because it causes the nonconvexity of the problem. Many algorithms have been developed proposing different ways to handle this low rank constraint. One of the most popular choices is to use the truncated singular value decomposition to project X to be its nearest rank r matrix and we denote the projection by ΠMr(X). This has given rise to the basic Method of Alternating Projections (MAP) (also known as the Cadzow method [1]): Given X0∈ H, update

Xν+1= ΠH

ΠMr(Xν)

, ν = 0, 1, 2, . . . (5)

where ΠH(·) is the orthogonal projection operator to the Hankel subspace Hk×. Despite its popularity in engineering sciences, Cadzow’s method can not guarantee the convergence to an optimal solution. Even convergence occurs, not much is known about where it converges to. It has also been criticized for completely ignoring the objective function, see [5,6,8,14]. In particular, the weight matrix W does not enter Cadzow’s method at all because the truncated SVD does not admit a closed-form solution under a weighted norm. Gillard and Zhigljavsky [16] proposed to replaceΠMr

by its diagonally weighted variants and studied how to best approximate the weight matrix W by a diagonal weight matrix. Qi et. al. [25] proposed to use a sequential diagonal weight matrices aiming to get a better approximation to the original weight matrix. Despite the improved numerical convergence, those methods in [16,25] still inherit the essential problem of convergence of Cadzow’s method. Recently, Lai and Varghese [20] considered a similar method for a matrix completion problem and established the linear convergence of their method under a kind of “transversality”

condition provided that the initial point is close enough to a true rank-r completion.

We refer to [7] for a more general transversality condition that ensures a local linear convergence rate of MAP onto nonconvex sets.

Alternating projections ofΠMr(·) and ΠH(·) also play an important role in the class of iterative hard thresholding (IHT) algorithms for spectral compressed sensing. For example, Cai et. al. [3] established the convergence of IHT in the statistical sense (i.e., with high probability) under a coherence assumption on the initial observation matrix A. Although local convergence results (be in the sense of monotonically decreasing [20] or in the statistical sense [3]) may be established for MAP under some conditions, we are not aware of any existing global convergence results mainly due to the noncon- vexity of the rank constraint. For the general weighted (1), it appears to be a difficult task to develop a variant of MAP that enjoys both global and local linear convergence properties. We will achieve this through a penalty approach.

Penalty approaches have long been used to develop globally convergent algorithms for problems with rank constraints, see [10,12,13,21,22,27,32,33]. For example, Gao [12] proposed the penalty function p(X) based on the following observation:

(5)

rank(X) ≤ r ⇐⇒ p(X) := X

r i=1

σi(X) = 0,

whereXis the nuclear norm of X andσ1(X) ≥ · · · σn(X) are the singular values of X in nonincreasing order. However, the resulting method, as well as those in [21,22,27], has nothing to do with MAP any more and its implementation is not trivial.

1.3 Our approach and main contributions

In this paper, we propose a new penalty function and develop a penalized method whose main step is the alternating projections. We call it the penalized MAP (pMAP).

The new penalty function is the Euclidean distance function dMr(X) from X to Mr:

dMr(X) := min {X − Z | Z ∈ Mr} and define gr(X) := 1

2dM2r(X). (6) Obviously, the original problem (1) is equivalent to

min f(X), s.t. dMr(X) = 0, X ∈ H.

We propose to solve the quadratic penalty problem with ρ > 0 being a penalty parameter:

min Fρ(X) := f (X) + ρgr(X), s.t. X ∈ H. (7) By following the standard argument [24, Thm. 17.1] for the classical quadratic penalty method, one can establish that the global solution of (7) converges to that of (1) as ρ approaches infinity provided the convergence happens. However, in practice, it is probably as difficult to find a global solution of (7) as for the original problems. It is hence important to establish the correspondence between the first-order stationary points of (7) and that of (1). This is done in Theorem1 under a generalized linear independence condition.

The remaining task is to efficiently compute a stationary point of (7) for a given ρ > 0. The key observation is that gr(X) can be represented as the difference of two convex functions, which can be easily majorized (later on its meaning) to get a majorization function gr(m)(X, Xν) of gr(X) at the current iterate Xν. We then solve the majorized subproblem:

Xν+1= arg min f (X) + ρgr(m)(X, Xν), s.t. X ∈ H. (8) We will show that the update takes the following form:

Xν+1= W(2)

ρ + W(2) ◦ A + ρ

ρ + W(2) ◦ ΠHMr(Xν)), (9)

(6)

where W(2) := W ◦ W and the division W(2)/(ρ + W(2)) is taken componentwise.

Compared with (5), this update is just a convex combination of the observation matrix A and the MAP iterate in (5). In the special case that W ≡ 0 (which completely ignores the objective in (1)) orρ = ∞, (9) reduces to MAP. We will analyze the convergence behaviour of pMAP (9). In particular, we will establish the following among others.

(i) The objective function sequence {F(Xν, ρ)} will converge and Xν+1− Xν converges to 0. Moreover, any limiting point of{Xν} is an approximate KKT point of the original problem (1) provided that the penalty parameter is above certain threshold (see Theorem2).

(ii) If X is an accumulation point of the iterate sequence{Xν}, then the whole sequence converges to X at a linear rate provided thatσr(X)  σr+1(X) (see Theorem3).

Our results in (i) and (ii) provide satisfactory justification of pMAP. It is not only globally convergent, but also enjoys a linear convergence rate under reasonable condi- tions. Furthermore, we can assess the quality of the solution as an approximate KKT of (1) if we are willing to increase the penalty parameter. Of course, balancing the fidelity term f(X) and the penalty term is an important issue that is beyond the current paper. The result in (ii) is practically important too. Existing empirical results show that MAP often terminates at a point whose cut-off singular values (σi, i ≥ r + 1) are significantly smaller than the kept singular values (σi, i ≤ r). Such points are often said to have a numerical rank r , but the theoretical rank is higher than r . This is exactly the situation that was addressed in (ii). Those results are stated and proved for real-valued matrices. We will extend them to the complex case, thanks to a technical result (Proposion 2) that the subdifferential of gr(X) in complex domain can also be computed in a similar fashion as in the real domain. To our best knowledge, this is the first variant of MAP that can handle general weights and enjoys both global convergence and locally linear convergence rate under a reasonable condition (i.e., σr  σr+1).

The paper is organized as follows. In the next section, we will first set up our standard notation and establish the convergence result for the quadratic penalty approach (7) whenρ → ∞. Sect.3includes our method of pMAP and its convergence results when ρ is fixed. In Sect.4, we will address the issue of extension to the complex-valued matrices, which arise from (2) and (4). The key concept used in this section is the Wirtinger calculus, which allows us to extend our analysis from the real case to the complex case. We report extensive numerical experiments in Sect.5and conclude the paper in Sect.6.

2 Quadratic penalty approach

The main purpose of this section is to establish the convergence of the stationary points of the penalized problems (7) to that of the original problem (1) as the penalty parameterρ goes to ∞. For the simplicity of our analysis, we focus on the real case. We will extend our results to the complex case in Sect.4. We first introduce the notation used in this paper.

(7)

2.1 Notation

For a nonnegative matrix such as the weight matrix W ,

W is its componentwise square root matrix(

Wi j). For a given matrix X ∈ Ck×, we often use its singular value decomposition (assume k≤ )

X= Udiag(σ1(X), . . . , σk(X))VT, (10) whereσ1(X) ≥ · · · ≥ σn(X) are the singular values of X and U ∈ Ck×k, V ∈ C×

are the left and right singular vectors of X . For a given closed subsetC ⊂ Ck×, we define the set of all projections from X toC by

PC(X) := arg min{X − Z : Z ∈ C}.

IfC is also convex, then PC(X) is unique. When C = Mr,PMr(X) may have multiple elements. We define a particular element inPMr(X) that is based on the SVD (10):

ΠMr(X) = Urdiag1(X), . . . , σr(X))VrT, where Ur and Vr consist of the first r columns of U and V respectively.

Related to the function gr(X) defined in (6), the function

hr(X) :=1

2X2F− gr(X), (11)

has the following properties by the classical result of Eckart and Young [9]:

dist2(X, Mr) = X − ΠMr2= σr2+1(X) + · · · + σn2(X), hr(X) = 1

2

σ12(X) + · · · + σr2(X)

= 1

2Mr(X)2.

It follows from [12, Prop. 2.16] that hr(X) is convex and the subdifferentials of hr(X) and gr(X) in the sense of [26, Def. 8.3] are respectively given by

∂hr(X) = conv(PMr(X)) and ∂gr(X) = X − ∂hr(X), (12) where conv(Ω) denotes the convex hull of the set Ω. Finally, we let B(X) denote the

-neighbourhood centred at X.

2.2 Convergence of quadratic penalty approach

The classical quadratic penalty methods try to solve a sequence of penalty problems:

Xν = arg min Fρν(X), s.t. X ∈ H, (13)

(8)

where the sequenceρν > 0 is increasing and goes to ∞. By following the standard argument (e.g., [24, Thm. 17.1]), one can establish that every limit of{Xν} is also a global solution of (1). However, in practice, it is probably as difficult to find a global solution for (13) as for the original problem (1). Therefore, only an approximate solution of (13) is possible. To quantify the approximation, we recall the optimality conditions relating to both the original and penalized problems.

Following the optimality theorem [26, Thm. 8.15], we define the first-order opti- mality condition of problem (1) and (7):

Definition 1 (First-order optimality condition) X ∈ H satisfies the first-order opti- mality condition of (1) if

0∈ ∇ f (X) +λ∂dMr(X) + H, (14) whereλ is the Lagrangian multiplier. Similarly, we say Xν ∈ H satisfies the first-order optimality condition of the penalty problem (7) if

0∈ ∇ f (Xν) + ρν ∂gr(Xν) + H. (15)

We generate Xν∈ H such that the condition (15) is approximately satisfied:

PH(∇ f (Xν) + ρν(Xν− ΠMr(Xν))) ≤ ν, (16)

whereν ↓ 0. We can establish the following convergence result.

Theorem 1 We assume the sequenceν} goes to ∞ and {ν} decreases to 0. Suppose each approximate solution Xν is generated to satisfy (16).

Let X be an accumulation point of{Xν} and we assume

∂dMr(X) ∩ H= {0}. (17)

Then X satisfies the first-order optimality condition (14).

Proof Suppose X is the limiting point of the subsequence{Xν}K. We consider the following two cases.

Case 1 There exists an infinite subsequenceK1ofK such that rank(Xν) ≤ r for ν ∈ K1. This would imply∂gr(Xν) = {0}, which with (16) impliesPH(∇ f (Xν)) → 0.

Hence (14) holds at X with the choice λ = 0.

Case 2 There exists an index ν0 such that Xν /∈ Mr for all ν0 ≤ ν ∈ K. In this case, we assume that there exists an infinite subsequence K2 of K such that 

(Xν− ΠMr(Xν))/dMr(Xν)

has the limit v. We note that (XνΠMr(Xν))/dMr(Xν) ∈ ∂dMr(Xν) for ν ≥ ν0 by [26, (8.53)]. Therefore, its limit v∈ ∂dMr(X) by the upper semicontinuity. By the assumption (17), we have v /∈ H because v has the unit length. Since H is a subspace, PH(·) is a linear operator. It follows from (16) that

(9)

ρνPH(Xν− ΠMr(Xν)) − PH(∇ f (Xν))

≤ PH(∇ f (Xν) + ρν(Xν− ΠMr(Xν)) ≤ ν.

Hence

PH(Xν− ΠMr(Xν)) ≤ 1 ρν

ν+ PH(∇ f (Xν))

,

which, forν ≥ ν0, is equivalent to

dMr(Xν)PH(Xν− ΠMr(Xν))/dMr(Xν) ≤ 1 ρν

ν+ PH(∇ f (Xν))

.

Taking limits on{Xν}ν∈K2and using the factρν → ∞ leads to dMr(X)PH(v) = 0.

Since v /∈ H, we havePH(v) > 0, which implies dMr(X) = 0. That is, X is a feasible point of (1). Now letλν := ρνdMr(Xν), we then have

λνXν− ΠMr(Xν)

dMr(Xν) = −∇ f (Xν) + ξν, ξν := ∇ f (Xν) + ρν(Xν − ΠMr(Xν)).

Projecting on both sides toH yields λνPH

Xν− ΠMr(Xν) dMr(Xν)



= PH(−∇ f (Xν)) + PHν). (18)

Computing the inner product on both sides withPH((Xν − ΠMr(Xν))/dMr(Xν)), taking limits on the sequence indexed byK2, and using the factPHν) → 0 due to (16), we obtain

ν∈Klim2

λνv2= v, PH(∇ f (X)).

We then have

λ = lim

ν∈K2λν = 1

v2v, PH(∇ f (X)).

Taking limits on both sides of (18) yields

PH(∇ f (X) +λv) = 0, which is sufficient for

0∈ ∇ f (X) +λ∂dMr(X) + H.

This completes our result. 

(10)

Remark 1 Condition (17) can be interpreted as that any 0= v ∈ ∂dMr(X) is linearly independent of any set of basis ofH. Therefore, (17) can be seen as a generalization of the linear independence assumption required in the classical quadratic penalty method for a similar convergence result with all the functions involved being assumed continuously differentiable, see [24, Thm. 17.2]. In fact, what we really needed in our proof is that there exists a subsequence{(Xν−ΠMr(Xν))/dMr(Xν)} in Case (ii) such that its limit v does not belong toH. That could be much weaker than the sufficient condition (17).

Theorem1establishes the global convergence of quadratic penalty method when the penalty parameter approaches infinity, which drives gr(Xν) smaller and smaller.

In practice, however, we often fixρ and solve for Xν. We are interested in how far Xν is from being a first-order optimal point of the original problem. For this purpose, we introduce the approximate KKT point, which keeps the first-order optimality condition (15) with an additional requirement that gr(X) is small enough.

Definition 2 (-approximate KKT point) Consider the penalty problem (7) and > 0 is given. We say a point X ∈ H is an -approximate KKT point of (1) if

0∈ ∇ f (X) + ρ ∂gr(X) + H and gr(X) ≤ .

3 The method of pMAP

This section develops a new algorithm that solves the penalty problem (7), in par- ticular for finding an approximate KKT point of (1). The first part is devoted to the construction of a majorization function for the distance function dist(X, Mr). We then describe pMAP based on the majorization introduced and establish its global and local convergence.

3.1 Majorization and DC interpretation

We first recall the essential properties that a majorization function should have. Let θ(·) be a real-valued function defined in a finite-dimensional space X . For a given y∈ X , we say a function θ(m)(·, y) : X → IR is a majorization of θ(·) at y if

θ(m)(x, y) ≥ θ(x), ∀ x ∈ X and θ(m)(y, y) = θ(y). (19) The motivation for employing the majorization is that the squared distance function gr(X) is hard to minimize when coupled with f (X) under the Hankel matrix structure.

It is noted that gr(X) = 1

2X − ΠMr(X)2≤ 1

2X − ΠMr(Z)2:= gr(m)(X, Z), ∀ X, Z ∈ Ck×

where the inequality used the fact thatΠMr(X) is a nearest point in Mrto X . It is easy to verify that gr(m)(X, Z) is a majorization function of gr(X) at Z.

(11)

The following way in deriving the majorization is crucial to our convergence anal- ysis. We recall

hr(X) = 1

2X2− gr(X) =1

2X2−1 2min

X − Z2: Z ∈ Mr



= max



X, Z −1

2Z2: Z ∈ Mr

 .

Being the pointwise maximum of linear functions when Z ∈ Mr is given, hr(X) is convex. The convexity of hr(X) and (12) yields

hr(X) ≥ hr(Z) + M, X − Z, ∀ X, Z ∈ Rk×, M ∈ PMr(Z) (20) which further implies

gr(X) = 1

2X2− hr(X)

≤ 1

2X2− hr(Z) − ΠMr(Z), X − Z

= 1

2X − ΠMr(Z)2

−1 2

 Mr(Z)2− Z2+ Z − Π Mr(Z)2− 2ΠMr(Z), Z

=0

= 1

2X − ΠMr(Z)2= gr(m)(X, Z).

In other words, gr(X) can be seen as Difference of Convex functions, the so-called DC function. Using a subgradient is a common way to majorize DC functions, see [12].

3.2 The pMAP algorithm

We recall that our main problem is (7). Our first step is to construct a majorized function of Fρ(X) at the current iterate Xν:

Fρ(m)(X, Xν) = 1

2W ◦ (X − A)2+ ρgr(m)(X, Xν)

= 1

2W ◦ (X − A)2+ρ

2X − ΠMr(Xν)2

= 1

2W ◦ X2+ρ

2X2− W(2)◦ A + ρΠMr(Xν), X +1

2W ◦ A2 +ρ

2Mr(Xν)2

= 1 2

ρ + W(2)◦ (X − Xνρ)2+1

2W ◦ A2+ρ

2Mr(Xν)2ρ + W(2) 2 Xνρ2,

(12)

where

Xρν := ρΠMr(Xν) + W(2)◦ A

ρ + W(2) . (21)

Note that the division is in the sense of componentwise. The subproblem to be solved at iterationν is

Xν+1= arg min Fρ(m)(X, Xν) s.t. X ∈ H

= arg min 1 2

ρ + W(2)◦ (X − Xνρ)2 s.t. X ∈ H

= ΠH(Xρν), (22)

where Xνρ is defined in (21). The last equation in (22) is due to Wρ := 

ρ + W(2) being Hankel and computing Xν+1in (22) is equivalent to averaging Xρνalong its all anti-diagonals. Since A,ρ/(ρ + W(2)), W(2)/(ρ + W(2)) are all Hankel matrices (due to Assumption 1), Xν+1can be calculated through (9).

Algorithm 1 (pMAP)

1: Input data: Matrix A, weight matrix W , penalty parameterρ, rank r, and the initial X0. Setν := 0.

2: Update X : Compute Xν+1by (9).

3: Convergence check: Terminate if some stopping criterion is satisfied.

Remark 2 Being a direct consequence of employing the majorization technique, the following decreasing property holds:

Fρ(Xν+1) ≤ Fρ(m)(Xν+1, Xν) (property of majorization (19))

≤ Fρ(m)(Xν, Xν) (because of (22))

= Fρ(Xν) (property of majorization (19)).

If Fρ is coercive (i.e., Fρ(X) → ∞ if X → ∞, which would be the case if we require W > 0), the sequence {Xν} will be bounded.

A widely used stopping criterion isXν+1− Xν ≤  for some small tolerance  >

0. We will see below thatXν+1− Xν approaches zero and hence such convergence check will eventually be satisfied. For our theoretical analysis, we assume that pMAP generates an infinite sequence (e.g., let = 0).

3.3 Convergence of pMAP

We have more results on the convergence of pMAP.

(13)

Theorem 2 Let{Xν} be the sequence generated by pMAP. The following holds.

(i) We have

Fρ(Xν+1) − Fρ(Xν) ≤ −ρ

2Xν+1− Xν2, k = 1, 2, . . . , .

Furthermore,Xν+1− Xν → 0.

(ii) Let X be an accumulation point of{Xν}. We then have

∇ f (X) + ρ(X− ΠMr(X)) ∈ H.

Moreover, for a given > 0, if X0∈ Mr ∩ H and

ρ ≥ ρ:= f(X0)

 ,

then X is an-approximate KKT point of (1).

Proof We will use a number of facts to establish (i). The first fact is due to the convexity of f(X):

f(Xν) − f (Xν+1) ≥ ∇ f (Xν+1), Xν− Xν+1 (23)

The second fact is the identity

Xν+12− Xν2= 2Xν+1− Xν, Xν+1 − Xν+1− Xν2 (24)

The third fact is due to the convexity of hr(X) defined in (11) andΠMr(X) ∈ ∂hr(X):

hr(Xν+1) − hr(Xν) ≥ ΠMr(Xν), Xν+1− Xν (25)

The last fact is the optimality condition of problem (22):

∇ f (Xν+1) + ρ(Xν+1− ΠMr(Xν)) ∈ H. (26)

(14)

Combining all facts above leads to a sufficient decrease in Fρ(Xk):

Fρ(Xν+1) − Fρ(Xν)

= f (Xν+1) − f (Xν) + ρgr(Xν+1) − ρgr(Xν)

(23)

≤ ∇ f (Xν+1), Xν+1− Xν + ρgr(Xν+1) − ρgr(Xν)

= ∇ f (Xν+1), Xν+1− Xν + ρ

2(Xν+12− Xν2) − ρ(hr(Xν+1) − hr(Xν))

(24= ∇ f (X) ν+1) + ρ Xν+1, Xν+1− Xν −ρ

2(Xν+1− Xν2)

− ρ(hr(Xν+1) − hr(Xν))

(25)

≤ ∇ f (Xν+1) + ρ Xν+1− ρΠMr(Xν), Xν+1− Xν − ρ

2Xν+1− Xν2

(26)

≤ −ρ

2Xν+1− Xν2

(27)

In the above we also used the fact that Xν+1− Xν ∈ H. Since the sequence {Fρ(Xν)}

is non-increasing and is bounded from below by 0, we haveXν+1− Xν2→ 0.

(ii) Suppose X is the limit of the subsequence{Xν}ν∈K. It follows fromXν+1Xν → 0 that X is also the limit of{Xν+1}ν∈K. Taking limits on both sides of (26) and using the upper semi-continuity of the projectionsPMr(·) yields

∇ f (X) + ρ(X− ΠMr(X)) ∈ H. we use the fact that{Fρ(Xν)} is non-increasing to get

f(X0) = f (X0) + ρgr(X0) = Fρ(X0) ≥ lim Fρ(Xν)

= Fρ(X) = f (X) + ρgr(X) ≥ ρgr(X).

The first equality holds because gr(X0) = 0 when X0∈ Mr. As a result,

gr(X) ≤ f(X0)

ρf(X0)

ρ = . (28)

Therefore, X is an-approximate KKT point of (1). 

We note that the first result (i) in Theorem 2 is standard in a majorization- minimization scheme and can be proved in different ways, see, e.g., [32, Thm. 3.7].

3.4 Final rank and linear convergence

This part reports two results. One is on the final rank of the output of pMAP and the rank is always bigger than the desired rank r unless A is already an optimal solution

(15)

of (1). The other is on the conditions that ensure a linear convergence rate of pMAP.

For this purpose, we need the following result.

Proposition 1 [11, Thm. 25] Given the integer r > 0 and consider X∈ IRk×of rank (r + p) with p ≥ 0. Suppose the SVD of X is represented as X = r+p

i=1σiuivTi , whereσ1(X) ≥ σ2(X) ≥ · · · ≥ σr+p(X) are the singular values of X and ui, vi, i = 1, . . . , r + p are the left and right (normalized) eigenvectors. We assume σr(X) > σr+1(X) so that the projection operator ΠMr(X) is uniquely defined in a neighbourhood of X . ThenΠMr(X) is differentiable at X and the directional derivative along the direction Y is given by

∇ ΠMr(X)(Y ) = ΠTMr(X)(Y )

+ 

1≤i≤r

1≤ j≤p

 σr+ j

σi− σr+ jY , Φi+,r+ ji+,r+ jσr+ j

σi+ σr+ jY , Φi,r+ ji,r+ j



whereTMr(X) is the tangent subspace of Mr at X and Φi±,r+ j = 1

√2

ur+ jviT ± uivrT+ j .

Theorem 3 Assume that W > 0 and X be an accumulation point of{Xν}. The follow- ing hold.

(i) rank(X) > r unless A is already the optimal solution of (1).

(ii) Suppose X has rank(r + p) with p > 0. Let σ1≥ σ2≥ · · · ≥ σk be the singular values of X . Define

w0:= min{Wi j} > 0, 0:= w20

ρ , 1:= 0

4+ 30, c := 1 1+ 1 < 1.

Under the condition

σr

σr+18 pr

0 + 1, it holds

Xν+1− X ≤ cXν − X for ν sufficiently large.

Consequently, the whole sequence{Xν} converges linearly to X .

Proof (i) Suppose X is the limit of the subsequence{Xν}k∈K. We assume rank(X) ≤ r . It follows from Theorem2that

{Xν+1}k∈K→ X and lim

k∈KΠMr(Xν) = ΠMr(X) = X.

(16)

Taking limits on both sides of (9) and using the fact that X is Hankel, we get

X= W(2)

ρ + W(2)◦ A + ρ

ρ + W(2) ◦ ΠHMr(X)) = W(2)

ρ + W(2) ◦ A + ρ

ρ + W(2) ◦ X.

Under the assumption W > 0, we have X = A. Consequently, rank(A) ≤ r, implying that A is the optimal solution of (1). Therefore, we must have rank(X) > r if the given matrix A is not optimal already.

(ii) Let φ(X) := ΠHMr(X)). Since ΠMr(X) is differentiable at X , so isφ(X).

Moreover, the directional derivative ofφ(X) at X along the direction Y is given by

∇φ(X)Y = ΠH(∇ΠMr(X)Y ) and ∇φ(X)Y  ≤ ∇ΠMr(X)Y . (29) The inequality above holds becauseΠH(·) is an orthogonal projection to a subspace and its operator norm is 1. The matrices in Proposition1have the following bounds.

i±,r+ j ≤ 1

√2

ur+ jvTi  + uivTr+ j

≤ 1

√2(1 + 1) =√ 2,

Y , Φi±,r+ ji±,r+ j ≤ Φi±,r+ j2Y  ≤ 2Y .

Therefore,



1≤i≤r

1≤ j≤p

 σr+ j

σi − σr+ jY , Φi+,r+ ji+,r+ jσr+ j

σi + σr+ jY , Φi,r+ ji,r+ j



≤ 4 

1≤i≤r

1≤ j≤p

σr+ j

σi− σr+ jY  ≤ 4pr σr+1

σr− σr+1Y  ≤w02

2ρY  = 1

20Y . (30)

In the above, we used the fact thatψ(t) := t/(σr − t) is an increasing function of t for t < σr. Proposition1, (29) and (30) imply

∇φ(X)Y  ≤ ΠTMr(X)(Y ) + 0/2Y  ≤ Y  + 0/2Y  ≤ (1 + 0/2)Y .

The second equality above used the fact that the operator norm of ΠTMr(X) is not greater than 1 due toTMr(X) being a subspace. Since φ(·) is differentiable at X , there exists > 0 such that

φ(X) − φ(X) − ∇φ(X)(X − X) ≤ 1

40X − X, ∀ X ∈ B(X).

Therefore,

φ(X) − φ(X) ≤ φ(X) − φ(X) − ∇φ(X)(X − X) + ∇φ(X)(X − X)

(17)

≤ 1

40X − X + (1 + 0/2)X − X = (1 + 30/4)X − X.

Now we are ready to quantify the error between Xνand X whenever Xν ∈ B(X).

Xν+1− X = ρ

ρ + W(2) ◦ (φ(Xν) − φ(X))ρ

ρ + w02φ(Xν) − φ(X)

≤ 1+ 30/4

1+ 0 Xν− X = cXν− X.

Consequently, Xν+1 ∈ B(X). Since {Xν}ν∈Kconverges to X , Xν will eventually falls inB(X), which implies that the whole sequence {Xν} will converge to X and

eventually converges at a linear rate. 

Remark 3 (Implication on MAP) When the weight matrix W = 0, pMAP reduces to MAP according to (9). Theorem2(i) implies

Xν+1− ΠMr(Xν+1)2− Xν− ΠMr(Xν)2≤ −Xν+1− Xν2, (31)

which obviously implies

Xν+1− ΠMr(Xν+1) ≤ Xν− ΠMr(Xν). (32)

The decrease property (32) was known in [5, Eq.(4.1)] and was used there to ascertain that MAP is a descent algorithm. Our improved bound (31) says a lightly more that the decrease each step in the functionX − ΠMr(X) is strict unless the update becomes unchanged. In this case (W = 0), the penalty parameter is just a scaling factor in the objective, hence the KKT result in Theorem2(ii) does not apply to MAP. This probably explains why it is difficult to establish similar results for MAP.

Remark 4 (On linear convergence) In the general context of matrix completion, Lai and Varghese [20] established a local linear convergence of MAP under the following two assumptions. We describe them in terms of the Hankel matrix completion. (i) The partially observed data a can be completed to a rank r Hankel matrix M. (ii) A transversality condition (see [20, Thm. 2]) holds at M. We emphasize that the result of [20] is a local result that requires that the initial point of MAP is close enough to M and the rank r assumption of M is also crucial to their analysis, which also motivated our proof. In contrast, our result is a global one and enjoys a linear convergence rate near the limit under a more realistic assumptionσr  σr+1. One may have noticed that the convergence rate c though strictly less than 1 may be close to 1. This is often numerically observed that MAP often converges slowly. But the more important point here is that in such a situation it ensures that the whole sequence converges. This global convergence justifies the widely used stopping criterionXν+1− Xν ≤ .

(18)

4 Extension to complex-valued matrices

The results obtained in the previous sections are for real-valued matrices and they can be extended to complex-valued matrices by employing what is known as the Wirtinger calculus [30]. We note that not all algorithms for Hankel matrix optimization have a straightforward extension from the real case to the complex case, see [6] for comments on some algorithms. We explain our extension below.

Suppose f : Cn→ IR is a real-valued function in the complex domain. We write z ∈ Cn as z = x + jy with x, y ∈ IRn. The conjugate¯z := x − jy. Then we can write the function f(z) in terms of its real variables x and y. With a slight abuse of notation, we still denote it as f(x, y). In the case where the optimization of f (z) can be equivalently represented as optimization of f in terms of its real variables, the partial derivatives∂ f (x, y)/∂x and ∂ f (x, y)/∂y would be sufficient. For other cases where algorithms are preferred to be executed in the complex domain, then the Wirtinger calculus [30] is more convenient to use and it is well explained (and derived) in [19]. TheR-derivative and the conjugate R-derivative of f in the complex domain are defined respectively by

∂ f

∂z = 1 2

∂ f

∂x − j∂ f

∂y

 , ∂ f

∂ ¯z = 1 2

∂ f

∂x + j∂ f

∂y

 .

TheR-derivatives in the complex domain play the same role as the derivatives in the real domain because the following two first-order expansions are equivalent:

f(x + Δx, y + Δy) = f (x, y) + ∂ f /∂x, Δx

+∂ f /∂y, Δy + o(Δx + Δy)

f(z + Δz) = f (z) + 2Re(@f/@Nz, z) + o(z). (33) Here, we treat the partial derivatives as column vectors and Re(x) is the real part of x.

Note that in the first-order expansion in f(z + Δz) used the conjugate R-derivative.

Hence, we define the complex gradient to be∇ f (z) := 2∂ f /∂ ¯z, when it exists. When f is not differentiable, we can extend the subdifferential of f from the real case to the complex case by generalizing (33).

In order to extend Theorem1, we need to characterize∂dMr(X) in the complex domain. We may follow the route of [26] to conduct the extension. For example, we may define the regular subgradient of dMr(X) [26, Def. 8.3] to its complex counterpart by replacing the conjugate-gradient in the first-order expansion in (33) by a regular subgradient. We then define subdifferential through regular subgradients. With this definition in the complex domain, we may extend [26, (8.53)] to derive formulae for

∂dMr(X). What we needed in the proof of Theorem1is(X − ΠMr(X))/dMr(X) ∈

∂dMr(X) when X /∈ Mr. The proof of this result follows a straightforward extension of the corresponding part in [26, (8.53)] and if reproduced here would take up much space. Hence we omit it.

In order to extend the results in Sect.3, we need the subdifferential of hr(X) in order to majorize gr(X). Since hr(X) is convex, its subdifferential is easy to define. We

參考文獻

相關文件

Describe finite-volume method for solving proposed model numerically on fixed mapped (body-fitted) grids Discuss adaptive moving mesh approach for efficient improvement of

The disadvantage of the inversion methods of that type, the encountered dependence of discretization and truncation error on the free parameters, is removed by

Then, we recast the signal recovery problem as a smoothing penalized least squares optimization problem, and apply the nonlinear conjugate gradient method to solve the smoothing

Then, we recast the signal recovery problem as a smoothing penalized least squares optimization problem, and apply the nonlinear conjugate gradient method to solve the smoothing

In this chapter, the results for each research question based on the data analysis were presented and discussed, including (a) the selection criteria on evaluating

[19] considered a weighted sum of multiple objectives, including minimizing the makespan, mean flow time, and machine idle time as a performance measurement, and proposed

Therefore, this study proposes a Reverse Logistics recovery scheduling optimization problem, and the pallet rental industry, for example.. The least cost path, the maximum amount

Thus, the proposed approach is a feasible and effective method for process parameter optimization in MIMO plastic injection molding and can result in significant quality and