A penalized method of alternating projections for weighted low-rank hankel matrix optimization

(1)

https://doi.org/10.1007/s12532-022-00217-1 FULL LENGTH PAPER

A penalized method of alternating projections for weighted low-rank hankel matrix optimization

Jian Shen¹· Jein-Shan Chen²· Hou-Duo Qi¹· Naihua Xiu³

Received: 1 August 2020 / Accepted: 18 December 2021 / Published online: 3 February 2022

Abstract

Weighted low-rank Hankel matrix optimization has long been used to reconstruct contaminated signal or forecast missing values for time series of a wide class. The Method of Alternating Projections (MAP) (i.e., alternatively projecting to a low-rank matrix manifold and the Hankel matrix subspace) is a leading method. Despite its wide use, MAP has long been criticized of lacking convergence and of ignoring the weights used to reflect importance of the observed data. The most of known results are in a local sense. In particular, the latest research shows that MAP may converge at a linear rate provided that the initial point is close enough to a true solution and a transversality condition is satisfied. In this paper, we propose a globalized variant of MAP through a penalty approach. The proposed method inherits the favourable local properties of MAP and has the same computational complexity. Moreover, it is capable of handling a general weight matrix, is globally convergent, and enjoys local linear convergence rate provided that the cutting off singular values are significantly smaller than the kept ones. Furthermore, the new method also applies to complex data.

Extensive numerical experiments demonstrate the efficiency of the proposed method against several popular variants of MAP.

B

^{Jian Shen}

[email protected] Jein-Shan Chen [email protected] Hou-Duo Qi

[email protected] Naihua Xiu [email protected]

1 School of Mathematical Sciences, University of Southampton, Highfield, Southampton SO17 1BJ, UK

2 Department of Mathematics, National Taiwan Normal University, Taipei 11677, Taiwan 3 Department of Applied Mathematics, Beijing Jiaotong University, Beijing, China

(2)

Keywords Hankel matrix· Alternating projections · Global convergence · Linear convergence· Time series

Mathematics Subject Classification 47B35· 62M10 · 65F55 · 90C26 · 90C30

1 Introduction

In this paper, we are mainly interested in the numerical methods for the weighted low-rank Hankel matrix optimization:

min f(X) :=1

2W ◦ (X − A), s.t. X ∈ Mr∩ H^k^×, (1)

whereH^k^× is the space of all k× Hankel matrices in the real/complex field with the standard trace inner product, · is the Frobenius norm, Mris the set of matrices whose ranks are not greater than a given rank r , A is given, and W is a given weight matrix Wi j ≥ 0. Here (A ◦ B) represents elementwise multiplication (e.g., Hadamard product) between A and B. We note that the size of all matrices involved are of k× .

The difficulties in solving (1) are with the low-rank constraint and how to effectively handle a general weight matrix W , the latter of which is often overlooked in existing literature. Our main purpose is to develop a novel, globally convergent algorithm for (1) and its efficiency will be benchmarked against several state-of-the-art algorithms.

In what follows, we first explain an important application of (1) to time series data, which will be tested in our numerical experiment part. We then review the latest advances on algorithms relating to the alternating projection method. We finish this section by explaining our approach and main contributions.

1.1 Applications in time series

Problem (1) arises from a large number of applications including signal processing, system identification and finding the greatest common divisor between polynomials [23]. To motivate our investigation on (1), let us consider a complex-valued time series a= (a1, a2, . . . , an) of finite rank [17, Chp. 5]:

at =

m s=1

Ps(t)λ^ts, t = 1, 2, . . . , n (2)

where Ps(t) are a complex polynomial of degree (νs − 1) (νs are positive integers) andλs ∈ C \ {0} are distinct. Define r := ν1+ . . . + νm(”:=” means ”define”). Then it is known [29, Prop. 2.1] that the rank of the Hankel matrix A generated by a:

(3)

A= H(a) :=

⎡

⎢⎢

⎢⎣

a1 a2 · · · a a2 a3 · · · a₊₁

... ... ... ...

ak ak+1· · · an

⎤

⎥⎥

⎥⎦

must be r , where the choice of(k, ) satisfies n = k + − 1 and r ≤ k ≤ n − r + 1.

Suppose now that the time series a is contaminated and/or has missing values. To reconstruct a, a natural approach is to computing its nearest time series x by the least squares:

min

n i=1

wi|ai − xi|², s.t. rank(X) ≤ r, X = H(x), (3)

where w = (w1, . . . , wn) ≥ 0 are the corresponding weight vector emphasizing the importance of each elements of a. The equivalent reformulation of (3) as (1) is obtained by setting

W := H(√ v◦√

w) and vi =

⎧⎨

⎩

1/i for i = 1, . . . , k − 1 1/k for i = k, . . . , n − k + 1 1/(n − i + 1) for i = n − k + 2, . . . , n, where v is known as the averaging vector of Hankel matrix of size k× (k ≤ ) and√

w is the elementwise square root of w. We note that the widely studied (1) with Wi j ≡ 1 corresponds towi = 1/vi, which is known as the trapezoid weighting. Another popular choice for financial time series is the exponential weightswi = exp(αi) for some α > 0. We refer to [15, Sect. 2.4] for more comments on the choice of weights.

A special type of the time series of (2) arises from the spectral compressed sensing, which has attracted considerable attention lately [4]. In its one dimensional case, atis often a superposition of a few complex sinusoids:

at =

r s=1

dsexp{(2π jωs− τs)t} , (4)

where j = √

−1, r is the model order, ωs is the frequency of each sinusoid, and ds = 0 is the weight of each sinusoid, and τs ≥ 0 is a damping factor. We note that (4) is a special case of (2) with Ps(t) = ds (henceνs = 1) and λs = exp(2π jωs− τs).

If at is sampled at all integer values from 1 to n, we get a sample vector a ∈ Cⁿ. Consequently, the rank of H(a) must be r. However, in practice, only a subset Ω of the sampling points{1, . . . , n} may be observed (possibly contaminated), leading to the question of how to best reconstruct a(t) based on its partial observation ai on Ω. This has led to the Hankel matrix completion/approximation problem of (1), see [4, Sect. II.A] and [2, Sect. 2.1]. A popular choice of W in the spectral compressed sensing is Wi j = 1 for all (i, j), resulting in the distance between X and A in (1) being measured by the standard Frobenius norm. In this paper, we assume

Assumption 1 W is Hankel and non-negative (i.e., Wi j ≥ 0 for all (i, j)).

(4)

1.2 On alternating projection methods

Low-rank matrix optimization is an active research area. Our short review is only able to focus on a small group of those papers that motivated our research. We note that there are four basic features about the problem (1): (i) X has to be low rank; (ii) X has Hankel structure; (iii) the objective is weighted; and (iv) X may be complex valued.

The first feature is the most difficult one to handle because it causes the nonconvexity of the problem. Many algorithms have been developed proposing different ways to handle this low rank constraint. One of the most popular choices is to use the truncated singular value decomposition to project X to be its nearest rank r matrix and we denote the projection by Π_Mr(X). This has given rise to the basic Method of Alternating Projections (MAP) (also known as the Cadzow method [1]): Given X⁰∈ H, update

X^ν+1= Π_H

Π_Mr(X^ν)

, ν = 0, 1, 2, . . . (5)

where Π_H(·) is the orthogonal projection operator to the Hankel subspace H^k^×. Despite its popularity in engineering sciences, Cadzow’s method can not guarantee the convergence to an optimal solution. Even convergence occurs, not much is known about where it converges to. It has also been criticized for completely ignoring the objective function, see [5,6,8,14]. In particular, the weight matrix W does not enter Cadzow’s method at all because the truncated SVD does not admit a closed-form solution under a weighted norm. Gillard and Zhigljavsky [16] proposed to replaceΠMr

by its diagonally weighted variants and studied how to best approximate the weight matrix W by a diagonal weight matrix. Qi et. al. [25] proposed to use a sequential diagonal weight matrices aiming to get a better approximation to the original weight matrix. Despite the improved numerical convergence, those methods in [16,25] still inherit the essential problem of convergence of Cadzow’s method. Recently, Lai and Varghese [20] considered a similar method for a matrix completion problem and established the linear convergence of their method under a kind of “transversality”

condition provided that the initial point is close enough to a true rank-r completion.

We refer to [7] for a more general transversality condition that ensures a local linear convergence rate of MAP onto nonconvex sets.

Alternating projections ofΠ_Mr(·) and Π_H(·) also play an important role in the class of iterative hard thresholding (IHT) algorithms for spectral compressed sensing. For example, Cai et. al. [3] established the convergence of IHT in the statistical sense (i.e., with high probability) under a coherence assumption on the initial observation matrix A. Although local convergence results (be in the sense of monotonically decreasing [20] or in the statistical sense [3]) may be established for MAP under some conditions, we are not aware of any existing global convergence results mainly due to the nonconvexity of the rank constraint. For the general weighted (1), it appears to be a difficult task to develop a variant of MAP that enjoys both global and local linear convergence properties. We will achieve this through a penalty approach.

Penalty approaches have long been used to develop globally convergent algorithms for problems with rank constraints, see [10,12,13,21,22,27,32,33]. For example, Gao [12] proposed the penalty function p(X) based on the following observation:

(5)

rank(X) ≤ r ⇐⇒ p(X) := X_∗−

r i=1

σi(X) = 0,

whereX_∗is the nuclear norm of X andσ1(X) ≥ · · · σn(X) are the singular values of X in nonincreasing order. However, the resulting method, as well as those in [21,22,27], has nothing to do with MAP any more and its implementation is not trivial.

1.3 Our approach and main contributions

In this paper, we propose a new penalty function and develop a penalized method whose main step is the alternating projections. We call it the penalized MAP (pMAP).

The new penalty function is the Euclidean distance function d_M_r(X) from X to Mr:

d_M_r(X) := min {X − Z | Z ∈ Mr} and define gr(X) := 1

2d_M²_r(X). (6) Obviously, the original problem (1) is equivalent to

min f(X), s.t. d_Mr(X) = 0, X ∈ H.

We propose to solve the quadratic penalty problem with ρ > 0 being a penalty parameter:

min F_ρ(X) := f (X) + ρgr(X), s.t. X ∈ H. (7) By following the standard argument [24, Thm. 17.1] for the classical quadratic penalty method, one can establish that the global solution of (7) converges to that of (1) as ρ approaches infinity provided the convergence happens. However, in practice, it is probably as difficult to find a global solution of (7) as for the original problems. It is hence important to establish the correspondence between the first-order stationary points of (7) and that of (1). This is done in Theorem1 under a generalized linear independence condition.

The remaining task is to efficiently compute a stationary point of (7) for a given ρ > 0. The key observation is that gr(X) can be represented as the difference of two convex functions, which can be easily majorized (later on its meaning) to get a majorization function gr^(m)(X, X^ν) of gr(X) at the current iterate X^ν. We then solve the majorized subproblem:

X^ν+1= arg min f (X) + ρgr^(m)(X, X^ν), s.t. X ∈ H. (8) We will show that the update takes the following form:

X^ν+1= W⁽²⁾

ρ + W⁽²⁾ ◦ A + ρ

ρ + W⁽²⁾ ◦ Π_H(Π_Mr(X^ν)), (9)

(6)

where W⁽²⁾ := W ◦ W and the division W⁽²⁾/(ρ + W⁽²⁾) is taken componentwise.

Compared with (5), this update is just a convex combination of the observation matrix A and the MAP iterate in (5). In the special case that W ≡ 0 (which completely ignores the objective in (1)) orρ = ∞, (9) reduces to MAP. We will analyze the convergence behaviour of pMAP (9). In particular, we will establish the following among others.

(i) The objective function sequence {F(X^ν, ρ)} will converge and X^ν+1− X^ν converges to 0. Moreover, any limiting point of{X^ν} is an approximate KKT point of the original problem (1) provided that the penalty parameter is above certain threshold (see Theorem2).

(ii) If X is an accumulation point of the iterate sequence{X^ν}, then the whole sequence converges to X at a linear rate provided thatσr(X) σr+1(X) (see Theorem3).

Our results in (i) and (ii) provide satisfactory justification of pMAP. It is not only globally convergent, but also enjoys a linear convergence rate under reasonable conditions. Furthermore, we can assess the quality of the solution as an approximate KKT of (1) if we are willing to increase the penalty parameter. Of course, balancing the fidelity term f(X) and the penalty term is an important issue that is beyond the current paper. The result in (ii) is practically important too. Existing empirical results show that MAP often terminates at a point whose cut-off singular values (σi, i ≥ r + 1) are significantly smaller than the kept singular values (σi, i ≤ r). Such points are often said to have a numerical rank r , but the theoretical rank is higher than r . This is exactly the situation that was addressed in (ii). Those results are stated and proved for real-valued matrices. We will extend them to the complex case, thanks to a technical result (Proposion 2) that the subdifferential of gr(X) in complex domain can also be computed in a similar fashion as in the real domain. To our best knowledge, this is the first variant of MAP that can handle general weights and enjoys both global convergence and locally linear convergence rate under a reasonable condition (i.e., σr  σr+1).

The paper is organized as follows. In the next section, we will first set up our standard notation and establish the convergence result for the quadratic penalty approach (7) whenρ → ∞. Sect.3includes our method of pMAP and its convergence results when ρ is fixed. In Sect.4, we will address the issue of extension to the complex-valued matrices, which arise from (2) and (4). The key concept used in this section is the Wirtinger calculus, which allows us to extend our analysis from the real case to the complex case. We report extensive numerical experiments in Sect.5and conclude the paper in Sect.6.

2 Quadratic penalty approach

The main purpose of this section is to establish the convergence of the stationary points of the penalized problems (7) to that of the original problem (1) as the penalty parameterρ goes to ∞. For the simplicity of our analysis, we focus on the real case. We will extend our results to the complex case in Sect.4. We first introduce the notation used in this paper.

(7)

2.1 Notation

For a nonnegative matrix such as the weight matrix W , √

W is its componentwise square root matrix(

Wi j). For a given matrix X ∈ C^k^×, we often use its singular value decomposition (assume k≤ )

X= Udiag(σ1(X), . . . , σk(X))V^T, (10) whereσ1(X) ≥ · · · ≥ σn(X) are the singular values of X and U ∈ C^k^×k, V ∈ C^×

are the left and right singular vectors of X . For a given closed subsetC ⊂ C^k^×, we define the set of all projections from X toC by

P_C(X) := arg min{X − Z : Z ∈ C}.

IfC is also convex, then P_C(X) is unique. When C = Mr,P_Mr(X) may have multiple elements. We define a particular element inP_Mr(X) that is based on the SVD (10):

ΠMr(X) = Urdiag(σ1(X), . . . , σr(X))Vr^T, where Ur and Vr consist of the first r columns of U and V respectively.

Related to the function gr(X) defined in (6), the function

hr(X) :=1

2X²F− gr(X), (11)

has the following properties by the classical result of Eckart and Young [9]:

dist²(X, Mr) = X − Π_Mr²= σ_r²₊₁(X) + · · · + σ_n²(X), hr(X) = 1

2

σ1²(X) + · · · + σr²(X)

= 1

2ΠMr(X)².

It follows from [12, Prop. 2.16] that hr(X) is convex and the subdifferentials of hr(X) and gr(X) in the sense of [26, Def. 8.3] are respectively given by

∂hr(X) = conv(P_Mr(X)) and ∂gr(X) = X − ∂hr(X), (12) where conv(Ω) denotes the convex hull of the set Ω. Finally, we let B(X) denote the

-neighbourhood centred at X.

2.2 Convergence of quadratic penalty approach

The classical quadratic penalty methods try to solve a sequence of penalty problems:

X^ν = arg min Fρν(X), s.t. X ∈ H, (13)

(8)

where the sequenceρν > 0 is increasing and goes to ∞. By following the standard argument (e.g., [24, Thm. 17.1]), one can establish that every limit of{X^ν} is also a global solution of (1). However, in practice, it is probably as difficult to find a global solution for (13) as for the original problem (1). Therefore, only an approximate solution of (13) is possible. To quantify the approximation, we recall the optimality conditions relating to both the original and penalized problems.

Following the optimality theorem [26, Thm. 8.15], we define the first-order optimality condition of problem (1) and (7):

Definition 1 (First-order optimality condition) X ∈ H satisfies the first-order optimality condition of (1) if

0∈ ∇ f (X) +λ∂d_Mr(X) + H^⊥, (14) whereλ is the Lagrangian multiplier. Similarly, we say X^ν ∈ H satisfies the first-order optimality condition of the penalty problem (7) if

0∈ ∇ f (X^ν) + ρ_ν ∂gr(X^ν) + H^⊥. (15)

We generate X^ν∈ H such that the condition (15) is approximately satisfied:

P_H(∇ f (X^ν) + ρ_ν(X^ν− Π_Mr(X^ν))) ≤ _ν, (16)

whereν ↓ 0. We can establish the following convergence result.

Theorem 1 We assume the sequence{ρν} goes to ∞ and {ν} decreases to 0. Suppose each approximate solution X^ν is generated to satisfy (16).

Let X be an accumulation point of{X^ν} and we assume

∂d_Mr(X) ∩ H^⊥= {0}. (17)

Then X satisfies the first-order optimality condition (14).

Proof Suppose X is the limiting point of the subsequence{X^ν}_K. We consider the following two cases.

Case 1 There exists an infinite subsequenceK1ofK such that rank(X^ν) ≤ r for ν ∈ K1. This would imply∂gr(X^ν) = {0}, which with (16) impliesP_H(∇ f (X^ν)) → 0.

Hence (14) holds at X with the choice λ = 0.

Case 2 There exists an index ν0 such that X^ν /∈ Mr for all ν0 ≤ ν ∈ K. In this case, we assume that there exists an infinite subsequence K2 of K such that

(X^ν− Π_Mr(X^ν))/d_Mr(X^ν)

has the limit v. We note that (X^ν − Π_Mr(X^ν))/d_Mr(X^ν) ∈ ∂d_Mr(X^ν) for ν ≥ ν0 by [26, (8.53)]. Therefore, its limit v∈ ∂dMr(X) by the upper semicontinuity. By the assumption (17), we have v /∈ H^⊥ because v has the unit length. Since H is a subspace, PH(·) is a linear operator. It follows from (16) that

(9)

ρνPH(X^ν− ΠMr(X^ν)) − PH(∇ f (X^ν))

≤ P_H(∇ f (X^ν) + ρ_ν(X^ν− Π_Mr(X^ν)) ≤ _ν.

Hence

P_H(X^ν− Π_Mr(X^ν)) ≤ 1 ρν

_ν+ P_H(∇ f (X^ν))

,

which, forν ≥ ν0, is equivalent to

d_M_r(X^ν)P_H(X^ν− Π_Mr(X^ν))/d_Mr(X^ν) ≤ 1 ρ_ν

_ν+ P_H(∇ f (X^ν))

.

Taking limits on{X^ν}_ν∈K2and using the factρν → ∞ leads to d_Mr(X)P_H(v) = 0.

Since v /∈ H^⊥, we havePH(v) > 0, which implies dMr(X) = 0. That is, X is a feasible point of (1). Now letλν := ρνd_M_r(X^ν), we then have

λ_νX^ν− ΠMr(X^ν)

d_M_r(X^ν) = −∇ f (X^ν) + ξ^ν, ξ^ν := ∇ f (X^ν) + ρ_ν(X^ν − Π_Mr(X^ν)).

Projecting on both sides toH yields λ_νP_H

X^ν− Π_Mr(X^ν) d_M_r(X^ν)

= P_H(−∇ f (X^ν)) + P_H(ξ^ν). (18)

Computing the inner product on both sides withPH((X^ν − ΠMr(X^ν))/dMr(X^ν)), taking limits on the sequence indexed byK2, and using the factP_H(ξ^ν) → 0 due to (16), we obtain

ν∈Klim2

λ_νv²= v, P_H(∇ f (X)).

We then have

λ = lim

ν∈K²λ_ν = 1

v²v, P_H(∇ f (X)).

Taking limits on both sides of (18) yields

PH(∇ f (X) +λv) = 0, which is sufficient for

0∈ ∇ f (X) +λ∂d_Mr(X) + H^⊥.

This completes our result.

(10)

Remark 1 Condition (17) can be interpreted as that any 0= v ∈ ∂dMr(X) is linearly independent of any set of basis ofH^⊥. Therefore, (17) can be seen as a generalization of the linear independence assumption required in the classical quadratic penalty method for a similar convergence result with all the functions involved being assumed continuously differentiable, see [24, Thm. 17.2]. In fact, what we really needed in our proof is that there exists a subsequence{(X^ν−Π_Mr(X^ν))/d_Mr(X^ν)} in Case (ii) such that its limit v does not belong toH^⊥. That could be much weaker than the sufficient condition (17).

Theorem1establishes the global convergence of quadratic penalty method when the penalty parameter approaches infinity, which drives gr(X^ν) smaller and smaller.

In practice, however, we often fixρ and solve for X^ν. We are interested in how far X^ν is from being a first-order optimal point of the original problem. For this purpose, we introduce the approximate KKT point, which keeps the first-order optimality condition (15) with an additional requirement that gr(X) is small enough.

Definition 2 (-approximate KKT point) Consider the penalty problem (7) and > 0 is given. We say a point X ∈ H is an -approximate KKT point of (1) if

0∈ ∇ f (X) + ρ ∂gr(X) + H^⊥ and gr(X) ≤ .

3 The method of pMAP

This section develops a new algorithm that solves the penalty problem (7), in particular for finding an approximate KKT point of (1). The first part is devoted to the construction of a majorization function for the distance function dist(X, Mr). We then describe pMAP based on the majorization introduced and establish its global and local convergence.

3.1 Majorization and DC interpretation

We first recall the essential properties that a majorization function should have. Let θ(·) be a real-valued function defined in a finite-dimensional space X . For a given y∈ X , we say a function θ^(m)(·, y) : X → IR is a majorization of θ(·) at y if

θ^(m)(x, y) ≥ θ(x), ∀ x ∈ X and θ^(m)(y, y) = θ(y). (19) The motivation for employing the majorization is that the squared distance function gr(X) is hard to minimize when coupled with f (X) under the Hankel matrix structure.

It is noted that gr(X) = 1

2X − Π_Mr(X)²≤ 1

2X − Π_Mr(Z)²:= gr^(m)(X, Z), ∀ X, Z ∈ C^k^×

where the inequality used the fact thatΠ_Mr(X) is a nearest point in Mrto X . It is easy to verify that gr^(m)(X, Z) is a majorization function of gr(X) at Z.

(11)

The following way in deriving the majorization is crucial to our convergence analysis. We recall

hr(X) = 1

2X²− gr(X) =1

2X²−1 2min

X − Z²: Z ∈ Mr

= max

X, Z −1

2Z²: Z ∈ Mr

.

Being the pointwise maximum of linear functions when Z ∈ Mr is given, hr(X) is convex. The convexity of hr(X) and (12) yields

hr(X) ≥ hr(Z) + M, X − Z, ∀ X, Z ∈ R^k^×, M ∈ P_Mr(Z) (20) which further implies

gr(X) = 1

2X²− hr(X)

≤ 1

2X²− hr(Z) − ΠMr(Z), X − Z

= 1

2X − ΠMr(Z)²

−1 2

Π Mr(Z)²− Z²+ Z − Π Mr(Z)²− 2ΠMr(Z), Z

=0

= 1

2X − Π_Mr(Z)²= g_r^(m)(X, Z).

In other words, gr(X) can be seen as Difference of Convex functions, the so-called DC function. Using a subgradient is a common way to majorize DC functions, see [12].

3.2 The pMAP algorithm

We recall that our main problem is (7). Our first step is to construct a majorized function of F_ρ(X) at the current iterate X^ν:

F_ρ^(m)(X, X^ν) = 1

2W ◦ (X − A)²+ ρgr^(m)(X, X^ν)

= 1

2W ◦ (X − A)²+ρ

2X − ΠMr(X^ν)²

= 1

2W ◦ X²+ρ

2X²− W⁽²⁾◦ A + ρΠMr(X^ν), X +1

2W ◦ A² +ρ

2Π_Mr(X^ν)²

= 1 2

ρ + W⁽²⁾◦ (X − X^ν_ρ)²+1

2W ◦ A²+ρ

2Π_Mr(X^ν)²−ρ + W⁽²⁾ 2 X^ν_ρ²,

(12)

where

X_ρ^ν := ρΠ_Mr(X^ν) + W⁽²⁾◦ A

ρ + W⁽²⁾ . (21)

Note that the division is in the sense of componentwise. The subproblem to be solved at iterationν is

X^ν+1= arg min F_ρ^(m)(X, X^ν) s.t. X ∈ H

= arg min 1 2

ρ + W⁽²⁾◦ (X − X^ν_ρ)² s.t. X ∈ H

= Π_H(X_ρ^ν), (22)

where X^ν_ρ is defined in (21). The last equation in (22) is due to W_ρ :=

ρ + W⁽²⁾ being Hankel and computing X^ν+1in (22) is equivalent to averaging X_ρ^νalong its all anti-diagonals. Since A,ρ/(ρ + W⁽²⁾), W⁽²⁾/(ρ + W⁽²⁾) are all Hankel matrices (due to Assumption 1), X^ν+1can be calculated through (9).

Algorithm 1 (pMAP)

1: Input data: Matrix A, weight matrix W , penalty parameterρ, rank r, and the initial X⁰. Setν := 0.

2: Update X : Compute X^ν+1by (9).

3: Convergence check: Terminate if some stopping criterion is satisfied.

Remark 2 Being a direct consequence of employing the majorization technique, the following decreasing property holds:

F_ρ(X^ν+1) ≤ F_ρ^(m)(X^ν+1, X^ν) (property of majorization (19))

≤ F_ρ^(m)(X^ν, X^ν) (because of (22))

= F_ρ(X^ν) (property of majorization (19)).

If F_ρ is coercive (i.e., F_ρ(X) → ∞ if X → ∞, which would be the case if we require W > 0), the sequence {X^ν} will be bounded.

A widely used stopping criterion isX^ν+1− X^ν ≤ for some small tolerance >

0. We will see below thatX^ν+1− X^ν approaches zero and hence such convergence check will eventually be satisfied. For our theoretical analysis, we assume that pMAP generates an infinite sequence (e.g., let = 0).

3.3 Convergence of pMAP

We have more results on the convergence of pMAP.

(13)

Theorem 2 Let{X^ν} be the sequence generated by pMAP. The following holds.

(i) We have

F_ρ(X^ν+1) − Fρ(X^ν) ≤ −ρ

2X^ν+1− X^ν², k = 1, 2, . . . , .

Furthermore,X^ν+1− X^ν → 0.

(ii) Let X be an accumulation point of{X^ν}. We then have

∇ f (X) + ρ(X− Π_Mr(X)) ∈ H^⊥.

Moreover, for a given > 0, if X⁰∈ Mr ∩ H and

ρ ≥ ρ:= f(X⁰)

,

then X is an-approximate KKT point of (1).

Proof We will use a number of facts to establish (i). The first fact is due to the convexity of f(X):

f(X^ν) − f (X^ν+1) ≥ ∇ f (X^ν+1), X^ν− X^ν+1 (23)

The second fact is the identity

X^ν+1²− X^ν²= 2X^ν+1− X^ν, X^ν+1 − X^ν+1− X^ν² (24)

The third fact is due to the convexity of hr(X) defined in (11) andΠ_Mr(X) ∈ ∂hr(X):

hr(X^ν+1) − hr(X^ν) ≥ Π_Mr(X^ν), X^ν+1− X^ν (25)

The last fact is the optimality condition of problem (22):

∇ f (X^ν+1) + ρ(X^ν+1− ΠMr(X^ν)) ∈ H^⊥. (26)

(14)

Combining all facts above leads to a sufficient decrease in F_ρ(X^k):

F_ρ(X^ν+1) − F_ρ(X^ν)

= f (X^ν+1) − f (X^ν) + ρgr(X^ν+1) − ρgr(X^ν)

(23)

≤ ∇ f (X^ν+1), X^ν+1− X^ν + ρgr(X^ν+1) − ρgr(X^ν)

= ∇ f (X^ν+1), X^ν+1− X^ν + ρ

2(X^ν+1²− X^ν²) − ρ(hr(X^ν+1) − hr(X^ν))

(24= ∇ f (X) ^ν+1) + ρ X^ν+1, X^ν+1− X^ν −ρ

2(X^ν+1− X^ν²)

− ρ(hr(X^ν+1) − hr(X^ν))

(25)

≤ ∇ f (X^ν+1) + ρ X^ν+1− ρΠ_Mr(X^ν), X^ν+1− X^ν − ρ

2X^ν+1− X^ν²

(26)

≤ −ρ

2X^ν+1− X^ν²

(27)

In the above we also used the fact that X^ν+1− X^ν ∈ H. Since the sequence {Fρ(X^ν)}

is non-increasing and is bounded from below by 0, we haveX^ν+1− X^ν²→ 0.

(ii) Suppose X is the limit of the subsequence{X^ν}_ν∈K. It follows fromX^ν+1− X^ν → 0 that X is also the limit of{X^ν+1}_ν∈K. Taking limits on both sides of (26) and using the upper semi-continuity of the projectionsP_Mr(·) yields

∇ f (X) + ρ(X− Π_Mr(X)) ∈ H^⊥. we use the fact that{F_ρ(X^ν)} is non-increasing to get

f(X⁰) = f (X⁰) + ρgr(X⁰) = F_ρ(X⁰) ≥ lim F_ρ(X^ν)

= F_ρ(X) = f (X) + ρgr(X) ≥ ρgr(X).

The first equality holds because gr(X⁰) = 0 when X⁰∈ Mr. As a result,

gr(X) ≤ f(X⁰)

ρ ≤ f(X⁰)

ρ = . (28)

Therefore, X is an-approximate KKT point of (1).

We note that the first result (i) in Theorem 2 is standard in a majorization- minimization scheme and can be proved in different ways, see, e.g., [32, Thm. 3.7].

3.4 Final rank and linear convergence

This part reports two results. One is on the final rank of the output of pMAP and the rank is always bigger than the desired rank r unless A is already an optimal solution

(15)

of (1). The other is on the conditions that ensure a linear convergence rate of pMAP.

For this purpose, we need the following result.

Proposition 1 [11, Thm. 25] Given the integer r > 0 and consider X∈ IR^k^×of rank (r + p) with p ≥ 0. Suppose the SVD of X is represented as X = r+p

i=1σiuiv^T_i , whereσ1(X) ≥ σ2(X) ≥ · · · ≥ σr+p(X) are the singular values of X and ui, vi, i = 1, . . . , r + p are the left and right (normalized) eigenvectors. We assume σr(X) > σr+1(X) so that the projection operator ΠMr(X) is uniquely defined in a neighbourhood of X . ThenΠMr(X) is differentiable at X and the directional derivative along the direction Y is given by

∇ Π_Mr(X)(Y ) = Π_T_Mr₍_X₎(Y )

+

1≤i≤r

1≤ j≤p

σr+ j

σi− σr+ jY , Φ_i⁺_{,r+ j}Φ_i⁺_{,r+ j}− σr+ j

σi+ σr+ jY , Φ_i⁻_{,r+ j}Φ_i⁻_{,r+ j}

whereT_Mr(X) is the tangent subspace of Mr at X and Φ_i^±_{,r+ j} = 1

√2

ur+ jv_i^T ± uiv_r^T_{+ j} .

Theorem 3 Assume that W > 0 and X be an accumulation point of{X^ν}. The follow- ing hold.

(i) rank(X) > r unless A is already the optimal solution of (1).

(ii) Suppose X has rank(r + p) with p > 0. Let σ1≥ σ2≥ · · · ≥ σk be the singular values of X . Define

w0:= min{Wi j} > 0, 0:= w²₀

ρ , 1:= 0

4+ 30, c := 1 1+ 1 < 1.

Under the condition

σr

σr+1 ≥ 8 pr

0 + 1, it holds

X^ν+1− X ≤ cX^ν − X for ν sufficiently large.

Consequently, the whole sequence{X^ν} converges linearly to X .

Proof (i) Suppose X is the limit of the subsequence{X^ν}k∈K. We assume rank(X) ≤ r . It follows from Theorem2that

{X^ν+1}k∈K→ X and lim

k∈KΠ_Mr(X^ν) = Π_Mr(X) = X.

(16)

Taking limits on both sides of (9) and using the fact that X is Hankel, we get

X= W⁽²⁾

ρ + W⁽²⁾◦ A + ρ

ρ + W⁽²⁾ ◦ Π_H(Π_M_r(X)) = W⁽²⁾

ρ + W⁽²⁾ ◦ A + ρ

ρ + W⁽²⁾ ◦ X.

Under the assumption W > 0, we have X = A. Consequently, rank(A) ≤ r, implying that A is the optimal solution of (1). Therefore, we must have rank(X) > r if the given matrix A is not optimal already.

(ii) Let φ(X) := Π_H(Π_Mr(X)). Since Π_Mr(X) is differentiable at X , so isφ(X).

Moreover, the directional derivative ofφ(X) at X along the direction Y is given by

∇φ(X)Y = Π_H(∇Π_Mr(X)Y ) and ∇φ(X)Y ≤ ∇Π_Mr(X)Y . (29) The inequality above holds becauseΠ_H(·) is an orthogonal projection to a subspace and its operator norm is 1. The matrices in Proposition1have the following bounds.

Φ_i^±_{,r+ j} ≤ 1

√2

ur+ jv^T_i + uiv^T_r_{+ j}

≤ 1

√2(1 + 1) =√ 2,

Y , Φ_i^±_{,r+ j}Φ_i^±_{,r+ j} ≤ Φ_i^±_{,r+ j}²Y ≤ 2Y .

Therefore,

1≤i≤r

1≤ j≤p

σr+ j

σi − σr+ jY , Φ_i⁺_{,r+ j}Φ_i⁺_{,r+ j} − σr+ j

σi + σr+ jY , Φ_i⁻_{,r+ j}Φ_i⁻_{,r+ j}

≤ 4

1≤i≤r

1≤ j≤p

σr+ j

σi− σr+ jY ≤ 4pr σr+1

σr− σr+1Y ≤w₀²

2ρY = 1

20Y . (30)

In the above, we used the fact thatψ(t) := t/(σr − t) is an increasing function of t for t < σr. Proposition1, (29) and (30) imply

∇φ(X)Y ≤ Π_T_Mr₍_X₎(Y ) + 0/2Y ≤ Y + 0/2Y ≤ (1 + 0/2)Y .

The second equality above used the fact that the operator norm of Π_T_Mr₍_X₎ is not greater than 1 due toT_Mr(X) being a subspace. Since φ(·) is differentiable at X , there exists > 0 such that

φ(X) − φ(X) − ∇φ(X)(X − X) ≤ 1

40X − X, ∀ X ∈ B(X).

Therefore,

φ(X) − φ(X) ≤ φ(X) − φ(X) − ∇φ(X)(X − X) + ∇φ(X)(X − X)

(17)

≤ 1

40X − X + (1 + 0/2)X − X = (1 + 30/4)X − X.

Now we are ready to quantify the error between X^νand X whenever X^ν ∈ B(X).

X^ν+1− X = ρ

ρ + W⁽²⁾ ◦ (φ(X^ν) − φ(X)) ≤ ρ

ρ + w₀²φ(X^ν) − φ(X)

≤ 1+ 30/4

1+ 0 X^ν− X = cX^ν− X.

Consequently, X^ν+1 ∈ B(X). Since {X^ν}_ν∈Kconverges to X , X^ν will eventually falls inB(X), which implies that the whole sequence {X^ν} will converge to X and

eventually converges at a linear rate.

Remark 3 (Implication on MAP) When the weight matrix W = 0, pMAP reduces to MAP according to (9). Theorem2(i) implies

X^ν+1− ΠMr(X^ν+1)²− X^ν− ΠMr(X^ν)²≤ −X^ν+1− X^ν², (31)

which obviously implies

X^ν+1− Π_Mr(X^ν+1) ≤ X^ν− Π_Mr(X^ν). (32)

The decrease property (32) was known in [5, Eq.(4.1)] and was used there to ascertain that MAP is a descent algorithm. Our improved bound (31) says a lightly more that the decrease each step in the functionX − ΠMr(X) is strict unless the update becomes unchanged. In this case (W = 0), the penalty parameter is just a scaling factor in the objective, hence the KKT result in Theorem2(ii) does not apply to MAP. This probably explains why it is difficult to establish similar results for MAP.

Remark 4 (On linear convergence) In the general context of matrix completion, Lai and Varghese [20] established a local linear convergence of MAP under the following two assumptions. We describe them in terms of the Hankel matrix completion. (i) The partially observed data a can be completed to a rank r Hankel matrix M. (ii) A transversality condition (see [20, Thm. 2]) holds at M. We emphasize that the result of [20] is a local result that requires that the initial point of MAP is close enough to M and the rank r assumption of M is also crucial to their analysis, which also motivated our proof. In contrast, our result is a global one and enjoys a linear convergence rate near the limit under a more realistic assumptionσr  σr+1. One may have noticed that the convergence rate c though strictly less than 1 may be close to 1. This is often numerically observed that MAP often converges slowly. But the more important point here is that in such a situation it ensures that the whole sequence converges. This global convergence justifies the widely used stopping criterionX^ν+1− X^ν ≤ .

(18)

4 Extension to complex-valued matrices

The results obtained in the previous sections are for real-valued matrices and they can be extended to complex-valued matrices by employing what is known as the Wirtinger calculus [30]. We note that not all algorithms for Hankel matrix optimization have a straightforward extension from the real case to the complex case, see [6] for comments on some algorithms. We explain our extension below.

Suppose f : Cⁿ→ IR is a real-valued function in the complex domain. We write z ∈ Cⁿ as z = x + jy with x, y ∈ IRⁿ. The conjugate¯z := x − jy. Then we can write the function f(z) in terms of its real variables x and y. With a slight abuse of notation, we still denote it as f(x, y). In the case where the optimization of f (z) can be equivalently represented as optimization of f in terms of its real variables, the partial derivatives∂ f (x, y)/∂x and ∂ f (x, y)/∂y would be sufficient. For other cases where algorithms are preferred to be executed in the complex domain, then the Wirtinger calculus [30] is more convenient to use and it is well explained (and derived) in [19]. TheR-derivative and the conjugate R-derivative of f in the complex domain are defined respectively by

∂ f

∂z = 1 2

∂ f

∂x − j∂ f

∂y

, ∂ f

∂ ¯z = 1 2

∂ f

∂x + j∂ f

∂y

.

TheR-derivatives in the complex domain play the same role as the derivatives in the real domain because the following two first-order expansions are equivalent:

f(x + Δx, y + Δy) = f (x, y) + ∂ f /∂x, Δx

+∂ f /∂y, Δy + o(Δx + Δy)

f(z + Δz) = f (z) + 2Re(@f/@Nz, z) + o(z). (33) Here, we treat the partial derivatives as column vectors and Re(x) is the real part of x.

Note that in the first-order expansion in f(z + Δz) used the conjugate R-derivative.

Hence, we define the complex gradient to be∇ f (z) := 2∂ f /∂ ¯z, when it exists. When f is not differentiable, we can extend the subdifferential of f from the real case to the complex case by generalizing (33).

In order to extend Theorem1, we need to characterize∂d_Mr(X) in the complex domain. We may follow the route of [26] to conduct the extension. For example, we may define the regular subgradient of d_M_r(X) [26, Def. 8.3] to its complex counterpart by replacing the conjugate-gradient in the first-order expansion in (33) by a regular subgradient. We then define subdifferential through regular subgradients. With this definition in the complex domain, we may extend [26, (8.53)] to derive formulae for

∂d_Mr(X). What we needed in the proof of Theorem1is(X − Π_Mr(X))/d_Mr(X) ∈

∂d_Mr(X) when X /∈ Mr. The proof of this result follows a straightforward extension of the corresponding part in [26, (8.53)] and if reproduced here would take up much space. Hence we omit it.

In order to extend the results in Sect.3, we need the subdifferential of hr(X) in order to majorize gr(X). Since hr(X) is convex, its subdifferential is easy to define. We