• 沒有找到結果。

Trace Norm Regularization: Reformulations, Algorithms, and Multi-task Learning

N/A
N/A
Protected

Academic year: 2022

Share "Trace Norm Regularization: Reformulations, Algorithms, and Multi-task Learning"

Copied!
24
0
0

加載中.... (立即查看全文)

全文

(1)

Trace Norm Regularization: Reformulations, Algorithms, and Multi-task Learning

Ting Kei Pong

Paul Tseng

Shuiwang Ji

Jieping Ye

§

June 23, 2009

Abstract

We consider a recently proposed optimization formulation of multi-task learning based on trace norm regularized least squares. While this problem may be formulated as a semidefinite program (SDP), its size is beyond general SDP solvers. Previous solution approaches apply proximal gradient methods to solve the primal problem. We derive new primal and dual reformulations of this problem, including a reduced dual formulation that involves minimizing a convex quadratic function over an operator-norm ball in matrix space. This reduced dual problem may be solved by gradient-projection methods, with each projection involving a singular value decomposition. The dual approach is com- pared with existing approaches and its practical effectiveness is illustrated on simulations and an application to gene expression pattern analysis.

Key words. Multi-task learning, gene expression pattern analysis, trace norm regularization, convex optimiza- tion, duality, semidefinite programming, proximal gradient method.

1 Introduction

In various applications we seek a minimum rank solution W ∈ Rn×m of the linear matrix equations AW = B, where A ∈ Rp×nand B ∈ Rp×m. This problem is NP-hard in general, and various approaches have been proposed to solve this problem approximately. One promising approach entails regularizing the objective by the trace norm (also called nuclear norm) kW k, defined as the sum of the singular values of W . This results in a convex optimization problem that can be reformulated as a semidefinite program (SDP) and be solved by various methods. This approach has been applied to minimum-order system realization [18], low-rank matrix completion [12, 13], low-dimensional Euclidean embedding [19], dimension reduction in multivariate linear regression [30, 51], multi-class classification [2], and multi-task learning [1, 5, 38]. We consider the regularized least squares formulation

υ := min

W

1

2kAW − Bk2F + µkW k, (1)

where µ > 0 and k · kF denotes the Frobenius-norm; see [1, 5, 30, 38, 44]. By dividing A and B with

√µ, we can without loss of generality assume that µ = 1. However, this is computationally inefficient

Department of Mathematics, University of Washington, Seattle, WA 98195, USA (tkpong@math.washington.edu)

Department of Mathematics, University of Washington, Seattle, WA 98195, USA (tseng@math.washington.edu)

Department of Computer Science and Engineering, Center for Evolutionary Functional Genomics, The Biodesign In- stitute, Arizona State University, Tempe, AZ 85287, USA (shuiwang.ji@asu.edu)

§Department of Computer Science and Engineering, Center for Evolutionary Functional Genomics, The Biodesign In- stitute, Arizona State University, Tempe, AZ 85287, USA (jieping.ye@asu.edu)

(2)

when solving (1) for multiple values of µ. We may view µ as the Lagrange multiplier associated with the constrained formulation

minW

1

2kAW − Bk2F subject to kW k≤ ω, (2)

for some ω ≥ 0 [51], or view 1/µ as the Lagrange multiplier associated with the alternative constrained formulation

minW kW k subject to 1

2kAW − Bk2F ≤ ρ, (3)

for some ρ ≥ 0.

Our motivation stems from multi-task learning [14], which has recently received attention in broad areas such as machine learning, data mining, computer vision, and bioinformatics [3–5,21,49]. Multi-task learning aims to improve the generalization performance of classifiers by learning from multiple related tasks. This can be achieved by learning the tasks simultaneously, and meanwhile exploiting their intrinsic relatedness, based on which the informative domain knowledge is allowed to be shared across the tasks, thus facilitating “cooperative” task learning. This approach is particularly effective when only limited training data for each task is available.

Multi-task learning has been investigated by many researchers using different approaches such as shar- ing hidden units of neural networks among similar tasks [7,14], modeling task relatedness using a common prior distribution in hierarchical Bayesian models [6], extending kernel methods and regularization net- works to multi-task learning [17], and multi-task learning with clustered tasks [23]. Recently, there is growing interest in studying multi-task learning in the context of feature learning and selection [2,4,5,37].

Specifically, [4] proposes the alternating structure optimization (ASO) algorithm to learn the predictive structure from multiple tasks. In ASO, a separate linear classifier is trained for each of the tasks and dimension reduction is applied on the predictor (classifier) space, finding low-dimensional structures with the highest predictive power. This approach has been applied successfully in several applications [3, 39].

However, the optimization formulation is non-convex and the alternating optimization procedure used is not guaranteed to find a global optimum [4]. Recently, trace norm regularization has been proposed for multi-task learning [1, 5, 38], resulting in the convex optimization problem (1). We explain its motivation in more detail below.

Assume that we are given m supervised (binary-class) learning tasks. Each of the learning tasks is associated with a predictor f and training data {(x1, y1), · · · , (xp, yp)} ⊂ Rn× {−1, 1} (ℓ = 1, . . . , m).

We focus on linear predictors f(x) = wTx, where w is the weight vector for the ℓth task. The convex multi-task learning formulation based on the trace norm regularization can be formulated as the following optimization problem:

min{w}

Xm ℓ=1

Xp i=1

L(wTxi, yi)

!

+ µkW k, (4)

where L is the loss function, and kW k is the trace norm of the matrix W , given by the summation of the singular values of W . For the least squares loss L(s, t) = 12(s − t)2, (4) reduces to (1) with A = [x1, · · · , xp]T∈ Rp×n, and the (i, j)-th entry of B ∈ Rp×m is yij. Trace norm regularization has the effect of inducing W to have low rank.

A practical challenge in using trace norm regularization is to develop efficient methods to solve the convex, but non-smooth, optimization problem (1). It is well known that a trace norm minimization problem can be formulated as an SDP [18, 43]. However, such formulation is computationally expensive for existing SDP solvers. To overcome the nonsmoothness, nonconvex smooth reformulations of (1) have been proposed that are solved by conjugate gradient or alternating minimization method [40, 47, 48].

However, the nonconvex reformulation introduces spurious stationary points and convergence to a global

(3)

minimum is not guaranteed. Recently, proximal gradient methods have been applied to solve (1). These methods have good theoretical convergence properties and their practical performances seem promising.

In [30], A is assumed to have full column rank and a variant of Nesterov’s smooth method [46, Algorithm 3] is applied to solve a certain saddle point reformulation of (1). Numerical results on random instances of A, B, with m up to 60 and n = 2m, p = 10m, show the method solves (1) much faster than the general SDP solver SDPT3. In [31], a proximal gradient method is used to solve (1) (see (26)), with µ gradually decreased to accelerate convergence. Each iteration of this method involves a singular value decomposition (SVD) of an n × m matrix. It solved randomly generated matrix completion problems of size up to 1000 × 1000. In [25, 44], an accelerated proximal gradient method is used. In [44], additional techniques to accelerate convergence and reduce SVD computation are developed, and matrix completion problems of size up to 105× 105 are solved. For the constrained version of the problem, corresponding to (3) with ρ = 0, primal-dual interior-point method [28] and smoothed dual gradient method [12] have been proposed. The successive regularization approach in [31] entails solving a sequence of problems of the form (1).

In this paper, we derive alternative primal and dual reformulations of (1) with varying advantages for numerical computation. In particular, we show that (1) is reducible to the case where A has full column rank, i.e., r = n, where

r := rank(A);

see Section 2. We then show that this reduced problem has a dual that involves minimizing a quadratic function over the unit-operator-norm ball in Rr×m; see Section 3. This reduced dual problem may be solved by a conditional gradient method and (accelerated) gradient-projection methods, with each projection involving an SVD of an r × m matrix. The primal and dual gradient methods complement each other in the sense that the iteration complexity of the former is proportional to the largest eigenvalue of ATA while that of the latter is inversely proportional to the smallest nonzero eigenvalue of ATA; see Section 4. For multi-task learning, m is small relative to r, so that computing an SVD of an r × m matrix is relatively inexpensive. Alternative primal formulations that may be advantageous when r < m are also derived; see Section 5. Numerical tests on randomly generated data suggest that the dual problem is often easier to solve than the primal problem, particularly by gradient-projection methods; see Section 6. In Section 7 we report the results of experiments conducted on the automatic gene expression pattern image annotation task [24, 45]. The results demonstrate the efficiency and effectiveness of the proposed multi-task learning formulation for m up to 60 and n = 3000 and p up to 3000.

In our notation, for any W, Z ∈ Rn×m, hW, Zi = tr(WTZ), so that kZkF = p

hZ, Zi. For any W ∈ Rn×m, σmax(W ) denotes its largest singular value. For any symmetric C, D ∈ Rn×n, λmin(C) and λmax(C) denote, respectively, the smallest and the largest eigenvalues of C, and C  D (respectively, C ≻ D) means C − D is positive semidefinite (respectively, positive definite) [22, Section 7.7].

2 Problem reduction when A lacks full column rank

Suppose A lacks full column rank, i.e., r < n. (Recall that r = rank(A).) We show below that (1) is reducible to one for which r = n.

We decompose

A = Rh A 0e i

ST,

where R ∈ Rp×pand S ∈ Rn×n are orthogonal, and eA ∈ Rp×r. In a QR decomposition of A, eA is lower triangular with positive diagonals and R is a permutation matrix. In a singular value decomposition

(4)

(SVD) of A, eA is diagonal with positive diagonals. In general, QR decomposition is much faster to compute than SVD. The following result shows that A and B in (1) are reducible to eA and RTB.

Proposition 1.

υ = min

Wf

1

2k eA fW − RTBk2F+ µkfW k, where fW ∈ Rr×m.

Proof. For any W ∈ Rn×m, let "

Wf Wˆ

#

= STW,

where fW ∈ Rr×m and ˆW ∈ R(n−r)×m. Then

kW k= kSTW k=

"

fW Wˆ

#

≥ kfW k,

where the inequality usesh

WfTTi"

fW Wˆ

#

= fWTW + ˆf WTW  fˆ WTW and the order-preserving propertyf of the eigenvalue map [22, Corollary 7.7.4(c)]. Hence

p(W ) := 1

2kAW − Bk2F+ µkW k (5)

= 1

2 R

hA 0e i

STW − B

2

F+ µkW k

= 1

2

hAe 0i"

fW Wˆ

#

− RTB

2

F

+ µkSTW k

≥ 1

2

eAfW − RTB

2

F + µkfW k

=: p(f˜W ).

It follows that υ = min

W p(W ) ≥ min

fW

˜

p(fW ). On the other hand, for each fW ∈ Rr×m, letting

W = S

"

Wf 0

#

yields WTW = fWTW , so that kW kf = kfW k. Hence

˜

p(fW ) =1

2k eAfW − RTBk2F+ µkfW k

=1 2

Rh

A 0e i"

Wf 0

#

− B

2

F

+ µkW k

=1 2

Rh

A 0e i STS

"

Wf 0

#

− B

2

F

+ µkW k

=1

2kAW − Bk2F+ µkW k

= p(W ).

Hence min

Wf

p(fW ) ≥ min

W p(W ) = υ and the proposition is proved.

(5)

3 A reduced dual problem

By Proposition 1, (1) is reducible to the case of r = n, which we assume throughout this section. Using this, we will derive a dual of (1), defined on the same space of Rr×m, that has additional desirable properties and in some sense complements the primal problem. In our numerical experience, this dual problem is often easier to solve by gradient methods; see Sections 6 and 7.

First, note that the trace norm is the dual norm of the operator norm, namely kW k= max

σmax(Λ)≤1−hΛ, W i = max

ΛTΛI−hΛ, W i, where Λ ∈ Rr×m. Thus (1) can be rewritten as the minimax problem:

υ = min

W max

ΛTΛI

1

2kAW − Bk2F− µhΛ, W i.

Switching “min” and “max” yields its dual:

ΛmaxTΛImin

W

1

2kAW − Bk2F− µhΛ, W i. (6)

Since the dual feasible set {Λ | ΛTΛ  I} is convex and compact, there is no duality gap, i.e., the optimal value of (6) equals υ; see [42, Corollary 37.3.2]. The minimization in W is attained at AT(AW −B)−µΛ = 0. Solving for W yields W = µCΛ + E, where we let

M := ATA, C := M−1, E := CATB. (7)

Plugging this into (6) yields

υ = max

ΛTΛI−µ2

2 hΛ, CΛi − µhE, Λi −1

2hE, ATBi +1 2kBk2F. Negating the objective and scaling Λ by µ, this dual problem reduces to

ΛTminΛµ2Id(Λ) := 1

2hΛ, CΛi + hE, Λi +1

2hE, ATBi −1

2kBk2F. (8)

(We scale Λ to make the dual objective independent of µ.) This is a convex quadratic minimization over the µ-operator-norm ball (since ΛTΛ  µ2I if and only if σmax(Λ) ≤ µ). Moreover, given any dual solution Λ, a primal solution W can be recovered by

W= CΛ+ E. (9)

On the other hand, given any primal solution W, a dual solution Λ can be recovered by

Λ= M (W− E). (10)

As we shall see in Section 4, this provides a practical termination criterion for methods solving either the primal problem (1) or the dual problem (8).

The following result shows that, when r ≥ m, minimizing a linear function over this set and projecting onto this set requires only an SVD of an r × m matrix. This will be key to the efficient implementation of solution methods for (8).

(6)

Proposition 2. For any Φ ∈ Rr×mwithr ≥ m, let Φ = R

D 0



STbe its SVD, i.e.,R ∈ Rr×r,S ∈ Rm×m are orthogonal and D ∈ Rm×m is diagonal withDii≥ 0 for all i. Then a minimizer of

ΛTminΛµ2I hΦ, Λi. (11)

isR

−µI 0



ST, and the unique minimizer of

ΛTminΛµ2I kΦ − Λk2F. (12)

isR

min {D, µI}

0



ST, where the minimum is taken entrywise.

Proof. Letting

F G



= RTΛS with F ∈ Rm×m and G ∈ R(r−m)×m, we can rewrite (11) as

minF,G

D 0

 ,

F G



= Xm i=1

DiiFii subject to FTF + GTG  µ2I. (13)

Notice that the constraint implies FTF  µ2I, which in turn implies Pm

j=1Fij2 ≤ µ2 for all i, which in turn implies Fii2≤ µ2for all i. Thus

minF,G

Xm i=1

DiiFii subject to Fii2≤ µ2, i = 1, . . . , m, (14) is a relaxation of (13). Since Dii ≥ 0, it is readily seen that a minimizer of (14) is Fii = −µ for all i, Fij= 0 for all i 6= j, and G = 0. Since this solution is feasible for (13), it is a minimizer of (13) also.

Similarly, we can rewrite (12) as minF,G

D 0



F G



2

F

= kD − F k2F+ kGk2F subject to FTF + GTG  µ2I. (15) Notice that the constraint implies FTF  µ2I, which in turn implies Pm

j=1Fij2 ≤ µ2 for all i, which in turn implies Fii2≤ µ2for all i. Thus

minF,G kD − F k2F+ kGk2F subject to Fii2 ≤ µ2, i = 1, . . . , m, (16) is a relaxation of (15). Since kD − F k2F = P

i|Dii − Fii|2 +P

i6=j|Fij|2, the relaxed problem (16) decomposes entrywise. Since Dii ≥ 0, it is readily seen that its unique minimizer is Fii = min{Dii, µ}

for all i, Fij = 0 for all i 6= j, and G = 0. Since this solution is feasible for (15), it is a minimizer of (15) also.

It can be seen that Proposition 2 readily extends to the case of r < m by replacing

D 0

 ,

−µI 0

 ,

min {D, µI}

0



with, respectively, D 0

,

−µI 0 ,

min {D, µI} 0

, where D ∈ Rr×ris diagonal with Dii ≥ 0 for all i. Specifically, in its proof we replace

F G

 with

F G

, so that (13) becomes

minF,G

Xm i=1

DiiFii subject to

FTF FTG GTF GTG



 µ2I

(7)

and (15) becomes

minF,G kD − F k2F+ kGk2F subject to

FTF FTG GTF GTG



 µ2I.

The remainder of the proof proceeds as before.

How does the dual problem (8) compare with the primal problems (1) and (2)? Unlike (1), the dual problem has a compact feasible set, which allows a duality gap bound (23) to be derived. It can also be solved by more algorithms; see Section 4.1. While (2) also has a compact feasible set, the set has a more complicated structure; see [30]. Specifically, projection onto this set has no closed form solution. Notice that the feasible sets of (2) and (8) scale with ω and µ, respectively. However, when µ is varied, warm starting by projecting the current feasible solution on to the feasible set of (8) appears to work better than scaling; see Section 6. Finally,

k∇d(Λ) − ∇d(Λ)kF = kC(Λ − Λ)kF ≤ λmax(C)kΛ − ΛkF ∀Λ, Λ∈ Rr×m, (17) with Lipschitz constant

LD:= λmax(C) = λmin(M )−1 (18)

(since C = M−1). Thus the gradient of the objective function of (8) is Lipschitz continuous with Lipschitz constant LD= λmin(M )−1. In contrast, the gradient of the quadratic function in (1) and (2) is Lipschitz continuous with Lipschitz constant LP = λmax(M ); see (25). As we shall see, the Lipschitz constant affects the iteration complexity of gradient methods for solving (1) and (8). Thus the primal and dual problems complement each other in the sense that LP is small when λmax(M ) is small and LD is small when λmin(M ) is large. The ‘hard case’ is when λmax(M ) is large and λmin(M ) is small.

4 Solution approaches

In this section we consider practical methods for solving the primal problem (1) and the dual problem (8). Due to the large problem size in multi-task learning, we consider first-order methods, specifically, gradient methods. (We also considered a primal-dual interior-point method using Nesterov-Todd direc- tion, which is a second-order method, but the computational cost per iteration appears prohibitive for our applications). In view of Proposition 1, we assume without loss of generality that r = n. Then it can be seen that (1) has a unique solution Wµ which approaches the least squares solution E as µ → 0.

Moreover,

Wµ= 0 ⇐⇒ µ ≤ µ0:= σmax(ATB), (19)

which follows from the optimality condition 0 ∈ −ATB + µ∂k0kand ∂k0kbeing the unit-operator-norm ball.

4.1 Dual gradient methods

In this subsection, we consider methods for solving the reduced dual problem (8). Since its feasible set is compact convex and, by Proposition 2, minimizing a linear function over or projecting onto this feasible set requires only an SVD of an r × m matrix, we can use either a conditional gradient method or gradient-projection method. In what follows, we assume for simplicity that r ≥ m, which holds for multi-task learning. Extension to the r < m case is straighforward.

(8)

In the conditional gradient method, proposed originally by Frank and Wolfe for quadratic program- ming, given a feasible point Λ, we replace the objective function by its linear approximation at Λ (using

∇d(Λ) = CΛ + E):

ΓTminΓµ2IhCΛ + E, Γi (20)

and then do a line search on the line segment joining Λ and a solution ˆΛ of (20); see [10, Section 2.2] and references therein. Specifically, letting ∆ = ˆΛ − Λ, it can be seen that

d(Λ + α∆) = d(Λ) + αhCΛ + E, ∆i +α2

2 h∆, C∆i.

Minimizing this over α ∈ [0, 1] yields

α = min



1, −hCΛ + E, ∆i h∆, C∆i



. (21)

By Proposition 2, the linearized problem (20) can be solved from the SVD of CΛ + E. We thus have the following method for solving (8):

Dual conditional gradient (DCG) method:

0. Choose any Λ satisfying ΛTΛ  µ2I. Go to Step 1.

1. Compute the SVD:

CΛ + E = R

D 0

 ST,

where R ∈ Rr×r, S ∈ Rm×mare orthogonal, and D ∈ Rm×m is diagonal with Dii≥ 0 for all i. Set Λ = Rˆ

−µI 0



ST, ∆ = ˆΛ − Λ.

Compute α by (21) and update Λnew = Λ + α∆. If a termination criterion is not met, go to Step 1.

It is known that every cluster point of the Λ-sequence generated by the DCG method is a solution of (8). A common choice for termination criterion is kΛ − ΛnewkF ≤ tol, where tol > 0. Another is d(Λ) − d(Λnew) ≤ tol. However, these criteria are algorithm dependent and offer no guarantee on the solution accuracy. In fact, for a fixed tol, the solution accuracy obtained can vary wildly across problem instances, which is undesirable. We will use a termination criterion, based on duality gap, with guaranteed solution accuracy. In particular, (9) shows that, as Λ approaches a solution of the dual problem (8), W = CΛ + E approaches a solution of the primal problem (1). This suggests terminating the method when the relative duality gap is small, namely,

|p(W ) + d(Λ)|

|d(Λ)| + 1 < tol, (22)

where p(W ) denotes the objective function of (1); see (5). To save computation, we can check this criterion every, say, 10 or r iterations.

Notice that we only need the first m columns of R in the SVD, which can be found in O(m2r + m3) floating point operations (flops) [20, page 254]. It takes O(m2r) flops to compute ˆΛ, and O(mr2) flops to compute CΛ + E, as well as the numerator and denominator in (21). The rest takes O(mr) flops. Thus

(9)

the work per iteration is O(mr2) (since r ≥ m). If A is sparse, then instead of storing the r × r matrix C = (ATA)−1 which is typically dense, it may be more efficient to store C in a factorized form (e.g., using a combination of Cholesky factorization and rank-1 update to handle dense columns).

The above DCG method can be slow. Gradient-projection methods, proposed originally by Goldstein and Levitin and Polyak, are often faster; see [10, Section 2.3] and references therein. Gradient projection involves moving along the negative gradient direction and projecting onto the feasible set of (8) which, by Proposition 2, can be done efficiently through an SVD. Since the objective function of (8) is quadratic, exact line search can be used. The following describes a basic implementation.

Dual gradient-projection (DGP) method:

0. Choose any Λ satisfying ΛTΛ  µ2I and any L > 0. Go to Step 1.

1. Compute the SVD:

Λ − 1

L(CΛ + E) = R

D 0

 ST,

where R ∈ Rr×r, S ∈ Rm×mare orthogonal, and D ∈ Rm×m is diagonal with Dii≥ 0 for all i. Set Λ = Rˆ

min{D, µI}

0



ST, ∆ = ˆΛ − Λ.

Compute α by (21), and update Λnew = Λ + α∆. If a termination criterion is not met, go to Step 1.

Since the feasible set of (8) is compact, it can be shown that 0 ≤ υ + d(Λ) ≤ ǫ after O(LǫD) iterations;

see [52, Theorem 5.1]. Like the DCG method, the work per iteration for the above DGP method is O(mr2) flops. Although global convergence of this method is guaranteed for any L > 0, for faster convergence, the constant L should be smaller than λmax(C), the Lipschitz constant for the dual function d (see (17)), but not too small. In our implementation, we use L = λmax(C)/8. For the termination criterion, we use (22) with W = CΛ + E. A variant of the DGP method uses L > LD/2 and constant stepsize α = 1, which has the advantage of avoiding computing (21).

Gradient-projection method can be accelerated using Nesterov’s extrapolation technique [32–35]. Here we use Algorithm 2 described in [46], which is the simplest; also see [8] for a similar method. It is an extension of the method in [32] for unconstrained problems. (Other accelerated methods can also be used; see [33, 35, 46].) This method applies gradient projection from a Ψ that is extrapolated from Λ of the last two iterations. This method also requires a Lipschitz constant for ∇d. By (17), LD is such a constant. We describe this method below.

Dual accelerated gradient-projection (DAGP) method:

0. Choose any Λ satisfying ΛTΛ  µ2I. Initialize Λ= Λ and θ= θ = 1. Set L = LD. Go to Step 1.

1. Set

Ψ = Λ +

 θ θ − θ



(Λ − Λ).

Compute the SVD:

Ψ − 1

L(CΨ + E) = R

D 0

 ST,

(10)

where R ∈ Rr×r, S ∈ Rm×m are orthogonal, and D ∈ Rm×m is diagonal with Dii ≥ 0 for all i.

Update

Λnew = R

min{D, µI}

0



ST, Λnew = Λ,

θnew =

√θ4+ 4θ2− θ2

2 , θnew = θ.

If a termination criterion is not met, go to Step 1.

It can be shown that θ ≤ k+22 after k iterations. Moreover, 0 ≤ υ +d(Λ) ≤ ǫ after O qLD

ǫ

iterations;

see [8,32] and [46, Corollary 2 and the proof of Corollary 1]. Like the previous two methods, the work per iteration for the above DAGP method is O(mr2) flops. Unlike the gradient-projection method, L cannot be chosen arbitrarily here (and line search does not seem to help). We also tested two other variants of the DAGP method [46, Algorithms 1 and 3], but neither performed better. We use the termination criterion (22) either with W = CΨ + E or with W initialized to the zero matrix and updated in Step 1 by

Wnew = (1 − θ)W + θ(CΨ + E).

For the latter choice of W , it can be shown, by using d(Λ) = max

W hΛ, W i −1

2kAW − Bk2F (see (6)) and the proof of [46, Corollary 2], that

0 ≤ p(Wnew) + d(Λnew) ≤ θ2LD∆, (23) where ∆ := max

ΓTΓµ2I 1

2kΓ − Λinitk2F. Since θ ≤ k+22 after k iterations, (22) is satisfied after at most O

qLD

tol

iterations. Such a duality gap bound, shown originally by Nesterov for related methods [34,35], is a special property of these accelerated methods. In our implementation, we check (22) for both choices of W .

4.2 Primal gradient method

The objective function of the primal problem (1) is the sum of a convex quadratic function fP(W ) := 1

2kAW − Bk2F,

and a “simple” nonsmooth convex function kW k. Moreover, we have from ∇fP(W ) = M W − ATB that k∇fP(W ) − ∇fP(W)kF = kM(W − W)kF ≤ LPkW − WkF ∀W, W ∈ Rn×m, (24) with Lipschitz constant

LP:= λmax(M ). (25)

We can apply accelerated proximal gradient methods [8,36,46] to solve (1). Here we consider Algorithm 2 in [46], which is the simplest; also see FISTA in [8] for a similar method. (Other accelerated methods can also be used; see [36,46].) The same type of method was used by Toh and Yun for matrix completion [44].

It extrapolates from W of the last two iterations to obtain an Y ∈ Rn×m, and then solves minW h∇fP(Y ), W i +L

2kW − Y k2F+ µkW k (26)

(11)

to obtain the new W , where L is set to LP, the Lipschitz constant for ∇fP. Like the dual gradient methods of Section 4.1, each iteration requires only one SVD of the n × m matrix Y − L1∇fP(Y );

see [12,44]. Unlike the dual problem, we need not reduce A to have full column rank before applying this method. We describe the basic method below.

Primal accelerated proximal gradient (PAPG) method:

0. Choose any W . Initialize W= W , θ = θ = 1. Set L = LP. Go to Step 1.

1. Set

Y = W +

 θ θ − θ



(W − W).

Compute the SVD:

Y − 1

L(M Y − ATB) = R

D 0

 ST,

where R ∈ Rn×n, S ∈ Rm×m are orthogonal, and D ∈ Rm×m is diagonal with Dii ≥ 0 for all i.

Update

Wnew = R

max{D − µLI, 0}

0



ST, Wnew = W,

θnew =

√θ4+ 4θ2− θ2

2 , θnew = θ.

If a termination criterion is not met, go to Step 1.

It can be shown that 0 ≤ p(W ) − υ ≤ ǫ after O qLP

ǫ

 iterations; see [8], [46, Corollary 2 and the proof of Corollary 1]. In our implementation, we use the termination criterion (22) with

Λ = ProjD(M (W − E)), (27)

where Proj

D(·) denotes the nearest-point projection onto the feasible set D of the dual problem (8). By (10), Λ approaches a solution of the reduced dual problem (8) as W approaches a solution of (1). To save computation, we can check this criterion every, say, 10 or r iterations.

A variant of PAPG, based on a method of Nesterov for smooth constrained convex optimization [35], replaces the proximity term kW −Y k2F in (26) by kW k2F and replaces ∇fP(Y ) by a sum of ∇fP(Y )/θ over all past iterations; see [46, Algorithm 3]. Thus the method uses a weighted average of all past gradients instead of the most recent gradient. It also computes Y and updates W differently, and uses a specific initialization of W = 0. In our tests with fixed µ, this variant performed more robustly than PAPG and another variant [46, Algorithm 2]; see Section 6. We describe this method below.

Primal accelerated proximal gradient-average (PAPGavg) method:

0. Initialize Z = W = 0, θ = 1, ϑ = 0, and G = 0. Set L = LP. Go to Step 1.

1. Set Y = (1 − θ)W + θZ, and update Gnew = G +1

θ(M Y − ATB), ϑnew = ϑ +1 θ. Compute the SVD:

−1

LGnew = R

D 0

 ST,

(12)

where R ∈ Rn×n, S ∈ Rm×m are orthogonal, and D ∈ Rm×m is diagonal with Dii ≥ 0 for all i.

Update

Znew = R

"

max{D −µϑ

new

L I, 0}

0

#

ST, Wnew = (1 − θ)W + θZnew,

θnew =

√θ4+ 4θ2− θ2

2 .

If a termination criterion is not met, go to Step 1.

How do the primal and dual gradient methods compare? Since C = M−1, we have that LD= λmin1(M). The iteration complexity results suggest that when LP = λmax(M ) is not too large, a primal gradient method (e.g., PAPG, PAPGavg) should be applied; and when λmin(M ) is not too small so that LD is not too large, a dual gradient method (e.g., DCG, DGP, DAGP) should be applied. The ‘tricky’ case is when λmax(M ) is large and λmin(M ) is small.

5 Alternative primal formulations

In this section we derive alternative formulations of (1) that may be easier to solve than (1) and (8) when r ≤ m. Although this case does not arise in the multi-task learning applications we consider later, it can arise when there are relatively few data points or few features.

An equivalent reformulation of the trace norm is given by (see [18, Lemma 1]) kW k=1

2 inf

Q≻0hW, Q−1W i + hI, Qi,

where Q ∈ Rn×n is symmetric and positive definite. Then the primal problem (1) can be written in augmented form as

υ = inf

Q≻0,W

1

2 f (Q, W ), (28)

where

f (Q, W ) := kAW − Bk2F+ µhW, Q−1W i + µhI, Qi. (29) By (5), p(W ) = infQ≻01

2f (Q, W ).

Any global minimizer (Q, W ) of (28) is a stationary point of f , i.e., ∇f(Q, W ) = 0. On the other hand, by [9, Proposition 8.5.15(xiv)], (Q, W ) 7→ WTQ−1W is an operator-convex mapping. Hence f is convex, and any stationary point of f is a global minimizer. However, a stationary point may not exist due to Q becoming singular along a minimizing sequence (e.g., B = 0). In fact, we shall see that A having full column rank (i.e., r = n) is a necessary condition for a stationary point of f to exist.

5.1 Existence/uniqueness of stationary point for augmented problem

We derive below a simple necessary and sufficient condition for a stationary point of f to exist. Moreover, the stationary point, if it exists, is unique, and is obtainable from taking one matrix square root and two matrix inversions of r × r symmetric positive definite matrices.

Proposition 3. (a) If a stationary point of f exists, then it is unique and is given by

Q = (EET)12 − µC, (30)

W = (µQ−1+ M )−1ATB, (31)

with M ≻ 0, where M, C and E are given by (7).

(13)

(b) A stationary point of f exists if and only if M ≻ 0 and

(EET)12 ≻ µC. (32)

Proof. (a) It is straightforward to verify that

∇f(Q, W ) = −µQ−1W WTQ−1+ µI, 2µQ−1W + 2AT(AW − B) . Suppose (Q, W ) is a stationary point of f , so that Q ≻ 0 and ∇f(Q, W ) = 0. Then

(WTQ−1)TWTQ−1 = I, (33)

µQ−1W + M W = ATB. (34)

By (33), the columns of WTQ−1are orthogonal and hence rank(W ) = n. Since M  0 so that µQ−1+M ≻ 0, this implies that the left-hand side of (34) has rank n. Thus rank(ATB) = n, implying r = rank(A) = n and hence M = ATA ≻ 0.

Upon multiplying (34) on both sides by their respective transposes and using (33), we have (µI + M Q)(µI + QM ) = ATBBTA.

Since M is invertible, multiplying both sides by M−1 yields

(µM−1+ Q)2= M−1ATBBTAM−1. Since µM−1+ Q ≻ 0, this in turn yields

µM−1+ Q = (M−1ATBBTAM−1)12.

Solving for Q and using (7) yields (30). The formula (31) for W follows from (34). This shows that the stationary point of f , if it exists, is unique and given by (30) and (31).

(b) If a stationary point (Q, W ) of f exists, then Q ≻ 0 and, by (a), M ≻ 0 and Q is given by (30).

This proves (32). Conversely, assume M ≻ 0 and (32). Let Q be give by (30). Then Q ≻ 0. Hence the right-hand side of (31) is well defined. Let W be given by (31). Then (Q, W ) satisfies (34). Let U = WTQ−1. Then

UTU = Q−1W WTQ−1

= Q−1(µQ−1+ M )−1ATBBTA(µQ−1+ M )−1Q−1

= (µI + M Q)−1ATBBTA(µI + QM )−1. (35)

We also have from (7) and (30) that

Q = (M−1ATBBTAM−1)12 − µM−1

⇐⇒ (µM−1+ Q)2 = M−1ATBBTAM−1

⇐⇒ (µI + M Q)(µI + QM ) = ATBBTA,

which together with (35) yields UTU = I. It follows that (Q, W ) satisfies (33) and (34) and hence is a stationary point of f .

Proposition 3 shows that r = n is necessary for (28) to have a stationary point. By Proposition 1, we can without loss of generality assume that r = n, so that M ≻ 0. Then it remains to check the condition (32), which is relatively cheap to do. Since E is r × m, a necessary condition for (32) to hold is r ≤ m. For Gaussian or uniformly distributed A and B, (32) appears to hold with probability near 1 whenever r ≤ m. Thus when r ≤ m, it is worth checking (32) first. A sufficient condition for (32) to hold is ATBBTA ≻ µ2I (since this implies M−1ATBBTAM−1 ≻ µ2M−2 and matrix square root is operator-monotone), which is even easier to check than (32).

(14)

5.2 A reduced primal formulation

By Proposition 1, we can without loss of generality assume that r = n, so that M ≻ 0. Although the minimum in (28) still may not be attained in Q, we show below that in this case (28) is reducible to a

‘nice’ convex optimization problem in Q for which the minimum is always attained. Since Q is r × r and symmetric, this reduced problem has lower dimension than (1) and (8) when r ≤ m.

Consider the reduced primal function h(Q) := inf

W f (Q, W ) ∀Q ≻ 0.

Since f (Q, ·) is convex quadratic, its infimum is attained at W given by (31). Plugging this into (29) and using M = ATA and (µQ−1+ M )−1= C − µC(µC + Q)−1C yields

h(Q) = µhI, Qi + µhEET, (µC + Q)−1i − hE, ATBi + kBk2F. (36) Since f is convex, this and the continuity of h(Q) on Q  0 imply h(Q) is convex on Q  0; see for example [11, Section 3.2.5] and [42, Theorem 5.7]. Moreover, we have

υ = min

Q0 h(Q). (37)

The following result shows that (37) always has a minimizer and, from any minimizer of (37), we can construct an ǫ-minimizer of (28) for any ǫ > 0.

Lemma 1. The problem (37) has a minimizer. Let Q  0 be any of its minimizers. For any ǫ > 0, define

Qǫ:= Q+ ǫI, Wǫ:= (µQ−1ǫ + M )−1ATB. (38) Then limǫ↓0f (Qǫ, Wǫ) = infQ≻0,Wf (Q, W ).

Proof. We first show that a minimizer of (37) exists. By (29), we have f (Q, W ) ≥ µhI, Qi for all Q ≻ 0 and W . Hence h(Q) ≥ µhI, Qi. Since h(Q) is continuous on Q  0, this holds for all Q  0. Then, for any α ∈ R, the set {Q  0 | h(Q) ≤ α} is contained in {Q  0 | µhI, Qi ≤ α}, which is bounded (since µ > 0). Hence a minimizer of (37) exists.

Next, for any ǫ > 0, define Qǫ and Wǫ as in (38). It follows from (31) and the definition of Wǫ that f (Qǫ, Wǫ) = h(Qǫ). Then the continuity of h and the definition of Q yield that

limǫ↓0f (Qǫ, Wǫ) = lim

ǫ↓0h(Qǫ) = h(Q) = min

Q0h(Q).

Since infQ≻0,Wf (Q, W ) = minQ0h(Q), the conclusion follows.

In view of Lemma 1, it suffices to solve (37), which then yields the optimal value and a nearly optima solution of (28). Direct calculation using (36) yields that

∇h(Q) = µI − µ(µC + Q)−1EET(µC + Q)−1. (39) It can be shown using R−1−S−1= R−1(S −R)S−1for any nonsingular R, S ∈ Rr×rthat ∇h is Lipschitz continuous on the feasible set of (37) with Lipschitz constant

Lh= 2

µ2λmax(EET) · λmax(M )3.

(15)

Thus gradient-projection methods can be applied to solve (37). Computing ∇h(Q) by (39) requires O(r3) flops, as does projection onto the feasible set of (37), requiring an eigen-factorization. The feasible set of (37) is unbounded, so we cannot use conditional gradient method nor obtain a duality gap bound analogous to (23). For the multi-task learning applications we consider later, r is much larger than m, so (37) has a much higher dimension than (1) and (8). However, when r ≤ m or Lh is smaller than LP

and LD, (37) may be easier to solve than (1) and (8). Unlike LPand LD, Lhdepends on B also (through E) and is small when kBkF is small. However, Lh grows cubically with λmax(M ).

5.3 An alternative SDP reformulation

By using (36), we can rewrite (37) as

minQ,U µhI, Qi + µhI, Ui − hE, ATBi + kBk2F

subject to Q  0, U  ET(Q + µC)−1E.

Since Q + µC is invertible for any Q  0, it follows from a basic property of Schur’s complement that this problem is equivalent to

minQ,U µhI, Qi + µhI, Ui − hE, ATBi + kBk2F

subject to Q  0,

Q + µC E

ET U



 0, (40)

where Q and U are symmetric r ×r and m×m matrices respectively. It can be shown that the Lagrangian dual of (40) is reducible to the dual problem (8). For n ≥ 1000, the size of the SDP (40) is beyond existing SDP solvers such as Sedumi, CSDP, SDPT3. Whether interior-point methods can be implemented to efficiently solve (40) requires further study.

6 Comparing solution methods

In this section we compare the solution methods described in Section 4 on simulated and real data, with µ fixed or varying. We consider different initialization strategies, as well as warm start when µ is varying.

All methods are coded in Matlab, using Matlab’s rank, QR decomposition, and SVD routines. Reported times are obtained using Matlab’s tic/toc timer on an HP DL360 workstation, running Red Hat Linux 3.5 and Matlab Version 7.2.

We use eight data sets in our tests. The first six are simulated data, with the entries of A generated independently and uniformly in [0, 1] and the entries of B generated independently in {−1, 1} with equal probabilities. These simulate the data in our application of gene expression pattern analysis, which we discuss in more detail in Section 7. The last two are real data from this application, with A having entries in [0, 1] and B having entries in {−1, 1} (10–30% entries are 1). These tests are used to select those methods best suited for our application. (We also ran tests with random Gaussian A and B, and the relative performances of the methods seem largely unchanged.) When r < n, we reduce the columns of A using its QR decomposition as described in Section 2. Table 1 shows the size of the data sets, as well as λmax(M ) and λmax(C) and µ0for the reduced problem, with M and C given by (7) and µ0given by (19). We see that λmax(M ) is generally large, while λmax(C) is generally small except when n is near p. As we shall see, this has a significant effect on the performance of the solution methods. Table 1 also reports timered, which is the time (in seconds) to compute r = rank(A) and, when r < n, to compute the reduced data matrices; see Proposition 1.

(16)

m n p λmax(M ) λmax(C) µ0 timered

1 50 500 500 6e4 3e4 2e3 1e0

2 50 500 1500 2e5 4e-2 3e3 2e0

3 50 2000 1500 8e5 3e-1 6e3 9e1

4 50 2000 3500 2e6 6e-2 8e3 1e2

5 50 3000 1500 1e6 5e-2 8e3 1e2

6 50 3000 3500 3e6 6e-1 1e4 3e2

7 10 3000 2228 2e3 4e2 3e3 3e2

8 60 3000 2754 2e3 4e3 2e4 4e2

Table 1: Statistics for the eight data sets used in our tests. Floating point numbers are shown their first digits only.

We first consider solving (1) with fixed µ < µ0. Based on the values of µ0 in Table 1, we choose µ = 100, 10, 1 in our tests. We implemented nine methods: DCG, DGP, DGP variant with L = LD/1.95 and α = 1, DAGP and its two variants, PAPG, PAPGavg and their other variant. All methods are terminated based on the relative duality gap criterion described in Section 4, with tol = 10−3; see (22).

This criterion is checked every 500 iterations. For DCG and DGP, we also terminate when the line search stepsize (21) is below 10−8. For the other methods, we also terminate when the iterate changes by less than 10−8, i.e., kΛnew−ΛkF < 10−8 for DAGP and its variants, and kWnew−W kF < 10−8for PAPG and its variants. These additional termination criteria are checked at each iteration. We cap the number of iterations at 5000. All methods except PAPGavgrequire an initial iterate. For µ large, the solution of (1) is near 0, suggesting the “zero-W (ZW) initialization” W = 0 for primal methods and, correspondingly, Λ = ProjD(−ME) for dual methods; see (27). We also tried scaling −ME instead of projecting, but it resulted in slower convergence. For µ small, the solution of (1) is near the least-squares solution E, suggesting the “least-squares (LS) initialization” W = E for primal methods and, correspondingly, Λ = 0 for dual methods.

In our tests, DGP consistently outperforms DCG, while DGP performs comparably to its α = 1 variant; DAGP mostly outperforms its two variants; and PAPG, PAPGavgmostly outperform their other variant. Also, DGP using the ZW initialization performs comparably to using the LS initialization, and DAGP using the LS initialization is never 10% slower than using the ZW initialization and is often faster. Thus we report the results for DGP (ZW initialization), DAGP (LS initialization), PAPG (ZW and LS initializations), and PAPGavg only. Specifically, we report in Table 2 the run time (in seconds and including timered), number of iterations (iter), and the relative duality gap (gap), given by (22), on termination.

In general, the dual methods perform better (i.e., lower time, iter, and gap) when µ is small, while the primal method performs better when µ is large. As might be expected, when µ is large, the ZW initialization works better and, when µ is small, the LS initialization works better. Also, the primal method performs worse when λmax(M ) is large, while the dual methods perform worse when λmax(C) is large, which corroborates the analysis at the end of Section 4.2. The dual methods DGP and DAGP have comparable performance when λmax(C) and µ are not large, while DAGP significantly outperforms DGP when λmax(C) is large. DAGP is also the most robust in the sense that it solves all but two instances under 5000 iterations. DGP is the next most robust, followed by PAPGavg and then PAPG, whose performance is very sensitive to the initialization. However, PAPG is useful for warm start, in contrast to PAPGavg. When p < n are both large, column reduction takes a significant amount of time. Thus the methods can all benefit from more efficient rank computation and QR decomposition.

Since QR decomposition is computationally expensive when p < n are both large, we also applied

(17)

PAPG (ZW init) PAPG (LS init) PAPGavg DGP (ZW init) DAGP (LS init) µ (iter/time/gap) (iter/time/gap) (iter/time/gap) (iter/time/gap) (iter/time/gap) 1 100 500/3e1/3e-4 max/3e2/3e-1 500/3e1/6e-4 max/4e2/5e-1 max/3e2/3e-2

10 4000/2e2/6e-4 max/3e2/2 2000/1e2/7e-4 max/4e2/3e-1 2000/1e2/6e-4 1 max/3e2/2e-2 max/3e2/2 max/3e2/1e-2 3500/3e2/5e-4 500/3e1/1e-4 2 100 1000/6e1/3e-4 1000/6e1/7e-4 1000/6e1/5e-4 38/7/2e-16 176/1e1/0 10 1500/8e1/7e-4 500/3e1/5e-4 1000/6e1/8e-4 10/4/1e-16 17/4/1e-16

1 2000/1e2/7e-4 500/3e1/4e-5 2000/1e2/5e-4 4/4/0 7/4/2e-16

3 100 2000/8e2/6e-4 max/2e3/2e-3 2500/9e2/6e-4 64/1e2/6e-15 459/3e2/5e-15 10 max/2e3/1e-2 max/2e3/2e-3 max/2e3/2e-3 22/1e2/2e-14 38/1e2/3e-14 1 max/2e3/4e-1 max/2e3/2e-3 max/2e3/3e-1 5/1e2/4e-13 12/1e2/2e-13 4 100 2500/2e3/3e-4 2000/1e3/1e-4 2000/1e3/8e-4 22/2e2/2e-15 85/2e2/2e-15 10 max/3e3/1e-3 1000/7e2/9e-5 4000/2e3/6e-4 9/2e2/1e-15 16/2e2/2e-15 1 max/3e3/3e-3 1000/7e2/1e-5 max/3e3/2e-3 7/2e2/3e-15 7/2e2/3e-15 5 100 3500/1e3/8e-4 1500/6e2/3e-4 3000/1e3/8e-4 21/2e2/6e-15 81/2e2/7e-15 10 max/2e3/1e-2 3500/1e3/7e-4 max/2e3/2e-3 7/2e2/7e-14 15/2e2/6e-14 1 max/2e3/4e-1 3500/1e3/7e-4 max/2e3/3e-1 5/2e2/7e-13 7/2e2/5e-13 6 100 2500/3e3/1e-3 max/6e3/7e-3 2500/3e3/1e-3 84/9e2/2e-15 500/1e3/6e-16 10 max/6e3/1e-2 3500/5e3/7e-4 max/6e3/8e-3 17/5e2/3e-15 41/5e2/2e-15 1 max/6e3/3e-2 3500/5e3/4e-4 max/6e3/3e-2 7/5e2/7e-15 10/5e2/2e-15 7 100 500/4e2/2e-8 1000/5e2/2e-4 500/4e2/8e-7 max/2e3/9e-3 2000/7e2/7e-4 10 500/4e2/9e-4 max/1e3/2e-3 1000/5e2/9e-5 500/5e2/2e-4 500/4e2/4e-6 1 5000/1e3/9e-4 max/1e3/5e-3 3500/9e2/7e-4 74/4e2/3e-14 275/4e2/1e-15 8 100 500/1e3/1e-5 1500/2e3/2e-4 500/1e3/2e-6 max/1e4/9e-2 max/7e3/3e-3 10 1000/2e3/9e-4 max/7e3/2e-3 1000/2e3/2e-4 max/1e4/2e-3 1000/2e3/5e-4 1 max/7e3/2e-3 max/7e3/2e-2 4000/6e3/7e-4 500/2e3/1e-5 500/1e3/2e-7

Table 2: Comparing the performances of PAPG, PAPGavg, DGP and DAGP on data sets from Table 1, with fixed µ and using one of two initialization strategies. Floating point numbers are shown their first digits only. (“Max” indicates the number of iterations reaches the maximum limit of 5000. The lowest time is highlighted in each row.)

PAPG to the primal problem (1) without the column reduction, which we call PAPG orig. However, without column reduction, M may be singular in which case the dual function d given by (8) is undefined and we cannot use duality gap to check for termination. Instead, we use the same termination criterion as in [44], namely,

kWnew− W kF

kW kF + 1 < 10−4.

For comparison, the same termination criterion is used for PAPG. The initialization W = 0 is used for both. Perhaps not surprisingly, PAPG and PAPG orig terminate in the same number of iterations with the same objective value. Their run times are comparable (within 20% of each other) on data sets 1, 2, 4, 6. On 3 and 5, PAPG orig is about 1.5–3 times slower than PAPG for all three values of µ. On 7 and 8, PAPG is about 3 times slower than PAPG orig for µ = 100 and comparable for other values of µ.

This seems to be due to a combination of (i) very high cost of column reduction in PAPG (see Table 1) and (ii) relatively few iterations in PAPG orig (due to 0 being near the solution), so the latter’s higher computational cost per iteration is not so significant. Thus, it appears that PAPG orig has an advantage over PAPG mainly when p < n are both large (so column reduction is expensive) and a good initial W is known, such as when µ is large (W = 0) or when µ is small (W being the least squares solution).

We next consider solving (1) with varying µ < µ0, for which warm start can be used to resolve the problem as µ varies. In our tests, we use µ = 100, 50, 10, 5, 1, 0.1 in descending order, which roughly

(18)

covers the range of µ values used in our application in Section 7. For simplicity, we use only the last four data sets from Table 1. As for the solution method, the results in Tables 1 and 2 suggest using PAPG when µ and λmax(C) are large, and otherwise use DAGP. After some experimentation, we settled on the following switching rule: use PAPG when

µ ≥ λmax(M )

r

2λmax(C) + λmax(M )µ0, (41)

and otherwise use DAGP, where µ0 is given by (19) and r = rank(A). We initialize PAPG with W = 0 and DAGP with Λ = 0, as is consistent with Table 2. The column reduction described in Section 2 is done once at the beginning. The reduced data matrices are then used for all values of µ. The termination criteria for each µ value are the same as in our tests with fixed µ. Each time when µ is decreased, we use the solutions W and Λ found for the old µ value to initialize either PAPG or DAGP for the new µ value.

For DAGP, we further project Λ onto the feasible set of (8) corresponding to the new µ value (since the new feasible set is smaller than before). We also tried scaling Λ instead of projecting, but the resulting method performed worse.

In Table 3, we report the performance of this primal-dual method. It shows, for each µ value, the method used (p/d), the (cumulative) run time (in seconds), the (cumulative) number of iterations (iter), and the relative duality gap (gap), computed as in (22), on termination. Compared with Table 2, the switching rule (41) seems effective in selecting the “right” method for each µ.

µ= 100 µ= 50 µ= 10 µ= 5 µ= 1 µ= 0.1

(alg/iter/time/gap) (alg/iter/time/gap) (alg/iter/time/gap) (alg/iter/time/gap) (alg/iter/time/gap) (alg/iter/time/gap) 5 d/81/2e2/7e-15 d/121/2e2/1e-14 d/136/2e2/4e-14 d/146/2e2/1e-13 d/151/2e2/5e-13 d/153/2e2/3e-12 6 d/500/1e3/2e-15 d/708/1e3/7e-16 d/750/1e3/6e-15 d/773/1e3/4e-15 d/783/1e3/6e-15 d/787/1e3/6e-15 7 p/500/4e2/2e-8 p/1000/5e2/4e-5 d/1500/6e2/1e-6 d/2000/7e2/1e-8 d/2317/7e2/6e-14 d/2340/7e2/2e-13 8 p/500/1e3/1e-5 p/1000/2e3/8e-5 p/2000/3e3/9e-4 d/2500/4e3/5e-5 d/3000/4e3/1e-7 d/3452/5e3/4e-12

Table 3: The performances of the primal-dual method on the last four data sets from Table 1, with varying µ and using warm start. Floating point numbers are shown their first digits only. (“p” stands for PAPG and “d” stands for DAGP.)

7 Experiment

In this section, we evaluate the effectiveness of the proposed methods on biological image annotation.

The Drosophila gene expression pattern images document the spatial and temporal dynamics of gene expression and provide valuable resources for explicating the gene functions, interactions, and networks during Drosophila embryogenesis. To provide text-based pattern searching, the images in the Berkeley Drosophila Genome Project (BDGP) study are annotated with ontology terms manually by human cu- rators [45]. In particular, images in the BDGP study are annotated collectively in small groups based on the genes and the developmental stages. We propose to encode each image group as a feature vector based on the bag-of-words and the sparse coding representations [15, 26, 50]. Both of these two schemes are based on a pre-computed codebook, which consists of representative local visual features computed on the images. The codebook is obtained by applying the k-means clustering algorithm on a subset of the local features and the cluster centers are then used as the visual words in the codebook. Each feature in an image is then quantized based on the computed codebook, and an entire image group is represented as a global histogram counting the number of occurrences of each word in the codebook. We call this

參考文獻

相關文件

He proposed a fixed point algorithm and a gradient projection method with constant step size based on the dual formulation of total variation.. These two algorithms soon became

• Oral interactions are often indivisible from the learning and teaching activities of an English task, and as such, speaking activities can be well integrated into any

Then, we recast the signal recovery problem as a smoothing penalized least squares optimization problem, and apply the nonlinear conjugate gradient method to solve the smoothing

Then, we recast the signal recovery problem as a smoothing penalized least squares optimization problem, and apply the nonlinear conjugate gradient method to solve the smoothing

The purpose of this talk is to analyze new hybrid proximal point algorithms and solve the constrained minimization problem involving a convex functional in a uni- formly convex

[1] F. Goldfarb, Second-order cone programming, Math. Alzalg, M.Pirhaji, Elliptic cone optimization and prime-dual path-following algorithms, Optimization. Terlaky, Notes on duality

By this, the second-order cone complementarity problem (SOCCP) in H can be converted into an unconstrained smooth minimization problem involving this class of merit functions,

Methods involving finite differences for solving boundary-value problems replace each of the derivatives in the differential equation by an appropriate