• 沒有找到結果。

Large-Scale Convex Optimization over Matrices for Multi-task Learning

N/A
N/A
Protected

Academic year: 2022

Share "Large-Scale Convex Optimization over Matrices for Multi-task Learning"

Copied!
34
0
0

加載中.... (立即查看全文)

全文

(1)

for Multi-task Learning

Paul Tseng

Mathematics, University of Washington Seattle

Optimization Seminar, Univ. Washington January 27, 2009

Joint work with Ting Kei Pong and Jieping Ye (ASU)

(2)

Prologue

This story began with an innocent looking email..

(3)

A Question..

On Thu, 18 Sep 2008, Jieping Ye wrote:

Dr. Tseng,

I recently came across your interesting work on the block

coordinate descent method for non-differentiable optimization.

I wonder whether the convergence result will apply for the

matrix case where each block is a positive definite matrix, i.e., min f(X, Y, Z), where X, Y, Z are positive definite matrices.

I will appreciate it if you can provide some relevant references on this if any. Thanks!

Best, Jieping

(4)

The Problem

Q0,Wmin f (Q,W) := tr(W Q−1WT) + tr(Q) + kWA − Bk2F

Q ∈ <n×n, W ∈ <m×n (A ∈ <n×p and B ∈ <m×p are given), kBkF = (P

i,j Bij2 )1/2

(5)

The Problem

Q0,Wmin f (Q,W) := tr(W Q−1WT) + tr(Q) + kWA − Bk2F

Q ∈ <n×n, W ∈ <m×n (A ∈ <n×p and B ∈ <m×p are given), kBkF = (P

i,j Bij2 )1/2

Note

:

• f (Q,W) is diff., convex in Q for each W, convex in W for each Q.

• The min is finite, but may not be attained (e.g., when B = 0).

• If min is attained, it’s attained at a critical pt, i.e., ∇f (Q,W) = 0.

(6)

On Wed, 24 Sep 2008, Jieping Ye wrote:

Dear Paul, ...

The multi-task learning problem comes from our biological application: Drosophila gene expression pattern analysis (funded by NSF and NIH).

...

Thanks, Jieping

(7)

First Try

∇f (Q,W) = −Q−1WTW Q−1 + I, 2W Q−1 + 2(WA − B)AT So ∇f (Q,W) = 0 implies

(W Q−1)TW Q−1 = I, W Q−1 + WM = BAT where M := AAT.

(8)

First Try

∇f (Q,W) = −Q−1WTW Q−1 + I, 2W Q−1 + 2(WA − B)AT So ∇f (Q,W) = 0 implies

(W Q−1)TW Q−1 = I, W Q−1 + WM = BAT where M := AAT.

So rank(W) = n, rank(BAT) = n, ... , and

(I + MQ)(I + QM ) = ABTBAT

(9)

First Try

∇f (Q,W) = −Q−1WTW Q−1 + I, 2W Q−1 + 2(WA − B)AT So ∇f (Q,W) = 0 implies

(W Q−1)TW Q−1 = I, W Q−1 + WM = BAT where M := AAT.

So rank(W) = n, rank(BAT) = n, ... , and

(I + MQ)(I + QM ) = ABTBAT

Prop. 1

: If f has a stationary pt, then rank(BAT) = n, M  0, and Q = (M−1ABTBATM−1)12 − M−1, W = BAT(QM + I)−1Q

(10)

But..

Date: Sat, 25 Oct 2008 16:48:20 -0700 Dear Paul,

Thanks for the writeup. Very interesting.

...

Unfortunately, M is commonly not positive definite in our applications.

...

Thanks, Jieping

(11)

Second Try

Suppose M = AAT is singular, so r := rank(A) < n.

Use SVD or QR decomp. of A:

A = R

"

Ae 0

# ST

with A ∈ <e r×p, RTR = I and STS = I. Let B := BSe .

Prop. 2

:

Q0,Wmin f (Q,W) = min

Q0,fe W

f (˜ Q,e Wf), where

f (˜ Q,e Wf) := tr(W efQ−1WfT) + tr(Q) +e

WfA − ee B

2 F .

Then recover Q,W from Q,e Wf.

(12)

Moreover, f˜ has a stationary pt iff

( fM−1A eeBTB eeATMf−1)12  fM−1

where M := ef A eAT.

Done?

(13)

But..

Date: Thu, 30 Oct 2008 10:44:56 -0700 Dear Tseng,

Thanks. I like the derivation.

It seems the condition in Eq. (4) is the key.

We need to somehow relax this condition.

Will perturbation solve this problem?

Best, Jieping

(14)

Third Try

Assume w.l.o.g. rankA = n. Let h(Q) := inf

W f (Q,W)

= inf

W tr(W Q−1WT) + tr(Q) + kWA − Bk2F

= tr(Q) + tr(ETE(Q + C)−1) + const.

with C := M−1  0 and E := BATC. (M = AAT)

(15)

Third Try

Assume w.l.o.g. rankA = n. Let h(Q) := inf

W f (Q,W)

= inf

W tr(W Q−1WT) + tr(Q) + kWA − Bk2F

= tr(Q) + tr(ETE(Q + C)−1) + const.

with C := M−1  0 and E := BATC. (M = AAT)

Then h(Q) is cont. over Q  0 (!) so

minQ0 h(Q) = min

Q0,W f (Q,W).

Moreover, (Q,W) 7→ W Q−1WT is operator-convex, so f is convex, and hence h is convex.

(16)

Prop. 3

: minQ0 h(Q) is attained, and

∇h(Q) = I − (Q + C)−1ETE(Q + C)−1

is Lipschitz cont. over Q  0.

(17)

Prop. 3

: minQ0 h(Q) is attained, and

∇h(Q) = I − (Q + C)−1ETE(Q + C)−1

is Lipschitz cont. over Q  0.

Moreover, using Schur complement, minQ0 h(Q) reduces to an SDP:

min tr(Q) + tr(U) s.t. Q  0,

 Q 0 0 U

 +

 C ET

E 0



 0

Recall C  0 is n × n and E is m × n. This SDP is solvable by existing IP solvers (SeDuMi, SDPT3, CSDP, Mosek, ..) for around m + n ≤ 500.

(18)

But..

Date: Mon, 1 Dec 2008 09:33:13 -0700 ...

In our application, n is around 1000-2000 and m is around 50-100.

...

It contains 1000-3000 rows depending on the feature extraction scheme. In general, X is dense. However, one of our recent feature extraction schemes produces sparse X. By the way, the columns of X correpspond

to biological images.

Best, Jieping

For m = 100, n = 2000, (Q,W) comprises 2201000 variables. A ∈ <n×p may be dense.

6. .

_

(19)

Fourth Try

Lesson from my graduate student days:

“When stuck, look at the dual”

(20)

Consider the dual problem

maxΛ0 min

Q0,U L(Q,U,Λ), with Lagrangian (hW, Zi = tr(WTZ))

L(Q,U,Λ) := hI,Qi + hI,Ui −

 Λ1 ΛT2 Λ2 Λ3

 ,

 Q 0 0 U

 +

 C ET

E 0



= hI − Λ1,Qi + hI − Λ3,Ui − hΛ1, Ci − 2hΛ2, Ei

(21)

Consider the dual problem

maxΛ0 min

Q0,U L(Q,U,Λ), with Lagrangian (hW, Zi = tr(WTZ))

L(Q,U,Λ) := hI,Qi + hI,Ui −

 Λ1 ΛT2 Λ2 Λ3

 ,

 Q 0 0 U

 +

 C ET

E 0



= hI − Λ1,Qi + hI − Λ3,Ui − hΛ1, Ci − 2hΛ2, Ei

For dual feas., need I  Λ1, I = Λ3, Λ1  ΛT2 Λ2. Dual problem reduces to min

IΛ1T

2 Λ2

hC, Λ1i + 2hE,Λ2i

Since C  0, minimum w.r.t. Λ1 is attained at Λ1 = ΛT2 Λ2.

(22)

The dual problem reduces to (recall Λ2 ∈ <m×n)

min

IΛT2 Λ2

d22) := 1

2hC,ΛT2 Λ2i + hE, Λ2i.

(23)

The dual problem reduces to (recall Λ2 ∈ <m×n)

min

IΛT2 Λ2

d22) := 1

2hC,ΛT2 Λ2i + hE, Λ2i.

• No duality gap since the primal problem has interior soln.

• Recovers Q as Lagrange multiplier assoc. with I  ΛT2 Λ2.

• ∇d22) = Λ2C + E is Lipschitz cont. with constant L = λmax(C).

(24)

The dual problem reduces to (recall Λ2 ∈ <m×n)

min

IΛT2 Λ2

d22) := 1

2hC,ΛT2 Λ2i + hE, Λ2i.

• No duality gap since the primal problem has interior soln.

• Recovers Q as Lagrange multiplier assoc. with I  ΛT2 Λ2.

• ∇d22) = Λ2C + E is Lipschitz cont. with constant L = λmax(C).

What about the constraint I  ΛT2 Λ2?

Prop. 4

: For any Λ2 ∈ <m×n (m ≤ n) with SVD Λ2 = R

D 0 

ST,

Proj(Λ2) := arg min

IΨT2 Ψ2

2 − Ψ2k2F = R

min {D, I} 0  ST

(25)

Solving the reduced dual

:

Coded 3 methods in Matlab: Frank-Wolfe, grad.-proj. with LS Goldstein, Levitin, Polyak, and accel. grad.-proj. Nesterov.

Accel. grad.-proj. seems most efficient.

0. Choose I  ΛT2 Λ2. Set Λprev2 = Λ2, θprev = θ = 1. Fix L = λmax(C). Go to 1.

1. Set

Λext2 = Λ2 +

 θ

θprev − θ



Λ2 − Λprev2  . Update Λprev2 ← Λ2, θprev ← θ, and

Λ2 ← Proj



Λext2 − 1

L∇d2ext2 )



θ ←

√θ4 + 4θ2 − θ2

2 .

If relative duality gap ≤ tol, stop. Else to to 1.

(26)

Test Results (Preliminary)

Tested on random data: A ∼ U [0, 1]n×p and B ∼ U [0, 1]m×p. tol = .001 n = 2000 m = 100 p= 1000 tol= 0.001

reduce A to have full row rank:

done reducing A, time: 38.9895

done computing C and E, time: 4.05682

termination due to negligible change in U = 3.4469e-11 iter= 10 dobj= -96.7469 dual feas= 8.88178e-15

pobj= -96.7469 primal feas= 1.43293e-15 accel. grad-proj: iter= 10 total_time= 67.9021

fmin = 193.494 fval = 193.494

n = 2000 m = 100 p= 3000 tol= 0.001 done computing C and E, time: 31.5357

termination due to negligible change in U = 7.32917e-11 iter= 10 dobj= -137.14 dual feas= 6.21725e-15

pobj= -137.14 primal feas= 3.06165e-15 accel. grad-proj: iter= 10 total_time= 190.652 fmin = 8632.32 fval = 8632.32

(27)

A Puzzle

: When n ≈ p, #iterations becomes very large.

(28)

Maybe finally..

Date: Sun, 11 Jan 2009 11:09:15 -0700 Dear Paul,

Sorry for the delay.

Some preliminary results prepared by my student are

attached. Overall, it performs well, especially when the number of labels is large. We will conduct more extensive experimental studies and keep you updated.

Best, Jieping

(29)

In preliminary result on Drosophila gene expression pattern annotation, a group of images are associated with variable number of terms using a controlled vocabulary.

k-means clustering and feature extractions are used to obtain a global histogram counting the number of features closest to the visual words obtained from the clustering algorithm (with 3000 clusters), etc.

• n = 3000 (dim. of data)

• 10 ≤ m ≤ 60 (#terms/tasks)

• 2200 ≤ p ≤ 2800 (#samples).

Measure m MTL SVM MLLS

10 87.38 86.66 87.72

AUC 20 84.57 83.25 84.61

(Area Under 30 82.76 81.13 82.92 the Curve) 40 81.61 79.56 77.25 50 80.52 78.17 79.38 60 80.17 77.18 78.11

Table 1: AUC Performance of MTL, SVM, MLLS (multi-label).

(30)

Measure m MTL SVM MLLS 10 64.43 62.97 64.91 20 49.98 50.45 49.90 macro F1 30 42.48 41.92 42.57 40 34.72 35.26 32.26 50 28.70 29.74 29.13 60 24.78 25.49 25.18

Table 2: macro F1 Performance of MTL, SVM, MLLS (multi-label).

Measure m MTL SVM MLLS

10 67.85 66.67 68.27 20 58.25 55.66 57.62 micro F1 30 53.74 48.11 53.02 40 50.68 44.26 46.92 50 49.45 43.53 48.76 60 48.79 42.84 48.02

Table 3: micro F1 Performance of MTL, SVM, MLLS (multi-label).

(31)

Conclusions & Extensions

1. A seemingly nasty problem arising from application is tamed by a mix of convex/matrix analysis, and modern algorithms.

(32)

Conclusions & Extensions

1. A seemingly nasty problem arising from application is tamed by a mix of convex/matrix analysis, and modern algorithms.

2. Extension to related convex optimization problems in learning?

(33)

Conclusions & Extensions

1. A seemingly nasty problem arising from application is tamed by a mix of convex/matrix analysis, and modern algorithms.

2. Extension to related convex optimization problems in learning?

3. Better algorithms to handle the case of p ≈ n?

(34)

Conclusions & Extensions

1. A seemingly nasty problem arising from application is tamed by a mix of convex/matrix analysis, and modern algorithms.

2. Extension to related convex optimization problems in learning?

3. Better algorithms to handle the case of p ≈ n?

The End?

參考文獻

相關文件

The multi-task learning problem comes from our biological application: Drosophila gene expression pattern analysis (funded by NSF and

• Information on learners’ performance in the learning task is collected throughout the learning and teaching process so as to help teachers design post-task activities

• Oral interactions are often indivisible from the learning and teaching activities of an English task, and as such, speaking activities can be well integrated into any

 A task which promotes self-directed learning skills Writing Activity: A Biography for a Famous Person. Onion

1) Pre-learning task [Edupuzzle task] on “Investor and Financial Education Council (IFEC): Chin Family” Youtube video clip.. Teaching financial literacy in junior form curriculum.

Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In Proceedings of the 22nd ACM SIGKDD International Conference on

“Since our classification problem is essentially a multi-label task, during the prediction procedure, we assume that the number of labels for the unlabeled nodes is already known

Efficient training relies on designing optimization algorithms by incorporating the problem structure Many issues about multi-core and distributed linear classification still need to