• 沒有找到結果。

Iteration Complexity of Feasible Descent Methods for Convex Optimization

N/A
N/A
Protected

Academic year: 2022

Share "Iteration Complexity of Feasible Descent Methods for Convex Optimization"

Copied!
26
0
0

加載中.... (立即查看全文)

全文

(1)

Iteration Complexity of Feasible Descent Methods for Convex Optimization

Chih-Jen Lin

Department of Computer Science National Taiwan University

Joint work with Po-Wei Wang

(2)

Outline

Introduction

Feasible descent methods and linear-convergence proof

Rate of the linear convergence Discussions and conclusions

(3)

Introduction

Outline

Introduction

Feasible descent methods and linear-convergence proof

Rate of the linear convergence Discussions and conclusions

(4)

Introduction

Problem

minx∈X f (x).

f (x) is convex differentiable, X is closed and convex.

We want to know

Iterations to reach f (xr) − f ≤  Specially, we investigate algorithms with linear convergence

f (xr +1)−f ≤ (1−1

c)(f (xr)−f), ∀r . 2 4 6 8 10

10−3 10−2 10−1 100

Iterations f(xr)f

Linearly 1/

(5)

Introduction

Motivation

Dual problem of support vector classification is minα

1

2w>w − 1Tα

subject to w = E α, 0 ≤ αi ≤ C , i = 1, . . . , l , E = y1z1, . . . , ylzl is the data matrix, (yi,zi):

label-instance pair, and 1 is the vector of ones w>w/2 is strongly convex in w, but Hessian may not be strongly convex in α

Coordinate descent method is commonly used, but

(6)

Introduction

Difficulties

For some convex but not non-strongly convex problems, Asymptotic Linear Convergence (Luo and Tseng, 1993)

∃r0 such that f (xr +1)−f ≤ (1−1

c)(f (xr)−f), ∀r ≥ r0. Usually we only know the existence of r0 but not its

relation to problem parameters.

To estimate iteration numbers, we hope to have Global Linear Convergence

f (xr +1) − f ≤ (1 − 1

c)(f (xr) − f), ∀r.

(7)

Introduction

Difficulties (Cont’d)

We also hope to know more about the convergence rate

That is, how the rate is related to the data Properties of the data include range of feature

values, number of instances, number of features etc.

(8)

Introduction

Past Studies

• We are interested in deterministic algorithms (e.g., cyclic coordinate descent)

• Interestingly, more studies have been done on the complexity of randomized coordinate descent:

Linear convergence for strongly convex f (·) (Nesterov, 2012; Richt´arik and Tak´aˇc, 2014;

Tappenden et al., 2013)

Sub-linear convergence for non-strongly convex f (·)

(Shalev-Shwartz and Tewari, 2009; Nesterov, 2012; Shalev-Shwartz and Zhang, 2013a,b)

(9)

Introduction

Past Studies (Cont’d)

• Past work on complexity of cyclic coordinate descent:

Linear convergence for l2-loss SVM (Chang et al., 2008); smooth and strongly convex f (·) (Beck and Tetruashvili, 2013)

Sub-linear convergence for non-strongly convex f (·) (Tseng and Yun, 2009; Saha and Tewari, 2013)

(10)

Feasible descent methods and linear-convergence proof

Outline

Introduction

Feasible descent methods and linear-convergence proof

Rate of the linear convergence Discussions and conclusions

(11)

Feasible descent methods and linear-convergence proof

Framework: Feasible Descent Methods

A sequence {xr} is generated by a feasible descent method if for all iteration index r , {xr} satisfies

xr +1 = [xr − ωr∇f (xr) + er]+X, kerk ≤ βkxr −xr +1k, f (xr) − f (xr +1) ≥ γkxr −xr +1k2, where infr ωr > 0, β > 0, and γ > 0.

Coordinate descent is a special case

(12)

Feasible descent methods and linear-convergence proof

Examples of Feasible Descent Methods for Machine Learning

Coordinate descent methods for dual Support Vector Classification (SVC)

Coordinate descent methods for dual Support Vector Regression (SVR)

Inexact coordinate descent for primal SVC

Inexact: one-variable sub-problem approximately solved

Gauss-Seidel method for solving linear systems

(13)

Feasible descent methods and linear-convergence proof

Projected Gradient

We need the following tools Definition (Convex Projection)

[y]+X ≡ arg min

x∈X kx − yk.

Definition (Projected gradient)

+f (x) ≡ x − [x − ∇f (x)]+X. Lemma (Optimality condition)

+f (x) =0 ⇔ x is optimal.

(14)

Feasible descent methods and linear-convergence proof

Existing Techniques to Prove Asymptotic Linear Convergence

In Luo and Tseng (1993), they prove the following error bound

xmin∈Xkxr −xk ≤ κk∇+f (xr)k, ∀r ≥ r0, where X is the set of optimal solutions

We call this a local error bound because of r0.

We aim at proving a global error bound and knowing more about κ

(15)

Feasible descent methods and linear-convergence proof

Existing Techniques to Prove Asymptotic Linear Convergence (Cont’d)

• In a sense you can also say that a local error bound is global. If X is compact, there exists ¯κ such that

xmin∈Xkxr −xk ≤ ¯κk∇+f (xr)k, ∀r ≥ 0

• Based on the existence of such bounds, linear

convergence has recently been established (e.g., Hong et al., 2014; Kadkhodaie et al., 2014) for problems not covered in (Luo and Tseng, 1993)

• However, we are interested in rate analysis here, so we

(16)

Feasible descent methods and linear-convergence proof

Sufficient Condition for Global Linear Convergence

We proved that feasible descent methods have global linear convergence if the following condition holds.

Global Error Bound from the Beginning kx − ¯xk ≤ κk∇+f (x)k, for all x satisfying

x ∈ X and f (x) − f ≤ M,

where ¯x is the nearest optimum to x, f is the optimal value, and M ≡ f (x0

(17)

Feasible descent methods and linear-convergence proof

Who Has A Global Error Bound from the Beginning?

Assumption (Strongly Convex)

f (x) is σ strongly convex and ∇f is ρ Lipschitz continuous.

A global error bound has been proved in Pang (1987) However, recall our goal is to study non-strongly convex problems such as SVM dual

(18)

Feasible descent methods and linear-convergence proof

Who Has A Global Error Bound from the Beginning? (Cont’d)

Assumption (Strongly Convex Composition) X is a polyhedral set {x | Ax ≤ d} and

f (x) = g (E x) + b>x, (1) where g (·) is σg strongly convex and ∇f is ρ Lipschitz continuous.

Our main result: global error bound for (1)

Then we can prove global linear convergence of feasible

(19)

Feasible descent methods and linear-convergence proof

Key Ideas in Our Proof

Optimal solution set is a polyhedral set

Ex = t, b>x = s, and Ax ≤ d.

Using Hoffman’s bound (Hoffman, 1952) to bound the distance between x and a polyhedron. We proved a modified version from Li (1994)

kx − ¯xk ≤ θ A, bE>

E (x − ¯x) b>(x − ¯x)

,

where θ A, bE> is a constant related to A, E , b.

Finally, we bound kE (x − ¯x)k2 and (b>(x − ¯x))2

(20)

Rate of the linear convergence

Outline

Introduction

Feasible descent methods and linear-convergence proof

Rate of the linear convergence Discussions and conclusions

(21)

Rate of the linear convergence

The Error Bound Constants

We proved

kx − ¯xk ≤ κk∇+f (x)k with

κ = θ2(1 + ρ)(1 + 2k∇g (t)k2

σg + 4M) + 2θk∇f (¯x)k, Recall that

f (x) = g (E x) + b>x,

where g (·) is σg strongly convex and ∇f is ρ Lipschitz If X = Rl or b = 0, κ can be simplified to

21 + ρ

(22)

Rate of the linear convergence

The Convergence Rate

With an error bound, the feasible descent method xr +1 = [xr − ωr∇f (xr) + er]+X,

kerk ≤ βkxr −xr +1k, f (xr) − f (xr +1) ≥ γkxr −xr +1k2, converges linearly with

f (xr +1) − f ≤ φ

φ + γ(f (xr) − f), ∀r ≥ 0, where

φ = (ρ+1 + β

ω )(1+κ1 + β

ω ), and ω ≡ min(1, inf

r ωr).

(23)

Rate of the linear convergence

Examples of the Error Bound Constant

Dual problem of l1-loss support vector classification minα

1

2w>w − 1Tα

subject to w = E α, 0 ≤ αi ≤ C , i = 1, . . . , l , E = y1z1, . . . , ylzl is the data matrix, (yi,zi):

label-instance pair, and 1 is the vector of ones If coordinate descent methods are used and each instance is normalized to unit length,

κ = O(ρθ2Cl ),

(24)

Rate of the linear convergence

Examples of the Convergence Rate

For dual problem of l1-loss support vector classification, the cyclic coordinate descent method has global linear convergence.

f (xr +1) − f ≤ (1 − 1

2φ + 1)(f (xr) − f), ∀r , where

φ = O(l ρ2κ) = O(ρ3θ2Cl2).

(25)

Discussions and conclusions

Outline

Introduction

Feasible descent methods and linear-convergence proof

Rate of the linear convergence Discussions and conclusions

(26)

Discussions and conclusions

Conclusions

• For some non-strongly convex functions, we provide rate analysis of linear convergence for feasible descent methods

• The key idea is to prove an error bound between any point and the optimal solution set

• Our result enables the global linear convergence of optimization methods for some machine learning problems

• Details of the proof can be found at: P.-W. Wang and C.-J. Lin. Iteration complexity of feasible descent methods for convex optimization. Journal of Machine Learning Research, 2014.

參考文獻

相關文件

Accelerated prox gradient method is promising in theory and practice.. Applicable to convex-concave optimization by using

The existence of transmission eigenvalues is closely related to the validity of some reconstruction methods for the inverse scattering problems in an inhomogeneous medium such as

We have made a survey for the properties of SOC complementarity functions and theoretical results of related solution methods, including the merit function methods, the

We have made a survey for the properties of SOC complementarity functions and the- oretical results of related solution methods, including the merit function methods, the

Numerical experiments are done for a class of quasi-convex optimization problems where the function f (x) is a composition of a quadratic convex function from IR n to IR and

Chen, The semismooth-related properties of a merit function and a descent method for the nonlinear complementarity problem, Journal of Global Optimization, vol.. Soares, A new

From these characterizations, we particularly obtain that a continuously differentiable function defined in an open interval is SOC-monotone (SOC-convex) of order n ≥ 3 if and only

We point out that extending the concepts of r-convex and quasi-convex functions to the setting associated with second-order cone, which be- longs to symmetric cones, is not easy