Iteration Complexity of Feasible Descent Methods for Convex Optimization

(1)

Iteration Complexity of Feasible Descent Methods for Convex Optimization

Chih-Jen Lin

Department of Computer Science National Taiwan University

Joint work with Po-Wei Wang

(2)

Outline

Introduction

Feasible descent methods and linear-convergence proof

Rate of the linear convergence Discussions and conclusions

(3)

Introduction

Outline

Introduction

(4)

Introduction

Problem

minx∈X f (x).

f (x) is convex differentiable, X is closed and convex.

We want to know

Iterations to reach f (x^r) − f^∗ ≤ Specially, we investigate algorithms with linear convergence

f (x^{r +1})−f^∗ ≤ (1−1

c)(f (x^r)−f^∗), ∀r . ² ⁴ ⁶ ⁸ ¹⁰

10⁻³ 10⁻² 10⁻¹ 10⁰

Iterations f(xr)−f∗

Linearly 1/

(5)

Introduction

Motivation

Dual problem of support vector classification is minα

1

2w^>w − 1^Tα

subject to w = E α, 0 ≤ αi ≤ C , i = 1, . . . , l , E = y1z1, . . . , ylzl is the data matrix, (y_i,zi):

label-instance pair, and 1 is the vector of ones w^>w/2 is strongly convex in w, but Hessian may not be strongly convex in α

Coordinate descent method is commonly used, but

(6)

Introduction

Difficulties

For some convex but not non-strongly convex problems, Asymptotic Linear Convergence (Luo and Tseng, 1993)

∃r₀ such that f (x^{r +1})−f^∗ ≤ (1−1

c)(f (x^r)−f^∗), ∀r ≥ r₀. Usually we only know the existence of r₀ but not its

relation to problem parameters.

To estimate iteration numbers, we hope to have Global Linear Convergence

f (x^{r +1}) − f^∗ ≤ (1 − 1

c)(f (x^r) − f^∗), ∀r.

(7)

Introduction

Difficulties (Cont’d)

We also hope to know more about the convergence rate

That is, how the rate is related to the data Properties of the data include range of feature

values, number of instances, number of features etc.

(8)

Introduction

Past Studies

• We are interested in deterministic algorithms (e.g., cyclic coordinate descent)

• Interestingly, more studies have been done on the complexity of randomized coordinate descent:

Linear convergence for strongly convex f (·) (Nesterov, 2012; Richt´arik and Tak´aˇc, 2014;

Tappenden et al., 2013)

Sub-linear convergence for non-strongly convex f (·)

(Shalev-Shwartz and Tewari, 2009; Nesterov, 2012; Shalev-Shwartz and Zhang, 2013a,b)

(9)

Introduction

Past Studies (Cont’d)

• Past work on complexity of cyclic coordinate descent:

Linear convergence for l2-loss SVM (Chang et al., 2008); smooth and strongly convex f (·) (Beck and Tetruashvili, 2013)

Sub-linear convergence for non-strongly convex f (·) (Tseng and Yun, 2009; Saha and Tewari, 2013)

(10)

Outline

Introduction

(11)

Framework: Feasible Descent Methods

A sequence {x^r} is generated by a feasible descent method if for all iteration index r , {x^r} satisfies

x^{r +1} = [x^r − ω_r∇f (x^r) + e^r]⁺_X, ke^rk ≤ βkx^r −x^{r +1}k, f (x^r) − f (x^{r +1}) ≥ γkx^r −x^{r +1}k², where inf_r ω_r > 0, β > 0, and γ > 0.

Coordinate descent is a special case

(12)

Examples of Feasible Descent Methods for Machine Learning

Coordinate descent methods for dual Support Vector Classification (SVC)

Coordinate descent methods for dual Support Vector Regression (SVR)

Inexact coordinate descent for primal SVC

Inexact: one-variable sub-problem approximately solved

Gauss-Seidel method for solving linear systems

(13)

Projected Gradient

We need the following tools Definition (Convex Projection)

[y]⁺_X ≡ arg min

x∈X kx − yk.

Definition (Projected gradient)

∇⁺f (x) ≡ x − [x − ∇f (x)]⁺_X. Lemma (Optimality condition)

∇⁺f (x^∗) =0 ⇔ x^∗ is optimal.

(14)

Existing Techniques to Prove Asymptotic Linear Convergence

In Luo and Tseng (1993), they prove the following error bound

xmin^∗∈X^∗kx^r −x^∗k ≤ κk∇⁺f (x^r)k, ∀r ≥ r₀, where X^∗ is the set of optimal solutions

We call this a local error bound because of r₀.

We aim at proving a global error bound and knowing more about κ

(15)

Existing Techniques to Prove Asymptotic Linear Convergence (Cont’d)

• In a sense you can also say that a local error bound is global. If X is compact, there exists ¯κ such that

xmin^∗∈X^∗kx^r −x^∗k ≤ ¯κk∇⁺f (x^r)k, ∀r ≥ 0

• Based on the existence of such bounds, linear

convergence has recently been established (e.g., Hong et al., 2014; Kadkhodaie et al., 2014) for problems not covered in (Luo and Tseng, 1993)

• However, we are interested in rate analysis here, so we

(16)

Sufficient Condition for Global Linear Convergence

We proved that feasible descent methods have global linear convergence if the following condition holds.

Global Error Bound from the Beginning kx − ¯xk ≤ κk∇⁺f (x)k, for all x satisfying

x ∈ X and f (x) − f^∗ ≤ M,

where ¯x is the nearest optimum to x, f^∗ is the optimal value, and M ≡ f (x⁰ ^∗

(17)

Who Has A Global Error Bound from the Beginning?

Assumption (Strongly Convex)

f (x) is σ strongly convex and ∇f is ρ Lipschitz continuous.

A global error bound has been proved in Pang (1987) However, recall our goal is to study non-strongly convex problems such as SVM dual

(18)

Who Has A Global Error Bound from the Beginning? (Cont’d)

Assumption (Strongly Convex Composition) X is a polyhedral set {x | Ax ≤ d} and

f (x) = g (E x) + b^>x, (1) where g (·) is σ_g strongly convex and ∇f is ρ Lipschitz continuous.

Our main result: global error bound for (1)

Then we can prove global linear convergence of feasible

(19)

Key Ideas in Our Proof

Optimal solution set is a polyhedral set

Ex^∗ = t^∗, b^>x^∗ = s^∗, and Ax^∗ ≤ d.

Using Hoffman’s bound (Hoffman, 1952) to bound the distance between x and a polyhedron. We proved a modified version from Li (1994)

kx − ¯xk ≤ θ A, _b^E^>

E (x − ¯x) b^>(x − ¯x)

,

where θ A, _b^E> is a constant related to A, E , b.

Finally, we bound kE (x − ¯x)k² and (b^>(x − ¯x))²

(20)

Rate of the linear convergence

Outline

Introduction

(21)

The Error Bound Constants

We proved

kx − ¯xk ≤ κk∇⁺f (x)k with

κ = θ²(1 + ρ)(1 + 2k∇g (t^∗)k²

σ_g + 4M) + 2θk∇f (¯x)k, Recall that

f (x) = g (E x) + b^>x,

where g (·) is σ_g strongly convex and ∇f is ρ Lipschitz If X = R^l or b = 0, κ can be simplified to

21 + ρ

(22)

The Convergence Rate

With an error bound, the feasible descent method x^{r +1} = [x^r − ω_r∇f (x^r) + e^r]⁺_X,

ke^rk ≤ βkx^r −x^{r +1}k, f (x^r) − f (x^{r +1}) ≥ γkx^r −x^{r +1}k², converges linearly with

f (x^{r +1}) − f^∗ ≤ φ

φ + γ(f (x^r) − f^∗), ∀r ≥ 0, where

φ = (ρ+1 + β

ω )(1+κ1 + β

ω ), and ω ≡ min(1, inf

r ω_r).

(23)

Examples of the Error Bound Constant

Dual problem of l1-loss support vector classification minα

1

2w^>w − 1^Tα

subject to w = E α, 0 ≤ αi ≤ C , i = 1, . . . , l , E = y₁z1, . . . , y_lzl is the data matrix, (y_i,zi):

label-instance pair, and 1 is the vector of ones If coordinate descent methods are used and each instance is normalized to unit length,

κ = O(ρθ²Cl ),

(24)

Examples of the Convergence Rate

For dual problem of l1-loss support vector classification, the cyclic coordinate descent method has global linear convergence.

f (x^{r +1}) − f^∗ ≤ (1 − 1

2φ + 1)(f (x^r) − f^∗), ∀r , where

φ = O(l ρ²κ) = O(ρ³θ²Cl²).

(25)

Discussions and conclusions

Outline

Introduction

(26)

Discussions and conclusions

Conclusions

• For some non-strongly convex functions, we provide rate analysis of linear convergence for feasible descent methods

• The key idea is to prove an error bound between any point and the optimal solution set

• Our result enables the global linear convergence of optimization methods for some machine learning problems

• Details of the proof can be found at: P.-W. Wang and C.-J. Lin. Iteration complexity of feasible descent methods for convex optimization. Journal of Machine Learning Research, 2014.