### Iteration Complexity of Feasible Descent Methods for Convex Optimization

Chih-Jen Lin

Department of Computer Science National Taiwan University

Joint work with Po-Wei Wang

### Outline

Introduction

Feasible descent methods and linear-convergence proof

Rate of the linear convergence Discussions and conclusions

Introduction

### Outline

Introduction

Feasible descent methods and linear-convergence proof

Rate of the linear convergence Discussions and conclusions

Introduction

### Problem

minx∈X f (x).

f (x) is convex differentiable, X is closed and convex.

We want to know

Iterations to reach f (x^{r}) − f^{∗} ≤
Specially, we investigate algorithms
with linear convergence

f (x^{r +1})−f^{∗} ≤ (1−1

c)(f (x^{r})−f^{∗}), ∀r . ^{2} ^{4} ^{6} ^{8} ^{10}

10^{−3}
10^{−2}
10^{−1}
10^{0}

Iterations f(xr)−f∗

Linearly 1/

Introduction

### Motivation

Dual problem of support vector classification is minα

1

2w^{>}w − 1^{T}α

subject to w = E α, 0 ≤ αi ≤ C , i = 1, . . . , l ,
E = y1z1, . . . , ylzl is the data matrix, (y_{i},zi):

label-instance pair, and 1 is the vector of ones
w^{>}w/2 is strongly convex in w, but Hessian may
not be strongly convex in α

Coordinate descent method is commonly used, but

Introduction

### Difficulties

For some convex but not non-strongly convex problems, Asymptotic Linear Convergence (Luo and Tseng, 1993)

∃r_{0} such that f (x^{r +1})−f^{∗} ≤ (1−1

c)(f (x^{r})−f^{∗}), ∀r ≥ r_{0}.
Usually we only know the existence of r_{0} but not its

relation to problem parameters.

To estimate iteration numbers, we hope to have Global Linear Convergence

f (x^{r +1}) − f^{∗} ≤ (1 − 1

c)(f (x^{r}) − f^{∗}), ∀r.

Introduction

### Difficulties (Cont’d)

We also hope to know more about the convergence rate

That is, how the rate is related to the data Properties of the data include range of feature

values, number of instances, number of features etc.

Introduction

### Past Studies

• We are interested in deterministic algorithms (e.g., cyclic coordinate descent)

• Interestingly, more studies have been done on the complexity of randomized coordinate descent:

Linear convergence for strongly convex f (·) (Nesterov, 2012; Richt´arik and Tak´aˇc, 2014;

Tappenden et al., 2013)

Sub-linear convergence for non-strongly convex f (·)

(Shalev-Shwartz and Tewari, 2009; Nesterov, 2012; Shalev-Shwartz and Zhang, 2013a,b)

Introduction

### Past Studies (Cont’d)

• Past work on complexity of cyclic coordinate descent:

Linear convergence for l2-loss SVM (Chang et al., 2008); smooth and strongly convex f (·) (Beck and Tetruashvili, 2013)

Sub-linear convergence for non-strongly convex f (·) (Tseng and Yun, 2009; Saha and Tewari, 2013)

Feasible descent methods and linear-convergence proof

### Outline

Introduction

Feasible descent methods and linear-convergence proof

Rate of the linear convergence Discussions and conclusions

Feasible descent methods and linear-convergence proof

### Framework: Feasible Descent Methods

A sequence {x^{r}} is generated by a feasible descent
method if for all iteration index r , {x^{r}} satisfies

x^{r +1} = [x^{r} − ω_{r}∇f (x^{r}) + e^{r}]^{+}_{X},
ke^{r}k ≤ βkx^{r} −x^{r +1}k,
f (x^{r}) − f (x^{r +1}) ≥ γkx^{r} −x^{r +1}k^{2},
where inf_{r} ω_{r} > 0, β > 0, and γ > 0.

Coordinate descent is a special case

Feasible descent methods and linear-convergence proof

### Examples of Feasible Descent Methods for Machine Learning

Coordinate descent methods for dual Support Vector Classification (SVC)

Coordinate descent methods for dual Support Vector Regression (SVR)

Inexact coordinate descent for primal SVC

Inexact: one-variable sub-problem approximately solved

Gauss-Seidel method for solving linear systems

Feasible descent methods and linear-convergence proof

### Projected Gradient

We need the following tools Definition (Convex Projection)

[y]^{+}_{X} ≡ arg min

x∈X kx − yk.

Definition (Projected gradient)

∇^{+}f (x) ≡ x − [x − ∇f (x)]^{+}_{X}.
Lemma (Optimality condition)

∇^{+}f (x^{∗}) =0 ⇔ x^{∗} is optimal.

Feasible descent methods and linear-convergence proof

### Existing Techniques to Prove Asymptotic Linear Convergence

In Luo and Tseng (1993), they prove the following error bound

xmin^{∗}∈X^{∗}kx^{r} −x^{∗}k ≤ κk∇^{+}f (x^{r})k, ∀r ≥ r_{0},
where X^{∗} is the set of optimal solutions

We call this a local error bound because of r_{0}.

We aim at proving a global error bound and knowing more about κ

Feasible descent methods and linear-convergence proof

### Existing Techniques to Prove Asymptotic Linear Convergence (Cont’d)

• In a sense you can also say that a local error bound is global. If X is compact, there exists ¯κ such that

xmin^{∗}∈X^{∗}kx^{r} −x^{∗}k ≤ ¯κk∇^{+}f (x^{r})k, ∀r ≥ 0

• Based on the existence of such bounds, linear

convergence has recently been established (e.g., Hong et al., 2014; Kadkhodaie et al., 2014) for problems not covered in (Luo and Tseng, 1993)

• However, we are interested in rate analysis here, so we

Feasible descent methods and linear-convergence proof

### Sufficient Condition for Global Linear Convergence

We proved that feasible descent methods have global linear convergence if the following condition holds.

Global Error Bound from the Beginning
kx − ¯xk ≤ κk∇^{+}f (x)k,
for all x satisfying

x ∈ X and f (x) − f^{∗} ≤ M,

where ¯x is the nearest optimum to x, f^{∗} is the optimal
value, and M ≡ f (x^{0} ^{∗}

Feasible descent methods and linear-convergence proof

### Who Has A Global Error Bound from the Beginning?

Assumption (Strongly Convex)

f (x) is σ strongly convex and ∇f is ρ Lipschitz continuous.

A global error bound has been proved in Pang (1987) However, recall our goal is to study non-strongly convex problems such as SVM dual

Feasible descent methods and linear-convergence proof

### Who Has A Global Error Bound from the Beginning? (Cont’d)

Assumption (Strongly Convex Composition) X is a polyhedral set {x | Ax ≤ d} and

f (x) = g (E x) + b^{>}x, (1)
where g (·) is σ_{g} strongly convex and ∇f is ρ Lipschitz
continuous.

Our main result: global error bound for (1)

Then we can prove global linear convergence of feasible

Feasible descent methods and linear-convergence proof

### Key Ideas in Our Proof

Optimal solution set is a polyhedral set

Ex^{∗} = t^{∗}, b^{>}x^{∗} = s^{∗}, and Ax^{∗} ≤ d.

Using Hoffman’s bound (Hoffman, 1952) to bound the distance between x and a polyhedron. We proved a modified version from Li (1994)

kx − ¯xk ≤ θ A, _{b}^{E}^{>}

E (x − ¯x)
b^{>}(x − ¯x)

,

where θ A, _{b}^{E}> is a constant related to A, E , b.

Finally, we bound kE (x − ¯x)k^{2} and (b^{>}(x − ¯x))^{2}

Rate of the linear convergence

### Outline

Introduction

Feasible descent methods and linear-convergence proof

Rate of the linear convergence Discussions and conclusions

Rate of the linear convergence

### The Error Bound Constants

We proved

kx − ¯xk ≤ κk∇^{+}f (x)k
with

κ = θ^{2}(1 + ρ)(1 + 2k∇g (t^{∗})k^{2}

σ_{g} + 4M) + 2θk∇f (¯x)k,
Recall that

f (x) = g (E x) + b^{>}x,

where g (·) is σ_{g} strongly convex and ∇f is ρ Lipschitz
If X = R^{l} or b = 0, κ can be simplified to

21 + ρ

Rate of the linear convergence

### The Convergence Rate

With an error bound, the feasible descent method
x^{r +1} = [x^{r} − ω_{r}∇f (x^{r}) + e^{r}]^{+}_{X},

ke^{r}k ≤ βkx^{r} −x^{r +1}k,
f (x^{r}) − f (x^{r +1}) ≥ γkx^{r} −x^{r +1}k^{2},
converges linearly with

f (x^{r +1}) − f^{∗} ≤ φ

φ + γ(f (x^{r}) − f^{∗}), ∀r ≥ 0,
where

φ = (ρ+1 + β

ω )(1+κ1 + β

ω ), and ω ≡ min(1, inf

r ω_{r}).

Rate of the linear convergence

### Examples of the Error Bound Constant

Dual problem of l1-loss support vector classification minα

1

2w^{>}w − 1^{T}α

subject to w = E α, 0 ≤ αi ≤ C , i = 1, . . . , l ,
E = y_{1}z1, . . . , y_{l}zl is the data matrix, (y_{i},zi):

label-instance pair, and 1 is the vector of ones If coordinate descent methods are used and each instance is normalized to unit length,

κ = O(ρθ^{2}Cl ),

Rate of the linear convergence

### Examples of the Convergence Rate

For dual problem of l1-loss support vector classification, the cyclic coordinate descent method has global linear convergence.

f (x^{r +1}) − f^{∗} ≤ (1 − 1

2φ + 1)(f (x^{r}) − f^{∗}), ∀r ,
where

φ = O(l ρ^{2}κ) = O(ρ^{3}θ^{2}Cl^{2}).

Discussions and conclusions

### Outline

Introduction

Feasible descent methods and linear-convergence proof

Rate of the linear convergence Discussions and conclusions

Discussions and conclusions

### Conclusions

• For some non-strongly convex functions, we provide rate analysis of linear convergence for feasible descent methods

• The key idea is to prove an error bound between any point and the optimal solution set

• Our result enables the global linear convergence of optimization methods for some machine learning problems

• Details of the proof can be found at: P.-W. Wang and C.-J. Lin. Iteration complexity of feasible descent methods for convex optimization. Journal of Machine Learning Research, 2014.