Iteration Complexity of Feasible Descent Methods for Convex Optimization
Chih-Jen Lin
Department of Computer Science National Taiwan University
Joint work with Po-Wei Wang
Outline
Introduction
Feasible descent methods and linear-convergence proof
Rate of the linear convergence Discussions and conclusions
Introduction
Outline
Introduction
Feasible descent methods and linear-convergence proof
Rate of the linear convergence Discussions and conclusions
Introduction
Problem
minx∈X f (x).
f (x) is convex differentiable, X is closed and convex.
We want to know
Iterations to reach f (xr) − f∗ ≤ Specially, we investigate algorithms with linear convergence
f (xr +1)−f∗ ≤ (1−1
c)(f (xr)−f∗), ∀r . 2 4 6 8 10
10−3 10−2 10−1 100
Iterations f(xr)−f∗
Linearly 1/
Introduction
Motivation
Dual problem of support vector classification is minα
1
2w>w − 1Tα
subject to w = E α, 0 ≤ αi ≤ C , i = 1, . . . , l , E = y1z1, . . . , ylzl is the data matrix, (yi,zi):
label-instance pair, and 1 is the vector of ones w>w/2 is strongly convex in w, but Hessian may not be strongly convex in α
Coordinate descent method is commonly used, but
Introduction
Difficulties
For some convex but not non-strongly convex problems, Asymptotic Linear Convergence (Luo and Tseng, 1993)
∃r0 such that f (xr +1)−f∗ ≤ (1−1
c)(f (xr)−f∗), ∀r ≥ r0. Usually we only know the existence of r0 but not its
relation to problem parameters.
To estimate iteration numbers, we hope to have Global Linear Convergence
f (xr +1) − f∗ ≤ (1 − 1
c)(f (xr) − f∗), ∀r.
Introduction
Difficulties (Cont’d)
We also hope to know more about the convergence rate
That is, how the rate is related to the data Properties of the data include range of feature
values, number of instances, number of features etc.
Introduction
Past Studies
• We are interested in deterministic algorithms (e.g., cyclic coordinate descent)
• Interestingly, more studies have been done on the complexity of randomized coordinate descent:
Linear convergence for strongly convex f (·) (Nesterov, 2012; Richt´arik and Tak´aˇc, 2014;
Tappenden et al., 2013)
Sub-linear convergence for non-strongly convex f (·)
(Shalev-Shwartz and Tewari, 2009; Nesterov, 2012; Shalev-Shwartz and Zhang, 2013a,b)
Introduction
Past Studies (Cont’d)
• Past work on complexity of cyclic coordinate descent:
Linear convergence for l2-loss SVM (Chang et al., 2008); smooth and strongly convex f (·) (Beck and Tetruashvili, 2013)
Sub-linear convergence for non-strongly convex f (·) (Tseng and Yun, 2009; Saha and Tewari, 2013)
Feasible descent methods and linear-convergence proof
Outline
Introduction
Feasible descent methods and linear-convergence proof
Rate of the linear convergence Discussions and conclusions
Feasible descent methods and linear-convergence proof
Framework: Feasible Descent Methods
A sequence {xr} is generated by a feasible descent method if for all iteration index r , {xr} satisfies
xr +1 = [xr − ωr∇f (xr) + er]+X, kerk ≤ βkxr −xr +1k, f (xr) − f (xr +1) ≥ γkxr −xr +1k2, where infr ωr > 0, β > 0, and γ > 0.
Coordinate descent is a special case
Feasible descent methods and linear-convergence proof
Examples of Feasible Descent Methods for Machine Learning
Coordinate descent methods for dual Support Vector Classification (SVC)
Coordinate descent methods for dual Support Vector Regression (SVR)
Inexact coordinate descent for primal SVC
Inexact: one-variable sub-problem approximately solved
Gauss-Seidel method for solving linear systems
Feasible descent methods and linear-convergence proof
Projected Gradient
We need the following tools Definition (Convex Projection)
[y]+X ≡ arg min
x∈X kx − yk.
Definition (Projected gradient)
∇+f (x) ≡ x − [x − ∇f (x)]+X. Lemma (Optimality condition)
∇+f (x∗) =0 ⇔ x∗ is optimal.
Feasible descent methods and linear-convergence proof
Existing Techniques to Prove Asymptotic Linear Convergence
In Luo and Tseng (1993), they prove the following error bound
xmin∗∈X∗kxr −x∗k ≤ κk∇+f (xr)k, ∀r ≥ r0, where X∗ is the set of optimal solutions
We call this a local error bound because of r0.
We aim at proving a global error bound and knowing more about κ
Feasible descent methods and linear-convergence proof
Existing Techniques to Prove Asymptotic Linear Convergence (Cont’d)
• In a sense you can also say that a local error bound is global. If X is compact, there exists ¯κ such that
xmin∗∈X∗kxr −x∗k ≤ ¯κk∇+f (xr)k, ∀r ≥ 0
• Based on the existence of such bounds, linear
convergence has recently been established (e.g., Hong et al., 2014; Kadkhodaie et al., 2014) for problems not covered in (Luo and Tseng, 1993)
• However, we are interested in rate analysis here, so we
Feasible descent methods and linear-convergence proof
Sufficient Condition for Global Linear Convergence
We proved that feasible descent methods have global linear convergence if the following condition holds.
Global Error Bound from the Beginning kx − ¯xk ≤ κk∇+f (x)k, for all x satisfying
x ∈ X and f (x) − f∗ ≤ M,
where ¯x is the nearest optimum to x, f∗ is the optimal value, and M ≡ f (x0 ∗
Feasible descent methods and linear-convergence proof
Who Has A Global Error Bound from the Beginning?
Assumption (Strongly Convex)
f (x) is σ strongly convex and ∇f is ρ Lipschitz continuous.
A global error bound has been proved in Pang (1987) However, recall our goal is to study non-strongly convex problems such as SVM dual
Feasible descent methods and linear-convergence proof
Who Has A Global Error Bound from the Beginning? (Cont’d)
Assumption (Strongly Convex Composition) X is a polyhedral set {x | Ax ≤ d} and
f (x) = g (E x) + b>x, (1) where g (·) is σg strongly convex and ∇f is ρ Lipschitz continuous.
Our main result: global error bound for (1)
Then we can prove global linear convergence of feasible
Feasible descent methods and linear-convergence proof
Key Ideas in Our Proof
Optimal solution set is a polyhedral set
Ex∗ = t∗, b>x∗ = s∗, and Ax∗ ≤ d.
Using Hoffman’s bound (Hoffman, 1952) to bound the distance between x and a polyhedron. We proved a modified version from Li (1994)
kx − ¯xk ≤ θ A, bE>
E (x − ¯x) b>(x − ¯x)
,
where θ A, bE> is a constant related to A, E , b.
Finally, we bound kE (x − ¯x)k2 and (b>(x − ¯x))2
Rate of the linear convergence
Outline
Introduction
Feasible descent methods and linear-convergence proof
Rate of the linear convergence Discussions and conclusions
Rate of the linear convergence
The Error Bound Constants
We proved
kx − ¯xk ≤ κk∇+f (x)k with
κ = θ2(1 + ρ)(1 + 2k∇g (t∗)k2
σg + 4M) + 2θk∇f (¯x)k, Recall that
f (x) = g (E x) + b>x,
where g (·) is σg strongly convex and ∇f is ρ Lipschitz If X = Rl or b = 0, κ can be simplified to
21 + ρ
Rate of the linear convergence
The Convergence Rate
With an error bound, the feasible descent method xr +1 = [xr − ωr∇f (xr) + er]+X,
kerk ≤ βkxr −xr +1k, f (xr) − f (xr +1) ≥ γkxr −xr +1k2, converges linearly with
f (xr +1) − f∗ ≤ φ
φ + γ(f (xr) − f∗), ∀r ≥ 0, where
φ = (ρ+1 + β
ω )(1+κ1 + β
ω ), and ω ≡ min(1, inf
r ωr).
Rate of the linear convergence
Examples of the Error Bound Constant
Dual problem of l1-loss support vector classification minα
1
2w>w − 1Tα
subject to w = E α, 0 ≤ αi ≤ C , i = 1, . . . , l , E = y1z1, . . . , ylzl is the data matrix, (yi,zi):
label-instance pair, and 1 is the vector of ones If coordinate descent methods are used and each instance is normalized to unit length,
κ = O(ρθ2Cl ),
Rate of the linear convergence
Examples of the Convergence Rate
For dual problem of l1-loss support vector classification, the cyclic coordinate descent method has global linear convergence.
f (xr +1) − f∗ ≤ (1 − 1
2φ + 1)(f (xr) − f∗), ∀r , where
φ = O(l ρ2κ) = O(ρ3θ2Cl2).
Discussions and conclusions
Outline
Introduction
Feasible descent methods and linear-convergence proof
Rate of the linear convergence Discussions and conclusions
Discussions and conclusions
Conclusions
• For some non-strongly convex functions, we provide rate analysis of linear convergence for feasible descent methods
• The key idea is to prove an error bound between any point and the optimal solution set
• Our result enables the global linear convergence of optimization methods for some machine learning problems
• Details of the proof can be found at: P.-W. Wang and C.-J. Lin. Iteration complexity of feasible descent methods for convex optimization. Journal of Machine Learning Research, 2014.