• 沒有找到結果。

# 2 The first smooth support vector machine

N/A
N/A
Protected

Share "2 The first smooth support vector machine"

Copied!
30
0
0

(1)

to appear in Computational Optimization and Applications, 2018

### Two smooth support vector machines for ε-insensitive regression

Weizhe Gu

Department of Mathematics School of Science Tianjin University Tianjin 300072, P.R. China E-mail: weizhegu@yahoo.com.cn

Wei-Po Chen

Department of Mathematics National Taiwan Normal University

Taipei 11677, Taiwan E-mail: weaper@gmail.com

Chun-Hsu Ko

Department of Electrical Engineering I-Shou University

Kaohsiung 840, Taiwan E-mail: chko@isu.edu.tw

Yuh-Jye Lee

Department of Applied Mathematics National Chiao Tung University

Hsinchu 300, Taiwan

E-mail: yuhjye@math.nctu.edu.tw

Jein-Shan Chen 1 Department of Mathematics National Taiwan Normal University

Taipei 11677, Taiwan E-mail: jschen@math.ntnu.edu.tw

1The author’s work is supported by Ministry of Science and Technology, Taiwan.

(2)

June 7, 2017

(1st revised on September 21, 2017) (2nd revised on November 6, 2017)

Abstract. In this paper, we propose two new smooth support vector machines for ε- insensitive regression. According to these two smooth support vector machines, we construct two systems of smooth equations based on two novel families of smoothing functions, from which we seek the solution to ε-support vector regression(ε-SVR). More specifically, using the proposed smoothing functions, we employ the smoothing Newton method to solve the systems of smooth equations. The algorithm is shown to be globally and quadratically convergent without any additional conditions. Numerical comparisons among different values of parameter are also reported.

Key words. support vector machine, ε-insensitive loss function, ε-smooth support vector regression, smoothing Newton algorithm

### 1 Introduction

Support vector machine (SVM) is a popular and important statistical learning tech- nology [1, 9, 10, 16, 17, 18, 19]. Generally speaking, there are two main categories for support vector machines (SVMs): support vector classification (SVC) and support vector regression (SVR). The model produced by SVR depends on a training data set S = {(A1, y1), . . . , (Am, ym)} ⊆ IRn× IR, where Ai ∈ IRn is the input data and yi ∈ IR is called the observation. The main goal of ε-insensitive regression with the idea of SVMs is to find a linear or nonlinear regression function f that has at most ε deviation from the actually obtained targets yi for all the training data, and at the same time is as flat as possible. This problem is called ε-support vector regression (ε-SVR).

For pedagogical reasons, we begin with the linear case, in which the regression function f (\$) is defined as

f (\$) = \$Tx + b with x ∈ IRn, b ∈ IR. (1) Flatness in the case of (1) means that one seeks a small x. One way to ensure this is to minimize the norm of x, then the problem ε-SVR can be formulated as a constrained minimization problem:

min 12xTx + CPm

i=1i+ ξi) s.t.

yi− ATi x − b ≤ ε + ξi ATi x + b − yi ≤ ε + ξi ξi, ξi ≥ 0, i = 1, · · · , m

(2)

(3)

The constant C > 0 determines the trade-off between the flatness of f and the amount up to which deviations larger than ε are tolerated. This corresponds to dealing with a so called ε-insensitive loss function |ξ|ε described by

|ξ|ε = max{0, |ξ| − ε}.

The formulation (2) is a convex quadratic minimization problem with n+1 free variables, 2m nonnegative variables, and 2m inequality constraints, which enlarges the problem size and could increase computational complexity.

In fact, the problem (2) can be reformulated as an unconstrained optimization prob- lem:

min

(x,b)∈IRn+1

1

2(xTx + b2) + C 2

m

X

i=1

ATi x + b − yi

2

ε (3)

This formulation has been proposed in active set support vector regression [8] and solved in its dual form. The objective function is strongly convex, hence, the problem has a unique global optimal solution. However, according to the fact that the objective func- tion is not twice continuously differentiable, Newton-type algorithms cannot be applied to solve (3) directly.

Lee, Hsieh and Huang [9] apply a smooth technique for (3). The smooth function fε(x, α) = x + 1

αlog(1 + e−αx), (4)

which is the integral of the sigmoid function 1+e1−αx, is used to smooth the plus function [x]+. More specifically, the smooth function fε(x, α) approaches to [x]+, when α goes to infinity. Then, the problem (3) is recast to a strongly convex unconstrained minimization problem with the smooth function fε(x, α) and a Newton-Armijo algorithm is proposed to solve it. It is proved that when the smoothing parameter α approaches to infinity, the unique solution of the reformulated problem converges to the unique solution of the original problem [9, Theorem 2.2]. However, the smoothing parameter α is fixed in the proposed algorithm, and in the implementation of this algorithm, α cannot be set large enough.

In this paper, we introduce two smooth support vector machines for ε-insensitive regression. For the first smooth support vector machine, we reformulated ε-SVR to a strongly convex unconstrained optimization problem with one type of smoothing func- tions φε(x, α). Then, we define a new function Hφ, which corresponds to the optimality condition of the unconstrained optimization problem. From the solution of Hφ(z) = 0, we can obtain the solution of ε-SVR. For the second smooth support vector machine, we smooth the optimality condition of the strongly convex unconstrained optimization

(4)

problem of (3) with another type of smooth functions ψε(x, α). Accordingly we define the function Hψ, which also possesses the same properties as Hφ does. For either Hφ= 0 or Hψ = 0, we consider the smoothing Newton method to solve it. The algorithm is shown to be globally convergent, specifically, the iterative sequence converges to the unique so- lution to (3). Furthermore, the algorithm is shown to be locally quadratically convergent without any assumptions.

The paper is organized as follows. In Section 2 and Section 3, we introduce two smooth support vector machine reformulations by two types of smoothing functions. In Section 4, we propose a smoothing Newton algorithm and study its global and local quadratic convergence. Numerical results and comparisons are reported in Section 5.

Throughout this paper, K := {1, 2, · · · }, all vectors will be column vectors. For a given vector x = (x1, . . . , xn)T ∈ IRn, the plus function [x]+ is defined as

([x]+)i = max{0, xi}, i = 1, · · · , n.

For a differentiable function f , we denote by ∇f (x) and ∇2f (x) the gradient and the Hessian matrix of f at x, respectively. For a differentiable mapping G : IRn → IRm, we denote by G0(x) ∈ IRm×n the Jacabian of G at x. For a matrix A ∈ IRm×n, ATi is the i-th row of A. A column vector of ones and identity matrix of arbitrary dimension will be denoted by 1 and I, respectively. We denote the sign function by

sgn(x) =

1 if x > 0, [−1, 1] if x = 0,

−1 if x < 0.

### 2 The first smooth support vector machine

As mentioned in [9], it is known that ε-SVR can be reformulated as a strongly convex unconstrained optimization problem (3). Denote ω := (x, b) ∈ IRn+1, ¯A := (A, 1) and A¯Ti is the i-th row of ¯A, then the smooth support vector regression (3) can be rewritten as

minω

1

Tω + C 2

m

X

i=1

¯ATi ω − yi

2

ε. (5)

Note that | · |2ε is smooth, but not twice differentiable, which means the objective func- tion is not twice continuously differentiable. Hence, the Newton-type method cannot be applied to solve (5) directly.

(5)

In view of this fact, we propose a family of twice continuously differentiable functions φε(x, α) to replace |x|2ε. The family of functions φε(x, α) : IR × IR+→ IR+ is given by

φε(x, α) =

(|x| − ε)2+13α2 if |x| − ε ≥ α,

1

(|x| − ε + α)3 if ||x| − ε| < α, 0 if |x| − ε ≤ −α,

(6)

where 0 < α < ε is a smooth parameter. The graphs of φε(x, α) are depicted in Figure 1.

From this geometric view, it is clear to see that φε(x, α) is a class of smoothing functions for |x|2ε.

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2

x 0

0.002 0.004 0.006 0.008 0.01 0.012 0.014

|x|2 (x, 0.03) (x, 0.06) (x, 0.09)

Figure 1: Graphs of φε(x, α) with ε = 0.1 and α = 0.03, 0.06, 0.09.

Besides the geometric approach, we hereat show that φε(x, α) is a class of smoothing functions for |x|2εby algebraic verification. To this end, we compute the partial derivatives of φε(x, α) as below:

xφε(x, α) =

2(|x| − ε)sgn(x) if |x| − ε ≥ α,

1

(|x| − ε + α)2sgn(x) if

|x| − ε < α,

0 if |x| − ε ≤ −α.

(7)

2xxφε(x, α) =

2 if |x| − ε ≥ α

|x|−ε+α

α if

|x| − ε < α, 0 if |x| − ε ≤ −α.

(8)

2φε(x, α) =

0 if |x| − ε ≥ α,

(|x|−ε+α)(α−|x|+ε)

2 sgn(x) if

|x| − ε < α,

0 if |x| − ε ≤ −α.

(9)

(6)

With the above, the following lemma shows some basic properties of φε(x, α).

Lemma 2.1. Let φε(x, α) be defined as in (6). Then, the following hold.

(a) For 0 < α < ε, there holds 0 ≤ φε(x, α) − |x|2ε13α2.

(b) The function φε(x, α) is twice continuously differentiable with respect to x for 0 <

α < ε.

(c) lim

α→0φε(x, α) = |x|2ε and lim

α→0xφε(x, α) = ∇(|x|2ε).

Proof. (a) To complete the arguments, we need to discuss four cases.

(i) For |x| − ε ≥ α, it is clear that φε(x, α) − |x|2ε = 13α2.

(ii) For 0 < |x| − ε < α, i.e., 0 < x − ε < α or 0 < −x − ε < α, there have two subcase.

If 0 < x − ε < α, letting f (x) := φε(x, α) − |x|2ε = 1 (x − ε + α)3− (x − ε)2 gives (

f0(x) = (x−ε+α) 2 − 2(x − ε), ∀x ∈ (ε, ε + α), f00(x) = x−ε+αα − 2 < 0, ∀x ∈ (ε, ε + α).

This indicates that f0(x) is monotone decreasing on (ε, ε + α), which further implies f0(x) ≥ f0(ε + α) = 0, ∀x ∈ (ε, ε + α).

Thus, we obtain that f (x) is monotone increasing on (ε, ε + α). With this, we have f (x) ≤ f (ε + α) = 13α2, which yields

φε(x, α) − |x|2ε ≤ 1

2, ∀x ∈ (ε, ε + α).

If 0 < −x − ε < α, the arguments are similar as above, and we omit them.

(iii) For −α < |x| − ε ≤ 0, it is clear that φε(x, α) − |x|2ε = 1 (|x| − ε + α)3α3α32. (iv) For |x| − ε ≤ −α, we have φε(x, α) − |x|2ε = 0. Then, the desired result follows.

(b) To prove the twice continuous differentiability of φε(x, α), we need to check φε(·, α),

xφε(·, α) and ∇2xxφε(·, α) are all continuous. Since they are piecewise functions, it suffices to check the junction points.

First, we check that φε(·, α) is continuous.

(i) If |x| − ε = α, then φε(x, α) = 43α2, which implies φε(·, α) is continuous.

(ii) If |x| − ε = −α, then φε(x, α) = 0. Hence, φε(·, α) is continuous.

Next, we check ∇xφε(·, α) is continuous.

(i) If |x| − ε = α, then ∇xφε(x, α) = 2α sgn(x).

(7)

(ii) If |x| − ε = −α, then ∇xφε(x, α) = 0. From the above, it clear to see that ∇xφε(·, α) is continuous.

Now we show that ∇2xxφε(·, α) is continuous. (i) If |x| − ε = α, ∇2xxφε(x, α) = 2.

(ii) |x| − ε = −α then ∇2xxφε(x, α) = 0. Hence, ∇2xxφε(·, α) is continuous.

(c) It is clear that lim

α→0φε(x, α) = |x|2εholds by part(a). It remains to verify lim

α→0xφε(x, α) =

∇(|x|2ε). First, we compute that

∇(|x|2ε) =  2(|x| − ε)sgn(x) if |x| − ε ≥ 0,

0 if |x| − ε < 0. (10)

In light of (10), we proceed the arguments by discussing four cases.

(i) For |x| − ε ≥ α, we have ∇xφε(x, α) − ∇(|x|2ε) = 0. Then, the desired result follows.

(ii) For 0 < |x| − ε < α, we have

xφε(x, α) − ∇(|x|2ε) = 1

2α(|x| − ε + α)2sgn(x) − 2(|x| − ε)sgn(x) which yields

α→0lim(∇xφε(x, α) − ∇(|x|2ε)) = lim

α→0

(|x| − ε + α)2− 4α(|x| − ε)

2α sgn(x).

We notice that |x| → ε when α → 0, and hence (|x|−ε+α)2−4α(|x|−ε)00. Then, applying L’hopital rule yields

α→0lim

(|x| − ε + α)2− 4α(|x| − ε)

2α = lim

α→0(α − (|x| − ε)) = 0.

This implies lim

α→0(∇xφε(x, α) − ∇(|x|2ε)) = 0, which is the desired result.

(iii) For −α < |x| − ε ≤ 0, we have ∇xφε(x, α) − ∇(|x|2ε) = 1 (|x| − ε + α)2sgn(x). Then, applying L’hopital rule gives

α→0lim

(|x| − ε + α)2

2α = lim

α→0(|x| − ε + α) = 0.

Thus, we prove that lim

α→0(∇xφε(x, α) − ∇(|x|2ε)) = 0 under this case.

(iv) For |x| − ε ≤ −α, we have ∇xφε(x, α) − ∇(|x|2ε) = 0. Then, the desired result follows clearly. 2

Now, we use the family of smoothing functions φεto replace the square of ε-insensitive loss function in (5) to obtain the first smooth support vector regression. In other words, we consider

minω Fε,α(ω) := 1

Tω + C

21TΦε Aω − y, α .¯ (11)

(8)

where ω := (x, b) ∈ IRn+1, and Φε(Ax + 1b − y, α) ∈ IRm is defined by Φε(Ax + 1b − y, α)i = φε(Aix + b − yi, α) .

This is a strongly convex unconstrained optimization with the twice continuously differ- entiable objective function. Noting lim

α→0φε(x, α) = |x|2ε, we see that minω Fε,0(ω) := lim

α→0Fε,α(ω) = 1

Tω + C 2

m

X

i=1

¯ATiω − yi

2

ε (12)

which is exactly the problem (5).

The following Theorem shows that the unique solution of the smooth problem (11) approaches to the unique solution of the problem (12) as α → 0. Indeed, it plays as the same role as [9, Theorem 2.2].

Theorem 2.1. Let Fε,α(ω) and Fε,0(ω) be defined as in (11) and (12), respectively. Then, the following hold.

(a) There exists a unique solution ¯ωα to min

ω∈IRn+1

Fε,α(ω) and a unique solution ¯ω to min

ω∈IRn+1Fε,0(ω).

(b) For all 0 < α < ε, we have the following inequality:

k¯ωα− ¯ωk2 ≤ 1

6Cmα2. (13)

Moreover, ¯ωα converges to ¯ω as α → 0 with an upper bound given by (13).

Proof. (a) In view of φε(x, α) − |x|2ε ≥ 0 in Lemma 2.1(a), we see that the level sets Lv(Fε,α(ω)) := ω ∈ IRn+1| Fε,α(ω) ≤ v

Lv(Fε,0(ω)) := ω ∈ IRn+1| Fε,0(ω) ≤ v satisfy

Lv(Fε,α(ω)) ⊆ Lv(Fε,0(ω)) ⊆ω ∈ IRn+1| ωTω ≤ 2v

(14) for any v ≥ 0. Hence, we obtain that Lv(Fε,α(ω)) and Lv(Fε,0(ω)) are compact (closed and bounded) subsets in IRn+1. Then, by the strong convexity of Fε,0(ω) and Fε,α(ω) with α > 0, each of the problems min

ω∈IRn+1Fε,α(ω) and min

ω∈IRn+1Fε,0(ω) has a unique solution.

(b) From the optimality condition and strong convexity of Fε,0(ω) and Fε,α(ω) with α > 0, we know that

Fε,0(¯ωα) − Fε,0(¯ω) ≥ ∇Fε,0(¯ωα− ¯ω) + 1

2k¯ωα− ¯ωk2 ≥ 1

2k¯ωα− ¯ωk2, (15)

(9)

Fε,α(¯ω) − Fε,α(¯ωα) ≥ ∇Fε,α(¯ω − ¯ωα) + 1

2k¯ω − ¯ωαk2 ≥ 1

2k¯ω − ¯ωαk2. (16) Note that Fε,α(ω) ≥ Fε,0(ω) because φε(x, α) − |x|2ε ≥ 0. Then, adding up (15) and (16) along with this fact yield

k¯ωα− ¯ωk2 ≤ (Fε,α(¯ω) − Fε,0(¯ω)) − (Fε,α(¯ωα) − Fε,0(¯ωα))

≤ Fε,α(¯ω) − Fε,0(¯ω)

= C

21TΦε( ¯A¯ω − y, α) −C 2

m

X

i=1

¯ATi ω − y¯ i

2 ε

= C

2

m

X

i=1

φε( ¯Aiω − y¯ i, α) − C 2

m

X

i=1

¯ATi ω − y¯ i

2 ε

≤ 1

6Cmα2,

where the last inequality is due to Lemma 2.1(a). It is clear that ¯ωα converges to ¯ω as α → 0 with an upper bound given by the above. Then, the proof is complete. 2

Next, we focus on the optimality condition of the minimization problem (11), which is indeed sufficient and necessary for (11) and has the form of

ωFε,α(ω) = 0.

With this, we define a function Hφ: IRn+2 → IRn+2 by Hφ(z) =

 α

ωFε,α(ω)



=

 α

ω + CPm

i=1xφε( ¯ATi ω − yi, α) ¯Ai



(17) where z := (α, ω) ∈ IRn+2. From Lemma 2.1 and the strong convexity of Fε,α(ω), it is easy to see that if Hφ(z) = 0, then α = 0 and ω solves (11); and for any z ∈ IR++× IRn+1, the function Hφ is continuously differentiable. In addition, the Jacobian of Hφ can be calculated as below:

Hφ0(z) =

 1 0

2ωαFε,α(ω) ∇2ωωFε,α(ω)



(18) where

2ωαFε,α(ω) = C

m

X

i=1

2φε( ¯ATi ω − yi, α) ¯Ai,

2ωωFε,α(ω) = I + C

m

X

i=1

2xxφε( ¯ATi ω − yi, α) ¯AiTi .

From (8), we can see ∇2xxφε(x, α) ≥ 0, which implies CPm

i=12xxφε( ¯ATi ω − yi, α) ¯AiTi is positive semidefinite. Hence, ∇2ωωFε,α(ω) is positive definite. This helps us to prove

(10)

that Hφ0(z) is invertible at any z ∈ IR++× IRn+1. In fact, if there exists a vector d :=

(d1, d2) ∈ IR × IRn+1 such that Hφ0(z)d = 0, then we have

 d1

d12ωαFε,α(ω) + ∇2ωωFε,α(ω)d2



= 0.

This implies that d = 0, and hence Hφ0(z) is invertible at any z ∈ IR++× IRn+1.

### 3 The second smooth support vector machine

In this section, we consider another type of smoothing functions ψε,p(x, α) : IR × IR+ → IR+, which is defined by

ψε,p(x, α) =





0 if 0 ≤ |x| ≤ ε − α,

α p−1

h(p−1)(|x|−ε+α)

ip

if ε − α < |x| < ε + p−1α ,

|x| − ε if |x| ≥ ε + p−1α .

(19)

Here p ≥ 2. The graphs of ψε,p(x, α) are depicted in Figure 2, which clearly verify that ψε,p(x, α) is a family of smoothing functions for |x|ε.

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2

x 0

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

|x|

,p(x, 0.03) ,p(x, 0.06) ,p(x, 0.09)

Figure 2: Graphs of ψε,p(x, α) with ε = 0.1, α = 0.03, 0.06, 0.09 and p = 2.

As in Lemma 3.1, we verify that ψε,p(x, α) is a family of smoothing functions for |x|ε, hence, ψε,p2 (x, α) is also a family of smoothing functions for |x|2ε. Then, we can employ ψε,p2 to replace the square of ε-insensitive loss function in (5) as the same way done in

(11)

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 x

0 0.002 0.004 0.006 0.008 0.01 0.012 0.014

|x|2 (x, 0.06) (x, 0.09)

,p 2 (x, 0.06)

,p 2 (x, 0.09)

Figure 3: Graphs of |x|2ε, φε(x, α) and ψ2ε,p(x, α) with ε = 0.1, α = 0.06, 0.09 and p = 2.

Section 2. The graphs of ψε,p2 (x, α) with comparison to φε(x, α) are depicted in Figure 3.

In fact, there is a relation between ψε,p2 (x, α) and φε(x, α) shown as in Proposition 3.1.

In other words, we obtain an alternative strongly convex unconstrained optimization for (5):

minω

1

Tω + C 2

m

X

i=1

ψε,p2Ti ω − yi, α . (20) However, the smooth function ψ2ε,p(x, α) is not twice differentiable with respect x, and hence the objective function of (20) is not twice differentiable although it smooth. Then, we still cannot apply Newton-type method to solve (20). To conquer this, we take another smoothing technique. Before presenting the idea of this smoothing technique, the following two lemmas regarding properties of ψε,p(x, α) are needed. To this end, we also compute the partial derivative of ψε,p(x, α) as below:

xψε,p(x, α) =





0 if 0 ≤ |x| ≤ ε − α,

sgn(x)h(p−1)(|x|−ε+α)

ip−1

if ε − α < |x| < ε + p−1α , sgn(x) if |x| ≥ ε + p−1α .

αψε,p(x, α) =





0 if 0 ≤ |x| ≤ ε − α,

(ε−|x|)(p−1)+α

h(p−1)(|x|−ε+α)

ip−1

if ε − α < |x| < ε + p−1α , 0 if |x| ≥ ε + p−1α .

(12)

Lemma 3.1. Let ψε,p(x, α) be defined as in (19). Then, we have (a) ψε,p(x, α) is smooth with respect to x for any p ≥ 2;

(b) lim

α→0ψε,p(x, α) = |x|ε for any p ≥ 2.

Proof. (a) To prove the result, we need to check both ψε,p(·, α) and ∇xψε,p(·, α) are continuous.

(i) If |x| = ε − α, then ψε,p(x, α) = 0.

(ii) If |x| = ε +p−1α , then ψε,p(x, α) = p−1α . Form (i) and (ii), it is clear to see ψε,p(·, α) is continuous.

Moreover, (i) If |x| = ε − α, then ∇xψε,p(x, α) = 0.

(ii) If |x| = ε + p−1α , then ∇xψε,p(x, α) = sgn(x). In view of (i) and (ii), we see that

xψε,p(·, α) is continuous.

(b) To proceed, we discuss four cases.

(1) If 0 ≤ |x| ≤ ε − α, then ψε,p(x, α) − |x|ε = 0. Then, the desired result follows.

(2) If ε − α ≤ |x| ≤ ε, then ψε,p(x, α) − |x|ε= p−1α h

(p−1)(|x|−ε+α)

ip

. Hence,

α→0lim



ψε,p(x, α) − |x|ε



= lim

α→0

 α

p − 1

  (p − 1)(|x| − ε + α) pα

p

= lim

α→0

 α

p − 1



α→0lim

 (p − 1)(|x| − ε + α) pα

p

.

It is clear that the first limit is zero, so we only need to show that the second limit is bounded. To this end, we rewrite it as

α→0lim

 (p − 1)(|x| − ε + α) pα

p

= lim

α→0

 p − 1 p

p

 |x| − ε + α α

p

.

We notice that |x| → ε when α → 0 so that |x|−ε+αα00. Therefore, by applying L’hopital’s rule, we obtain

α→0lim

 |x| − ε + α α



= 1

which implies that limα→0



ψε,p(x, α) − |x|ε



= 0 under this case.

(3) If ε ≤ |x| ≤ ε +p−1α , then

ψε,p(x, α) − |x|ε= α p − 1

 (p − 1)(|x| − ε + α) pα

p

− (|x| − ε).

(13)

We have shown in case (2) that

α→0lim α p − 1

 (p − 1)(|x| − ε + α) pα

p

= 0.

It is also obvious that limα→0(|x|−ε) = 0. Hence, we obtain limα→0 ψε,p(x, α)−|x|ε = 0 under this case.

(4) If |x| ≥ ε + p−1α , the desired result follows since it is clear that ψε,p(x, α) − |x|ε = 0.

From all the above, the proof is complete. 2

Lemma 3.2. Let ψε,p(x, α) be defined as in (19). Then, we have (a) ψε,p(x, α)sgn(x) is smooth with respect to x for any p ≥ 2;

(b) lim

α→0ψε,p(x, α)sgn(x) = |x|εsgn(x) for any p ≥ 2.

Proof. (a) First, we observe that ψε,p(x, α)sgn(x) can be written as

ψε,p(x, α)sgn(x) =





0 if 0 ≤ |x| ≤ ε − α,

α p−1

h(p−1)(|x|−ε+α)

ip

sgn(x) if ε − α < |x| < ε + p−1α , (|x| − ε)sgn(x) if |x| ≥ ε +p−1α .

Note that sgn(x) is continuous at x 6= 0 and ψε,p(x, α) = 0 at x = 0, then applying Lemma 3.1(a) yields ψε,p(x, α)sgn(x) is continuous. Furthermore, by simple calculations, we have

xε,p(x, α)sgn(x)) = ∇xψε,p(x, α)sgn(x)

=





0 if 0 ≤ |x| ≤ ε − α, h(p−1)(|x|−ε+α)

ip−1

if ε − α < |x| < ε + p−1α , 1 if |x| ≥ ε +p−1α .

(21)

Mimicking the arguments as in Lemma 3.1(a), we can verify that ∇xε,p(x, α)sgn(x)) is continuous. Thus, the desired result follows.

(b) By Lemma 3.1(b), it is easy to see that lim

α→0ψε,p(x, α)sgn(x) = |x|εsgn(x). Then, the desired result follows. 2

Note that |x|2ε is smooth with

∇(|x|2ε) = 2|x|εsgn(x) = 2(|x| − ε)sgn(x) if |x| > ε, 0 if |x| ≤ ε.

(14)

being continuous (but not differentiable). Then, we consider the optimality condition of (12), that is

ωFε,0(ω) = ω + C

m

X

i=1

| ¯ATi ω − yi|εsgn( ¯ATi ω − yi) ¯Ai = 0, (22)

which is indeed sufficient and necessary for (5). Hence, solving (22) is equivalent to solv- ing (5).

Using the family of smoothing functions ψε,p to replace ε-loss function of (22) leads to a system of smooth equations. More specifically, we define a function Hψ : IRn+2 → IRn+2 by

Hψ(z) = Hψ(α, ω) =

 α

ω + CPm

i=1ψε( ¯ATi ω − yi, α)sgn( ¯ATiω − yi) ¯Ai



where z := (α, ω) ∈ IRn+2. From Lemma 3.1, it is easy to see that if Hψ(z) = 0, then α = 0 and ω is the solution of the equations (22), i.e., the solution of (12). Moreover, for any z ∈ IR++× IRn+1, the function Hψ is continuously differentiable with

Hψ0(z) =

 1 0

E(ω) I + D(ω)



(23) where

E(ω) = C

m

X

i=1

αψε( ¯ATi ω − yi, α)sgn( ¯ATi ω − yi) ¯Ai,

D(ω) = C

m

X

i=1

xψε( ¯ATi ω − yi, α)sgn( ¯ATi ω − yi) ¯AiTi .

Because ∇xψε( ¯ATi ω − yi, α)sgn( ¯ATi ω − yi) is nonnegative for any α > 0 from (21), we see that I + D(x) is positive definite at any z ∈ IR++ × IRn+1. Following the similar arguments as in Section 2, we obtain that Hψ0(z) is invertible at any z ∈ IR++× IRn+1. Proposition 3.1. Let φε(x, α) be defined as in (6) and ψε,p(x, α) be defined as in (19).

Then, the following hold.

(a) For p ≥ 2, we have φε(x, α) ≥ ψ2ε,p(x, α) ≥ |x|2ε. (b) For p ≥ q ≥ 2, we have ψε,q(x, α) ≥ ψε,p(x, α).

Proof. (a) First, we show that φε(x, α) ≥ ψ2ε,p(x, α) holds. To proceed, we discuss four cases.

(i) If |x| ≤ ε − α, then φε(x, α) = 0 = ψε,p2 (x, α).

(15)

(ii) If ε − α < |x| < ε + p−1α , then |x| ≤ ε + p−1α which is equivalent to |x|−ε+α1p−1αp. Thus, we have

φε(x, α)

ψε,p2 (x, α) = α2p−3p2p

6(p − 1)2p−2(|x| − ε + α)2p−3 ≥ p3

6(p − 1) ≥ 1, which implies φε(x, α) ≥ ψε,p2 (x, α).

(iii) For ε + p−1α ≤ |x| < ε + α, letting t := |x| − ε ∈ [p−1α , α) yields φε(x, α) − ψε,p2 (x, α) = 1

6α(t + α)3 − t2 = t 1

6αt2− 1 2t + 1

 + 1

2 ≥ 0.

Here the last inequality follows from the fact that discriminant of 1 t212t + 12α is less than 0 and 1 > 0. Then, φε(x, α) − ψ2ε,p(x, α) > 0.

(iv) If |x| ≥ ε + α, then it is clear that φε(x, α) = (|x| − ε)2+ 13α2 ≥ (|x| − ε)2 = ψε,p2 . Now we show that the other part ψε,p(x, α) ≥ |x|ε, which is equivalent to verifying ψε,p2 (x, α) ≥ |x|2ε. Again, we discuss four cases.

(i) If |x| ≤ ε − α, then ψε,p(x, α) = 0 = |x|ε.

(ii) If ε − α < |x| ≤ ε, then ε − α < |x| which says |x| − ε + α > 0. Thus, we have ψε,p(x, α) ≥ 0 = |x|ε.

(iii) For ε < |x| < ε + p−1α , we let t := |x| − ε ∈ (0,p−1α ) and define a function as f (t) = α

p − 1

 (p − 1)(t + α) pα

p

− t,

which is a function onh

0,p−1α i

. Note that f (|x|−ε) = ψε,p(x, α)−|x|εfor |x| ∈ (ε, ε+p−1α ) and observe that

f0(t) = (p − 1)(t + α) pα

p−1

− 1 ≤ (p − 1)(p−1α + α) pα

!p−1

− 1 = 0.

This meansf (t) is monotone decreasing on (0,p−1α ). Since f (p−1α ) = 0, we have f (t) ≥ 0 for t ∈ (0,p−1α ), which implies ψε,p(x, α) ≥ |x|ε for |x| ∈ (ε, ε + p−1α ).

(iv) If |x| ≥ ε + p−1α , then it is clear that ψε,p(x, α) = |x| − ε = |x|ε. (b) For p ≥ q ≥ 2, it is obvious to see that

ψε,q(x, α) = ψε,p(x, α) for |x| ∈ [0, ε − α] ∪ [ε + α

q − 1, +∞).

If |x| ∈ [ε + p−1α , ε + q−1α ), then ψε,p(x, α) = |x|ε ≤ ψε,q(x, α) from the above. Thus, we only need to prove the case of |x| ∈ (ε − α, ε +p−1α ).

(16)

Consider |x| ∈ (ε − α, ε + p−1α ) and t := |x| − ε + α, we observe that αtp−1p . Then, we verify that

ψε,q(x, α)

ψε,p(x, α) = (q − 1)q−1pp (p − 1)p−1qq ·α

t

p−q

≥ (q − 1)q−1pp

(p − 1)p−1qq · p − 1 p

p−q

=  p q

q

· q − 1 p − 1

q−1

=



1 + p−qq q



1 + p−qq−1q−1

≥ 1,

where the last inequality is due to (1 +p−qx )x being increasing for x > 0. Thus, the proof is complete. 2

### 4 A smoothing Newton algorithm

In Section 2 and Section 3, we construct two systems of smooth equations: Hφ(z) = 0 and Hψ(z) = 0. We briefly describe the difference between Hφ(z) = 0 and Hψ(z) = 0.

In general, the way we come up with Hφ(z) = 0 and Hψ(z) = 0 is a bit different. For achieving Hφ(z) = 0, we first use the twice continuously differentiable functions φε(x, α) to replace |x|2ε in problem (5), and then write out its KKT condition. To the contrast, for achieving Hψ(z) = 0, we write out the KKT condition of problem (5) first, then we use the smoothing functions ψε,p(x, α) to replace ε-loss function of (22) therein. For convenience, we denote ˜H(z) ∈ {Hφ(z), Hψ(z)}. In other words, ˜H(z) possesses the property that if ˜H(z) = 0, then α = 0 and ω solves (12). In view of this, we apply some Newton-type methods to solve the system of smooth equations ˜H(z) = 0 at each iteration and letting α → 0 so that the solution to the problem (12) can be found.

Algorithm 4.1. (A smoothing Newton method)

Step 0 Choose δ ∈ (0, 1), σ ∈ (0,12), and α0 > 0. Take τ ∈ (0, 1) such that τ α0 < 1. Let ω0 ∈ IRn+1 be an arbitrary vector. Set z0 := (α0, ω0). Set e0 := (1, 0, . . . , 0) ∈ IRn+2. Step 1 If k ˜H(zk)k = 0, stop.

Step 2 Define function Γ, β by

Γ(z) := k ˜H(zk)k2 and β(z) := τ min{1, Γ(z)}. (24)

(17)

Compute 4zk := (4αk, 4xk) by

H(z˜ k) + ˜H0(zk)4zk= α0β(zk)e0. Step 3 Let θk be the maximum of the values 1, δ, δ2, · · · such that

Γ(zk+ λk∆zk) ≤ [1 − 2σ(1 − γα0k]Γ(zk). (25) Step 4 Set zk+1 := zk+ θk∆zk, and k := k + 1, Go to step 1.

Proposition 4.1. Suppose that the sequence {zk} is generated by Algorithm 4.1. Then, the following results hold.

(a) {Γ(zk)} is monotonically decreasing.

(b) { ˜H(zk)} and {β(zk)} are monotonically decreasing.

(c) Let N (τ ) := {z ∈ IR+× IRn+1 : α0β(z) ≤ α}, then zk ∈ N (τ ) for any k ∈ K and 0 < αk+1 ≤ αk.

(d) The algorithm is well defined.

Proof. Since the proof is much similar to [6, Remark 2.1], we omit it here. 2

Lemma 4.1. Let ¯λ := maxn

λi(Pm

i=1iiT)o

. Then, for any z ∈ IR++× IRn+1, we have (a) 1 ≤ λi(Hφ0(z)) ≤ 1 + 2¯λ, i = 1, · · · , n + 2;

(b) 1 ≤ λi(Hψ0 (z)) ≤ 1 + ¯λ, i = 1, · · · , n + 2.

Proof. (a) Hφ0(z) is continuously differentiable at any z ∈ IR++× IRn+1, and by (18), it is easy to see that {1, λ1(∇2ωωFε,α(ω)), · · · , λn+1(∇2ωωFε,α(ω))} are eigenvalues of Hφ0(z).

From the representation of ∇2xxφε in (8), we have 0 ≤ ∇2xxφε( ¯ATi ω − yi, α) ≤ 2. As

2ωωFε,α(ω) = I +Pm

i=12xxφε( ¯ATi ω − yi, α) ¯AiTi , then

1 ≤ λi(∇2ωωFε,α(ω)) ≤ 1 + 2¯λ(i = 1, · · · , n + 1). (26) Thus the result (i) holds.

(b) Note that

xψε,p(x, α)sgn(x) =





0 0 ≤ |x| ≤ ε − α,

h(p−1)(|x|−ε+α)

ip−1

ε − α < |x| < ε + p−1α ,

1 |x| ≥ ε +p−1α ,

which says 0 ≤ ∇xψε,p(x, α) ≤ 1. Then, following the similar arguments as in part(a), the result of part(b) cab be proved. 2

(18)

Proposition 4.2. { ˜H(α, ω)} is coercive for any fixed α > 0, i.e., limkωk→+∞k ˜H(α, ω)k = +∞.

Proof. We first claim that {Hφ(α, ω)} is coercive for any fixed α > 0. By the definition of Hφ(α, ω) in (17), kHφ(α, ω)k2 = α2+ k∇ωFε,α(ω)k2. Then for any fixed α > 0,

lim

kωk→+∞kHφ(α, ω)k = +∞ ⇔ lim

kωk→+∞k∇ωFε,α(ω)k = +∞.

By (26), we have k∇2ωωFε,α(x, b)k ≥ 1. For any ω0 ∈ IRn+1,

k∇ωFε,α(ω)k + k∇ωFε,α0)k ≥ k∇ωFε,α(ω) − ∇ωFε,α0)k

= k∇2ωωFε,α(ˆω)(ω − ω0)k

≥ kω − ω0k,

where ˆω between ω0 and ω, then limkωk→+∞k∇ωFε,α(ω)k = +∞.

By a similar proof, we can get {Hψ(α, ω)} is coercive for any fixed α > 0.

From the above, ˜H(α, ω) ∈ {Hφ(α, ω), Hψ(α, ω)} is coercive for any fixed α > 0. 2

Lemma 4.2. Let Ω ⊆ IRn+1 be a compact set and Γ(α, ω) be defined as in (24). Then, for every ς > 0, there exists a ¯α > 0 such that

|Γ(α, ω) − Γ(0, ω)| ≤ ς for all ω ∈ Ω and all α ∈ [0, ¯α].

Proof. The function Γ(α, ω) defined as in (24) is continuous on the compact set [0, ¯α]×Ω.

The lemma is then an immediate consequence of the fact that every continuous function on a compact set is uniformly continuous there. 2

Lemma 4.3. (Mountain Pass Theorem [12, Theorem 9.2.7]) Suppose that g : IRm → IR is a continuously differentiable and coercive function. Let Ω ⊂ IRm be a nonempty and compact set and ξ be the minimum value of g on the boundary of Ω, i.e., ξ :=

miny∈∂Ωg(y). Assume that there exist points a ∈ Ω and b /∈ Ω such that g(a) < ξ and g(b) < ξ . Then, there exists a point c ∈ IRm such that ∇g(c) = 0 and g(c) ≥ ξ.

Theorem 4.1. Suppose the sequence {zk} is generated by Algorithm 4.1. Then, the sequence {zk} is bounded, and ωk = (xk, bk) converges to the unique solution ωsol = (xsol, bsol) of problem (12).

(19)

Proof. (a) We first show that the sequence {zk} is bounded. It is clear from Proposition 4.1(c) that the sequence {αk} is bounded. In the following two cases, by assuming that {ωk} is unbounded, we will derive a contradiction. By passing to subsequence if necessary, we assume kωkk → +∞ as k → +∞. Then, we discuss two cases.

(i) If α = lim

k→+∞αk > 0, applying Proposition 4.1(b) yields that n ˜H(zk)o

is bounded.

In addition, by Proposition 4.2, we have lim

k→+∞

H(α˜ , ωk) = lim

kk→+∞

H(α˜ , ωk) = +∞. (27)

(ii) If α = lim

k→+∞αk = 0, by assuming that {ωk} is unbounded, there exists a compact set Ω ⊂ IRn with

ωsol∈ Ω/ (28)

for all k sufficiently large. Since

¯

m := min

ω∈∂ΩΓ(0, ω) > 0, we can apply Lemma 4.2 with ς := ¯m/4 and conclude that

Γ(αk, ωsol) ≤ 1

4m¯ (29)

and

ω∈∂ΩminΓ(αk, ω) ≥ 3 4m¯

for all k sufficiently large. Since αk → 0 in this case, combining (24) and Proposition 4.1(c) gives

Γ(αk, ωk) = β(αk, ωk) ≤ αk0. Hence,

Γ(αk, ωk) ≤ 1

4m¯ (30)

for all k sufficiently large. Now let us fix an index k such that (29) and (30) hold.

Applying the Mountain Pass Theorem 4.3 with a := ωsol and b := ωk, we obtain the existence of a vector c ∈ IRn+1 such that

ωΓ(αk, c) = 0 and Γ(αk, c) ≥ 3

4m > 0.¯ (31)

To derive a contradiction, we need to show that c is a global minimizer of Γ(αk, ω). Since Γ(αk, ω) ≥ α2, it is sufficient to show Γ(αk, c) = α2. We discuss this in two cases:

(20)

• If ˜H = Hp, then

ωΓ(αk, c) = 2∇2ωωFε,αk(c) ¯Hpk, c)

where ¯Hp is the last n + 1 components of Hφ, i.e., ¯Hp = Hp(2 : n + 2). Then, using (31) and the fact that ∇2ωωFε,αk(c) is invertible for αk > 0, we have ¯Hpk, c) = 0.

Furthermore,

Γ(αk, c) = kH(αk, c)k2 = α2.

• If ˜H = Hψ, then

ωΓ(αk, c) = 2(I + D(ω)) ¯Hψk, c)

where I + D(ω) is given by (23) and ¯Hψ is the last n + 1 components of Hψ. Since I + D(ω) is invertible for αk> 0, we obtain that Γ(αk, c) = α2 by the same way as in the above case.

(b) From Proposition 4.1, we know that sequences { ˜H(zk)} and {Γ(zk)} are non-negative and monotone decreasing, and hence they are convergent. In addition, by using the first result of this theorem, we obtain that the sequence {zk} is bounded. Passing to subsequence if necessary, we may assume that there exists a point z = (α, ω)IR++× IRn+1 such that limk→+∞zk= z, and hence,

k→+∞lim kH(zk)k = kH(z)k and lim

k→+∞Γ(zk) = Γ(z).

For H(z) = 0, by a simple continuity discussion, we obtain that ω is a solution to problem (12). For the case of H(z) > 0, and hence α > 0, we will derive a contradiction.

First, by the assumption that H(z) > 0, we have limk→+∞θk = 0. Thus, for any sufficiently large k, the stepsize ˆθk := θk/δ does not satisfy the line search criterion (25), i.e.,

Γ(zk+ ˆθk4zk) >

h

1 − 2σ(1 − γα0)ˆθk

i Γ(zk), which implies that

Γ(zk+ ˆθk4zk) − Γ(zk)

θˆk > −2σ(1 − γα0)Γ(zk).

Since α > 0, it follows that Γ(zk) is continuously differentiable at z. Letting k → +∞

in the above inequality gives

−2σ(1 − γα0)Γ(z)

≤ 2 ˜H(z)T0(z)4z = 2 ˜H(z)T(− ˜H(z) + α0β(z)e0)

= −2 ˜H(z)TH(z˜ ) + 2α0β(z) ˜H(z)Te0

≤ 2(−1 + γα0)Γ(z).

(21)

This indicates that −1 + γα0+ σ(1 − γα0) ≥ 0, which contradicts the fact that γα0 < 1.

Thus, there should be ˜H(z) = 0.

Because the unique solution to problem (12) is ωsol, we have z = (0, ωsol) and the whole sequence {zk} converge to z, that is,

lim

k→+∞z = (0, ωsol).

Then, the proof is complete. 2

In the following, we discuss the local convergence of Algorithm 4.1. To this end, we need the concept of semismoothness, which was originally introduced by Mifflin [7]

for functionals and was further extended to the setting of vector-valued functions by Qi and Sun [14]. A locally Lipschitz continuous function F : IRn → IRm, which has the generalized Jacobian ∂F (x) in the sense of Clarke [2], is said to be semismooth (respectively, strongly semismooth) at x ∈ IRn, if F is directionally differentiable at x and

F (x + h) − F (x) − V h = o(khk) (= O(khk2), respectively) holds for any V ∈ ∂F (x + h).

Lemma 4.4. (a) Suppose that the sequence {zk} is generated by Algorithm 4.1. Then,

0(zk)−1 ≤ 1.

(b) ˜H(z) is strongly semismooth at any z = (α, ω) ∈ IRn+2.

Proof. (a) By Proposition 4.1 (c), we know that αk > 0 for any k ∈ K. This together with Lemma 4.1 leads to the desired result.

(b) We only provide the proof for the case of ˜H(α, ω) = Hφ(α, ω). For the other case of H(α, ω) = H˜ ψ(α, ω), the proof is similar and is omitted. First, we observe that Hφ0(z) is continuously differentiable and Lipschitz continuous at z ∈ IR++× IRn+1 by Lemma 4.1(a). Thus, Hφ(z) is strongly semismooth at z ∈ IR++× IRn+1. It remains to verify that Hφ(z) is strongly semismooth at z ∈ {0} × IRn+1. To see this, we recall that

xφε(x, 0) = 2(|x| − ε)sgn(x), |x| − ε ≥ 0;

0, |x| − ε ≤ −α.

It is a piecewise linear function, and hence ∇xφε(x, 0) is a strongly semismooth function.

In summary, Hφ(z) is strongly semismooth at z ∈ {0} × IRn+1. 2

Theorem 4.2. Suppose that z = (µ, x) is an accumulation point of {zk} generated by Algorithm 4.1. Then, we have

include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power).

– stump kernel: succeeded in specific applications infinite ensemble learning could be better – existing AdaBoost-Stump applications may switch. not the

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning.. 3 Distributed clustering

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning?. 3 Distributed clustering

Keywords Support vector machine · ε-insensitive loss function · ε-smooth support vector regression · Smoothing Newton algorithm..