123 Twosmoothsupportvectormachinesfor ε -insensitiveregression

(1)

Two smooth support vector machines for ε-insensitive regression

Weizhe Gu¹ · Wei-Po Chen² · Chun-Hsu Ko³ · Yuh-Jye Lee⁴ · Jein-Shan Chen²

Received: 6 June 2017 / Published online: 18 December 2017

Abstract In this paper, we propose two new smooth support vector machines for ε-insensitive regression. According to these two smooth support vector machines, we construct two systems of smooth equations based on two novel families of smoothing functions, from which we seek the solution toε-support vector regression (ε-SVR).

More specifically, using the proposed smoothing functions, we employ the smoothing Newton method to solve the systems of smooth equations. The algorithm is shown to be globally and quadratically convergent without any additional conditions. Numerical comparisons among different values of parameter are also reported.

J.-S. Chen work is supported by Ministry of Science and Technology, Taiwan.

B

Jein-Shan Chen [email protected] Weizhe Gu

[email protected] Wei-Po Chen

[email protected] Chun-Hsu Ko [email protected] Yuh-Jye Lee

[email protected]

1 Department of Mathematics, School of Science, Tianjin University, Tianjin 300072, People’s Republic of China

2 Department of Mathematics, National Taiwan Normal University, Taipei 11677, Taiwan 3 Department of Electrical Engineering, I-Shou University, Kaohsiung 840, Taiwan

4 Department of Applied Mathematics, National Chiao Tung University, Hsinchu 300, Taiwan

(2)

Keywords Support vector machine· ε-insensitive loss function · ε-smooth support vector regression· Smoothing Newton algorithm

1 Introduction

Support vector machine (SVM) is a popular and important statistical learning technology [1,7,8,16–19]. Generally speaking, there are two main categories for support vector machines (SVMs): support vector classification (SVC) and support vector regression (SVR). The model produced by SVR depends on a training data set S = {(A1, y1), . . . , (Am, ym)} ⊆ IRⁿ × IR, where Ai ∈ IRⁿ is the input data and yi ∈ IR is called the observation. The main goal of ε-insensitive regression with the idea of SVMs is to find a linear or nonlinear regression function f that has at mostε deviation from the actually obtained targets yifor all the training data, and at the same time is as flat as possible. This problem is calledε-support vector regression (ε-SVR).

For pedagogical reasons, we begin with the linear case, in which the regression function f() is defined as

f() = ^Tx+ b with x ∈ IRⁿ, b ∈ IR. (1) Flatness in the case of (1) means that one seeks a small x. One way to ensure this is to minimize the norm of x, then the problemε-SVR can be formulated as a constrained minimization problem:

min ₂¹x^Tx+ Cm

i=1(ξi+ ξ_i^∗) s.t.

⎧⎪

⎨

⎪⎩

yi− A_i^Tx− b ≤ ε + ξi

A^T_i x+ b − yi ≤ ε + ξ_i^∗ ξi, ξ_i^∗≥ 0, i = 1, . . . , m

(2)

The constant C > 0 determines the trade-off between the flatness of f and the amount up to which deviations larger thanε are tolerated. This corresponds to dealing with a so calledε-insensitive loss function |ξ|εdescribed by

|ξ|ε= max{0, |ξ| − ε}.

The formulation (2) is a convex quadratic minimization problem with n+ 1 free variables, 2m nonnegative variables, and 2m inequality constraints, which enlarges the problem size and could increase computational complexity.

In fact, the problem (2) can be reformulated as an unconstrained optimization problem:

(x,b)∈IRminⁿ⁺¹ 1 2

x^Tx+ b² +C

2 m i=1

Ai^Tx+ b − yi²

ε (3)

This formulation has been proposed in active set support vector regression [11] and solved in its dual form. The objective function is strongly convex, hence, the problem

(3)

has a unique global optimal solution. However, according to the fact that the objective function is not twice continuously differentiable, Newton-type algorithms cannot be applied to solve (3) directly.

Lee, Hsieh and Huang [7] apply a smooth technique for (3). The smooth function

f_ε(x, α) = x + 1

αlog(1 + e^−αx), (4)

which is the integral of the sigmoid function ₁_+e¹_−αx, is used to smooth the plus function[x]₊. More specifically, the smooth function f_ε(x, α) approaches to [x]₊, whenα goes to infinity. Then, the problem (3) is recast to a strongly convex unconstrained min- imization problem with the smooth function f_ε(x, α) and a Newton-Armijo algorithm is proposed to solve it. It is proved that when the smoothing parameterα approaches to infinity, the unique solution of the reformulated problem converges to the unique solution of the original problem [7, Theorem 2.2]. However, the smoothing parameter α is fixed in the proposed algorithm, and in the implementation of this algorithm, α cannot be set large enough.

In this paper, we introduce two smooth support vector machines forε-insensitive regression. For the first smooth support vector machine, we reformulatedε-SVR to a strongly convex unconstrained optimization problem with one type of smoothing functionsφε(x, α). Then, we define a new function Hφ, which corresponds to the optimality condition of the unconstrained optimization problem. From the solution of H_φ(z) = 0, we can obtain the solution of ε-SVR. For the second smooth support vector machine, we smooth the optimality condition of the strongly convex unconstrained optimization problem of (3) with another type of smooth functionsψ_ε(x, α).

Accordingly we define the function H_ψ, which also possesses the same properties as H_φdoes. For either H_φ = 0 or H_ψ = 0, we consider the smoothing Newton method to solve it. The algorithm is shown to be globally convergent, specifically, the iterative sequence converges to the unique solution to (3). Furthermore, the algorithm is shown to be locally quadratically convergent without any assumptions.

The paper is organized as follows. In Sects.2 and3, we introduce two smooth support vector machine reformulations by two types of smoothing functions. In Sect.4, we propose a smoothing Newton algorithm and study its global and local quadratic convergence. Numerical results and comparisons are reported in Sect.5. Throughout this paper,K := {1, 2, . . .}, all vectors will be column vectors. For a given vector x= (x1, . . . , xn)^T ∈ IRⁿ, the plus function[x]+is defined as

([x]+)i = max{0, xi}, i = 1, . . . , n.

For a differentiable function f , we denote by∇ f (x) and ∇²f(x) the gradient and the Hessian matrix of f at x, respectively. For a differentiable mapping G : IRⁿ → IR^m, we denote by G(x) ∈ IR^m^×nthe Jacabian of G at x. For a matrix A∈ IR^m^×n, A^T_i is the i -th row of A. A column vector of ones and identity matrix of arbitrary dimension will be denoted by 1 and I , respectively. We denote the sign function by

(4)

sgn(x) =

⎧⎨

⎩

1 if x > 0, [−1, 1] if x = 0,

−1 if x < 0.

2 The first smooth support vector machine

As mentioned in [7], it is known thatε-SVR can be reformulated as a strongly convex unconstrained optimization problem (3). Denoteω := (x, b) ∈ IRⁿ⁺¹, ¯A:= (A, 1) and ¯A^T_i is the i -th row of ¯A, then the smooth support vector regression (3) can be rewritten as

minω

1

2ω^Tω +C 2

m i=1

¯A^Ti ω − yi²

ε. (5)

Note that | · |²_ε is smooth, but not twice differentiable, which means the objective function is not twice continuously differentiable. Hence, the Newton-type method cannot be applied to solve (5) directly.

In view of this fact, we propose a family of twice continuously differentiable func- tionsφ_ε(x, α) to replace |x|²_ε. The family of functionsφ_ε(x, α) : IR × IR₊→ IR₊is given by

φ_ε(x, α) =

⎧⎪

⎪⎨

⎪⎪

⎩

(|x| − ε)²+¹₃α²if |x| − ε ≥ α,

1

6α(|x| − ε + α)³if ||x| − ε| < α,

0 if |x| − ε ≤ −α,

(6)

where 0 < α < ε is a smooth parameter. The graphs of φ_ε(x, α) are depicted in Fig.1. From this geometric view, it is clear to see thatφ_ε(x, α) is a class of smoothing functions for|x|²_ε.

Besides the geometric approach, we hereat show thatφ_ε(x, α) is a class of smooth- ing functions for|x|²_ε by algebraic verification. To this end, we compute the partial derivatives ofφ_ε(x, α) as below:

∇xφ_ε(x, α) =

⎧⎪

⎪⎨

⎪⎪

⎩

2(|x| − ε)sgn(x) if|x| − ε ≥ α,

1

2α(|x| − ε + α)²sgn(x) if x| − ε

0 if|x| − ε ≤ −α.

(7)

∇x x² φε(x, α) =

⎧⎪

⎪⎨

⎪⎪

⎩

2 if |x| − ε ≥ α

|x|−ε+α

α if x| − ε 0 if |x| − ε ≤ −α.

(8)

∇²xαφε(x, α) =

⎧⎪

⎪⎨

⎪⎪

⎩

0 if |x| − ε ≥ α,

(|x|−ε+α)(α−|x|+ε)

2α² sgn(x) if x| − ε

0 if |x| − ε ≤ −α.

(9)

(5)

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 x

0 0.002 0.004 0.006 0.008 0.01 0.012 0.014

|x|² (x, 0.03) (x, 0.06) (x, 0.09)

Fig. 1 Graphs ofφε(x, α) with ε = 0.1 and α = 0.03, 0.06, 0.09

With the above, the following lemma shows some basic properties ofφε(x, α).

Lemma 2.1 Letφ_ε(x, α) be defined as in (6). Then, the following hold.

(a) For 0< α < ε, there holds 0 ≤ φ_ε(x, α) − |x|²_ε≤ ¹₃α².

(b) The functionφ_ε(x, α) is twice continuously differentiable with respect to x for 0< α < ε.

(c) lim

α→0φ_ε(x, α) = |x|²_εand lim

α→0∇xφ_ε(x, α) = ∇(|x|²_ε).

Proof (a) To complete the arguments, we need to discuss four cases.

(i) For|x| − ε ≥ α, it is clear that φ_ε(x, α) − |x|²_ε= ¹₃α².

(ii) For 0< |x| − ε < α, i.e., 0 < x − ε < α or 0 < −x − ε < α, there have two subcase.

If 0< x − ε < α, letting f (x) := φ_ε(x, α) − |x|²_ε = ₆¹_α(x − ε + α)³− (x − ε)²gives f(x) = ^(x−ε+α)₂_α ² − 2(x − ε), ∀x ∈ (ε, ε + α),

f(x) = ^x^−ε+α_α − 2 < 0, ∀x ∈ (ε, ε + α).

This indicates that f(x) is monotone decreasing on (ε, ε + α), which further implies f(x) ≥ f(ε + α) = 0, ∀x ∈ (ε, ε + α).

Thus, we obtain that f(x) is monotone increasing on (ε, ε + α). With this, we have f(x) ≤ f (ε + α) = ¹₃α², which yields

φε(x, α) − |x|²_ε≤ 1

3α², ∀x ∈ (ε, ε + α).

(6)

If 0< −x − ε < α, the arguments are similar as above, and we omit them.

(iii) For−α < |x| − ε ≤ 0, it is clear that φε(x, α) − |x|²_ε= ₆¹_α(|x| − ε + α)³≤ ^α₆_α³ ≤

α² 3.

(iv) For|x| − ε ≤ −α, we have φ_ε(x, α) − |x|²_ε = 0. Then, the desired result follows.

(b) To prove the twice continuous differentiability of φ_ε(x, α), we need to check φ_ε(·, α), ∇xφ_ε(·, α) and ∇²_{x x}φ_ε(·, α) are all continuous. Since they are piecewise functions, it suffices to check the junction points.

First, we check thatφ_ε(·, α) is continuous.

(i) If|x| − ε = α, then φ_ε(x, α) = ⁴₃α², which impliesφ_ε(·, α) is continuous.

(ii) If|x| − ε = −α, then φ_ε(x, α) = 0. Hence, φ_ε(·, α) is continuous.

Next, we check∇xφ_ε(·, α) is continuous.

(i) If|x| − ε = α, then ∇xφ_ε(x, α) = 2α sgn(x).

(ii) If |x| − ε = −α, then ∇xφ_ε(x, α) = 0. From the above, it clear to see that

∇xφε(·, α) is continuous.

Now we show that∇²x xφε(·, α) is continuous.

(i) If|x| − ε = α, ∇x x² φε(x, α) = 2.

(ii)|x| − ε = −α then ∇x x² φε(x, α) = 0. Hence, ∇²x xφε(·, α) is continuous.

(c) It is clear that lim

α→0φε(x, α) = |x|²_ε holds by part(a). It remains to verify

α→0lim∇xφ_ε(x, α) = ∇(|x|²_ε). First, we compute that

∇(|x|²_ε) =

2(|x| − ε)sgn(x) if |x| − ε ≥ 0,

0 if |x| − ε < 0. (10)

In light of (10), we proceed the arguments by discussing four cases.

(i) For|x| − ε ≥ α, we have ∇xφε(x, α) − ∇(|x|²_ε) = 0. Then, the desired result follows.

(ii) For 0< |x| − ε < α, we have

∇xφε(x, α) − ∇(|x|²_ε) = 1

2α(|x| − ε + α)²sgn(x) − 2(|x| − ε)sgn(x) which yields

α→0lim(∇xφ_ε(x, α) − ∇(|x|²_ε)) = lim

α→0

(|x| − ε + α)²− 4α(|x| − ε)

2α sgn(x).

We notice that|x| → ε when α → 0, and hence ^{(|x|−ε+α)}₂²^{−4α(|x|−ε)}_α → ⁰₀. Then, applying L’hopital rule yields

α→0lim

(|x| − ε + α)²− 4α(|x| − ε)

2α = lim

α→0(α − (|x| − ε)) = 0.

This implies lim

α→0(∇xφ_ε(x, α) − ∇(|x|²_ε)) = 0, which is the desired result.

(7)

(iii) For−α < |x|−ε ≤ 0, we have ∇xφ_ε(x, α)−∇(|x|²_ε) = ₂¹_α(|x|−ε +α)²sgn(x).

Then, applying L’hopital rule gives

α→0lim

(|x| − ε + α)²

2α = lim

α→0(|x| − ε + α) = 0.

Thus, we prove that lim

α→0(∇xφ_ε(x, α) − ∇(|x|²_ε)) = 0 under this case.

(iv) For|x| − ε ≤ −α, we have ∇xφε(x, α) − ∇(|x|²_ε) = 0. Then, the desired result

follows clearly.

Now, we use the family of smoothing functions φ_ε to replace the square of ε- insensitive loss function in (5) to obtain the first smooth support vector regression. In other words, we consider

minω F_ε,α(ω) := 1

2ω^Tω +C

21^T ε ¯Aω − y, α

. (11)

whereω := (x, b) ∈ IRⁿ⁺¹, and _ε(Ax + 1b − y, α) ∈ IR^mis defined by ε(Ax + 1b − y, α)i = φε(Aix+ b − yi, α) .

This is a strongly convex unconstrained optimization with the twice continuously differentiable objective function. Noting lim

α→0φε(x, α) = |x|²_ε, we see that

minω F_ε,0(ω) := lim

α→0F_ε,α(ω) = 1

2ω^Tω + C 2

m i=1

¯Ai^Tω − yi²

ε (12)

which is exactly the problem (5).

The following Theorem shows that the unique solution of the smooth problem (11) approaches to the unique solution of the problem (12) asα → 0. Indeed, it plays as the same role as [7, Theorem 2.2].

Theorem 2.1 Let F_ε,α(ω) and Fε,0(ω) be defined as in (11) and (12), respectively.

Then, the following hold.

(a) There exists a unique solution ¯ωα to min

ω∈IRⁿ⁺¹F_ε,α(ω) and a unique solution ¯ω to

ω∈IRminⁿ⁺¹F_ε,0(ω).

(b) For all 0< α < ε, we have the following inequality:

¯ωα− ¯ω ²≤ 1

6Cmα². (13)

Moreover, ¯ωα converges to ¯ω as α → 0 with an upper bound given by (13).

(8)

Proof (a) In view ofφ_ε(x, α) − |x|²_ε ≥ 0 in Lemma2.1(a), we see that the level sets

L_v(F_ε,α(ω)) :=

ω ∈ IRⁿ⁺¹| F_ε,α(ω) ≤ v L_v(Fε,0(ω)) :=

ω ∈ IRⁿ⁺¹| Fε,0(ω) ≤ v

satisfy

L_v(F_ε,α(ω)) ⊆ L_v(F_ε,0(ω)) ⊆

ω ∈ IRⁿ⁺¹| ω^Tω ≤ 2v

(14)

for any v ≥ 0. Hence, we obtain that L_v(F_ε,α(ω)) and L_v(F_ε,0(ω)) are compact (closed and bounded) subsets in IRⁿ⁺¹. Then, by the strong convexity of F_ε,0(ω) and F_ε,α(ω) with α > 0, each of the problems min

ω∈IRⁿ⁺¹F_ε,α(ω) and min

ω∈IRⁿ⁺¹F_ε,0(ω) has a unique solution.

(b) From the optimality condition and strong convexity of F_ε,0(ω) and F_ε,α(ω) with α > 0, we know that

F_ε,0( ¯ω_α) − F_ε,0( ¯ω) ≥ ∇ F_ε,0( ¯ω_α− ¯ω) +1

2 ¯ω_α− ¯ω ²≥ 1

2 ¯ω_α− ¯ω ², (15) F_ε,α( ¯ω) − F_ε,α( ¯ω_α) ≥ ∇ F_ε,α( ¯ω − ¯ω_α) +1

2 ¯ω − ¯ω_α ²≥ 1

2 ¯ω − ¯ω_α ². (16) Note that F_ε,α(ω) ≥ F_ε,0(ω) because φ_ε(x, α) − |x|²_ε ≥ 0. Then, adding up (15) and (16) along with this fact yield

¯ω_α− ¯ω ²≤ (F_ε,α( ¯ω) − F_ε,0( ¯ω)) − (F_ε,α( ¯ω_α) − F_ε,0( ¯ω_α))

≤ Fε,α( ¯ω) − Fε,0( ¯ω)

=C

21^T _ε( ¯A ¯ω − y, α) −C 2

m i=1

¯Ai^T¯ω − yi²

ε

=C 2

m i=1

φ_ε( ¯Ai¯ω − yi, α) −C 2

m i=1

¯A^Ti ¯ω − yi²

ε

≤ 1 6Cmα²,

where the last inequality is due to Lemma2.1(a). It is clear that ¯ω_α converges to ¯ω as α → 0 with an upper bound given by the above. Then, the proof is complete.

Next, we focus on the optimality condition of the minimization problem (11), which is indeed sufficient and necessary for (11) and has the form of

∇_ωF_ε,α(ω) = 0.

(9)

With this, we define a function H_φ: IRⁿ⁺²→ IRⁿ⁺²by

H_φ(z) =

α

∇_ωF_ε,α(ω)

=

α

ω + C_m

i=1∇xφ_ε( ¯A^T_i ω − yi, α) ¯Ai

(17)

where z := (α, ω) ∈ IRⁿ⁺². From Lemma2.1and the strong convexity of F_ε,α(ω), it is easy to see that if H_φ(z) = 0, then α = 0 and ω solves (11); and for any z ∈ IR₊₊× IRⁿ⁺¹, the function H_φ is continuously differentiable. In addition, the Jacobian of H_φcan be calculated as below:

H_φ(z) =

1 0

∇_ωα² F_ε,α(ω) ∇_ωω² F_ε,α(ω)

(18)

where

∇_ωα² F_ε,α(ω) = C m i=1

∇x²αφε

¯A^T_i ω − yi, α

¯A_i,

∇_ωω² F_ε,α(ω) = I + C m i=1

∇x x² φε

¯A^T_i ω − yi, α

¯A_i ¯A^T_i .

From (8), we can see ∇x x² φε(x, α) ≥ 0, which implies Cm

i=1∇x x² φε( ¯A^T_i ω − yi, α) ¯Ai ¯A_i^T is positive semidefinite. Hence, ∇_ωω² F_ε,α(ω) is positive definite. This helps us to prove that H_φ(z) is invertible at any z ∈ IR₊₊× IRⁿ⁺¹. In fact, if there exists a vector d := (d1, d2) ∈ IR × IRⁿ⁺¹such that H_φ(z)d = 0, then we have

d1

d1∇_ωα² F_ε,α(ω) + ∇_ωω² F_ε,α(ω)d2

= 0.

This implies that d= 0, and hence H_φ(z) is invertible at any z ∈ IR₊₊× IRⁿ⁺¹.

3 The second smooth support vector machine

In this section, we consider another type of smoothing functionsψ_ε,p(x, α) : IR × IR₊→ IR₊, which is defined by

ψ_ε,p(x, α) =

⎧⎪

⎨

⎪⎩

0 if 0≤ |x| ≤ ε − α,

pα−1

(p−1)(|x|−ε+α) pα

p

if ε − α < |x| < ε + _p^α₋₁,

|x| − ε if |x| ≥ ε + _p^α₋₁.

(19)

here p≥ 2. The graphs of ψ_ε,p(x, α) are depicted in Fig.2, which clearly verify that ψ_ε,p(x, α) is a family of smoothing functions for |x|_ε.

As in Lemma3.1, we verify thatψ_ε,p(x, α) is a family of smoothing functions for

|x|_ε, hence,ψ_ε,p² (x, α) is also a family of smoothing functions for |x|²_ε. Then, we can

(10)

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 x

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

|x|

,p(x, 0.03) ,p(x, 0.06) ,p(x, 0.09)

Fig. 2 Graphs ofψε,p(x, α) with ε = 0.1, α = 0.03, 0.06, 0.09 and p = 2

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2

x 0

0.002 0.004 0.006 0.008 0.01 0.012 0.014

|x|² (x, 0.06) (x, 0.09)

,p 2 (x, 0.06)

,p 2 (x, 0.09)

Fig. 3 Graphs of|x|²_ε,φε(x, α) and ψ_ε,p² (x, α) with ε = 0.1, α = 0.06, 0.09 and p = 2

employψ_ε,p² to replace the square ofε-insensitive loss function in (5) as the same way done in Sect.2. The graphs ofψ_ε,p² (x, α) with comparison to φ_ε(x, α) are depicted in Fig.3. In fact, there is a relation betweenψ_ε,p² (x, α) and φ_ε(x, α) shown as in Proposition3.1.

(11)

In other words, we obtain an alternative strongly convex unconstrained optimization for (5):

minω

1

2ω^Tω +C 2

m i=1

ψ_ε,p²

¯A^T_i ω − yi, α

. (20)

However, the smooth functionψ_ε,p² (x, α) is not twice differentiable with respect x, and hence the objective function of (20) is not twice differentiable although it smooth.

Then, we still cannot apply Newton-type method to solve (20). To conquer this, we take another smoothing technique. Before presenting the idea of this smoothing technique, the following two lemmas regarding properties ofψ_ε,p(x, α) are needed. To this end, we also compute the partial derivative ofψ_ε,p(x, α) as below:

∇xψ_ε,p(x, α) =

⎧⎪

⎨

⎪⎩

0 if 0≤ |x| ≤ ε − α,

sgn(x)

(p−1)(|x|−ε+α) pα

p−1

ifε − α < |x| < ε + _p₋₁^α ,

sgn(x) if|x| ≥ ε + _p₋₁^α .

∇_αψ_ε,p(x, α) =

⎧⎪

⎨

⎪⎩

0 if 0≤ |x| ≤ ε − α,

(ε−|x|)(p−1)+α pα

(p−1)(|x|−ε+α) pα

p−1

if ε − α < |x| < ε + _p^α₋₁,

0 if |x| ≥ ε + _p^α₋₁.

Lemma 3.1 Letψ_ε,p(x, α) be defined as in (19). Then, we have (a) ψ_ε,p(x, α) is smooth with respect to x for any p ≥ 2;

(b) lim

α→0ψ_ε,p(x, α) = |x|_εfor any p≥ 2.

Proof (a) To prove the result, we need to check bothψ_ε,p(·, α) and ∇xψ_ε,p(·, α) are continuous.

(i) If|x| = ε − α, then ψ_ε,p(x, α) = 0.

(ii) If |x| = ε + _p^α₋₁, thenψ_ε,p(x, α) = _p^α₋₁. Form (i) and (ii), it is clear to see ψ_ε,p(·, α) is continuous.

Moreover, (i) If|x| = ε − α, then ∇xψ_ε,p(x, α) = 0.

(ii) If|x| = ε + _p^α₋₁, then∇xψ_ε,p(x, α) = sgn(x). In view of (i) and (ii), we see that

∇xψ_ε,p(·, α) is continuous.

(b) To proceed, we discuss four cases.

(1) If 0≤ |x| ≤ ε − α, then ψε,p(x, α) − |x|ε= 0. Then, the desired result follows.

(2) Ifε − α ≤ |x| ≤ ε, then ψε,p(x, α) − |x|ε= _p^α₋₁

(p−1)(|x|−ε+α) pα

p

. Hence,

α→0lim

ψ_ε,p(x, α) − |x|_ε

= lim

α→0

α

p− 1

(p − 1)(|x| − ε + α) pα

p

= lim

α→0

α

p− 1

α→0lim

(p − 1)(|x| − ε + α) pα

p

.

(12)

It is clear that the first limit is zero, so we only need to show that the second limit is bounded. To this end, we rewrite it as

α→0lim

(p − 1)(|x| − ε + α) pα

p

= lim

α→0

p− 1 p

p

|x| − ε + α α

p

.

We notice that|x| → ε when α → 0 so that ^|x|−ε+α_α → ⁰₀. Therefore, by applying L’hopital’s rule, we obtain

α→0lim

|x| − ε + α α

= 1

which implies that lim_α→0

ψ_ε,p(x, α) − |x|_ε

= 0 under this case.

(3) Ifε ≤ |x| ≤ ε + _p^α₋₁, then

ψ_ε,p(x, α) − |x|_ε = α p− 1

(p − 1)(|x| − ε + α) pα

p

− (|x| − ε).

We have shown in case (2) that

α→0lim α p− 1

(p − 1)(|x| − ε + α) pα

p

= 0.

It is also obvious that lim_α→0(|x| − ε) = 0. Hence, we obtain lim_α→0

ψ_ε,p(x, α) −

|x|_ε

= 0 under this case.

(4) If|x| ≥ ε +_p^α₋₁, the desired result follows since it is clear thatψ_ε,p(x, α)−|x|_ε=

0. From all the above, the proof is complete.

Lemma 3.2 Letψ_ε,p(x, α) be defined as in (19). Then, we have (a) ψε,p(x, α)sgn(x) is smooth with respect to x for any p ≥ 2;

(b) lim

α→0ψ_ε,p(x, α)sgn(x) = |x|_εsgn(x) for any p ≥ 2.

Proof (a) First, we observe thatψ_ε,p(x, α)sgn(x) can be written as ψ_ε,p(x, α)sgn(x) =

⎧⎪

⎨

⎪⎩

0 if 0≤ |x| ≤ ε − α,

pα−1

(p−1)(|x|−ε+α) pα

p

sgn(x) if ε − α < |x| < ε + _p^α₋₁, (|x| − ε)sgn(x) if |x| ≥ ε + _p^α₋₁.

Note that sgn(x) is continuous at x = 0 and ψ_ε,p(x, α) = 0 at x = 0, then applying Lemma3.1(a) yieldsψ_ε,p(x, α)sgn(x) is continuous. Furthermore, by simple calcu- lations, we have

(13)

∇x(ψ_ε,p(x, α)sgn(x)) = ∇xψ_ε,p(x, α)sgn(x)

=

⎧⎪

⎨

⎪⎩

0 if 0≤ |x| ≤ ε − α,

(p−1)(|x|−ε+α) pα

p−1

ifε − α < |x| < ε + _p₋₁^α ,

1 if|x| ≥ ε + _p₋₁^α .

(21)

Mimicking the arguments as in Lemma3.1(a), we can verify that∇x(ψ_ε,p(x, α)sgn(x)) is continuous. Thus, the desired result follows.

(b) By Lemma3.1(b), it is easy to see that lim

α→0ψε,p(x, α)sgn(x) = |x|εsgn(x). Then,

the desired result follows.

Note that|x|²_εis smooth with

∇(|x|²_ε) = 2|x|_εsgn(x) =

2(|x| − ε)sgn(x) if |x| > ε,

0 if |x| ≤ ε.

being continuous (but not differentiable). Then, we consider the optimality condition of (12), that is

∇ωF_ε,0(ω) = ω + C m

i=1

| ¯A^Ti ω − yi|εsgn

¯A^T_i ω − yi

¯Ai = 0, (22)

which is indeed sufficient and necessary for (5). Hence, solving (22) is equivalent to solving (5).

Using the family of smoothing functionsψε,p to replaceε-loss function of (22) leads to a system of smooth equations. More specifically, we define a function H_ψ : IRⁿ⁺²→ IRⁿ⁺²by

H_ψ(z) = H_ψ(α, ω) =

α

ω + C_m

i=1ψ_ε ¯A^T_i ω − yi, α

sgn ¯A^T_i ω − yi ¯Ai

where z:= (α, ω) ∈ IRⁿ⁺². From Lemma3.1, it is easy to see that if H_ψ(z) = 0, then α = 0 and ω is the solution of the equations (22), i.e., the solution of (12). Moreover, for any z∈ IR₊₊× IRⁿ⁺¹, the function H_ψ is continuously differentiable with

H_ψ(z) =

1 0

E(ω) I + D(ω)

(23)

where

E(ω) = C m

i=1

∇_αψ_ε

¯A_i^Tω − yi, α sgn

¯A^T_i ω − yi

¯A_i,

D(ω) = C m

i=1

∇xψ_ε

¯A^T_i ω − yi, α sgn

¯A_i^Tω − yi

¯A_i ¯A^T_i .

(14)

Because∇xψ_ε( ¯A^T_i ω − yi, α)sgn( ¯A_i^Tω − yi) is nonnegative for any α > 0 from (21), we see that I+ D(x) is positive definite at any z ∈ IR₊₊×IRⁿ⁺¹. Following the similar arguments as in Sect.2, we obtain that H_ψ(z) is invertible at any z ∈ IR₊₊× IRⁿ⁺¹. Proposition 3.1 Letφε(x, α) be defined as in (6) andψε,p(x, α) be defined as in (19).

Then, the following hold.

(a) For p≥ 2, we have φ_ε(x, α) ≥ ψ_ε,p² (x, α) ≥ |x|²_ε. (b) For p≥ q ≥ 2, we have ψ_ε,q(x, α) ≥ ψ_ε,p(x, α).

Proof (a) First, we show thatφ_ε(x, α) ≥ ψ_ε,p² (x, α) holds. To proceed, we discuss four cases.

(i) If|x| ≤ ε − α, then φ_ε(x, α) = 0 = ψ_ε,p² (x, α).

(ii) Ifε−α < |x| < ε+_p₋₁^α , then|x| ≤ ε+_p^α₋₁which is equivalent to_|x|−ε+α¹ ≥ ^p_αp⁻¹. Thus, we have

φ_ε(x, α)

ψ_ε,p² (x, α) = α^{2 p}⁻³p^{2 p}

6(p − 1)^{2 p}⁻²(|x| − ε + α)^{2 p}⁻³ ≥ p³

6(p − 1) ≥ 1,

which impliesφ_ε(x, α) ≥ ψ_ε,p² (x, α).

(iii) Forε + _p^α₋₁ ≤ |x| < ε + α, letting t := |x| − ε ∈ [_p^α₋₁, α) yields

φε(x, α) − ψ_ε,p² (x, α) = 1

6α(t + α)³− t²= t

1 6αt²−1

2t+1 2α

+1

6α²≥ 0.

here the last inequality follows from the fact that discriminant of ₆¹_αt²−¹₂t+¹₂α is less than 0 and ₆¹_α > 0. Then, φε(x, α) − ψ_ε,p² (x, α) > 0.

(iv) If|x| ≥ ε+α, then it is clear that φ_ε(x, α) = (|x|−ε)²+¹₃α²≥ (|x|−ε)²= ψ_ε,p² . Now we show that the other partψε,p(x, α) ≥ |x|ε, which is equivalent to verifying ψ_ε,p² (x, α) ≥ |x|²_ε. Again, we discuss four cases.

(i) If|x| ≤ ε − α, then ψε,p(x, α) = 0 = |x|ε.

(ii) Ifε − α < |x| ≤ ε, then ε − α < |x| which says |x| − ε + α > 0. Thus, we have ψε,p(x, α) ≥ 0 = |x|ε.

(iii) Forε < |x| < ε + _p^α₋₁, we let t:= |x| − ε ∈ (0, _p^α₋₁) and define a function as

f(t) = α p− 1

(p − 1)(t + α) pα

p

− t,

which is a function on

0, _p^α₋₁

. Note that f(|x| − ε) = ψ_ε,p(x, α) − |x|_εfor|x| ∈ (ε, ε + _p^α₋₁) and observe that

f(t) =

(p − 1)(t + α) pα

p−1

− 1 ≤

(p − 1)(_p^α₋₁+ α) pα

p−1

− 1 = 0.

(15)

This means f(t) is monotone decreasing on (0, _p^α₋₁). Since f (_p^α₋₁) = 0, we have f(t) ≥ 0 for t ∈ (0,_p^α₋₁), which implies ψ_ε,p(x, α) ≥ |x|_εfor|x| ∈ (ε, ε + _p^α₋₁).

(iv) If|x| ≥ ε + _p^α₋₁, then it is clear thatψ_ε,p(x, α) = |x| − ε = |x|_ε. (b) For p≥ q ≥ 2, it is obvious to see that

ψ_ε,q(x, α) = ψ_ε,p(x, α) for |x| ∈ [0, ε − α] ∪

ε + α

q− 1, +∞

.

If|x| ∈ [ε + _p^α₋₁, ε + _q₋₁^α ), then ψ_ε,p(x, α) = |x|_ε ≤ ψ_ε,q(x, α) from the above.

Thus, we only need to prove the case of|x| ∈ (ε − α, ε + _p₋₁^α ).

Consider|x| ∈ (ε − α, ε + _p^α₋₁) and t := |x| − ε + α, we observe that ^α_t ≥ ^p⁻¹_p . Then, we verify that

ψε,q(x, α)

ψ_ε,p(x, α) = (q − 1)^q⁻¹p^p (p − 1)^p⁻¹q^q ·α

t

p−q

≥ (q − 1)^q⁻¹p^p (p − 1)^p⁻¹q^q ·

p− 1 p

p−q

=

p q

q

·

q− 1 p− 1

q−1

=

1+ ^p^−q_q q

1+_q^p^−q₋₁q−1

≥ 1,

where the last inequality is due to(1 + ^p^−q_x )^xbeing increasing for x > 0. Thus, the

proof is complete.

4 A smoothing Newton algorithm

In Sects.2 and3, we construct two systems of smooth equations: H_φ(z) = 0 and H_ψ(z) = 0. We briefly describe the difference between Hφ(z) = 0 and Hψ(z) = 0.

In general, the way we come up with H_φ(z) = 0 and Hψ(z) = 0 is a bit different.

For achieving H_φ(z) = 0, we first use the twice continuously differentiable functions φ_ε(x, α) to replace |x|²_ε in problem (5), and then write out its KKT condition. To the contrast, for achieving H_ψ(z) = 0, we write out the KKT condition of problem (5) first, then we use the smoothing functionsψ_ε,p(x, α) to replace ε-loss function of (22) therein. For convenience, we denote ˜H(z) ∈ {H_φ(z), H_ψ(z)}. In other words, ˜H(z) possesses the property that if ˜H(z) = 0, then α = 0 and ω solves (12). In view of this, we apply some Newton-type methods to solve the system of smooth equations H˜(z) = 0 at each iteration and letting α → 0 so that the solution to the problem (12) can be found.

Algorithm 4.1 (A smoothing Newton method)

(16)

Step 0 Choose δ ∈ (0, 1), σ ∈ (0,¹₂), and α0 > 0. Take τ ∈ (0, 1) such that τα0 < 1. Let ω0 ∈ IRⁿ⁺¹be an arbitrary vector. Set z⁰ := (α0, ω0). Set e⁰:= (1, 0, . . . , 0) ∈ IRⁿ⁺².

Step 1 If ˜H(z^k) = 0, stop.

Step 2 Define function , β by

(z) := ˜H(z^k) ² and β(z) := τ min{1, (z)}. (24)

Computez^k:= (αk, x^k) by H˜

z^k

+ ˜H

z^k

z^k = α0β z^k

e⁰.

Step 3 Letθkbe the maximum of the values 1, δ, δ², · · · such that

z^k+ λkz^k

≤

1− 2σ (1 − γ α⁰) θk

z^k

. (25)

Step 4 Set z^k⁺¹:= z^k+ θkz^k, and k := k + 1, Go to step 1.

Proposition 4.1 Suppose that the sequence{z^k} is generated by Algorithm4.1. Then, the following results hold.

(a) { (z^k)} is monotonically decreasing.

(b) { ˜H(z^k)} and {β(z^k)} are monotonically decreasing.

(c) LetN (τ) := {z ∈ IR₊× IRⁿ⁺¹: α0β(z) ≤ α}, then z^k ∈ N (τ) for any k ∈ K and 0< αk+1≤ αk.

(d) The algorithm is well defined.

Proof Since the proof is much similar to [6, Remark 2.1], we omit it here.

Lemma 4.1 Let ¯λ := max λi(_m

i=1 ¯A_i ¯A_i^T)

. Then, for any z∈ IR++× IRⁿ⁺¹, we have

(a) 1≤ λi(H_φ(z)) ≤ 1 + 2¯λ, i = 1, · · · , n + 2;

(b) 1≤ λi(H_ψ(z)) ≤ 1 + ¯λ, i = 1, · · · , n + 2.

Proof (a) H_φ(z) is continuously differentiable at any z ∈ IR++× IRⁿ⁺¹, and by (18), it is easy to see that{1, λ1(∇_ωω² F_ε,α(ω)), · · · , λn+1(∇_ωω² F_ε,α(ω))} are eigenvalues of H_φ(z). From the representation of ∇x x² φ_εin (8), we have 0≤ ∇x x² φ_ε( ¯A^T_i ω−yi, α) ≤ 2.

As∇_ωω² F_ε,α(ω) = I +_m

i=1∇x x² φε( ¯A_i^Tω − yi, α) ¯Ai ¯A^T_i , then 1≤ λi

∇_ωω² F_ε,α(ω)

≤ 1 + 2¯λ(i = 1, · · · , n + 1). (26)

Thus the result (i) holds.