2 The first smooth support vector machine

(1)

to appear in Computational Optimization and Applications, 2018

Two smooth support vector machines for ε-insensitive regression

Weizhe Gu

Department of Mathematics School of Science Tianjin University Tianjin 300072, P.R. China E-mail: weizhegu@yahoo.com.cn

Wei-Po Chen

Department of Mathematics National Taiwan Normal University

Taipei 11677, Taiwan E-mail: weaper@gmail.com

Chun-Hsu Ko

Department of Electrical Engineering I-Shou University

Kaohsiung 840, Taiwan E-mail: chko@isu.edu.tw

Yuh-Jye Lee

Department of Applied Mathematics National Chiao Tung University

Hsinchu 300, Taiwan

E-mail: yuhjye@math.nctu.edu.tw

Jein-Shan Chen ¹ Department of Mathematics National Taiwan Normal University

Taipei 11677, Taiwan E-mail: jschen@math.ntnu.edu.tw

1The author’s work is supported by Ministry of Science and Technology, Taiwan.

(2)

June 7, 2017

(1st revised on September 21, 2017) (2nd revised on November 6, 2017)

Abstract. In this paper, we propose two new smooth support vector machines for ε- insensitive regression. According to these two smooth support vector machines, we construct two systems of smooth equations based on two novel families of smoothing functions, from which we seek the solution to ε-support vector regression(ε-SVR). More specifically, using the proposed smoothing functions, we employ the smoothing Newton method to solve the systems of smooth equations. The algorithm is shown to be globally and quadratically convergent without any additional conditions. Numerical comparisons among different values of parameter are also reported.

Key words. support vector machine, ε-insensitive loss function, ε-smooth support vector regression, smoothing Newton algorithm

1 Introduction

Support vector machine (SVM) is a popular and important statistical learning technology [1, 9, 10, 16, 17, 18, 19]. Generally speaking, there are two main categories for support vector machines (SVMs): support vector classification (SVC) and support vector regression (SVR). The model produced by SVR depends on a training data set S = {(A₁, y₁), . . . , (A_m, y_m)} ⊆ IRⁿ× IR, where A_i ∈ IRⁿ is the input data and y_i ∈ IR is called the observation. The main goal of ε-insensitive regression with the idea of SVMs is to find a linear or nonlinear regression function f that has at most ε deviation from the actually obtained targets y_i for all the training data, and at the same time is as flat as possible. This problem is called ε-support vector regression (ε-SVR).

For pedagogical reasons, we begin with the linear case, in which the regression function f ($) is defined as

f ($) = $^Tx + b with x ∈ IRⁿ, b ∈ IR. (1) Flatness in the case of (1) means that one seeks a small x. One way to ensure this is to minimize the norm of x, then the problem ε-SVR can be formulated as a constrained minimization problem:

min ¹₂x^Tx + CPm

i=1(ξ_i+ ξ^∗_i) s.t.







y_i− A^T_i x − b ≤ ε + ξ_i A^T_i x + b − y_i ≤ ε + ξ_i^∗ ξi, ξ_i^∗ ≥ 0, i = 1, · · · , m

(2)

(3)

The constant C > 0 determines the trade-off between the flatness of f and the amount up to which deviations larger than ε are tolerated. This corresponds to dealing with a so called ε-insensitive loss function |ξ|_ε described by

|ξ|_ε = max{0, |ξ| − ε}.

The formulation (2) is a convex quadratic minimization problem with n+1 free variables, 2m nonnegative variables, and 2m inequality constraints, which enlarges the problem size and could increase computational complexity.

In fact, the problem (2) can be reformulated as an unconstrained optimization problem:

min

(x,b)∈IRⁿ⁺¹

1

2(x^Tx + b²) + C 2

m

X

i=1

A^T_i x + b − yi

2

ε (3)

This formulation has been proposed in active set support vector regression [8] and solved in its dual form. The objective function is strongly convex, hence, the problem has a unique global optimal solution. However, according to the fact that the objective function is not twice continuously differentiable, Newton-type algorithms cannot be applied to solve (3) directly.

Lee, Hsieh and Huang [9] apply a smooth technique for (3). The smooth function f_ε(x, α) = x + 1

αlog(1 + e^−αx), (4)

which is the integral of the sigmoid function _1+e¹−αx, is used to smooth the plus function [x]₊. More specifically, the smooth function f_ε(x, α) approaches to [x]₊, when α goes to infinity. Then, the problem (3) is recast to a strongly convex unconstrained minimization problem with the smooth function f_ε(x, α) and a Newton-Armijo algorithm is proposed to solve it. It is proved that when the smoothing parameter α approaches to infinity, the unique solution of the reformulated problem converges to the unique solution of the original problem [9, Theorem 2.2]. However, the smoothing parameter α is fixed in the proposed algorithm, and in the implementation of this algorithm, α cannot be set large enough.

In this paper, we introduce two smooth support vector machines for ε-insensitive regression. For the first smooth support vector machine, we reformulated ε-SVR to a strongly convex unconstrained optimization problem with one type of smoothing functions φ_ε(x, α). Then, we define a new function H_φ, which corresponds to the optimality condition of the unconstrained optimization problem. From the solution of H_φ(z) = 0, we can obtain the solution of ε-SVR. For the second smooth support vector machine, we smooth the optimality condition of the strongly convex unconstrained optimization

(4)

problem of (3) with another type of smooth functions ψ_ε(x, α). Accordingly we define the function H_ψ, which also possesses the same properties as H_φ does. For either H_φ= 0 or H_ψ = 0, we consider the smoothing Newton method to solve it. The algorithm is shown to be globally convergent, specifically, the iterative sequence converges to the unique solution to (3). Furthermore, the algorithm is shown to be locally quadratically convergent without any assumptions.

The paper is organized as follows. In Section 2 and Section 3, we introduce two smooth support vector machine reformulations by two types of smoothing functions. In Section 4, we propose a smoothing Newton algorithm and study its global and local quadratic convergence. Numerical results and comparisons are reported in Section 5.

Throughout this paper, K := {1, 2, · · · }, all vectors will be column vectors. For a given vector x = (x₁, . . . , x_n)^T ∈ IRⁿ, the plus function [x]₊ is defined as

([x]₊)_i = max{0, x_i}, i = 1, · · · , n.

For a differentiable function f , we denote by ∇f (x) and ∇²f (x) the gradient and the Hessian matrix of f at x, respectively. For a differentiable mapping G : IRⁿ → IR^m, we denote by G⁰(x) ∈ IR^m×n the Jacabian of G at x. For a matrix A ∈ IR^m×n, A^T_i is the i-th row of A. A column vector of ones and identity matrix of arbitrary dimension will be denoted by 1 and I, respectively. We denote the sign function by

sgn(x) =







1 if x > 0, [−1, 1] if x = 0,

−1 if x < 0.

2 The first smooth support vector machine

As mentioned in [9], it is known that ε-SVR can be reformulated as a strongly convex unconstrained optimization problem (3). Denote ω := (x, b) ∈ IRⁿ⁺¹, ¯A := (A, 1) and A¯^T_i is the i-th row of ¯A, then the smooth support vector regression (3) can be rewritten as

minω

1

2ω^Tω + C 2

m

X

i=1

¯A^T_i ω − y_i

2

ε. (5)

Note that | · |²_ε is smooth, but not twice differentiable, which means the objective function is not twice continuously differentiable. Hence, the Newton-type method cannot be applied to solve (5) directly.

(5)

In view of this fact, we propose a family of twice continuously differentiable functions φ_ε(x, α) to replace |x|²_ε. The family of functions φ_ε(x, α) : IR × IR₊→ IR₊ is given by

φ_ε(x, α) =







(|x| − ε)²+¹₃α² if |x| − ε ≥ α,

1

6α(|x| − ε + α)³ if ||x| − ε| < α, 0 if |x| − ε ≤ −α,

(6)

where 0 < α < ε is a smooth parameter. The graphs of φ_ε(x, α) are depicted in Figure 1.

From this geometric view, it is clear to see that φε(x, α) is a class of smoothing functions for |x|²_ε.

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2

x 0

0.002 0.004 0.006 0.008 0.01 0.012 0.014

|x|² (x, 0.03) (x, 0.06) (x, 0.09)

Figure 1: Graphs of φ_ε(x, α) with ε = 0.1 and α = 0.03, 0.06, 0.09.

Besides the geometric approach, we hereat show that φε(x, α) is a class of smoothing functions for |x|²_εby algebraic verification. To this end, we compute the partial derivatives of φ_ε(x, α) as below:

∇_xφ_ε(x, α) =







2(|x| − ε)sgn(x) if |x| − ε ≥ α,

1

2α(|x| − ε + α)²sgn(x) if

|x| − ε < α,

0 if |x| − ε ≤ −α.

(7)

∇²_xxφ_ε(x, α) =







2 if |x| − ε ≥ α

|x|−ε+α

α if

|x| − ε < α, 0 if |x| − ε ≤ −α.

(8)

∇²_xαφ_ε(x, α) =







0 if |x| − ε ≥ α,

(|x|−ε+α)(α−|x|+ε)

2α² sgn(x) if

|x| − ε < α,

0 if |x| − ε ≤ −α.

(9)

(6)

With the above, the following lemma shows some basic properties of φ_ε(x, α).

Lemma 2.1. Let φ_ε(x, α) be defined as in (6). Then, the following hold.

(a) For 0 < α < ε, there holds 0 ≤ φ_ε(x, α) − |x|²_ε ≤ ¹₃α².

(b) The function φ_ε(x, α) is twice continuously differentiable with respect to x for 0 <

α < ε.

(c) lim

α→0φ_ε(x, α) = |x|²_ε and lim

α→0∇_xφ_ε(x, α) = ∇(|x|²_ε).

Proof. (a) To complete the arguments, we need to discuss four cases.

(i) For |x| − ε ≥ α, it is clear that φ_ε(x, α) − |x|²_ε = ¹₃α².

(ii) For 0 < |x| − ε < α, i.e., 0 < x − ε < α or 0 < −x − ε < α, there have two subcase.

If 0 < x − ε < α, letting f (x) := φε(x, α) − |x|²_ε = _6α¹ (x − ε + α)³− (x − ε)² gives (

f⁰(x) = ^(x−ε+α)_2α ² − 2(x − ε), ∀x ∈ (ε, ε + α), f⁰⁰(x) = ^x−ε+α_α − 2 < 0, ∀x ∈ (ε, ε + α).

This indicates that f⁰(x) is monotone decreasing on (ε, ε + α), which further implies f⁰(x) ≥ f⁰(ε + α) = 0, ∀x ∈ (ε, ε + α).

Thus, we obtain that f (x) is monotone increasing on (ε, ε + α). With this, we have f (x) ≤ f (ε + α) = ¹₃α², which yields

φ_ε(x, α) − |x|²_ε ≤ 1

3α², ∀x ∈ (ε, ε + α).

If 0 < −x − ε < α, the arguments are similar as above, and we omit them.

(iii) For −α < |x| − ε ≤ 0, it is clear that φ_ε(x, α) − |x|²_ε = _6α¹ (|x| − ε + α)³ ≤ ^α_6α³ ≤ ^α₃². (iv) For |x| − ε ≤ −α, we have φ_ε(x, α) − |x|²_ε = 0. Then, the desired result follows.

(b) To prove the twice continuous differentiability of φε(x, α), we need to check φε(·, α),

∇_xφ_ε(·, α) and ∇²_xxφ_ε(·, α) are all continuous. Since they are piecewise functions, it suffices to check the junction points.

First, we check that φ_ε(·, α) is continuous.

(i) If |x| − ε = α, then φ_ε(x, α) = ⁴₃α², which implies φ_ε(·, α) is continuous.

(ii) If |x| − ε = −α, then φ_ε(x, α) = 0. Hence, φ_ε(·, α) is continuous.

Next, we check ∇_xφ_ε(·, α) is continuous.

(i) If |x| − ε = α, then ∇_xφ_ε(x, α) = 2α sgn(x).

(7)

(ii) If |x| − ε = −α, then ∇_xφ_ε(x, α) = 0. From the above, it clear to see that ∇_xφ_ε(·, α) is continuous.

Now we show that ∇²_xxφε(·, α) is continuous. (i) If |x| − ε = α, ∇²_xxφε(x, α) = 2.

(ii) |x| − ε = −α then ∇²_xxφ_ε(x, α) = 0. Hence, ∇²_xxφ_ε(·, α) is continuous.

(c) It is clear that lim

α→0φε(x, α) = |x|²_εholds by part(a). It remains to verify lim

α→0∇xφε(x, α) =

∇(|x|²_ε). First, we compute that

∇(|x|²_ε) = 2(|x| − ε)sgn(x) if |x| − ε ≥ 0,

0 if |x| − ε < 0. (10)

In light of (10), we proceed the arguments by discussing four cases.

(i) For |x| − ε ≥ α, we have ∇_xφ_ε(x, α) − ∇(|x|²_ε) = 0. Then, the desired result follows.

(ii) For 0 < |x| − ε < α, we have

∇_xφ_ε(x, α) − ∇(|x|²_ε) = 1

2α(|x| − ε + α)²sgn(x) − 2(|x| − ε)sgn(x) which yields

α→0lim(∇_xφ_ε(x, α) − ∇(|x|²_ε)) = lim

α→0

(|x| − ε + α)²− 4α(|x| − ε)

2α sgn(x).

We notice that |x| → ε when α → 0, and hence ^{(|x|−ε+α)}_2α²^{−4α(|x|−ε)} → ⁰₀. Then, applying L’hopital rule yields

α→0lim

(|x| − ε + α)²− 4α(|x| − ε)

2α = lim

α→0(α − (|x| − ε)) = 0.

This implies lim

α→0(∇_xφ_ε(x, α) − ∇(|x|²_ε)) = 0, which is the desired result.

(iii) For −α < |x| − ε ≤ 0, we have ∇xφε(x, α) − ∇(|x|²_ε) = _2α¹ (|x| − ε + α)²sgn(x). Then, applying L’hopital rule gives

α→0lim

(|x| − ε + α)²

2α = lim

α→0(|x| − ε + α) = 0.

Thus, we prove that lim

α→0(∇_xφ_ε(x, α) − ∇(|x|²_ε)) = 0 under this case.

(iv) For |x| − ε ≤ −α, we have ∇_xφ_ε(x, α) − ∇(|x|²_ε) = 0. Then, the desired result follows clearly. 2

Now, we use the family of smoothing functions φ_εto replace the square of ε-insensitive loss function in (5) to obtain the first smooth support vector regression. In other words, we consider

minω F_ε,α(ω) := 1

2ω^Tω + C

21^TΦ_ε Aω − y, α .¯ (11)

(8)

where ω := (x, b) ∈ IRⁿ⁺¹, and Φ_ε(Ax + 1b − y, α) ∈ IR^m is defined by Φ_ε(Ax + 1b − y, α)_i = φ_ε(A_ix + b − y_i, α) .

This is a strongly convex unconstrained optimization with the twice continuously differentiable objective function. Noting lim

α→0φ_ε(x, α) = |x|²_ε, we see that minω F_ε,0(ω) := lim

α→0F_ε,α(ω) = 1

2ω^Tω + C 2

m

X

i=1

¯A^T_iω − y_i

2

ε (12)

which is exactly the problem (5).

The following Theorem shows that the unique solution of the smooth problem (11) approaches to the unique solution of the problem (12) as α → 0. Indeed, it plays as the same role as [9, Theorem 2.2].

Theorem 2.1. Let F_ε,α(ω) and F_ε,0(ω) be defined as in (11) and (12), respectively. Then, the following hold.

(a) There exists a unique solution ¯ω_α to min

ω∈IRⁿ⁺¹

F_ε,α(ω) and a unique solution ¯ω to min

ω∈IRⁿ⁺¹F_ε,0(ω).

(b) For all 0 < α < ε, we have the following inequality:

k¯ω_α− ¯ωk² ≤ 1

6Cmα². (13)

Moreover, ¯ω_α converges to ¯ω as α → 0 with an upper bound given by (13).

Proof. (a) In view of φ_ε(x, α) − |x|²_ε ≥ 0 in Lemma 2.1(a), we see that the level sets L_v(F_ε,α(ω)) := ω ∈ IRⁿ⁺¹| F_ε,α(ω) ≤ v

L_v(F_ε,0(ω)) := ω ∈ IRⁿ⁺¹| F_ε,0(ω) ≤ v satisfy

L_v(F_ε,α(ω)) ⊆ L_v(F_ε,0(ω)) ⊆ω ∈ IRⁿ⁺¹| ω^Tω ≤ 2v

(14) for any v ≥ 0. Hence, we obtain that L_v(F_ε,α(ω)) and L_v(F_ε,0(ω)) are compact (closed and bounded) subsets in IRⁿ⁺¹. Then, by the strong convexity of F_ε,0(ω) and F_ε,α(ω) with α > 0, each of the problems min

ω∈IRⁿ⁺¹F_ε,α(ω) and min

ω∈IRⁿ⁺¹F_ε,0(ω) has a unique solution.

(b) From the optimality condition and strong convexity of Fε,0(ω) and Fε,α(ω) with α > 0, we know that

F_ε,0(¯ω_α) − F_ε,0(¯ω) ≥ ∇F_ε,0(¯ω_α− ¯ω) + 1

2k¯ω_α− ¯ωk² ≥ 1

2k¯ω_α− ¯ωk², (15)

(9)

F_ε,α(¯ω) − F_ε,α(¯ω_α) ≥ ∇F_ε,α(¯ω − ¯ω_α) + 1

2k¯ω − ¯ω_αk² ≥ 1

2k¯ω − ¯ω_αk². (16) Note that F_ε,α(ω) ≥ F_ε,0(ω) because φ_ε(x, α) − |x|²_ε ≥ 0. Then, adding up (15) and (16) along with this fact yield

k¯ω_α− ¯ωk² ≤ (F_ε,α(¯ω) − F_ε,0(¯ω)) − (F_ε,α(¯ω_α) − F_ε,0(¯ω_α))

≤ F_ε,α(¯ω) − F_ε,0(¯ω)

= C

21^TΦ_ε( ¯A¯ω − y, α) −C 2

m

X

i=1

¯A^T_i ω − y¯ _i

2 ε

= C

2

m

X

i=1

φ_ε( ¯A_iω − y¯ _i, α) − C 2

m

X

i=1

¯A^T_i ω − y¯ _i

2 ε

≤ 1

6Cmα²,

where the last inequality is due to Lemma 2.1(a). It is clear that ¯ω_α converges to ¯ω as α → 0 with an upper bound given by the above. Then, the proof is complete. 2

Next, we focus on the optimality condition of the minimization problem (11), which is indeed sufficient and necessary for (11) and has the form of

∇_ωF_ε,α(ω) = 0.

With this, we define a function H_φ: IRⁿ⁺² → IRⁿ⁺² by Hφ(z) =

α

∇_ωF_ε,α(ω)

=

α

ω + CPm

i=1∇_xφ_ε( ¯A^T_i ω − y_i, α) ¯A_i

(17) where z := (α, ω) ∈ IRⁿ⁺². From Lemma 2.1 and the strong convexity of F_ε,α(ω), it is easy to see that if H_φ(z) = 0, then α = 0 and ω solves (11); and for any z ∈ IR₊₊× IRⁿ⁺¹, the function H_φ is continuously differentiable. In addition, the Jacobian of H_φ can be calculated as below:

H_φ⁰(z) =

1 0

∇²_ωαF_ε,α(ω) ∇²_ωωF_ε,α(ω)

(18) where

∇²_ωαF_ε,α(ω) = C

m

X

i=1

∇²_xαφ_ε( ¯A^T_i ω − y_i, α) ¯A_i,

∇²_ωωF_ε,α(ω) = I + C

m

X

i=1

∇²_xxφ_ε( ¯A^T_i ω − y_i, α) ¯A_iA¯^T_i .

From (8), we can see ∇²_xxφ_ε(x, α) ≥ 0, which implies CPm

i=1∇²_xxφ_ε( ¯A^T_i ω − y_i, α) ¯A_iA¯^T_i is positive semidefinite. Hence, ∇²_ωωF_ε,α(ω) is positive definite. This helps us to prove

(10)

that H_φ⁰(z) is invertible at any z ∈ IR₊₊× IRⁿ⁺¹. In fact, if there exists a vector d :=

(d₁, d₂) ∈ IR × IRⁿ⁺¹ such that H_φ⁰(z)d = 0, then we have

d₁

d₁∇²_ωαF_ε,α(ω) + ∇²_ωωF_ε,α(ω)d₂

= 0.

This implies that d = 0, and hence H_φ⁰(z) is invertible at any z ∈ IR₊₊× IRⁿ⁺¹.

3 The second smooth support vector machine

In this section, we consider another type of smoothing functions ψ_ε,p(x, α) : IR × IR₊ → IR₊, which is defined by

ψε,p(x, α) =







0 if 0 ≤ |x| ≤ ε − α,

α p−1

h(p−1)(|x|−ε+α) pα

ip

if ε − α < |x| < ε + _p−1^α ,

|x| − ε if |x| ≥ ε + _p−1^α .

(19)

Here p ≥ 2. The graphs of ψ_ε,p(x, α) are depicted in Figure 2, which clearly verify that ψ_ε,p(x, α) is a family of smoothing functions for |x|_ε.

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2

x 0

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

|x|

,p(x, 0.03) ,p(x, 0.06) ,p(x, 0.09)

Figure 2: Graphs of ψ_ε,p(x, α) with ε = 0.1, α = 0.03, 0.06, 0.09 and p = 2.

As in Lemma 3.1, we verify that ψ_ε,p(x, α) is a family of smoothing functions for |x|_ε, hence, ψ_ε,p² (x, α) is also a family of smoothing functions for |x|²_ε. Then, we can employ ψ_ε,p² to replace the square of ε-insensitive loss function in (5) as the same way done in

(11)

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 x

0 0.002 0.004 0.006 0.008 0.01 0.012 0.014

|x|² (x, 0.06) (x, 0.09)

,p 2 (x, 0.06)

,p 2 (x, 0.09)

Figure 3: Graphs of |x|²_ε, φε(x, α) and ψ²_ε,p(x, α) with ε = 0.1, α = 0.06, 0.09 and p = 2.

Section 2. The graphs of ψ_ε,p² (x, α) with comparison to φ_ε(x, α) are depicted in Figure 3.

In fact, there is a relation between ψ_ε,p² (x, α) and φ_ε(x, α) shown as in Proposition 3.1.

In other words, we obtain an alternative strongly convex unconstrained optimization for (5):

minω

1

2ω^Tω + C 2

m

X

i=1

ψ_ε,p² A¯^T_i ω − y_i, α . (20) However, the smooth function ψ²_ε,p(x, α) is not twice differentiable with respect x, and hence the objective function of (20) is not twice differentiable although it smooth. Then, we still cannot apply Newton-type method to solve (20). To conquer this, we take another smoothing technique. Before presenting the idea of this smoothing technique, the following two lemmas regarding properties of ψε,p(x, α) are needed. To this end, we also compute the partial derivative of ψ_ε,p(x, α) as below:

∇_xψ_ε,p(x, α) =







0 if 0 ≤ |x| ≤ ε − α,

sgn(x)h(p−1)(|x|−ε+α) pα

ip−1

if ε − α < |x| < ε + _p−1^α , sgn(x) if |x| ≥ ε + _p−1^α .

∇_αψ_ε,p(x, α) =







0 if 0 ≤ |x| ≤ ε − α,

(ε−|x|)(p−1)+α pα

h(p−1)(|x|−ε+α) pα

ip−1

if ε − α < |x| < ε + _p−1^α , 0 if |x| ≥ ε + _p−1^α .

(12)

Lemma 3.1. Let ψ_ε,p(x, α) be defined as in (19). Then, we have (a) ψ_ε,p(x, α) is smooth with respect to x for any p ≥ 2;

(b) lim

α→0ψ_ε,p(x, α) = |x|_ε for any p ≥ 2.

Proof. (a) To prove the result, we need to check both ψ_ε,p(·, α) and ∇_xψ_ε,p(·, α) are continuous.

(i) If |x| = ε − α, then ψ_ε,p(x, α) = 0.

(ii) If |x| = ε +_p−1^α , then ψ_ε,p(x, α) = _p−1^α . Form (i) and (ii), it is clear to see ψ_ε,p(·, α) is continuous.

Moreover, (i) If |x| = ε − α, then ∇_xψ_ε,p(x, α) = 0.

(ii) If |x| = ε + _p−1^α , then ∇xψε,p(x, α) = sgn(x). In view of (i) and (ii), we see that

∇_xψ_ε,p(·, α) is continuous.

(b) To proceed, we discuss four cases.

(1) If 0 ≤ |x| ≤ ε − α, then ψ_ε,p(x, α) − |x|_ε = 0. Then, the desired result follows.

(2) If ε − α ≤ |x| ≤ ε, then ψ_ε,p(x, α) − |x|_ε= _p−1^α h

(p−1)(|x|−ε+α) pα

ip

. Hence,

α→0lim

ψ_ε,p(x, α) − |x|_ε

= lim

α→0

α

p − 1

(p − 1)(|x| − ε + α) pα

p

= lim

α→0

α

p − 1

α→0lim

(p − 1)(|x| − ε + α) pα

p

.

It is clear that the first limit is zero, so we only need to show that the second limit is bounded. To this end, we rewrite it as

α→0lim

(p − 1)(|x| − ε + α) pα

p

= lim

α→0

p − 1 p

p

|x| − ε + α α

p

.

We notice that |x| → ε when α → 0 so that ^|x|−ε+α_α → ⁰₀. Therefore, by applying L’hopital’s rule, we obtain

α→0lim

|x| − ε + α α

= 1

which implies that lim_α→0

ψ_ε,p(x, α) − |x|_ε

= 0 under this case.

(3) If ε ≤ |x| ≤ ε +_p−1^α , then

ψε,p(x, α) − |x|ε= α p − 1

(p − 1)(|x| − ε + α) pα

p

− (|x| − ε).

(13)

We have shown in case (2) that

α→0lim α p − 1

(p − 1)(|x| − ε + α) pα

p

= 0.

It is also obvious that lim_α→0(|x|−ε) = 0. Hence, we obtain lim_α→0 ψ_ε,p(x, α)−|x|_ε = 0 under this case.

(4) If |x| ≥ ε + _p−1^α , the desired result follows since it is clear that ψ_ε,p(x, α) − |x|_ε = 0.

From all the above, the proof is complete. 2

Lemma 3.2. Let ψε,p(x, α) be defined as in (19). Then, we have (a) ψ_ε,p(x, α)sgn(x) is smooth with respect to x for any p ≥ 2;

(b) lim

α→0ψ_ε,p(x, α)sgn(x) = |x|_εsgn(x) for any p ≥ 2.

Proof. (a) First, we observe that ψ_ε,p(x, α)sgn(x) can be written as

ψ_ε,p(x, α)sgn(x) =







0 if 0 ≤ |x| ≤ ε − α,

α p−1

h(p−1)(|x|−ε+α) pα

ip

sgn(x) if ε − α < |x| < ε + _p−1^α , (|x| − ε)sgn(x) if |x| ≥ ε +_p−1^α .

Note that sgn(x) is continuous at x 6= 0 and ψε,p(x, α) = 0 at x = 0, then applying Lemma 3.1(a) yields ψ_ε,p(x, α)sgn(x) is continuous. Furthermore, by simple calculations, we have

∇_x(ψ_ε,p(x, α)sgn(x)) = ∇_xψ_ε,p(x, α)sgn(x)

=







0 if 0 ≤ |x| ≤ ε − α, h(p−1)(|x|−ε+α)

pα

ip−1

if ε − α < |x| < ε + _p−1^α , 1 if |x| ≥ ε +_p−1^α .

(21)

Mimicking the arguments as in Lemma 3.1(a), we can verify that ∇x(ψε,p(x, α)sgn(x)) is continuous. Thus, the desired result follows.

(b) By Lemma 3.1(b), it is easy to see that lim

α→0ψ_ε,p(x, α)sgn(x) = |x|_εsgn(x). Then, the desired result follows. 2

Note that |x|²_ε is smooth with

∇(|x|²_ε) = 2|x|_εsgn(x) = 2(|x| − ε)sgn(x) if |x| > ε, 0 if |x| ≤ ε.

(14)

being continuous (but not differentiable). Then, we consider the optimality condition of (12), that is

∇_ωF_ε,0(ω) = ω + C

m

X

i=1

| ¯A^T_i ω − y_i|_εsgn( ¯A^T_i ω − y_i) ¯A_i = 0, (22)

which is indeed sufficient and necessary for (5). Hence, solving (22) is equivalent to solving (5).

Using the family of smoothing functions ψ_ε,p to replace ε-loss function of (22) leads to a system of smooth equations. More specifically, we define a function Hψ : IRⁿ⁺² → IRⁿ⁺² by

H_ψ(z) = H_ψ(α, ω) =

α

ω + CPm

i=1ψ_ε( ¯A^T_i ω − y_i, α)sgn( ¯A^T_iω − y_i) ¯A_i

where z := (α, ω) ∈ IRⁿ⁺². From Lemma 3.1, it is easy to see that if H_ψ(z) = 0, then α = 0 and ω is the solution of the equations (22), i.e., the solution of (12). Moreover, for any z ∈ IR₊₊× IRⁿ⁺¹, the function H_ψ is continuously differentiable with

H_ψ⁰(z) =

1 0

E(ω) I + D(ω)

(23) where

E(ω) = C

m

X

i=1

∇αψε( ¯A^T_i ω − yi, α)sgn( ¯A^T_i ω − yi) ¯Ai,

D(ω) = C

m

X

i=1

∇_xψ_ε( ¯A^T_i ω − y_i, α)sgn( ¯A^T_i ω − y_i) ¯A_iA¯^T_i .

Because ∇_xψ_ε( ¯A^T_i ω − y_i, α)sgn( ¯A^T_i ω − y_i) is nonnegative for any α > 0 from (21), we see that I + D(x) is positive definite at any z ∈ IR₊₊ × IRⁿ⁺¹. Following the similar arguments as in Section 2, we obtain that H_ψ⁰(z) is invertible at any z ∈ IR₊₊× IRⁿ⁺¹. Proposition 3.1. Let φ_ε(x, α) be defined as in (6) and ψ_ε,p(x, α) be defined as in (19).

Then, the following hold.

(a) For p ≥ 2, we have φ_ε(x, α) ≥ ψ²_ε,p(x, α) ≥ |x|²_ε. (b) For p ≥ q ≥ 2, we have ψε,q(x, α) ≥ ψε,p(x, α).

Proof. (a) First, we show that φ_ε(x, α) ≥ ψ²_ε,p(x, α) holds. To proceed, we discuss four cases.

(i) If |x| ≤ ε − α, then φ_ε(x, α) = 0 = ψ_ε,p² (x, α).

(15)

(ii) If ε − α < |x| < ε + _p−1^α , then |x| ≤ ε + _p−1^α which is equivalent to _|x|−ε+α¹ ≥ ^p−1_αp. Thus, we have

φ_ε(x, α)

ψ_ε,p² (x, α) = α^2p−3p^2p

6(p − 1)^2p−2(|x| − ε + α)^2p−3 ≥ p³

6(p − 1) ≥ 1, which implies φ_ε(x, α) ≥ ψ_ε,p² (x, α).

(iii) For ε + _p−1^α ≤ |x| < ε + α, letting t := |x| − ε ∈ [_p−1^α , α) yields φ_ε(x, α) − ψ_ε,p² (x, α) = 1

6α(t + α)³ − t² = t 1

6αt²− 1 2t + 1

2α

+ 1

6α² ≥ 0.

Here the last inequality follows from the fact that discriminant of _6α¹ t²− ¹₂t + ¹₂α is less than 0 and _6α¹ > 0. Then, φ_ε(x, α) − ψ²_ε,p(x, α) > 0.

(iv) If |x| ≥ ε + α, then it is clear that φ_ε(x, α) = (|x| − ε)²+ ¹₃α² ≥ (|x| − ε)² = ψ_ε,p² . Now we show that the other part ψ_ε,p(x, α) ≥ |x|_ε, which is equivalent to verifying ψ_ε,p² (x, α) ≥ |x|²_ε. Again, we discuss four cases.

(i) If |x| ≤ ε − α, then ψ_ε,p(x, α) = 0 = |x|_ε.

(ii) If ε − α < |x| ≤ ε, then ε − α < |x| which says |x| − ε + α > 0. Thus, we have ψε,p(x, α) ≥ 0 = |x|ε.

(iii) For ε < |x| < ε + _p−1^α , we let t := |x| − ε ∈ (0,_p−1^α ) and define a function as f (t) = α

p − 1

(p − 1)(t + α) pα

p

− t,

which is a function onh

0,_p−1^α i

. Note that f (|x|−ε) = ψ_ε,p(x, α)−|x|_εfor |x| ∈ (ε, ε+_p−1^α ) and observe that

f⁰(t) = (p − 1)(t + α) pα

p−1

− 1 ≤ (p − 1)(_p−1^α + α) pα

!p−1

− 1 = 0.

This meansf (t) is monotone decreasing on (0,_p−1^α ). Since f (_p−1^α ) = 0, we have f (t) ≥ 0 for t ∈ (0,_p−1^α ), which implies ψ_ε,p(x, α) ≥ |x|_ε for |x| ∈ (ε, ε + _p−1^α ).

(iv) If |x| ≥ ε + _p−1^α , then it is clear that ψ_ε,p(x, α) = |x| − ε = |x|_ε. (b) For p ≥ q ≥ 2, it is obvious to see that

ψε,q(x, α) = ψε,p(x, α) for |x| ∈ [0, ε − α] ∪ [ε + α

q − 1, +∞).

If |x| ∈ [ε + _p−1^α , ε + _q−1^α ), then ψ_ε,p(x, α) = |x|_ε ≤ ψ_ε,q(x, α) from the above. Thus, we only need to prove the case of |x| ∈ (ε − α, ε +_p−1^α ).

(16)

Consider |x| ∈ (ε − α, ε + _p−1^α ) and t := |x| − ε + α, we observe that ^α_t ≥ ^p−1_p . Then, we verify that

ψ_ε,q(x, α)

ψ_ε,p(x, α) = (q − 1)^q−1p^p (p − 1)^p−1q^q ·α

t

p−q

≥ (q − 1)^q−1p^p

(p − 1)^p−1q^q · p − 1 p

p−q

= p q

q

· q − 1 p − 1

q−1

=

1 + ^p−q_q q

1 + ^p−q_q−1q−1

≥ 1,

where the last inequality is due to (1 +^p−q_x )^x being increasing for x > 0. Thus, the proof is complete. 2

4 A smoothing Newton algorithm

In Section 2 and Section 3, we construct two systems of smooth equations: H_φ(z) = 0 and Hψ(z) = 0. We briefly describe the difference between Hφ(z) = 0 and Hψ(z) = 0.

In general, the way we come up with H_φ(z) = 0 and H_ψ(z) = 0 is a bit different. For achieving H_φ(z) = 0, we first use the twice continuously differentiable functions φ_ε(x, α) to replace |x|²_ε in problem (5), and then write out its KKT condition. To the contrast, for achieving Hψ(z) = 0, we write out the KKT condition of problem (5) first, then we use the smoothing functions ψ_ε,p(x, α) to replace ε-loss function of (22) therein. For convenience, we denote ˜H(z) ∈ {H_φ(z), H_ψ(z)}. In other words, ˜H(z) possesses the property that if ˜H(z) = 0, then α = 0 and ω solves (12). In view of this, we apply some Newton-type methods to solve the system of smooth equations ˜H(z) = 0 at each iteration and letting α → 0 so that the solution to the problem (12) can be found.

Algorithm 4.1. (A smoothing Newton method)

Step 0 Choose δ ∈ (0, 1), σ ∈ (0,¹₂), and α₀ > 0. Take τ ∈ (0, 1) such that τ α₀ < 1. Let ω0 ∈ IRⁿ⁺¹ be an arbitrary vector. Set z⁰ := (α0, ω0). Set e⁰ := (1, 0, . . . , 0) ∈ IRⁿ⁺². Step 1 If k ˜H(z^k)k = 0, stop.

Step 2 Define function Γ, β by

Γ(z) := k ˜H(z^k)k² and β(z) := τ min{1, Γ(z)}. (24)

(17)

Compute 4z^k := (4α_k, 4x^k) by

H(z˜ ^k) + ˜H⁰(z^k)4z^k= α₀β(z^k)e⁰. Step 3 Let θk be the maximum of the values 1, δ, δ², · · · such that

Γ(z^k+ λ_k∆z^k) ≤ [1 − 2σ(1 − γα₀)θ_k]Γ(z^k). (25) Step 4 Set z^k+1 := z^k+ θ_k∆z^k, and k := k + 1, Go to step 1.

Proposition 4.1. Suppose that the sequence {z^k} is generated by Algorithm 4.1. Then, the following results hold.

(a) {Γ(z^k)} is monotonically decreasing.

(b) { ˜H(z^k)} and {β(z^k)} are monotonically decreasing.

(c) Let N (τ ) := {z ∈ IR₊× IRⁿ⁺¹ : α₀β(z) ≤ α}, then z^k ∈ N (τ ) for any k ∈ K and 0 < α_k+1 ≤ α_k.

(d) The algorithm is well defined.

Proof. Since the proof is much similar to [6, Remark 2.1], we omit it here. 2

Lemma 4.1. Let ¯λ := maxn

λ_i(Pm

i=1A¯_iA¯_i^T)o

. Then, for any z ∈ IR₊₊× IRⁿ⁺¹, we have (a) 1 ≤ λ_i(H_φ⁰(z)) ≤ 1 + 2¯λ, i = 1, · · · , n + 2;

(b) 1 ≤ λi(H_ψ⁰ (z)) ≤ 1 + ¯λ, i = 1, · · · , n + 2.

Proof. (a) H_φ⁰(z) is continuously differentiable at any z ∈ IR₊₊× IRⁿ⁺¹, and by (18), it is easy to see that {1, λ₁(∇²_ωωF_ε,α(ω)), · · · , λ_n+1(∇²_ωωF_ε,α(ω))} are eigenvalues of H_φ⁰(z).

From the representation of ∇²_xxφ_ε in (8), we have 0 ≤ ∇²_xxφ_ε( ¯A^T_i ω − y_i, α) ≤ 2. As

∇²_ωωF_ε,α(ω) = I +Pm

i=1∇²_xxφ_ε( ¯A^T_i ω − y_i, α) ¯A_iA¯^T_i , then

1 ≤ λi(∇²_ωωFε,α(ω)) ≤ 1 + 2¯λ(i = 1, · · · , n + 1). (26) Thus the result (i) holds.

(b) Note that

∇_xψ_ε,p(x, α)sgn(x) =







0 0 ≤ |x| ≤ ε − α,

h(p−1)(|x|−ε+α) pα

ip−1

ε − α < |x| < ε + _p−1^α ,

1 |x| ≥ ε +_p−1^α ,

which says 0 ≤ ∇_xψ_ε,p(x, α) ≤ 1. Then, following the similar arguments as in part(a), the result of part(b) cab be proved. 2

(18)

Proposition 4.2. { ˜H(α, ω)} is coercive for any fixed α > 0, i.e., lim_kωk→+∞k ˜H(α, ω)k = +∞.

Proof. We first claim that {H_φ(α, ω)} is coercive for any fixed α > 0. By the definition of Hφ(α, ω) in (17), kHφ(α, ω)k² = α²+ k∇ωFε,α(ω)k². Then for any fixed α > 0,

lim

kωk→+∞kH_φ(α, ω)k = +∞ ⇔ lim

kωk→+∞k∇_ωF_ε,α(ω)k = +∞.

By (26), we have k∇²_ωωFε,α(x, b)k ≥ 1. For any ω0 ∈ IRⁿ⁺¹,

k∇_ωF_ε,α(ω)k + k∇_ωF_ε,α(ω₀)k ≥ k∇_ωF_ε,α(ω) − ∇_ωF_ε,α(ω₀)k

= k∇²_ωωF_ε,α(ˆω)(ω − ω₀)k

≥ kω − ω0k,

where ˆω between ω₀ and ω, then limkωk→+∞k∇_ωF_ε,α(ω)k = +∞.

By a similar proof, we can get {H_ψ(α, ω)} is coercive for any fixed α > 0.

From the above, ˜H(α, ω) ∈ {H_φ(α, ω), H_ψ(α, ω)} is coercive for any fixed α > 0. 2

Lemma 4.2. Let Ω ⊆ IRⁿ⁺¹ be a compact set and Γ(α, ω) be defined as in (24). Then, for every ς > 0, there exists a ¯α > 0 such that

|Γ(α, ω) − Γ(0, ω)| ≤ ς for all ω ∈ Ω and all α ∈ [0, ¯α].

Proof. The function Γ(α, ω) defined as in (24) is continuous on the compact set [0, ¯α]×Ω.

The lemma is then an immediate consequence of the fact that every continuous function on a compact set is uniformly continuous there. 2

Lemma 4.3. (Mountain Pass Theorem [12, Theorem 9.2.7]) Suppose that g : IR^m → IR is a continuously differentiable and coercive function. Let Ω ⊂ IR^m be a nonempty and compact set and ξ be the minimum value of g on the boundary of Ω, i.e., ξ :=

min_y∈∂Ωg(y). Assume that there exist points a ∈ Ω and b /∈ Ω such that g(a) < ξ and g(b) < ξ . Then, there exists a point c ∈ IR^m such that ∇g(c) = 0 and g(c) ≥ ξ.

Theorem 4.1. Suppose the sequence {z^k} is generated by Algorithm 4.1. Then, the sequence {z^k} is bounded, and ω^k = (x^k, b^k) converges to the unique solution ω^sol = (x^sol, b^sol) of problem (12).

(19)

Proof. (a) We first show that the sequence {z^k} is bounded. It is clear from Proposition 4.1(c) that the sequence {α_k} is bounded. In the following two cases, by assuming that {ω^k} is unbounded, we will derive a contradiction. By passing to subsequence if necessary, we assume kω^kk → +∞ as k → +∞. Then, we discuss two cases.

(i) If α∗ = lim

k→+∞α_k > 0, applying Proposition 4.1(b) yields that n ˜H(z^k)o

is bounded.

In addition, by Proposition 4.2, we have lim

k→+∞

H(α˜ ∗, ω^k) = lim

kω^kk→+∞

H(α˜ ∗, ω^k) = +∞. (27)

Hence, a contradiction is reached.

(ii) If α∗ = lim

k→+∞α_k = 0, by assuming that {ω^k} is unbounded, there exists a compact set Ω ⊂ IRⁿ with

ω^sol∈ Ω/ (28)

for all k sufficiently large. Since

¯

m := min

ω∈∂ΩΓ(0, ω) > 0, we can apply Lemma 4.2 with ς := ¯m/4 and conclude that

Γ(αk, ω^sol) ≤ 1

4m¯ (29)

and

ω∈∂ΩminΓ(α_k, ω) ≥ 3 4m¯

for all k sufficiently large. Since αk → 0 in this case, combining (24) and Proposition 4.1(c) gives

Γ(α_k, ω^k) = β(α_k, ω^k) ≤ α_k/α₀. Hence,

Γ(α_k, ω^k) ≤ 1

4m¯ (30)

for all k sufficiently large. Now let us fix an index k such that (29) and (30) hold.

Applying the Mountain Pass Theorem 4.3 with a := ω^sol and b := ω^k, we obtain the existence of a vector c ∈ IRⁿ⁺¹ such that

∇_ωΓ(α_k, c) = 0 and Γ(α_k, c) ≥ 3

4m > 0.¯ (31)

To derive a contradiction, we need to show that c is a global minimizer of Γ(αk, ω). Since Γ(α_k, ω) ≥ α², it is sufficient to show Γ(α_k, c) = α². We discuss this in two cases:

(20)

• If ˜H = H_p, then

∇_ωΓ(α_k, c) = 2∇²_ωωF_ε,α_k(c) ¯H_p(α_k, c)

where ¯Hp is the last n + 1 components of Hφ, i.e., ¯Hp = Hp(2 : n + 2). Then, using (31) and the fact that ∇²_ωωF_ε,α_k(c) is invertible for α_k > 0, we have ¯H_p(α_k, c) = 0.

Furthermore,

Γ(α_k, c) = kH(α_k, c)k² = α².

• If ˜H = H_ψ, then

∇ωΓ(αk, c) = 2(I + D(ω)) ¯Hψ(αk, c)

where I + D(ω) is given by (23) and ¯H_ψ is the last n + 1 components of H_ψ. Since I + D(ω) is invertible for α_k> 0, we obtain that Γ(α_k, c) = α² by the same way as in the above case.

(b) From Proposition 4.1, we know that sequences { ˜H(z^k)} and {Γ(z^k)} are non-negative and monotone decreasing, and hence they are convergent. In addition, by using the first result of this theorem, we obtain that the sequence {z^k} is bounded. Passing to subsequence if necessary, we may assume that there exists a point z^∗ = (α_∗, ω^∗)IR₊₊× IRⁿ⁺¹ such that lim_k→+∞z^k= z^∗, and hence,

k→+∞lim kH(z^k)k = kH(z^∗)k and lim

k→+∞Γ(z^k) = Γ(z^∗).

For H(z^∗) = 0, by a simple continuity discussion, we obtain that ω^∗ is a solution to problem (12). For the case of H(z^∗) > 0, and hence α^∗ > 0, we will derive a contradiction.

First, by the assumption that H(z^∗) > 0, we have limk→+∞θk = 0. Thus, for any sufficiently large k, the stepsize ˆθ_k := θ_k/δ does not satisfy the line search criterion (25), i.e.,

Γ(z^k+ ˆθk4z^k) >

h

1 − 2σ(1 − γα0)ˆθk

i Γ(z^k), which implies that

Γ(z^k+ ˆθ_k4z^k) − Γ(z^k)

θˆ_k > −2σ(1 − γα₀)Γ(z^k).

Since α_∗ > 0, it follows that Γ(z^k) is continuously differentiable at z^∗. Letting k → +∞

in the above inequality gives

−2σ(1 − γα₀)Γ(z^∗)

≤ 2 ˜H(z^∗)^TH˜⁰(z^∗)4z^∗ = 2 ˜H(z^∗)^T(− ˜H(z^∗) + α₀β(z^∗)e⁰)

= −2 ˜H(z^∗)^TH(z˜ ^∗) + 2α₀β(z^∗) ˜H(z^∗)^Te⁰

≤ 2(−1 + γα₀)Γ(z^∗).

(21)

This indicates that −1 + γα₀+ σ(1 − γα₀) ≥ 0, which contradicts the fact that γα₀ < 1.

Thus, there should be ˜H(z^∗) = 0.

Because the unique solution to problem (12) is ω^sol, we have z^∗ = (0, ω^sol) and the whole sequence {z^k} converge to z^∗, that is,

lim

k→+∞z^∗ = (0, ω^sol).

Then, the proof is complete. 2

In the following, we discuss the local convergence of Algorithm 4.1. To this end, we need the concept of semismoothness, which was originally introduced by Mifflin [7]

for functionals and was further extended to the setting of vector-valued functions by Qi and Sun [14]. A locally Lipschitz continuous function F : IRⁿ → IR^m, which has the generalized Jacobian ∂F (x) in the sense of Clarke [2], is said to be semismooth (respectively, strongly semismooth) at x ∈ IRⁿ, if F is directionally differentiable at x and

F (x + h) − F (x) − V h = o(khk) (= O(khk²), respectively) holds for any V ∈ ∂F (x + h).

Lemma 4.4. (a) Suppose that the sequence {z^k} is generated by Algorithm 4.1. Then,

H˜⁰(z^k)⁻¹ ≤ 1.

(b) ˜H(z) is strongly semismooth at any z = (α, ω) ∈ IRⁿ⁺².

Proof. (a) By Proposition 4.1 (c), we know that α_k > 0 for any k ∈ K. This together with Lemma 4.1 leads to the desired result.

(b) We only provide the proof for the case of ˜H(α, ω) = H_φ(α, ω). For the other case of H(α, ω) = H˜ _ψ(α, ω), the proof is similar and is omitted. First, we observe that H_φ⁰(z) is continuously differentiable and Lipschitz continuous at z ∈ IR₊₊× IRⁿ⁺¹ by Lemma 4.1(a). Thus, Hφ(z) is strongly semismooth at z ∈ IR++× IRⁿ⁺¹. It remains to verify that H_φ(z) is strongly semismooth at z ∈ {0} × IRⁿ⁺¹. To see this, we recall that

∇_xφ_ε(x, 0) = 2(|x| − ε)sgn(x), |x| − ε ≥ 0;

0, |x| − ε ≤ −α.

It is a piecewise linear function, and hence ∇_xφ_ε(x, 0) is a strongly semismooth function.

In summary, H_φ(z) is strongly semismooth at z ∈ {0} × IRⁿ⁺¹. 2

Theorem 4.2. Suppose that z^∗ = (µ∗, x^∗) is an accumulation point of {z^k} generated by Algorithm 4.1. Then, we have