• 沒有找到結果。

Support Vector Classification and Regression

N/A
N/A
Protected

Academic year: 2022

Share "Support Vector Classification and Regression"

Copied!
183
0
0

加載中.... (立即查看全文)

全文

(1)

Regression

Chih-Jen Lin

Department of Computer Science National Taiwan University

Short course at ITRI, 2016

(2)

1 Introduction

2 SVM and kernel methods

3 Dual problem and solving optimization problems

4 Regulatization and linear versus kernel

5 Multi-class classification

6 Support vector regression

7 SVM for clustering

8 Practical use of support vector classification

9 A practical example of SVR Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 2 / 181

(3)

Outline

1 Introduction

2 SVM and kernel methods

3 Dual problem and solving optimization problems

4 Regulatization and linear versus kernel

5 Multi-class classification

6 Support vector regression

7 SVM for clustering

8 Practical use of support vector classification

9 A practical example of SVR

10 Discussion and conclusions

(4)

About this Course

Last year I gave a four-day short course on

“introduction of data mining”

In that course, SVM was discussed

This year I received a request to specifically talk about SVM

So I assume that some of you would like to learn more details of SVM

(5)

About this Course (Cont’d)

Therefore, this short course will be more technical than last year

More mathematics will be involved

We will have breaks at 9:50, 10:50, 13:50, and 14:50 Course slides:

www.csie.ntu.edu.tw/~cjlin/talks/itri.pdf I may still update slides (e.g., if we find errors in our lectures)

(6)

Outline

1 Introduction

2 SVM and kernel methods

3 Dual problem and solving optimization problems

4 Regulatization and linear versus kernel

5 Multi-class classification

6 Support vector regression

7 SVM for clustering

8 Practical use of support vector classification

9 A practical example of SVR

10 Discussion and conclusions

(7)

Support Vector Classification

Training vectors : xi, i = 1, . . . , l Feature vectors. For example, A patient = [height, weight, . . .]T

Consider a simple case with two classes:

Define an indicator vector y ∈ Rl yi =

 1 if xi in class 1

−1 if xi in class 2 A hyperplane which separates all data

(8)

◦ ◦

◦ ◦

◦◦

4 4 4 4

4 4

4

◦ ◦

◦ ◦

◦◦

4 4 4 4

4 4

4

wTx + b = n+1

−10

A separating hyperplane: wTx + b = 0 (wTxi) + b ≥ 1 if yi = 1 (wTxi) + b ≤ −1 if yi = −1

Decision function f (x ) = sgn(wTx + b), x : test data

Many possible choices of w and b

(9)

Maximal Margin

Distance between wTx + b = 1 and −1:

2/kw k = 2/

√ wTw

A quadratic programming problem (Boser et al., 1992)

minw ,b

1 2wTw

subject to yi(wTxi + b) ≥ 1, i = 1, . . . , l .

(10)

Example

Given two training data in R1 as in the following figure:

4 0

1 What is the separating hyperplane ? Now two data are x1 = 1, x2 = 0 with

y = [+1, −1]T

(11)

Example (Cont’d)

Now w ∈ R1. The optimization problem is minw ,b

1 2w2

subject to w · 1 + b ≥ 1, (1)

−1(w · 0 + b) ≥ 1. (2) From (2), −b ≥ 1.

Putting this into (1), w ≥ 2.

That is, for any (w , b) satisfying (1) and (2), w ≥ 2.

(12)

Example (Cont’d)

We are minimizing 12w2, so the smallest possibility is w = 2.

Thus, (w , b) = (2, −1) is the optimal solution.

The separating hyperplane is 2x − 1 = 0, in the middle of the two training data:

4 0

1

• x = 1/2

(13)

Data May Not Be Linearly Separable

An example:

◦ ◦

4 4

4 4

4

4 4 4

Allow training errors

Higher dimensional ( maybe infinite ) feature space φ(x ) = [φ1(x ), φ2(x ), . . .]T.

(14)

Standard SVM (Boser et al., 1992; Cortes and Vapnik, 1995)

min

w ,b,ξ

1

2wTw +C

l

X

i =1

ξi

subject to yi(wTφ(xi)+ b) ≥ 1 −ξi, ξi ≥ 0, i = 1, . . . , l .

Example: x ∈ R3, φ(x ) ∈ R10 φ(x ) = [1,√

2x1,√

2x2,√

2x3, x12, x22, x32,√

2x1x2,√

2x1x3,√

2x2x3]T

(15)

Finding the Decision Function

w : maybe infinite variables

The dual problem: finite number of variables minα

1

TQα − eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0,

where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T At optimum

w = Pl

i =1αiyiφ(xi)

A finite problem: #variables = #training data

(16)

Kernel Tricks

Qij = yiyjφ(xi)Tφ(xj) needs a closed form Example: xi ∈ R3, φ(xi) ∈ R10

φ(xi) = [1,√

2(xi)1,√

2(xi)2,√

2(xi)3, (xi)21, (xi)22, (xi)23,√

2(xi)1(xi)2,√

2(xi)1(xi)3,√

2(xi)2(xi)3]T Then φ(xi)Tφ(xj) = (1 + xTi xj)2.

Kernel: K (x , y) = φ(x )Tφ(y); common kernels:

e−γkxi−xjk2, (Radial Basis Function or Gaussian kernel) (xTi xj/a + b)d (Polynomial kernel)

(17)

Can be inner product in infinite dimensional space Assume x ∈ R1 and γ > 0.

e−γkxi−xjk2 = e−γ(xi−xj)2 = e−γxi2+2γxixj−γxj2

=e−γxi2−γxj2 1 + 2γxixj

1! + (2γxixj)2

2! + (2γxixj)3

3! + · · ·

=e−γxi2−γxj2 1 · 1+

r2γ 1!xi ·

r2γ 1!xj+

r(2γ)2 2! xi2 ·

r(2γ)2 2! xj2 +

r(2γ)3 3! xi3 ·

r(2γ)3

3! xj3 + · · · = φ(xi)Tφ(xj), where

φ(x ) = e−γx2

 1,

r2γ 1!x ,

r(2γ)2 2! x2,

r(2γ)3

3! x3, · · ·

T

.

(18)

Decision function

At optimum

w = Pl

i =1αiyiφ(xi) Decision function

wTφ(x ) + b

=

l

X

i =1

αiyiφ(xi)Tφ(x ) + b

=

l

X

i =1

αiyiK (xi, x ) + b

Only φ(xi) of αi > 0 used ⇒ support vectors

(19)

Support Vectors: More Important Data

Only φ(xi) of αi > 0 used ⇒ support vectors

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1

(20)

See more examples via SVM Toy available at libsvm web page

(http://www.csie.ntu.edu.tw/~cjlin/libsvm/)

(21)

Example: Primal-dual Relationship

If separable, primal problem does not have ξi

min

w ,b

1 2wTw

subject to yi(wTxi + b) ≥ 1, i = 1, . . . , l . Dual problem is

minα

1 2

Xl

i =1

Xl

j =1αiαjyiyjxTi xj −Xl

i =1αi subject to 0 ≤ αi, i = 1, . . . , l ,

Xl

i =1yiαi = 0.

(22)

Example: Primal-dual Relationship (Cont’d)

Consider the earlier example:

4 0

1

Now two data are x1 = 1, x2 = 0 with y = [+1, −1]T The solution is (w , b) = (2, −1)

(23)

Example: Primal-dual Relationship (Cont’d)

The dual objective function 1

2α1 α21 0 0 0

 α1 α2



−1 1α1 α2



= 1

21 − (α1 + α2)

In optimization, objective function means the function to be optimized

Constraints are

α1 − α2 = 0, 0 ≤ α1, 0 ≤ α2.

(24)

Example: Primal-dual Relationship (Cont’d)

Substituting α2 = α1 into the objective function, 1

21 − 2α1 has the smallest value at α1 = 2.

Because [2, 2]T satisfies constraints 0 ≤ α1 and 0 ≤ α2, it is optimal

(25)

Example: Primal-dual Relationship (Cont’d)

Using the primal-dual relation w = y1α1x1 + y2α2x2

= 1 · 2 · 1 + (−1) · 2 · 0

= 2

This is the same as that by solving the primal problem.

(26)

More about Support vectors

We know

αi > 0 ⇒ support vector We have

yi(wTxi + b) < 1 ⇒ αi > 0 ⇒ support vector, yi(wTxi + b) = 1 ⇒ αi ≥ 0 ⇒ maybe SV and

yi(wTxi + b) > 1 ⇒ αi = 0 ⇒ not SV

(27)

Outline

1 Introduction

2 SVM and kernel methods

3 Dual problem and solving optimization problems

4 Regulatization and linear versus kernel

5 Multi-class classification

6 Support vector regression

7 SVM for clustering

8 Practical use of support vector classification

9 A practical example of SVR

10 Discussion and conclusions

(28)

Convex Optimization I

Convex problems are a important class of

optimization problems that possess nice properties A function is convex if ∀x , y

f (θx + (1 − θ)y ) ≤ θf (x ) + (1 − θ)f (y ), ∀θ ∈ [0, 1]

That is, the line segment between any two points is not lower than the function value

(29)

Convex Optimization II

x y

f (x )

f (y )

(30)

Convex Optimization III

A convex optimization problem takes the following form

min f0(w )

subject to fi(w ) ≤ 0, i = 1, . . . , m, (3) hi(w ) = 0, i = 1, . . . , p,

where f0, . . . , fm are convex functions and h1, . . . , hp are affine (i.e., a linear function):

hi(w ) = aTw + b

(31)

Convex Optimization IV

A nice property of convex optimization problems is that

infw {f0(w ) | w satisfies constraints}

is unique

Optimal objective value is unique, but optimal w may be not

There are other nice properties such as the primal-dual relationship that we will use

To learn more about convex optimization, you can check the book by Boyd and Vandenberghe (2004)

(32)

Deriving the Dual

For simplification, consider the problem without ξi minw ,b

1 2wTw

subject to yi(wTφ(xi) + b) ≥ 1, i = 1, . . . , l . Its dual is

minα

1

TQα − eTα

subject to 0 ≤ αi, i = 1, . . . , l , yTα = 0,

where

Qij = yiyjφ(xi)Tφ(xj)

(33)

Lagrangian Dual I

Lagrangian dual

maxα≥0 min

w ,b L(w , b, α), where

L(w , b, α)

=1

2kw k2

l

X

i =1

αi yi(wTφ(xi) + b) − 1

(34)

Lagrangian Dual II

Strong duality

min Primal = max

α≥0 min

w ,b L(w , b, α)

After SVM is popular, quite a few people think that for any optimization problem

⇒ Lagrangian dual exists and strong duality holds Wrong! We usually need

The optimization problem is convex

Certain constraint qualification holds (details not discussed)

(35)

Lagrangian Dual III

We have them

SVM primal is convex and has linear constraints

(36)

Simplify the dual. When α is fixed, min

w ,b L(w , b, α) =





−∞ if

l

P

i =1

αiyi 6= 0, minw

1

2wTw −

l

P

i =1

αi[yi(wTφ(xi) − 1] if

l

P

i =1

αiyi = 0.

If Pl

i =1αiyi 6= 0, we can decrease

−b

l

X

i =1

αiyi in L(w , b, α) to −∞

(37)

If Pl

i =1αiyi = 0, optimum of the strictly convex function

1

2wTw −

l

X

i =1

αi[yi(wTφ(xi) − 1]

happens when

wL(w , b, α) = 0.

Thus,

w =

l

X

i =1

αiyiφ(xi).

(38)

Note that wTw =

 l X

i =1

αiyiφ(xi)

T l X

j =1

αjyjφ(xj)



= X

i ,j

αiαjyiyjφ(xi)Tφ(xj)

The dual is

maxα≥0









l

P

i =1

αi12 P

i ,j

αiαjyiyjφ(xi)Tφ(xj) if

l

P

i =1

αiyi = 0,

−∞ if

l

P

i =1

αiyi 6= 0.

(39)

Lagrangian dual: maxα≥0 minw ,bL(w , b, α)

−∞ definitely not maximum of the dual Dual optimal solution not happen when

l

X

i =1

αiyi 6= 0 .

Dual simplified to max

α∈Rl l

X

i =1

αi − 1 2

l

X

i =1 l

X

j =1

αiαjyiyjφ(xi)Tφ(xj) subject to yTα = 0,

αi ≥ 0, i = 1, . . . , l .

(40)

Our problems may be infinite dimensional (i.e., w ∈ R)

We can still use Lagrangian duality See a rigorous discussion in Lin (2001)

(41)

Primal versus Dual I

Recall the dual problem is minα

1

TQα − eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0

and at optimum w =

l

X

i =1

αiyiφ(xi) (4)

(42)

Primal versus Dual II

What if we put (4) into primal min

α,ξ

1

TQα + C Xl

i =1ξi

subject to (Qα + by)i ≥ 1 − ξi (5) ξi ≥ 0

Note that

yiwTφ(xi) = yi l

X

j =1

αjyjφ(xj)Tφ(xi)

=

l

X

j =1

Qijαj = (Qα)i

(43)

Primal versus Dual III

If Q is positive definite, we can prove that the optimal α of (5) is the same as that of the dual So dual is not the only choice to obtain the model

(44)

Large Dense Quadratic Programming

minα

1

TQα − eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0

Qij 6= 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables:

(50, 0002 × 8/2) bytes = 10GB RAM to store Q

(45)

Large Dense Quadratic Programming (Cont’d)

For quadratic programming problems, traditional optimization methods assume that Q is available in the computer memory

They cannot be directly applied here because Q cannot even be stored

Currently, decomposition methods (a type of coordinate descent methods) are what used in practice

(46)

Decomposition Methods

Working on some variables each time (e.g., Osuna et al., 1997; Joachims, 1998; Platt, 1998)

Similar to coordinate-wise minimization Working set B , N = {1, . . . , l }\B fixed Sub-problem at the kth iteration:

minαB

1

2αTBkN)TQBB QBN QNB QNN

 αB αkN



eTB (ekN)TαB αkN



subject to 0 ≤ αt ≤ C , t ∈ B, yBTαB = −yTNαkN

(47)

Avoid Memory Problems

The new objective function 1

2αTBkN)TQBBαB + QBNαkN QNBαB + QNNαkN



− eTBαB + constant

= 1

TBQBBαB + (−eB + QBNαkN)TαB + constant Only |B| columns of Q are needed

In general |B| ≤ 10 is used

Calculated when used : trade time for space But is such an approach practical?

(48)

How Decomposition Methods Perform?

Convergence not very fast. This is known because of using only first-order information

But, no need to have very accurate α decision function:

sgn(wTφ(x ) + b) = sgn

 Xl

i =1αiK (xi, x ) + b



Prediction may still be correct with a rough α Further, in some situations,

# support vectors  # training points Initial α1 = 0, some instances never used

(49)

How Decomposition Methods Perform?

(Cont’d)

An example of training 50,000 instances using the software LIBSVM

$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370

Time 79.524s

This was done on a typical desktop

Calculating the whole Q takes more time

#SVs = 3,370  50,000

A good case where some remain at zero all the time

(50)

How Decomposition Methods Perform?

(Cont’d)

Because many αi = 0 in the end, we can develop a shrinking techniques

Variables are removed during the optimization procedure. Smaller problems are solved

(51)

Machine Learning Properties are Useful in Designing Optimization Algorithms

We have seen that special properties of SVM contribute to the viability of decomposition methods

For machine learning applications, no need to accurately solve the optimization problem Because some optimal αi = 0, decomposition methods may not need to update all the variables Also, we can use shrinking techniques to reduce the problem size during decomposition methods

(52)

Differences between Optimization and Machine Learning

The two topics may have different focuses. We give the following example

The decomposition method we just discussed converges more slowly when C is large

Using C = 1 on a data set

# iterations: 508 Using C = 5, 000

# iterations: 35,241

(53)

Optimization researchers may rush to solve difficult cases of large C

It turns out that large C is less used than small C Recall that SVM solves

1

2wTw + C (sum of training losses) A large C means to overfit training data

This does not give good test accuracy. More details about overfitting will be discussed later

(54)

Outline

1 Introduction

2 SVM and kernel methods

3 Dual problem and solving optimization problems

4 Regulatization and linear versus kernel

5 Multi-class classification

6 Support vector regression

7 SVM for clustering

8 Practical use of support vector classification

9 A practical example of SVR

10 Discussion and conclusions

(55)

Equivalent Optimization Problem

• Recall SVM optimization problem is

w ,b,ξmin 1

2wTw + C Xl

i =1ξi

subject to yi(wTφ(xi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l .

• It is equivalent to minw ,b

1

2wTw + C

l

X

i =1

max(0, 1 − yi(wTφ(xi) + b))

• The reformulation is useful to derive SVM from a different viewpoint

(56)

Equivalent Optimization Problem (Cont’d)

That is, at optimum,

ξi = max(0, 1 − yi(wTφ(xi) + b)) Reason: from constraints

ξi ≥ 1 − yi(wTφ(xi) + b) and ξi ≥ 0 but we also want to minimize ξi

(57)

Linear and Kernel I

Linear classifier

sgn(wTx + b) Kernel classifier

sgn(wTφ(x ) + b) = sgn

 Xl

i =1αiK (xi, x ) + b



Linear is a special case of kernel

An important difference is that for linear we can store w

(58)

Linear and Kernel II

For kernel, w may be infinite dimensional and cannot be stored

We will show that they are useful in different circumstances

(59)

The Bias Term b

Recall the decision function is sgn(wTx + b) Sometimes the bias term b is omitted

sgn(wTx )

This is fine if the number of features is not too small

(60)

Minimizing Training Errors

For classification naturally we aim to minimize the training error

minw (training errors)

To characterize the training error, we need a loss function ξ(w ; x , y ) for each instance (x , y ) Ideally we should use 0–1 training loss:

ξ(w ; x , y ) =

(1 if y wTx < 0, 0 otherwise

(61)

Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−y wTx ξ(w ; x , y )

We need continuous approximations

(62)

Common Loss Functions

Hinge loss (l1 loss)

ξL1(w ; x , y ) ≡ max(0, 1 − y wTx ) (6) Squared hinge loss (l2 loss)

ξL2(w ; x , y ) ≡ max(0, 1 − y wTx )2 (7) Logistic loss

ξLR(w ; x , y ) ≡ log(1 + e−y wTx) (8) SVM: (6)-(7). Logistic regression (LR): (8)

(63)

Common Loss Functions (Cont’d)

−y wTx ξ(w ; x , y )

ξL1 ξL2

ξLR

Logistic regression is very related to SVM

Their performance (i.e., test accuracy) is usually similar

(64)

Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs

(65)

Overfitting

See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

(66)

l and s: training; and 4: testing

(67)

Regularization

To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w ’s values less extreme.

One idea is to make w values closer to zero We can add, for example,

wTw

2 or kw k1 to the objective function

(68)

General Form of Linear Classification I

Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features

minw f (w ), f (w ) ≡ wTw

2 + C

l

X

i =1

ξ(w ; xi, yi) (9) wTw /2: regularization term

ξ(w ; x , y ): loss function C : regularization parameter

(69)

General Form of Linear Classification II

Of course we can map data to a higher dimensional space

minw f (w ), f (w ) ≡ wTw

2 + C

l

X

i =1

ξ(w ;φ(xi), yi)

(70)

SVM and Logistic Regression I

If hinge (l1) loss is used, the optimization problem is

minw

1

2wTw + C

l

X

i =1

max(0, 1 − yiwTxi) It is the SVM problem we had earlier (without the bias b)

Therefore, we have derived SVM from a different viewpoint

We also see that SVM is very related to logistic regression

(71)

SVM and Logistic Regression II

However, many wrongly think that they are different This is wrong.

Reason of this misunderstanding: traditionally, when people say SVM ⇒ kernel SVM

when people say logistic regression ⇒ linear logistic regression

Indeed we can do kernel logistic regression minw

1

2wTw + C

l

X

i =1

log(1 + e−yiwTφ(xi))

(72)

SVM and Logistic Regression III

A main difference from SVM is that logistic regression has probability interpretation

We will introduce logistic regression from another viewpoint

(73)

Logistic Regression

For a label-feature pair (y , x ), assume the probability model is

p(y |x ) = 1

1 + e−y wTx. Note that

p(1|x ) + p(−1|x )

= 1

1 + e−wTx + 1 1 + ewTx

= ewTx

1 + ewTx + 1 1 + ewTx

= 1

w is the parameter to be decided

Chih-Jen Lin (National Taiwan Univ.) 73 / 181

(74)

Logistic Regression (Cont’d)

Idea of this model p(1|x ) = 1

1 + e−wTx

(→ 1 if wTx  0,

→ 0 if wTx  0 Assume training instances are

(yi, xi), i = 1, . . . , l

(75)

Logistic Regression (Cont’d)

Logistic regression finds w by maximizing the following likelihood

maxw l

Y

i =1

p (yi|xi) . (10) Negative log-likelihood

− log

l

Y

i =1

p (yi|xi) = −

l

X

i =1

log p (yi|xi)

=

l

X

i =1

log



1 + e−yiwTxi



(76)

Logistic Regression (Cont’d)

Logistic regression minw

l

X

i =1

log



1 + e−yiwTxi

 . Regularized logistic regression

minw

1

2wTw + C

l

X

i =1

log



1 + e−yiwTxi



. (11) C : regularization parameter decided by users

(77)

Loss Functions: Differentiability

However,

ξL1: not differentiable

ξL2: differentiable but not twice differentiable ξLR: twice differentiable

The same optimization method may not be applicable to all these losses

(78)

Discussion

We see that the same classification method can be derived from different ways

SVM

Maximal margin

Regularization and training losses LR

Regularization and training losses Maximum likelihood

(79)

Regularization

L1 versus L2

kw k1 and wTw /2 wTw /2: smooth, easier to optimize kw k1: non-differentiable

sparse solution; possibly many zero elements Possible advantages of L1 regularization:

Feature selection Less storage for w

(80)

Linear and Kernel Classification

Methods such as SVM and logistic regression can used in two ways

Kernel methods: data mapped to a higher dimensional space

x ⇒ φ(x )

φ(xi)Tφ(xj) easily calculated; little control on φ(·) Feature engineering + linear classification:

We have x without mapping. Alternatively, we can say that φ(x ) is our x ; full control on x or φ(x ) We refer to them as kernel and linear classifiers

(81)

Linear and Kernel Classification

Let’s check the prediction cost wTx + b versus Xl

i =1αiyiK (xi, x ) + b If K (xi, xj) takes O(n), then

O(n) versus O(nl ) Linear is much cheaper

A similar difference occurs for training

(82)

Linear and Kernel Classification (Cont’d)

In a sense, linear is a special case of kernel

Indeed, we can prove that test accuracy of linear is the same as Gaussian (RBF) kernel under certain parameters (Keerthi and Lin, 2003)

Therefore, roughly we have

test accuracy: kernel ≥ linear cost: kernel  linear Speed is the reason to use linear

(83)

Linear and Kernel Classification (Cont’d)

For some problems, accuracy by linear is as good as nonlinear

But training and testing are much faster

This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)

(84)

Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(85)

Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(86)

Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(87)

Extension: Training Explicit Form of Nonlinear Mappings I

Linear-SVM method to train φ(x1), . . . , φ(xl) Kernel not used

Applicable only if dimension of φ(x ) not too large Low-degree Polynomial Mappings

K (xi, xj) = (xTi xj + 1)2 = φ(xi)Tφ(xj) φ(x ) = [1,√

2x1, . . . ,√

2xn, x12, . . . , xn2,

2x1x2, . . . ,√

2xn−1xn]T

(88)

Extension: Training Explicit Form of Nonlinear Mappings II

For this mapping, # features = O(n2)

Recall O(n) for linear versus O(nl ) for kernel Now O(n2) versus O(nl )

Sparse data

n ⇒ ¯n, average # non-zeros for sparse data

¯

n  n ⇒ O( ¯n2) may be much smaller than O(l ¯n) When degree is small, train the explicit form of φ(x )

(89)

Testing Accuracy and Training Time

Data set

Degree-2 Polynomial Accuracy diff.

Training time (s)

Accuracy Linear RBF LIBLINEAR LIBSVM

a9a 1.6 89.8 85.06 0.07 0.02

real-sim 59.8 1,220.5 98.00 0.49 0.10

ijcnn1 10.7 64.2 97.84 5.63 −0.85

MNIST38 8.6 18.4 99.29 2.47 −0.40

covtype 5,211.9 NA 80.09 3.74 −15.98

webspam 3,228.1 NA 98.44 5.29 −0.76

Training φ(xi) by linear: faster than kernel, but sometimes competitive accuracy

(90)

Example: Dependency Parsing I

This is an NLP Application

Kernel Linear

RBF Poly-2 Linear Poly-2 Training time 3h34m53s 3h21m51s 3m36s 3m43s

Parsing speed 0.7x 1x 1652x 103x

UAS 89.92 91.67 89.11 91.71

LAS 88.55 90.60 88.07 90.71

We get faster training/testing, while maintain good accuracy

(91)

Example: Dependency Parsing II

We achieve this by training low-degree

polynomial-mapped data by linear classification That is, linear methods to explicitly train φ(xi), ∀i We consider the following low-degree polynomial mapping:

φ(x ) = [1, x1, . . . , xn, x12, . . . , xn2, x1x2, . . . , xn−1xn]T

(92)

Handing High Dimensionality of φ(x )

A multi-class problem with sparse data

n Dim. of φ(x ) l n w ’s # nonzeros¯ 46,155 1,065,165,090 204,582 13.3 1,438,456

¯

n: average # nonzeros per instance

Dimensionality of w is very high, but w is sparse Some training feature columns of xixj are entirely zero

Hashing techniques are used to handle sparse w

(93)

Example: Classifier in a Small Device

In a sensor application (Yu et al., 2014), the classifier can use less than 16KB of RAM

Classifiers Test accuracy Model Size

Decision Tree 77.77 76.02KB

AdaBoost (10 trees) 78.84 1,500.54KB SVM (RBF kernel) 85.33 1,287.15KB Number of features: 5

We consider a degree-3 polynomial mapping dimensionality = 5 + 3

3



+ bias term = 57.

(94)

Example: Classifier in a Small Device

One-against-one strategy for 5-class classification

5 2



× 57 × 4bytes = 2.28KB Assume single precision

Results

SVM method Test accuracy Model Size

RBF kernel 85.33 1,287.15KB

Polynomial kernel 84.79 2.28KB

Linear kernel 78.51 0.24KB

(95)

Outline

1 Introduction

2 SVM and kernel methods

3 Dual problem and solving optimization problems

4 Regulatization and linear versus kernel

5 Multi-class classification

6 Support vector regression

7 SVM for clustering

8 Practical use of support vector classification

9 A practical example of SVR

10 Discussion and conclusions

(96)

Multi-class Classification

SVM and logistic regression are methods for two-class classification

We need certain ways to extend them for multi-class problems

This is not a problem for methods such as nearest neighbor or decision trees

(97)

Multi-class Classification (Cont’d)

k classes

One-against-the rest: Train k binary SVMs:

1st class vs. (2, · · · , k)th class 2nd class vs. (1, 3, . . . , k)th class

...

k decision functions

(w1)Tφ(x ) + b1

...

(wk)Tφ(x ) + bk

(98)

Prediction:

arg max

j (wj)Tφ(x ) + bj

Reason: If x ∈ 1st class, then we should have (w1)Tφ(x ) + b1 ≥ +1

(w2)Tφ(x ) + b2 ≤ −1 ...

(wk)Tφ(x ) + bk ≤ −1

(99)

Multi-class Classification (Cont’d)

One-against-one: train k(k − 1)/2 binary SVMs (1, 2), (1, 3), . . . , (1, k), (2, 3), (2, 4), . . . , (k − 1, k) If 4 classes ⇒ 6 binary SVMs

yi = 1 yi = −1 Decision functions class 1 class 2 f12(x ) = (w12)Tx + b12 class 1 class 3 f13(x ) = (w13)Tx + b13 class 1 class 4 f14(x ) = (w14)Tx + b14 class 2 class 3 f23(x ) = (w23)Tx + b23 class 2 class 4 f24(x ) = (w24)Tx + b24 class 3 class 4 f34(x ) = (w34)Tx + b34

(100)

For a testing data, predicting all binary SVMs Classes winner

1 2 1

1 3 1

1 4 1

2 3 2

2 4 4

3 4 3

Select the one with the largest vote class 1 2 3 4

# votes 3 1 1 1 May use decision values as well

(101)

Solving a Single Problem

An approach by Crammer and Singer (2002)

w1min,...,wk

1 2

k

X

m=1

kwmk22 + C

l

X

i =1

ξ({wm}km=1; xi, yi), where

ξ({wm}km=1; x , y ) ≡ max

m6=y max(0, 1 − (wy− wm)Tx ).

We hope the decision value of xi by the model wyi

is larger than others

Prediction: same as one-against-the rest arg max

j (wj)Tx

(102)

Discussion

Other variants of solving a single optimization problem include Weston and Watkins (1999); Lee et al. (2004)

A comparison in Hsu and Lin (2002)

RBF kernel: accuracy similar for different methods But 1-against-1 is the fastest for training

(103)

Maximum Entropy

Maximum Entropy: a generalization of logistic regression for multi-class problems

It is widely applied by NLP applications.

Conditional probability of label y given data x . P(y |x ) ≡ exp(wTy x )

Pk

m=1exp(wTmx ),

(104)

Maximum Entropy (Cont’d)

We then minimizes regularized negative log-likelihood.

w1min,...,wm

1 2

k

X

m=1

kwkk2 + C

l

X

i =1

ξ({wm}km=1; xi, yi), where

ξ({wm}km=1; x , y ) ≡ − log P(y |x ).

(105)

Maximum Entropy (Cont’d)

Is this loss function reasonable?

If

wTyixi  wTmxi, ∀m 6= yi, then

ξ({wm}km=1; xi, yi) ≈ 0 That is, no loss

In contrast, if

wTyixi  wTmxi, m 6= yi, then P(yi|xi)  1 and the loss is large.

(106)

Features as Functions

NLP applications often use a function f (x , y ) to generate the feature vector

P(y |x ) ≡ exp(wTf (x , y )) P

y0 exp(wTf (x , y0)). (12) The earlier probability model is a special case by

f (x , y ) =

0...

x0 0...

0



y − 1

∈ Rnk and w =

w1

w...k

 .

(107)

Outline

1 Introduction

2 SVM and kernel methods

3 Dual problem and solving optimization problems

4 Regulatization and linear versus kernel

5 Multi-class classification

6 Support vector regression

7 SVM for clustering

8 Practical use of support vector classification

9 A practical example of SVR

10 Discussion and conclusions

(108)

Least Square Regression I

Given training data (x1, y1), . . . , (xl, yl) Now

yi ∈ R is the target value

Regression: find a function so that f (xi) ≈ yi

(109)

Least Square Regression II

Least square regression:

minw ,b l

X

i =1

(yi − (wTxi + b))2 That is, we model f (x ) by

f (x ) = wTx + b An example

(110)

Least Square Regression III

40 50 60 70 80 90 100

150 160 170 180 190 200

0.60 · Weight + 130.2

Weight (kg)

Height(cm)

note: picture is from

http://tex.stackexchange.com/questions/119179/

how-to-add-a-regression-line-to-randomly-generated-points-using-pgfplots-in-tikz

(111)

Least Square Regression IV

This is equivalent to minw ,b

Xl

i =1ξ(w , b; xi, yi) where

ξ(w , b; xi, yi) = (yi − (wTxi + b))2

(112)

Regularized Least Square

ξ(w , b; xi, yi) is a kind of loss function We can add regularization.

minw ,b

1

2wTw + C Xl

i =1ξ(w , b; xi, yi) C is still the regularization parameter

Other loss functions?

(113)

Support Vector Regression I

-insensitive loss function (b omitted) max(|wTxi − yi| − , 0) max(|wTxi − yi| − , 0)2

(114)

Support Vector Regression II

wTxi − yi 0

loss

− 

L2 L1

: errors small enough are treated as no error This make the model more robust (less overfitting the data)

(115)

Support Vector Regression III

One more parameter () to decide

An equivalent form of the optimization problem min

w ,b,ξ,ξ

1

2wTw + C

l

X

i =1

ξi + C

l

X

i =1

ξi subject to wTφ(xi) + b − yi ≤  + ξi,

yi − wTφ(xi) − b ≤  + ξi, ξi, ξi ≥ 0, i = 1, . . . , l .

This form is similar to the SVM formulation derived earlier

(116)

Support Vector Regression IV

The dual problem is

α,αmin 1

2(α − α)TQ(α − α) + 

l

X

i =1

i + αi)

+

l

X

i =1

yii − αi) subject to eT(α − α) = 0,

0 ≤ αi, αi ≤ C , i = 1, . . . , l , where Qij = K (xi, xj) ≡ φ(xi)Tφ(xj).

(117)

Support Vector Regression V

After solving the dual problem, w =

l

X

i =1

(−αi + αi)φ(xi) and the approximate function is

l

X

i =1

(−αi + αi)K (xi, x ) + b.

(118)

Discussion

SVR and least-square regression are very related Why people more commonly use l2 (least-square) rather than l1 losses?

Easier because of differentiability

(119)

Outline

1 Introduction

2 SVM and kernel methods

3 Dual problem and solving optimization problems

4 Regulatization and linear versus kernel

5 Multi-class classification

6 Support vector regression

7 SVM for clustering

8 Practical use of support vector classification

9 A practical example of SVR

10 Discussion and conclusions

(120)

One-class SVM I

Separate data to normal ones and outliers (Sch¨olkopf et al., 2001)

w ,ξ,ρmin 1

2wTw − ρ + 1 νl

l

X

i =1

ξi subject to wTφ(xi) ≥ ρ − ξi,

ξi ≥ 0, i = 1, . . . , l .

(121)

One-class SVM II

Instead of the parameter C is SVM, here the parameter is ν.

wTφ(xi) ≥ ρ − ξi

means that we hope most data satisfy wTφ(xi) ≥ ρ.

That is, most data are on one side of the hyperplane Those on the wrong side are considered as outliers

(122)

One-class SVM III

The dual problem is minα

1

T

subject to 0 ≤ αi ≤ 1/(νl ), i = 1, . . . , l , eTα = 1,

where Qij = K (xi, xj) = φ(xi)Tφ(xj).

The decision function is sgn

l

X

i =1

αiK (xi, x ) − ρ

! .

(123)

One-class SVM IV

The role of −ρ is similar to the bias term b earlier From the dual problem we can see that

ν ∈ (0, 1]

Otherwise, if ν > 1, then

eTα ≤ 1/ν < 1 violates the linear constraint.

Clearly, a larger ν means we don’t need to push ξi

to zero ⇒ more data are considered as outliers

參考文獻

相關文件

If we recorded the monthly sodium in- take for each individual in a sample and his/her blood pressure, do individuals with higher sodium consumption also have higher blood

Since we use the Fourier transform in time to reduce our inverse source problem to identification of the initial data in the time-dependent Maxwell equations by data on the

Keywords Support vector machine · ε-insensitive loss function · ε-smooth support vector regression · Smoothing Newton algorithm..

support vector machine, ε-insensitive loss function, ε-smooth support vector regression, smoothing Newton algorithm..

About the evaluation of strategies, we mainly focus on the profitability aspects and use the daily transaction data of Taiwan's Weighted Index futures from 1999 to 2007 and the

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time

Solving SVM Quadratic Programming Problem Training large-scale data..

“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced?. insight and