Support Vector Classiﬁcation and Regression

(1)

Regression

Chih-Jen Lin

Department of Computer Science National Taiwan University

Short course at ITRI, 2016

(2)

1 Introduction

2 SVM and kernel methods

3 Dual problem and solving optimization problems

4 Regulatization and linear versus kernel

5 Multi-class classification

6 Support vector regression

7 SVM for clustering

8 Practical use of support vector classification

9 A practical example of SVR Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 2 / 181

(3)

Outline

1 Introduction

9 A practical example of SVR

10 Discussion and conclusions

(4)

About this Course

Last year I gave a four-day short course on

“introduction of data mining”

In that course, SVM was discussed

This year I received a request to specifically talk about SVM

So I assume that some of you would like to learn more details of SVM

(5)

About this Course (Cont’d)

Therefore, this short course will be more technical than last year

More mathematics will be involved

We will have breaks at 9:50, 10:50, 13:50, and 14:50 Course slides:

www.csie.ntu.edu.tw/~cjlin/talks/itri.pdf I may still update slides (e.g., if we find errors in our lectures)

(6)

Outline

1 Introduction

(7)

Support Vector Classification

Training vectors : x_i, i = 1, . . . , l Feature vectors. For example, A patient = [height, weight, . . .]^T

Consider a simple case with two classes:

Define an indicator vector y ∈ R^l y_i =

1 if x_i in class 1

−1 if xi in class 2 A hyperplane which separates all data

(8)

◦ ◦

◦

◦ ◦

◦

◦◦

4 4 4 4

4 4

4

◦ ◦

◦

◦ ◦

◦

◦◦

4 4 4 4

4 4

4

w^Tx + b = n₊₁

−10

A separating hyperplane: w^Tx + b = 0 (w^Txi) + b ≥ 1 if y_i = 1 (w^Txi) + b ≤ −1 if y_i = −1

Decision function f (x ) = sgn(w^Tx + b), x : test data

Many possible choices of w and b

(9)

Maximal Margin

Distance between w^Tx + b = 1 and −1:

2/kw k = 2/

√ w^Tw

A quadratic programming problem (Boser et al., 1992)

minw ,b

1 2w^Tw

subject to y_i(w^Tx_i + b) ≥ 1, i = 1, . . . , l .

(10)

Example

Given two training data in R¹ as in the following figure:

4 0

1 What is the separating hyperplane ? Now two data are x₁ = 1, x₂ = 0 with

y = [+1, −1]^T

(11)

Example (Cont’d)

Now w ∈ R¹. The optimization problem is minw ,b

1 2w²

subject to w · 1 + b ≥ 1, (1)

−1(w · 0 + b) ≥ 1. (2) From (2), −b ≥ 1.

Putting this into (1), w ≥ 2.

That is, for any (w , b) satisfying (1) and (2), w ≥ 2.

(12)

Example (Cont’d)

We are minimizing ¹₂w², so the smallest possibility is w = 2.

Thus, (w , b) = (2, −1) is the optimal solution.

The separating hyperplane is 2x − 1 = 0, in the middle of the two training data:

4 0

1

• x = 1/2

(13)

Data May Not Be Linearly Separable

An example:

◦

◦ ◦

◦

4 4

4

4 4 4

Allow training errors

Higher dimensional ( maybe infinite ) feature space φ(x ) = [φ1(x ), φ2(x ), . . .]^T.

(14)

Standard SVM (Boser et al., 1992; Cortes and Vapnik, 1995)

min

w ,b,ξ

1

2w^Tw +C

l

X

i =1

ξ_i

subject to y_i(w^Tφ(x_i)+ b) ≥ 1 −ξ_i, ξ_i ≥ 0, i = 1, . . . , l .

Example: x ∈ R³, φ(x ) ∈ R¹⁰ φ(x ) = [1,√

2x₁,√

2x₂,√

2x₃, x₁², x₂², x₃²,√

2x₁x₂,√

2x₁x₃,√

2x₂x₃]^T

(15)

Finding the Decision Function

w : maybe infinite variables

The dual problem: finite number of variables minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C , i = 1, . . . , l y^Tα = 0,

where Q_ij = y_iy_jφ(x_i)^Tφ(x_j) and e = [1, . . . , 1]^T At optimum

w = Pl

i =1α_iy_iφ(x_i)

A finite problem: #variables = #training data

(16)

Kernel Tricks

Q_ij = y_iy_jφ(x_i)^Tφ(x_j) needs a closed form Example: x_i ∈ R³, φ(x_i) ∈ R¹⁰

φ(x_i) = [1,√

2(x_i)₁,√

2(x_i)₂,√

2(x_i)₃, (x_i)²₁, (x_i)²₂, (x_i)²₃,√

2(x_i)₁(x_i)₂,√

2(x_i)₁(x_i)₃,√

2(x_i)₂(x_i)₃]^T Then φ(xi)^Tφ(xj) = (1 + x^T_i xj)².

Kernel: K (x , y) = φ(x )^Tφ(y); common kernels:

e^−γkxⁱ^−x^j^k², (Radial Basis Function or Gaussian kernel) (x^T_i x_j/a + b)^d (Polynomial kernel)

(17)

Can be inner product in infinite dimensional space Assume x ∈ R¹ and γ > 0.

e^−γkxⁱ^−x^j^k² = e^−γ(xⁱ^−x^j⁾² = e^−γxⁱ²^+2γxⁱ^x^j^−γx^j²

=e^−γxⁱ²^−γx^j² 1 + 2γx_ix_j

1! + (2γx_ix_j)²

2! + (2γx_ix_j)³

3! + · · ·

=e^−γxⁱ²^−γx^j² 1 · 1+

r2γ 1!x_i ·

r2γ 1!x_j+

r(2γ)² 2! x_i² ·

r(2γ)² 2! x_j² +

r(2γ)³ 3! x_i³ ·

r(2γ)³

3! x_j³ + · · · = φ(x_i)^Tφ(x_j), where

φ(x ) = e^−γx²

1,

r2γ 1!x ,

r(2γ)² 2! x²,

r(2γ)³

3! x³, · · ·

T

.

(18)

Decision function

At optimum

w = Pl

i =1α_iy_iφ(x_i) Decision function

w^Tφ(x ) + b

=

l

X

i =1

α_iy_iφ(x_i)^Tφ(x ) + b

=

l

X

i =1

α_iy_iK (x_i, x ) + b

Only φ(x_i) of α_i > 0 used ⇒ support vectors

(19)

Support Vectors: More Important Data

Only φ(xi) of αi > 0 used ⇒ support vectors

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1

(20)

See more examples via SVM Toy available at libsvm web page

(http://www.csie.ntu.edu.tw/~cjlin/libsvm/)

(21)

Example: Primal-dual Relationship

If separable, primal problem does not have ξi

min

w ,b

1 2w^Tw

subject to y_i(w^Tx_i + b) ≥ 1, i = 1, . . . , l . Dual problem is

minα

1 2

X^l

i =1

X^l

j =1α_iα_jy_iy_jx^T_i xj −X^l

i =1α_i subject to 0 ≤ α_i, i = 1, . . . , l ,

X^l

i =1y_iα_i = 0.

(22)

Example: Primal-dual Relationship (Cont’d)

Consider the earlier example:

4 0

1

Now two data are x₁ = 1, x₂ = 0 with y = [+1, −1]^T The solution is (w , b) = (2, −1)

(23)

Example: Primal-dual Relationship (Cont’d)

The dual objective function 1

2α₁ α₂1 0 0 0

α₁ α₂

−1 1α₁ α₂

= 1

2α²₁ − (α₁ + α₂)

In optimization, objective function means the function to be optimized

Constraints are

α₁ − α₂ = 0, 0 ≤ α₁, 0 ≤ α₂.

(24)

Example: Primal-dual Relationship (Cont’d)

Substituting α₂ = α₁ into the objective function, 1

2α²₁ − 2α₁ has the smallest value at α₁ = 2.

Because [2, 2]^T satisfies constraints 0 ≤ α₁ and 0 ≤ α₂, it is optimal

(25)

Example: Primal-dual Relationship (Cont’d)

Using the primal-dual relation w = y1α1x1 + y2α2x2

= 1 · 2 · 1 + (−1) · 2 · 0

= 2

This is the same as that by solving the primal problem.

(26)

More about Support vectors

We know

α_i > 0 ⇒ support vector We have

y_i(w^Tx_i + b) < 1 ⇒ α_i > 0 ⇒ support vector, y_i(w^Tx_i + b) = 1 ⇒ α_i ≥ 0 ⇒ maybe SV and

y_i(w^Txi + b) > 1 ⇒ α_i = 0 ⇒ not SV

(27)

Outline

1 Introduction

(28)

Convex Optimization I

Convex problems are a important class of

optimization problems that possess nice properties A function is convex if ∀x , y

f (θx + (1 − θ)y ) ≤ θf (x ) + (1 − θ)f (y ), ∀θ ∈ [0, 1]

That is, the line segment between any two points is not lower than the function value

(29)

Convex Optimization II

x y

f (x )

f (y )

(30)

Convex Optimization III

A convex optimization problem takes the following form

min f₀(w )

subject to f_i(w ) ≤ 0, i = 1, . . . , m, (3) h_i(w ) = 0, i = 1, . . . , p,

where f₀, . . . , f_m are convex functions and h₁, . . . , h_p are affine (i.e., a linear function):

h_i(w ) = a^Tw + b

(31)

Convex Optimization IV

A nice property of convex optimization problems is that

infw {f₀(w ) | w satisfies constraints}

is unique

Optimal objective value is unique, but optimal w may be not

There are other nice properties such as the primal-dual relationship that we will use

To learn more about convex optimization, you can check the book by Boyd and Vandenberghe (2004)

(32)

Deriving the Dual

For simplification, consider the problem without ξ_i minw ,b

1 2w^Tw

subject to y_i(w^Tφ(x_i) + b) ≥ 1, i = 1, . . . , l . Its dual is

minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i, i = 1, . . . , l , y^Tα = 0,

where

Q_ij = y_iy_jφ(x_i)^Tφ(x_j)

(33)

Lagrangian Dual I

Lagrangian dual

maxα≥0 min

w ,b L(w , b, α), where

L(w , b, α)

=1

2kw k² −

l

X

i =1

α_i y_i(w^Tφ(x_i) + b) − 1

(34)

Lagrangian Dual II

Strong duality

min Primal = max

α≥0 min

w ,b L(w , b, α)

After SVM is popular, quite a few people think that for any optimization problem

⇒ Lagrangian dual exists and strong duality holds Wrong! We usually need

The optimization problem is convex

Certain constraint qualification holds (details not discussed)

(35)

Lagrangian Dual III

We have them

SVM primal is convex and has linear constraints

(36)

Simplify the dual. When α is fixed, min

w ,b L(w , b, α) =











−∞ if

l

P

i =1

α_iy_i 6= 0, minw

1

2w^Tw −

l

P

i =1

αi[yi(w^Tφ(xi) − 1] if

l

P

i =1

αiyi = 0.

If Pl

i =1α_iy_i 6= 0, we can decrease

−b

l

X

i =1

α_iy_i in L(w , b, α) to −∞

(37)

If Pl

i =1α_iy_i = 0, optimum of the strictly convex function

1

2w^Tw −

l

X

i =1

α_i[y_i(w^Tφ(x_i) − 1]

happens when

∇_wL(w , b, α) = 0.

Thus,

w =

l

X

i =1

α_iy_iφ(x_i).

(38)

Note that w^Tw =

^l X

i =1

α_iy_iφ(x_i)

T ^l X

j =1

α_jy_jφ(x_j)

= X

i ,j

α_iα_jy_iy_jφ(x_i)^Tφ(x_j)

The dual is

maxα≥0











l

P

i =1

αi − ¹₂ P

i ,j

αiαjyiyjφ(xi)^Tφ(xj) if

l

P

i =1

αiyi = 0,

−∞ if

l

P

i =1

αiyi 6= 0.

(39)

Lagrangian dual: max_α≥0 min_{w ,b}L(w , b, α)

−∞ definitely not maximum of the dual Dual optimal solution not happen when

l

X

i =1

α_iy_i 6= 0 .

Dual simplified to max

α∈R^l l

X

i =1

α_i − 1 2

l

X

i =1 l

X

j =1

α_iα_jy_iy_jφ(x_i)^Tφ(x_j) subject to y^Tα = 0,

α_i ≥ 0, i = 1, . . . , l .

(40)

Our problems may be infinite dimensional (i.e., w ∈ R^∞)

We can still use Lagrangian duality See a rigorous discussion in Lin (2001)

(41)

Primal versus Dual I

Recall the dual problem is minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C , i = 1, . . . , l y^Tα = 0

and at optimum w =

l

X

i =1

α_iy_iφ(x_i) (4)

(42)

Primal versus Dual II

What if we put (4) into primal min

α,ξ

1

2α^TQα + C X^l

i =1ξ_i

subject to (Qα + by)i ≥ 1 − ξ_i (5) ξ_i ≥ 0

Note that

yiw^Tφ(xi) = yi l

X

j =1

αjyjφ(xj)^Tφ(xi)

=

l

X

j =1

Q_ijα_j = (Qα)_i

(43)

Primal versus Dual III

If Q is positive definite, we can prove that the optimal α of (5) is the same as that of the dual So dual is not the only choice to obtain the model

(44)

Large Dense Quadratic Programming

minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C , i = 1, . . . , l y^Tα = 0

Qij 6= 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables:

(50, 000² × 8/2) bytes = 10GB RAM to store Q

(45)

Large Dense Quadratic Programming (Cont’d)

For quadratic programming problems, traditional optimization methods assume that Q is available in the computer memory

They cannot be directly applied here because Q cannot even be stored

Currently, decomposition methods (a type of coordinate descent methods) are what used in practice

(46)

Decomposition Methods

Working on some variables each time (e.g., Osuna et al., 1997; Joachims, 1998; Platt, 1998)

Similar to coordinate-wise minimization Working set B , N = {1, . . . , l }\B fixed Sub-problem at the kth iteration:

minαB

1

2α^T_B (α^k_N)^TQ_BB Q_BN Q_NB Q_NN

α_B α^k_N

−

e^T_B (e^k_N)^Tα_B α^k_N

subject to 0 ≤ α_t ≤ C , t ∈ B, y_B^Tα_B = −y^T_Nα^k_N

(47)

Avoid Memory Problems

The new objective function 1

2α^T_B (α^k_N)^TQ_BBα_B + Q_BNα^k_N Q_NBα_B + Q_NNα^k_N

− e^T_Bα_B + constant

= 1

2α^T_BQ_BBα_B + (−e_B + Q_BNα^k_N)^Tα_B + constant Only |B| columns of Q are needed

In general |B| ≤ 10 is used

Calculated when used : trade time for space But is such an approach practical?

(48)

How Decomposition Methods Perform?

Convergence not very fast. This is known because of using only first-order information

But, no need to have very accurate α decision function:

sgn(w^Tφ(x ) + b) = sgn

X^l

i =1αiK (xi, x ) + b

Prediction may still be correct with a rough α Further, in some situations,

# support vectors # training points Initial α¹ = 0, some instances never used

(49)

How Decomposition Methods Perform?

(Cont’d)

An example of training 50,000 instances using the software LIBSVM

$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370

Time 79.524s

This was done on a typical desktop

Calculating the whole Q takes more time

#SVs = 3,370 50,000

A good case where some remain at zero all the time

(50)

How Decomposition Methods Perform?

(Cont’d)

Because many α_i = 0 in the end, we can develop a shrinking techniques

Variables are removed during the optimization procedure. Smaller problems are solved

(51)

Machine Learning Properties are Useful in Designing Optimization Algorithms

We have seen that special properties of SVM contribute to the viability of decomposition methods

For machine learning applications, no need to accurately solve the optimization problem Because some optimal αi = 0, decomposition methods may not need to update all the variables Also, we can use shrinking techniques to reduce the problem size during decomposition methods

(52)

Differences between Optimization and Machine Learning

The two topics may have different focuses. We give the following example

The decomposition method we just discussed converges more slowly when C is large

Using C = 1 on a data set

# iterations: 508 Using C = 5, 000

# iterations: 35,241

(53)

Optimization researchers may rush to solve difficult cases of large C

It turns out that large C is less used than small C Recall that SVM solves

1

2w^Tw + C (sum of training losses) A large C means to overfit training data

This does not give good test accuracy. More details about overfitting will be discussed later

(54)

Outline

1 Introduction

(55)

Equivalent Optimization Problem

• Recall SVM optimization problem is

w ,b,ξmin 1

2w^Tw + C X^l

i =1ξ_i

subject to y_i(w^Tφ(x_i) + b) ≥ 1 − ξ_i, ξ_i ≥ 0, i = 1, . . . , l .

• It is equivalent to minw ,b

1

2w^Tw + C

l

X

i =1

max(0, 1 − y_i(w^Tφ(x_i) + b))

• The reformulation is useful to derive SVM from a different viewpoint

(56)

Equivalent Optimization Problem (Cont’d)

That is, at optimum,

ξ_i = max(0, 1 − y_i(w^Tφ(x_i) + b)) Reason: from constraints

ξ_i ≥ 1 − y_i(w^Tφ(x_i) + b) and ξ_i ≥ 0 but we also want to minimize ξ_i

(57)

Linear and Kernel I

Linear classifier

sgn(w^Tx + b) Kernel classifier

sgn(w^Tφ(x ) + b) = sgn

Xl

i =1α_iK (x_i, x ) + b

Linear is a special case of kernel

An important difference is that for linear we can store w

(58)

Linear and Kernel II

For kernel, w may be infinite dimensional and cannot be stored

We will show that they are useful in different circumstances

(59)

The Bias Term b

Recall the decision function is sgn(w^Tx + b) Sometimes the bias term b is omitted

sgn(w^Tx )

This is fine if the number of features is not too small

(60)

Minimizing Training Errors

For classification naturally we aim to minimize the training error

minw (training errors)

To characterize the training error, we need a loss function ξ(w ; x , y ) for each instance (x , y ) Ideally we should use 0–1 training loss:

ξ(w ; x , y ) =

(1 if y w^Tx < 0, 0 otherwise

(61)

Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−y w^Tx ξ(w ; x , y )

We need continuous approximations

(62)

Common Loss Functions

Hinge loss (l1 loss)

ξ_L1(w ; x , y ) ≡ max(0, 1 − y w^Tx ) (6) Squared hinge loss (l2 loss)

ξ_L2(w ; x , y ) ≡ max(0, 1 − y w^Tx )² (7) Logistic loss

ξ_LR(w ; x , y ) ≡ log(1 + e^{−y w}^T^x) (8) SVM: (6)-(7). Logistic regression (LR): (8)

(63)

Common Loss Functions (Cont’d)

−y w^Tx ξ(w ; x , y )

ξ_L1 ξ_L2

ξLR

Logistic regression is very related to SVM

Their performance (i.e., test accuracy) is usually similar

(64)

Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs

(65)

Overfitting

See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

(66)

l and s: training; and 4: testing

(67)

Regularization

To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w ’s values less extreme.

One idea is to make w values closer to zero We can add, for example,

w^Tw

2 or kw k₁ to the objective function

(68)

General Form of Linear Classification I

Training data {y_i, x_i}, x_i ∈ Rⁿ, i = 1, . . . , l , y_i = ±1 l : # of data, n: # of features

minw f (w ), f (w ) ≡ w^Tw

2 + C

l

X

i =1

ξ(w ; x_i, y_i) (9) w^Tw /2: regularization term

ξ(w ; x , y ): loss function C : regularization parameter

(69)

General Form of Linear Classification II

Of course we can map data to a higher dimensional space

minw f (w ), f (w ) ≡ w^Tw

2 + C

l

X

i =1

ξ(w ;φ(x_i), y_i)

(70)

SVM and Logistic Regression I

If hinge (l1) loss is used, the optimization problem is

minw

1

2w^Tw + C

l

X

i =1

max(0, 1 − yiw^Txi) It is the SVM problem we had earlier (without the bias b)

Therefore, we have derived SVM from a different viewpoint

We also see that SVM is very related to logistic regression

(71)

SVM and Logistic Regression II

However, many wrongly think that they are different This is wrong.

Reason of this misunderstanding: traditionally, when people say SVM ⇒ kernel SVM

when people say logistic regression ⇒ linear logistic regression

Indeed we can do kernel logistic regression minw

1

2w^Tw + C

l

X

i =1

log(1 + e^−yⁱ^w^T^φ(xⁱ⁾)

(72)

SVM and Logistic Regression III

A main difference from SVM is that logistic regression has probability interpretation

We will introduce logistic regression from another viewpoint

(73)

Logistic Regression

For a label-feature pair (y , x ), assume the probability model is

p(y |x ) = 1

1 + e^{−y w}^T^x. Note that

p(1|x ) + p(−1|x )

= 1

1 + e^−w^T^x + 1 1 + e^w^T^x

= e^w^T^x

1 + e^w^T^x + 1 1 + e^w^T^x

= 1

w is the parameter to be decided

Chih-Jen Lin (National Taiwan Univ.) 73 / 181

(74)

Logistic Regression (Cont’d)

Idea of this model p(1|x ) = 1

1 + e^−w^T^x

(→ 1 if w^Tx 0,

→ 0 if w^Tx 0 Assume training instances are

(yi, xi), i = 1, . . . , l

(75)

Logistic Regression (Cont’d)

Logistic regression finds w by maximizing the following likelihood

maxw l

Y

i =1

p (y_i|x_i) . (10) Negative log-likelihood

− log

l

Y

i =1

p (yi|x_i) = −

l

X

i =1

log p (yi|x_i)

=

l

X

i =1

log

1 + e^−yⁱ^w^T^xⁱ

(76)

Logistic Regression (Cont’d)

Logistic regression minw

l

X

i =1

log

. Regularized logistic regression

minw

1

2w^Tw + C

l

X

i =1

log

. (11) C : regularization parameter decided by users

(77)

Loss Functions: Differentiability

However,

ξ_L1: not differentiable

ξL2: differentiable but not twice differentiable ξ_LR: twice differentiable

The same optimization method may not be applicable to all these losses

(78)

Discussion

We see that the same classification method can be derived from different ways

SVM

Maximal margin

Regularization and training losses LR

Regularization and training losses Maximum likelihood

(79)

Regularization

L1 versus L2

kw k₁ and w^Tw /2 w^Tw /2: smooth, easier to optimize kw k₁: non-differentiable

sparse solution; possibly many zero elements Possible advantages of L1 regularization:

Feature selection Less storage for w

(80)

Linear and Kernel Classification

Methods such as SVM and logistic regression can used in two ways

Kernel methods: data mapped to a higher dimensional space

x ⇒ φ(x )

φ(xi)^Tφ(xj) easily calculated; little control on φ(·) Feature engineering + linear classification:

We have x without mapping. Alternatively, we can say that φ(x ) is our x ; full control on x or φ(x ) We refer to them as kernel and linear classifiers

(81)

Linear and Kernel Classification

Let’s check the prediction cost w^Tx + b versus X^l

i =1αiyiK (xi, x ) + b If K (x_i, x_j) takes O(n), then

O(n) versus O(nl ) Linear is much cheaper

A similar difference occurs for training

(82)

Linear and Kernel Classification (Cont’d)

In a sense, linear is a special case of kernel

Indeed, we can prove that test accuracy of linear is the same as Gaussian (RBF) kernel under certain parameters (Keerthi and Lin, 2003)

Therefore, roughly we have

test accuracy: kernel ≥ linear cost: kernel linear Speed is the reason to use linear

(83)

Linear and Kernel Classification (Cont’d)

For some problems, accuracy by linear is as good as nonlinear

But training and testing are much faster

This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)

(84)

Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(85)

Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

(86)

Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

(87)

Extension: Training Explicit Form of Nonlinear Mappings I

Linear-SVM method to train φ(x₁), . . . , φ(x_l) Kernel not used

Applicable only if dimension of φ(x ) not too large Low-degree Polynomial Mappings

K (x_i, x_j) = (x^T_i xj + 1)² = φ(x_i)^Tφ(x_j) φ(x ) = [1,√

2x₁, . . . ,√

2x_n, x₁², . . . , x_n²,

√

2x₁x₂, . . . ,√

2x_n−1x_n]^T

(88)

Extension: Training Explicit Form of Nonlinear Mappings II

For this mapping, # features = O(n²)

Recall O(n) for linear versus O(nl ) for kernel Now O(n²) versus O(nl )

Sparse data

n ⇒ ¯n, average # non-zeros for sparse data

¯

n n ⇒ O( ¯n²) may be much smaller than O(l ¯n) When degree is small, train the explicit form of φ(x )

(89)

Testing Accuracy and Training Time

Data set

Degree-2 Polynomial Accuracy diff.

Training time (s)

Accuracy Linear RBF LIBLINEAR LIBSVM

a9a 1.6 89.8 85.06 0.07 0.02

real-sim 59.8 1,220.5 98.00 0.49 0.10

ijcnn1 10.7 64.2 97.84 5.63 −0.85

MNIST38 8.6 18.4 99.29 2.47 −0.40

covtype 5,211.9 NA 80.09 3.74 −15.98

webspam 3,228.1 NA 98.44 5.29 −0.76

Training φ(x_i) by linear: faster than kernel, but sometimes competitive accuracy

(90)

Example: Dependency Parsing I

This is an NLP Application

Kernel Linear

RBF Poly-2 Linear Poly-2 Training time 3h34m53s 3h21m51s 3m36s 3m43s

Parsing speed 0.7x 1x 1652x 103x

UAS 89.92 91.67 89.11 91.71

LAS 88.55 90.60 88.07 90.71

We get faster training/testing, while maintain good accuracy

(91)

Example: Dependency Parsing II

We achieve this by training low-degree

polynomial-mapped data by linear classification That is, linear methods to explicitly train φ(x_i), ∀i We consider the following low-degree polynomial mapping:

φ(x ) = [1, x₁, . . . , x_n, x₁², . . . , x_n², x₁x₂, . . . , x_n−1x_n]^T

(92)

Handing High Dimensionality of φ(x )

A multi-class problem with sparse data

n Dim. of φ(x ) l n w ’s # nonzeros¯ 46,155 1,065,165,090 204,582 13.3 1,438,456

¯

n: average # nonzeros per instance

Dimensionality of w is very high, but w is sparse Some training feature columns of x_ix_j are entirely zero

Hashing techniques are used to handle sparse w

(93)

Example: Classifier in a Small Device

In a sensor application (Yu et al., 2014), the classifier can use less than 16KB of RAM

Classifiers Test accuracy Model Size

Decision Tree 77.77 76.02KB

AdaBoost (10 trees) 78.84 1,500.54KB SVM (RBF kernel) 85.33 1,287.15KB Number of features: 5

We consider a degree-3 polynomial mapping dimensionality = 5 + 3

3

+ bias term = 57.

(94)

Example: Classifier in a Small Device

One-against-one strategy for 5-class classification

5 2

× 57 × 4bytes = 2.28KB Assume single precision

Results

SVM method Test accuracy Model Size

RBF kernel 85.33 1,287.15KB

Polynomial kernel 84.79 2.28KB

Linear kernel 78.51 0.24KB

(95)

Outline

1 Introduction

(96)

Multi-class Classification

SVM and logistic regression are methods for two-class classification

We need certain ways to extend them for multi-class problems

This is not a problem for methods such as nearest neighbor or decision trees

(97)

Multi-class Classification (Cont’d)

k classes

One-against-the rest: Train k binary SVMs:

1st class vs. (2, · · · , k)th class 2nd class vs. (1, 3, . . . , k)th class

...

k decision functions

(w¹)^Tφ(x ) + b1

...

(w^k)^Tφ(x ) + bk

(98)

Prediction:

arg max

j (w^j)^Tφ(x ) + b_j

Reason: If x ∈ 1st class, then we should have (w¹)^Tφ(x ) + b₁ ≥ +1

(w²)^Tφ(x ) + b₂ ≤ −1 ...

(w^k)^Tφ(x ) + b_k ≤ −1

(99)

Multi-class Classification (Cont’d)

One-against-one: train k(k − 1)/2 binary SVMs (1, 2), (1, 3), . . . , (1, k), (2, 3), (2, 4), . . . , (k − 1, k) If 4 classes ⇒ 6 binary SVMs

y_i = 1 y_i = −1 Decision functions class 1 class 2 f¹²(x ) = (w¹²)^Tx + b¹² class 1 class 3 f¹³(x ) = (w¹³)^Tx + b¹³ class 1 class 4 f¹⁴(x ) = (w¹⁴)^Tx + b¹⁴ class 2 class 3 f²³(x ) = (w²³)^Tx + b²³ class 2 class 4 f²⁴(x ) = (w²⁴)^Tx + b²⁴ class 3 class 4 f³⁴(x ) = (w³⁴)^Tx + b³⁴

(100)

For a testing data, predicting all binary SVMs Classes winner

1 2 1

1 3 1

1 4 1

2 3 2

2 4 4

3 4 3

Select the one with the largest vote class 1 2 3 4

# votes 3 1 1 1 May use decision values as well

(101)

Solving a Single Problem

An approach by Crammer and Singer (2002)

w1min,...,wk

1 2

k

X

m=1

kw_mk²₂ + C

l

X

i =1

ξ({w_m}^k_m=1; x_i, y_i), where

ξ({wm}^k_m=1; x , y ) ≡ max

m6=y max(0, 1 − (wy− w_m)^Tx ).

We hope the decision value of xi by the model wy_i

is larger than others

Prediction: same as one-against-the rest arg max

j (w_j)^Tx

(102)

Discussion

Other variants of solving a single optimization problem include Weston and Watkins (1999); Lee et al. (2004)

A comparison in Hsu and Lin (2002)

RBF kernel: accuracy similar for different methods But 1-against-1 is the fastest for training

(103)

Maximum Entropy

Maximum Entropy: a generalization of logistic regression for multi-class problems

It is widely applied by NLP applications.

Conditional probability of label y given data x . P(y |x ) ≡ exp(w^T_y x )

Pk

m=1exp(w^T_mx ),

(104)

Maximum Entropy (Cont’d)

We then minimizes regularized negative log-likelihood.

w1min,...,wm

1 2

k

X

m=1

kw_kk² + C

l

X

i =1

ξ({wm}^k_m=1; xi, yi), where

ξ({w_m}^k_m=1; x , y ) ≡ − log P(y |x ).

(105)

Maximum Entropy (Cont’d)

Is this loss function reasonable?

If

w^T_y_ix_i w^T_mx_i, ∀m 6= y_i, then

ξ({w_m}^k_m=1; x_i, y_i) ≈ 0 That is, no loss

In contrast, if

w^T_y_ixi w^T_mxi, m 6= y_i, then P(y_i|x_i) 1 and the loss is large.

(106)

Features as Functions

NLP applications often use a function f (x , y ) to generate the feature vector

P(y |x ) ≡ exp(w^Tf (x , y )) P

y⁰ exp(w^Tf (x , y⁰)). (12) The earlier probability model is a special case by

f (x , y ) =







0...

x0 0...

0







y − 1

∈ R^nk and w =

_w₁

w...k

.

(107)

Outline

1 Introduction

(108)

Least Square Regression I

Given training data (x₁, y₁), . . . , (x_l, y_l) Now

y_i ∈ R is the target value

Regression: find a function so that f (x_i) ≈ y_i

(109)

Least Square Regression II

Least square regression:

minw ,b l

X

i =1

(y_i − (w^Tx_i + b))² That is, we model f (x ) by

f (x ) = w^Tx + b An example

(110)

Least Square Regression III

40 50 60 70 80 90 100

150 160 170 180 190 200

0.60 · Weight + 130.2

Weight (kg)

Height(cm)

note: picture is from

http://tex.stackexchange.com/questions/119179/

how-to-add-a-regression-line-to-randomly-generated-points-using-pgfplots-in-tikz

(111)

Least Square Regression IV

This is equivalent to minw ,b

X^l

i =1ξ(w , b; x_i, y_i) where

ξ(w , b; x_i, y_i) = (y_i − (w^Txi + b))²

(112)

Regularized Least Square

ξ(w , b; x_i, y_i) is a kind of loss function We can add regularization.

minw ,b

1

2w^Tw + C X^l

i =1ξ(w , b; x_i, y_i) C is still the regularization parameter

Other loss functions?

(113)

Support Vector Regression I

-insensitive loss function (b omitted) max(|w^Tx_i − y_i| − , 0) max(|w^Tx_i − y_i| − , 0)²

(114)

Support Vector Regression II

w^Txi − y_i 0

loss

−

L2 L1

: errors small enough are treated as no error This make the model more robust (less overfitting the data)

(115)

Support Vector Regression III

One more parameter () to decide

An equivalent form of the optimization problem min

w ,b,ξ,ξ^∗

1

2w^Tw + C

l

X

i =1

ξ_i + C

l

X

i =1

ξ_i^∗ subject to w^Tφ(x_i) + b − y_i ≤ + ξ_i,

y_i − w^Tφ(x_i) − b ≤ + ξ_i^∗, ξi, ξ_i^∗ ≥ 0, i = 1, . . . , l .

This form is similar to the SVM formulation derived earlier

(116)

Support Vector Regression IV

The dual problem is

α,αmin^∗ 1

2(α − α^∗)^TQ(α − α^∗) +

l

X

i =1

(α_i + α_i^∗)

+

l

X

i =1

y_i(α_i − α^∗_i) subject to e^T(α − α^∗) = 0,

0 ≤ α_i, α^∗_i ≤ C , i = 1, . . . , l , where Q_ij = K (x_i, x_j) ≡ φ(x_i)^Tφ(x_j).

(117)

Support Vector Regression V

After solving the dual problem, w =

l

X

i =1

(−αi + α^∗_i)φ(xi) and the approximate function is

l

X

i =1

(−α_i + α^∗_i)K (x_i, x ) + b.

(118)

Discussion

SVR and least-square regression are very related Why people more commonly use l2 (least-square) rather than l1 losses?

Easier because of differentiability

(119)

Outline

1 Introduction

(120)

One-class SVM I

Separate data to normal ones and outliers (Sch¨olkopf et al., 2001)

w ,ξ,ρmin 1

2w^Tw − ρ + 1 νl

l

X

i =1

ξ_i subject to w^Tφ(x_i) ≥ ρ − ξ_i,

ξ_i ≥ 0, i = 1, . . . , l .

(121)

One-class SVM II

Instead of the parameter C is SVM, here the parameter is ν.

w^Tφ(xi) ≥ ρ − ξi

means that we hope most data satisfy w^Tφ(x_i) ≥ ρ.

That is, most data are on one side of the hyperplane Those on the wrong side are considered as outliers

(122)

One-class SVM III

The dual problem is minα

1

2α^TQα

subject to 0 ≤ α_i ≤ 1/(νl ), i = 1, . . . , l , e^Tα = 1,

where Q_ij = K (x_i, x_j) = φ(x_i)^Tφ(x_j).

The decision function is sgn

l

X

i =1

α_iK (x_i, x ) − ρ

! .

(123)

One-class SVM IV

The role of −ρ is similar to the bias term b earlier From the dual problem we can see that

ν ∈ (0, 1]

Otherwise, if ν > 1, then

e^Tα ≤ 1/ν < 1 violates the linear constraint.

Clearly, a larger ν means we don’t need to push ξi

to zero ⇒ more data are considered as outliers