Regression
Chih-Jen Lin
Department of Computer Science National Taiwan University
Short course at ITRI, 2016
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 2 / 181
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
About this Course
Last year I gave a four-day short course on
“introduction of data mining”
In that course, SVM was discussed
This year I received a request to specifically talk about SVM
So I assume that some of you would like to learn more details of SVM
About this Course (Cont’d)
Therefore, this short course will be more technical than last year
More mathematics will be involved
We will have breaks at 9:50, 10:50, 13:50, and 14:50 Course slides:
www.csie.ntu.edu.tw/~cjlin/talks/itri.pdf I may still update slides (e.g., if we find errors in our lectures)
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Support Vector Classification
Training vectors : xi, i = 1, . . . , l Feature vectors. For example, A patient = [height, weight, . . .]T
Consider a simple case with two classes:
Define an indicator vector y ∈ Rl yi =
1 if xi in class 1
−1 if xi in class 2 A hyperplane which separates all data
◦ ◦
◦
◦ ◦
◦
◦◦
4 4 4 4
4 4
4
◦ ◦
◦
◦ ◦
◦
◦◦
4 4 4 4
4 4
4
wTx + b = n+1
−10
A separating hyperplane: wTx + b = 0 (wTxi) + b ≥ 1 if yi = 1 (wTxi) + b ≤ −1 if yi = −1
Decision function f (x ) = sgn(wTx + b), x : test data
Many possible choices of w and b
Maximal Margin
Distance between wTx + b = 1 and −1:
2/kw k = 2/
√ wTw
A quadratic programming problem (Boser et al., 1992)
minw ,b
1 2wTw
subject to yi(wTxi + b) ≥ 1, i = 1, . . . , l .
Example
Given two training data in R1 as in the following figure:
4 0
1 What is the separating hyperplane ? Now two data are x1 = 1, x2 = 0 with
y = [+1, −1]T
Example (Cont’d)
Now w ∈ R1. The optimization problem is minw ,b
1 2w2
subject to w · 1 + b ≥ 1, (1)
−1(w · 0 + b) ≥ 1. (2) From (2), −b ≥ 1.
Putting this into (1), w ≥ 2.
That is, for any (w , b) satisfying (1) and (2), w ≥ 2.
Example (Cont’d)
We are minimizing 12w2, so the smallest possibility is w = 2.
Thus, (w , b) = (2, −1) is the optimal solution.
The separating hyperplane is 2x − 1 = 0, in the middle of the two training data:
4 0
1
• x = 1/2
Data May Not Be Linearly Separable
An example:
◦
◦
◦
◦
◦
◦ ◦
◦
4 4
4 4
4
4 4 4
Allow training errors
Higher dimensional ( maybe infinite ) feature space φ(x ) = [φ1(x ), φ2(x ), . . .]T.
Standard SVM (Boser et al., 1992; Cortes and Vapnik, 1995)
min
w ,b,ξ
1
2wTw +C
l
X
i =1
ξi
subject to yi(wTφ(xi)+ b) ≥ 1 −ξi, ξi ≥ 0, i = 1, . . . , l .
Example: x ∈ R3, φ(x ) ∈ R10 φ(x ) = [1,√
2x1,√
2x2,√
2x3, x12, x22, x32,√
2x1x2,√
2x1x3,√
2x2x3]T
Finding the Decision Function
w : maybe infinite variables
The dual problem: finite number of variables minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0,
where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T At optimum
w = Pl
i =1αiyiφ(xi)
A finite problem: #variables = #training data
Kernel Tricks
Qij = yiyjφ(xi)Tφ(xj) needs a closed form Example: xi ∈ R3, φ(xi) ∈ R10
φ(xi) = [1,√
2(xi)1,√
2(xi)2,√
2(xi)3, (xi)21, (xi)22, (xi)23,√
2(xi)1(xi)2,√
2(xi)1(xi)3,√
2(xi)2(xi)3]T Then φ(xi)Tφ(xj) = (1 + xTi xj)2.
Kernel: K (x , y) = φ(x )Tφ(y); common kernels:
e−γkxi−xjk2, (Radial Basis Function or Gaussian kernel) (xTi xj/a + b)d (Polynomial kernel)
Can be inner product in infinite dimensional space Assume x ∈ R1 and γ > 0.
e−γkxi−xjk2 = e−γ(xi−xj)2 = e−γxi2+2γxixj−γxj2
=e−γxi2−γxj2 1 + 2γxixj
1! + (2γxixj)2
2! + (2γxixj)3
3! + · · ·
=e−γxi2−γxj2 1 · 1+
r2γ 1!xi ·
r2γ 1!xj+
r(2γ)2 2! xi2 ·
r(2γ)2 2! xj2 +
r(2γ)3 3! xi3 ·
r(2γ)3
3! xj3 + · · · = φ(xi)Tφ(xj), where
φ(x ) = e−γx2
1,
r2γ 1!x ,
r(2γ)2 2! x2,
r(2γ)3
3! x3, · · ·
T
.
Decision function
At optimum
w = Pl
i =1αiyiφ(xi) Decision function
wTφ(x ) + b
=
l
X
i =1
αiyiφ(xi)Tφ(x ) + b
=
l
X
i =1
αiyiK (xi, x ) + b
Only φ(xi) of αi > 0 used ⇒ support vectors
Support Vectors: More Important Data
Only φ(xi) of αi > 0 used ⇒ support vectors
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
-1.5 -1 -0.5 0 0.5 1
See more examples via SVM Toy available at libsvm web page
(http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
Example: Primal-dual Relationship
If separable, primal problem does not have ξi
min
w ,b
1 2wTw
subject to yi(wTxi + b) ≥ 1, i = 1, . . . , l . Dual problem is
minα
1 2
Xl
i =1
Xl
j =1αiαjyiyjxTi xj −Xl
i =1αi subject to 0 ≤ αi, i = 1, . . . , l ,
Xl
i =1yiαi = 0.
Example: Primal-dual Relationship (Cont’d)
Consider the earlier example:
4 0
1
Now two data are x1 = 1, x2 = 0 with y = [+1, −1]T The solution is (w , b) = (2, −1)
Example: Primal-dual Relationship (Cont’d)
The dual objective function 1
2α1 α21 0 0 0
α1 α2
−1 1α1 α2
= 1
2α21 − (α1 + α2)
In optimization, objective function means the function to be optimized
Constraints are
α1 − α2 = 0, 0 ≤ α1, 0 ≤ α2.
Example: Primal-dual Relationship (Cont’d)
Substituting α2 = α1 into the objective function, 1
2α21 − 2α1 has the smallest value at α1 = 2.
Because [2, 2]T satisfies constraints 0 ≤ α1 and 0 ≤ α2, it is optimal
Example: Primal-dual Relationship (Cont’d)
Using the primal-dual relation w = y1α1x1 + y2α2x2
= 1 · 2 · 1 + (−1) · 2 · 0
= 2
This is the same as that by solving the primal problem.
More about Support vectors
We know
αi > 0 ⇒ support vector We have
yi(wTxi + b) < 1 ⇒ αi > 0 ⇒ support vector, yi(wTxi + b) = 1 ⇒ αi ≥ 0 ⇒ maybe SV and
yi(wTxi + b) > 1 ⇒ αi = 0 ⇒ not SV
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Convex Optimization I
Convex problems are a important class of
optimization problems that possess nice properties A function is convex if ∀x , y
f (θx + (1 − θ)y ) ≤ θf (x ) + (1 − θ)f (y ), ∀θ ∈ [0, 1]
That is, the line segment between any two points is not lower than the function value
Convex Optimization II
x y
f (x )
f (y )
Convex Optimization III
A convex optimization problem takes the following form
min f0(w )
subject to fi(w ) ≤ 0, i = 1, . . . , m, (3) hi(w ) = 0, i = 1, . . . , p,
where f0, . . . , fm are convex functions and h1, . . . , hp are affine (i.e., a linear function):
hi(w ) = aTw + b
Convex Optimization IV
A nice property of convex optimization problems is that
infw {f0(w ) | w satisfies constraints}
is unique
Optimal objective value is unique, but optimal w may be not
There are other nice properties such as the primal-dual relationship that we will use
To learn more about convex optimization, you can check the book by Boyd and Vandenberghe (2004)
Deriving the Dual
For simplification, consider the problem without ξi minw ,b
1 2wTw
subject to yi(wTφ(xi) + b) ≥ 1, i = 1, . . . , l . Its dual is
minα
1
2αTQα − eTα
subject to 0 ≤ αi, i = 1, . . . , l , yTα = 0,
where
Qij = yiyjφ(xi)Tφ(xj)
Lagrangian Dual I
Lagrangian dual
maxα≥0 min
w ,b L(w , b, α), where
L(w , b, α)
=1
2kw k2 −
l
X
i =1
αi yi(wTφ(xi) + b) − 1
Lagrangian Dual II
Strong duality
min Primal = max
α≥0 min
w ,b L(w , b, α)
After SVM is popular, quite a few people think that for any optimization problem
⇒ Lagrangian dual exists and strong duality holds Wrong! We usually need
The optimization problem is convex
Certain constraint qualification holds (details not discussed)
Lagrangian Dual III
We have them
SVM primal is convex and has linear constraints
Simplify the dual. When α is fixed, min
w ,b L(w , b, α) =
−∞ if
l
P
i =1
αiyi 6= 0, minw
1
2wTw −
l
P
i =1
αi[yi(wTφ(xi) − 1] if
l
P
i =1
αiyi = 0.
If Pl
i =1αiyi 6= 0, we can decrease
−b
l
X
i =1
αiyi in L(w , b, α) to −∞
If Pl
i =1αiyi = 0, optimum of the strictly convex function
1
2wTw −
l
X
i =1
αi[yi(wTφ(xi) − 1]
happens when
∇wL(w , b, α) = 0.
Thus,
w =
l
X
i =1
αiyiφ(xi).
Note that wTw =
l X
i =1
αiyiφ(xi)
T l X
j =1
αjyjφ(xj)
= X
i ,j
αiαjyiyjφ(xi)Tφ(xj)
The dual is
maxα≥0
l
P
i =1
αi − 12 P
i ,j
αiαjyiyjφ(xi)Tφ(xj) if
l
P
i =1
αiyi = 0,
−∞ if
l
P
i =1
αiyi 6= 0.
Lagrangian dual: maxα≥0 minw ,bL(w , b, α)
−∞ definitely not maximum of the dual Dual optimal solution not happen when
l
X
i =1
αiyi 6= 0 .
Dual simplified to max
α∈Rl l
X
i =1
αi − 1 2
l
X
i =1 l
X
j =1
αiαjyiyjφ(xi)Tφ(xj) subject to yTα = 0,
αi ≥ 0, i = 1, . . . , l .
Our problems may be infinite dimensional (i.e., w ∈ R∞)
We can still use Lagrangian duality See a rigorous discussion in Lin (2001)
Primal versus Dual I
Recall the dual problem is minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0
and at optimum w =
l
X
i =1
αiyiφ(xi) (4)
Primal versus Dual II
What if we put (4) into primal min
α,ξ
1
2αTQα + C Xl
i =1ξi
subject to (Qα + by)i ≥ 1 − ξi (5) ξi ≥ 0
Note that
yiwTφ(xi) = yi l
X
j =1
αjyjφ(xj)Tφ(xi)
=
l
X
j =1
Qijαj = (Qα)i
Primal versus Dual III
If Q is positive definite, we can prove that the optimal α of (5) is the same as that of the dual So dual is not the only choice to obtain the model
Large Dense Quadratic Programming
minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0
Qij 6= 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables:
(50, 0002 × 8/2) bytes = 10GB RAM to store Q
Large Dense Quadratic Programming (Cont’d)
For quadratic programming problems, traditional optimization methods assume that Q is available in the computer memory
They cannot be directly applied here because Q cannot even be stored
Currently, decomposition methods (a type of coordinate descent methods) are what used in practice
Decomposition Methods
Working on some variables each time (e.g., Osuna et al., 1997; Joachims, 1998; Platt, 1998)
Similar to coordinate-wise minimization Working set B , N = {1, . . . , l }\B fixed Sub-problem at the kth iteration:
minαB
1
2αTB (αkN)TQBB QBN QNB QNN
αB αkN
−
eTB (ekN)TαB αkN
subject to 0 ≤ αt ≤ C , t ∈ B, yBTαB = −yTNαkN
Avoid Memory Problems
The new objective function 1
2αTB (αkN)TQBBαB + QBNαkN QNBαB + QNNαkN
− eTBαB + constant
= 1
2αTBQBBαB + (−eB + QBNαkN)TαB + constant Only |B| columns of Q are needed
In general |B| ≤ 10 is used
Calculated when used : trade time for space But is such an approach practical?
How Decomposition Methods Perform?
Convergence not very fast. This is known because of using only first-order information
But, no need to have very accurate α decision function:
sgn(wTφ(x ) + b) = sgn
Xl
i =1αiK (xi, x ) + b
Prediction may still be correct with a rough α Further, in some situations,
# support vectors # training points Initial α1 = 0, some instances never used
How Decomposition Methods Perform?
(Cont’d)
An example of training 50,000 instances using the software LIBSVM
$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370
Time 79.524s
This was done on a typical desktop
Calculating the whole Q takes more time
#SVs = 3,370 50,000
A good case where some remain at zero all the time
How Decomposition Methods Perform?
(Cont’d)
Because many αi = 0 in the end, we can develop a shrinking techniques
Variables are removed during the optimization procedure. Smaller problems are solved
Machine Learning Properties are Useful in Designing Optimization Algorithms
We have seen that special properties of SVM contribute to the viability of decomposition methods
For machine learning applications, no need to accurately solve the optimization problem Because some optimal αi = 0, decomposition methods may not need to update all the variables Also, we can use shrinking techniques to reduce the problem size during decomposition methods
Differences between Optimization and Machine Learning
The two topics may have different focuses. We give the following example
The decomposition method we just discussed converges more slowly when C is large
Using C = 1 on a data set
# iterations: 508 Using C = 5, 000
# iterations: 35,241
Optimization researchers may rush to solve difficult cases of large C
It turns out that large C is less used than small C Recall that SVM solves
1
2wTw + C (sum of training losses) A large C means to overfit training data
This does not give good test accuracy. More details about overfitting will be discussed later
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Equivalent Optimization Problem
• Recall SVM optimization problem is
w ,b,ξmin 1
2wTw + C Xl
i =1ξi
subject to yi(wTφ(xi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l .
• It is equivalent to minw ,b
1
2wTw + C
l
X
i =1
max(0, 1 − yi(wTφ(xi) + b))
• The reformulation is useful to derive SVM from a different viewpoint
Equivalent Optimization Problem (Cont’d)
That is, at optimum,
ξi = max(0, 1 − yi(wTφ(xi) + b)) Reason: from constraints
ξi ≥ 1 − yi(wTφ(xi) + b) and ξi ≥ 0 but we also want to minimize ξi
Linear and Kernel I
Linear classifier
sgn(wTx + b) Kernel classifier
sgn(wTφ(x ) + b) = sgn
Xl
i =1αiK (xi, x ) + b
Linear is a special case of kernel
An important difference is that for linear we can store w
Linear and Kernel II
For kernel, w may be infinite dimensional and cannot be stored
We will show that they are useful in different circumstances
The Bias Term b
Recall the decision function is sgn(wTx + b) Sometimes the bias term b is omitted
sgn(wTx )
This is fine if the number of features is not too small
Minimizing Training Errors
For classification naturally we aim to minimize the training error
minw (training errors)
To characterize the training error, we need a loss function ξ(w ; x , y ) for each instance (x , y ) Ideally we should use 0–1 training loss:
ξ(w ; x , y ) =
(1 if y wTx < 0, 0 otherwise
Minimizing Training Errors (Cont’d)
However, this function is discontinuous. The optimization problem becomes difficult
−y wTx ξ(w ; x , y )
We need continuous approximations
Common Loss Functions
Hinge loss (l1 loss)
ξL1(w ; x , y ) ≡ max(0, 1 − y wTx ) (6) Squared hinge loss (l2 loss)
ξL2(w ; x , y ) ≡ max(0, 1 − y wTx )2 (7) Logistic loss
ξLR(w ; x , y ) ≡ log(1 + e−y wTx) (8) SVM: (6)-(7). Logistic regression (LR): (8)
Common Loss Functions (Cont’d)
−y wTx ξ(w ; x , y )
ξL1 ξL2
ξLR
Logistic regression is very related to SVM
Their performance (i.e., test accuracy) is usually similar
Common Loss Functions (Cont’d)
However, minimizing training losses may not give a good model for future prediction
Overfitting occurs
Overfitting
See the illustration in the next slide For classification,
You can easily achieve 100% training accuracy This is useless
When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error
l and s: training; and 4: testing
Regularization
To minimize the training error we manipulate the w vector so that it fits the data
To avoid overfitting we need a way to make w ’s values less extreme.
One idea is to make w values closer to zero We can add, for example,
wTw
2 or kw k1 to the objective function
General Form of Linear Classification I
Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features
minw f (w ), f (w ) ≡ wTw
2 + C
l
X
i =1
ξ(w ; xi, yi) (9) wTw /2: regularization term
ξ(w ; x , y ): loss function C : regularization parameter
General Form of Linear Classification II
Of course we can map data to a higher dimensional space
minw f (w ), f (w ) ≡ wTw
2 + C
l
X
i =1
ξ(w ;φ(xi), yi)
SVM and Logistic Regression I
If hinge (l1) loss is used, the optimization problem is
minw
1
2wTw + C
l
X
i =1
max(0, 1 − yiwTxi) It is the SVM problem we had earlier (without the bias b)
Therefore, we have derived SVM from a different viewpoint
We also see that SVM is very related to logistic regression
SVM and Logistic Regression II
However, many wrongly think that they are different This is wrong.
Reason of this misunderstanding: traditionally, when people say SVM ⇒ kernel SVM
when people say logistic regression ⇒ linear logistic regression
Indeed we can do kernel logistic regression minw
1
2wTw + C
l
X
i =1
log(1 + e−yiwTφ(xi))
SVM and Logistic Regression III
A main difference from SVM is that logistic regression has probability interpretation
We will introduce logistic regression from another viewpoint
Logistic Regression
For a label-feature pair (y , x ), assume the probability model is
p(y |x ) = 1
1 + e−y wTx. Note that
p(1|x ) + p(−1|x )
= 1
1 + e−wTx + 1 1 + ewTx
= ewTx
1 + ewTx + 1 1 + ewTx
= 1
w is the parameter to be decided
Chih-Jen Lin (National Taiwan Univ.) 73 / 181
Logistic Regression (Cont’d)
Idea of this model p(1|x ) = 1
1 + e−wTx
(→ 1 if wTx 0,
→ 0 if wTx 0 Assume training instances are
(yi, xi), i = 1, . . . , l
Logistic Regression (Cont’d)
Logistic regression finds w by maximizing the following likelihood
maxw l
Y
i =1
p (yi|xi) . (10) Negative log-likelihood
− log
l
Y
i =1
p (yi|xi) = −
l
X
i =1
log p (yi|xi)
=
l
X
i =1
log
1 + e−yiwTxi
Logistic Regression (Cont’d)
Logistic regression minw
l
X
i =1
log
1 + e−yiwTxi
. Regularized logistic regression
minw
1
2wTw + C
l
X
i =1
log
1 + e−yiwTxi
. (11) C : regularization parameter decided by users
Loss Functions: Differentiability
However,
ξL1: not differentiable
ξL2: differentiable but not twice differentiable ξLR: twice differentiable
The same optimization method may not be applicable to all these losses
Discussion
We see that the same classification method can be derived from different ways
SVM
Maximal margin
Regularization and training losses LR
Regularization and training losses Maximum likelihood
Regularization
L1 versus L2
kw k1 and wTw /2 wTw /2: smooth, easier to optimize kw k1: non-differentiable
sparse solution; possibly many zero elements Possible advantages of L1 regularization:
Feature selection Less storage for w
Linear and Kernel Classification
Methods such as SVM and logistic regression can used in two ways
Kernel methods: data mapped to a higher dimensional space
x ⇒ φ(x )
φ(xi)Tφ(xj) easily calculated; little control on φ(·) Feature engineering + linear classification:
We have x without mapping. Alternatively, we can say that φ(x ) is our x ; full control on x or φ(x ) We refer to them as kernel and linear classifiers
Linear and Kernel Classification
Let’s check the prediction cost wTx + b versus Xl
i =1αiyiK (xi, x ) + b If K (xi, xj) takes O(n), then
O(n) versus O(nl ) Linear is much cheaper
A similar difference occurs for training
Linear and Kernel Classification (Cont’d)
In a sense, linear is a special case of kernel
Indeed, we can prove that test accuracy of linear is the same as Gaussian (RBF) kernel under certain parameters (Keerthi and Lin, 2003)
Therefore, roughly we have
test accuracy: kernel ≥ linear cost: kernel linear Speed is the reason to use linear
Linear and Kernel Classification (Cont’d)
For some problems, accuracy by linear is as good as nonlinear
But training and testing are much faster
This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)
Comparison Between Linear and Kernel (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Comparison Between Linear and Kernel (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Comparison Between Linear and Kernel (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Extension: Training Explicit Form of Nonlinear Mappings I
Linear-SVM method to train φ(x1), . . . , φ(xl) Kernel not used
Applicable only if dimension of φ(x ) not too large Low-degree Polynomial Mappings
K (xi, xj) = (xTi xj + 1)2 = φ(xi)Tφ(xj) φ(x ) = [1,√
2x1, . . . ,√
2xn, x12, . . . , xn2,
√
2x1x2, . . . ,√
2xn−1xn]T
Extension: Training Explicit Form of Nonlinear Mappings II
For this mapping, # features = O(n2)
Recall O(n) for linear versus O(nl ) for kernel Now O(n2) versus O(nl )
Sparse data
n ⇒ ¯n, average # non-zeros for sparse data
¯
n n ⇒ O( ¯n2) may be much smaller than O(l ¯n) When degree is small, train the explicit form of φ(x )
Testing Accuracy and Training Time
Data set
Degree-2 Polynomial Accuracy diff.
Training time (s)
Accuracy Linear RBF LIBLINEAR LIBSVM
a9a 1.6 89.8 85.06 0.07 0.02
real-sim 59.8 1,220.5 98.00 0.49 0.10
ijcnn1 10.7 64.2 97.84 5.63 −0.85
MNIST38 8.6 18.4 99.29 2.47 −0.40
covtype 5,211.9 NA 80.09 3.74 −15.98
webspam 3,228.1 NA 98.44 5.29 −0.76
Training φ(xi) by linear: faster than kernel, but sometimes competitive accuracy
Example: Dependency Parsing I
This is an NLP Application
Kernel Linear
RBF Poly-2 Linear Poly-2 Training time 3h34m53s 3h21m51s 3m36s 3m43s
Parsing speed 0.7x 1x 1652x 103x
UAS 89.92 91.67 89.11 91.71
LAS 88.55 90.60 88.07 90.71
We get faster training/testing, while maintain good accuracy
Example: Dependency Parsing II
We achieve this by training low-degree
polynomial-mapped data by linear classification That is, linear methods to explicitly train φ(xi), ∀i We consider the following low-degree polynomial mapping:
φ(x ) = [1, x1, . . . , xn, x12, . . . , xn2, x1x2, . . . , xn−1xn]T
Handing High Dimensionality of φ(x )
A multi-class problem with sparse data
n Dim. of φ(x ) l n w ’s # nonzeros¯ 46,155 1,065,165,090 204,582 13.3 1,438,456
¯
n: average # nonzeros per instance
Dimensionality of w is very high, but w is sparse Some training feature columns of xixj are entirely zero
Hashing techniques are used to handle sparse w
Example: Classifier in a Small Device
In a sensor application (Yu et al., 2014), the classifier can use less than 16KB of RAM
Classifiers Test accuracy Model Size
Decision Tree 77.77 76.02KB
AdaBoost (10 trees) 78.84 1,500.54KB SVM (RBF kernel) 85.33 1,287.15KB Number of features: 5
We consider a degree-3 polynomial mapping dimensionality = 5 + 3
3
+ bias term = 57.
Example: Classifier in a Small Device
One-against-one strategy for 5-class classification
5 2
× 57 × 4bytes = 2.28KB Assume single precision
Results
SVM method Test accuracy Model Size
RBF kernel 85.33 1,287.15KB
Polynomial kernel 84.79 2.28KB
Linear kernel 78.51 0.24KB
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Multi-class Classification
SVM and logistic regression are methods for two-class classification
We need certain ways to extend them for multi-class problems
This is not a problem for methods such as nearest neighbor or decision trees
Multi-class Classification (Cont’d)
k classes
One-against-the rest: Train k binary SVMs:
1st class vs. (2, · · · , k)th class 2nd class vs. (1, 3, . . . , k)th class
...
k decision functions
(w1)Tφ(x ) + b1
...
(wk)Tφ(x ) + bk
Prediction:
arg max
j (wj)Tφ(x ) + bj
Reason: If x ∈ 1st class, then we should have (w1)Tφ(x ) + b1 ≥ +1
(w2)Tφ(x ) + b2 ≤ −1 ...
(wk)Tφ(x ) + bk ≤ −1
Multi-class Classification (Cont’d)
One-against-one: train k(k − 1)/2 binary SVMs (1, 2), (1, 3), . . . , (1, k), (2, 3), (2, 4), . . . , (k − 1, k) If 4 classes ⇒ 6 binary SVMs
yi = 1 yi = −1 Decision functions class 1 class 2 f12(x ) = (w12)Tx + b12 class 1 class 3 f13(x ) = (w13)Tx + b13 class 1 class 4 f14(x ) = (w14)Tx + b14 class 2 class 3 f23(x ) = (w23)Tx + b23 class 2 class 4 f24(x ) = (w24)Tx + b24 class 3 class 4 f34(x ) = (w34)Tx + b34
For a testing data, predicting all binary SVMs Classes winner
1 2 1
1 3 1
1 4 1
2 3 2
2 4 4
3 4 3
Select the one with the largest vote class 1 2 3 4
# votes 3 1 1 1 May use decision values as well
Solving a Single Problem
An approach by Crammer and Singer (2002)
w1min,...,wk
1 2
k
X
m=1
kwmk22 + C
l
X
i =1
ξ({wm}km=1; xi, yi), where
ξ({wm}km=1; x , y ) ≡ max
m6=y max(0, 1 − (wy− wm)Tx ).
We hope the decision value of xi by the model wyi
is larger than others
Prediction: same as one-against-the rest arg max
j (wj)Tx
Discussion
Other variants of solving a single optimization problem include Weston and Watkins (1999); Lee et al. (2004)
A comparison in Hsu and Lin (2002)
RBF kernel: accuracy similar for different methods But 1-against-1 is the fastest for training
Maximum Entropy
Maximum Entropy: a generalization of logistic regression for multi-class problems
It is widely applied by NLP applications.
Conditional probability of label y given data x . P(y |x ) ≡ exp(wTy x )
Pk
m=1exp(wTmx ),
Maximum Entropy (Cont’d)
We then minimizes regularized negative log-likelihood.
w1min,...,wm
1 2
k
X
m=1
kwkk2 + C
l
X
i =1
ξ({wm}km=1; xi, yi), where
ξ({wm}km=1; x , y ) ≡ − log P(y |x ).
Maximum Entropy (Cont’d)
Is this loss function reasonable?
If
wTyixi wTmxi, ∀m 6= yi, then
ξ({wm}km=1; xi, yi) ≈ 0 That is, no loss
In contrast, if
wTyixi wTmxi, m 6= yi, then P(yi|xi) 1 and the loss is large.
Features as Functions
NLP applications often use a function f (x , y ) to generate the feature vector
P(y |x ) ≡ exp(wTf (x , y )) P
y0 exp(wTf (x , y0)). (12) The earlier probability model is a special case by
f (x , y ) =
0...
x0 0...
0
y − 1
∈ Rnk and w =
w1
w...k
.
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Least Square Regression I
Given training data (x1, y1), . . . , (xl, yl) Now
yi ∈ R is the target value
Regression: find a function so that f (xi) ≈ yi
Least Square Regression II
Least square regression:
minw ,b l
X
i =1
(yi − (wTxi + b))2 That is, we model f (x ) by
f (x ) = wTx + b An example
Least Square Regression III
40 50 60 70 80 90 100
150 160 170 180 190 200
0.60 · Weight + 130.2
Weight (kg)
Height(cm)
note: picture is from
http://tex.stackexchange.com/questions/119179/
how-to-add-a-regression-line-to-randomly-generated-points-using-pgfplots-in-tikz
Least Square Regression IV
This is equivalent to minw ,b
Xl
i =1ξ(w , b; xi, yi) where
ξ(w , b; xi, yi) = (yi − (wTxi + b))2
Regularized Least Square
ξ(w , b; xi, yi) is a kind of loss function We can add regularization.
minw ,b
1
2wTw + C Xl
i =1ξ(w , b; xi, yi) C is still the regularization parameter
Other loss functions?
Support Vector Regression I
-insensitive loss function (b omitted) max(|wTxi − yi| − , 0) max(|wTxi − yi| − , 0)2
Support Vector Regression II
wTxi − yi 0
loss
−
L2 L1
: errors small enough are treated as no error This make the model more robust (less overfitting the data)
Support Vector Regression III
One more parameter () to decide
An equivalent form of the optimization problem min
w ,b,ξ,ξ∗
1
2wTw + C
l
X
i =1
ξi + C
l
X
i =1
ξi∗ subject to wTφ(xi) + b − yi ≤ + ξi,
yi − wTφ(xi) − b ≤ + ξi∗, ξi, ξi∗ ≥ 0, i = 1, . . . , l .
This form is similar to the SVM formulation derived earlier
Support Vector Regression IV
The dual problem is
α,αmin∗ 1
2(α − α∗)TQ(α − α∗) +
l
X
i =1
(αi + αi∗)
+
l
X
i =1
yi(αi − α∗i) subject to eT(α − α∗) = 0,
0 ≤ αi, α∗i ≤ C , i = 1, . . . , l , where Qij = K (xi, xj) ≡ φ(xi)Tφ(xj).
Support Vector Regression V
After solving the dual problem, w =
l
X
i =1
(−αi + α∗i)φ(xi) and the approximate function is
l
X
i =1
(−αi + α∗i)K (xi, x ) + b.
Discussion
SVR and least-square regression are very related Why people more commonly use l2 (least-square) rather than l1 losses?
Easier because of differentiability
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
One-class SVM I
Separate data to normal ones and outliers (Sch¨olkopf et al., 2001)
w ,ξ,ρmin 1
2wTw − ρ + 1 νl
l
X
i =1
ξi subject to wTφ(xi) ≥ ρ − ξi,
ξi ≥ 0, i = 1, . . . , l .
One-class SVM II
Instead of the parameter C is SVM, here the parameter is ν.
wTφ(xi) ≥ ρ − ξi
means that we hope most data satisfy wTφ(xi) ≥ ρ.
That is, most data are on one side of the hyperplane Those on the wrong side are considered as outliers
One-class SVM III
The dual problem is minα
1
2αTQα
subject to 0 ≤ αi ≤ 1/(νl ), i = 1, . . . , l , eTα = 1,
where Qij = K (xi, xj) = φ(xi)Tφ(xj).
The decision function is sgn
l
X
i =1
αiK (xi, x ) − ρ
! .
One-class SVM IV
The role of −ρ is similar to the bias term b earlier From the dual problem we can see that
ν ∈ (0, 1]
Otherwise, if ν > 1, then
eTα ≤ 1/ν < 1 violates the linear constraint.
Clearly, a larger ν means we don’t need to push ξi
to zero ⇒ more data are considered as outliers