Support Vector Machines and Kernel Methods

(1)

Support Vector Machines and Kernel Methods

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at International Workshop on Recent Trends in Learning,

(2)

Outline

Basic concepts: SVM and kernels Training SVM

Practical use of SVM Multi-class classification

Research directions: large-scale training Research directions: linear SVM

Research directions: others Conclusions

(3)

Outline

(4)

Support Vector Classification

Training vectors : x_i, i = 1, . . . , l Feature vectors. For example, A patient = [height, weight, . . .]^T

Consider a simple case with two classes:

Define an indicator vector y y_i =

1 if x_i in class 1

−1 if x_i in class 2 A hyperplane which separates all data

(5)

w^Tx + b = h₊₁

−10

i

A separating hyperplane: w^Tx + b = 0 (w^Txi) + b ≥ 1 if yi = 1 (w^Tx_i) + b ≤ −1 if y_i = −1

Decision function f (x) = sgn(w^Tx + b), x: test data Many possible choices of w and b

(6)

Maximal Margin

Distance between w^Tx + b = 1 and −1:

2/kwk = 2/

√ w^Tw

A quadratic programming problem (Boser et al., 1992)

minw,b

1 2w^Tw

subject to y_i(w^Tx_i + b) ≥ 1, i = 1, . . . , l .

(7)

Data May Not Be Linearly Separable

An example:

Allow training errors

Higher dimensional ( maybe infinite ) feature space φ(x) = [φ₁(x), φ₂(x), . . .]^T.

(8)

Standard SVM (Boser et al., 1992; Cortes and Vapnik, 1995)

min

w,b,ξ

1

2w^Tw +C

l

X

i =1

ξ_i

subject to y_i(w^Tφ(x_i)+ b) ≥ 1 −ξ_i, ξ_i ≥ 0, i = 1, . . . , l .

Example: x ∈ R³, φ(x) ∈ R¹⁰ φ(x) = [1,√

2x₁,√

2x₂,√

2x₃, x₁², x₂², x₃²,√

2x₁x₂,√

2x₁x₃,√

2x₂x₃]^T

(9)

Finding the Decision Function

w: maybe infinite variables

The dual problem: finite number of variables minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C , i = 1, . . . , l y^Tα = 0,

where Q_ij = y_iy_jφ(x_i)^Tφ(x_j) and e = [1, . . . , 1]^T At optimum

w =Pl

i =1α_iy_iφ(x_i)

(10)

Kernel Tricks

Q_ij = y_iy_jφ(x_i)^Tφ(x_j) needs a closed form Example: x_i ∈ R³, φ(x_i) ∈ R¹⁰

φ(x_i) = [1,√

2(x_i)₁,√

2(x_i)₂,√

2(x_i)₃, (x_i)²₁, (x_i)²₂, (x_i)²₃,√

2(x_i)₁(x_i)₂,√

2(x_i)₁(x_i)₃,√

2(x_i)₂(x_i)₃]^T Then φ(x_i)^Tφ(x_j) = (1 + x^T_i x_j)².

Kernel: K (x, y) = φ(x)^Tφ(y); common kernels:

e^−γkxⁱ^−x^j^k², (Radial Basis Function) (x^T_i x_j/a + b)^d (Polynomial kernel)

(11)

Can be inner product in infinite dimensional space Assume x ∈ R¹ and γ > 0.

e^−γkxⁱ^−x^j^k² = e^−γ(xⁱ^−x^j⁾² = e^−γxⁱ²^+2γxⁱ^x^j^−γx^j²

=e^−γxⁱ²^−γx^j² 1 + 2γx_ix_j

1! + (2γx_ix_j)²

2! + (2γx_ix_j)³

3! + · · ·

=e^−γxⁱ²^−γx^j² 1 · 1+

r2γ 1!x_i ·

r2γ 1!x_j+

r(2γ)² 2! x_i² ·

r(2γ)² 2! x_j² +

r(2γ)³ 3! x_i³ ·

r(2γ)³

3! x_j³ + · · · = φ(x_i)^Tφ(xj), where

φ(x ) = e^−γx²

1,

r2γ 1!x ,

r(2γ)² 2! x²,

r(2γ)³

3! x³, · · ·

T

.

(12)

Issues

So what kind of kernel should I use?

What kind of functions are valid kernels?

How to decide kernel parameters?

Some of these issues will be discussed later

(13)

Decision function

At optimum

w =Pl

i =1α_iy_iφ(x_i) Decision function

w^Tφ(x) + b

=

l

X

i =1

α_iy_iφ(x_i)^Tφ(x) + b

=

l

X

i =1

α_iy_iK (x_i, x) + b

(14)

Support Vectors: More Important Data

Only φ(x_i) of α_i > 0 used ⇒ support vectors

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1

(15)

We have roughly shown basic ideas of SVM A 3-D demonstration

http://www.csie.ntu.edu.tw/~cjlin/

libsvmtools/svmtoy3d More about dual problems

We omit detailed derivations here

Quite a few people think that for any optimization problem

⇒ Lagrangian dual exists Wrong! We usually need

Convex programming; Constraint qualification We have them for SVM

(16)

Outline

(17)

Large Dense Quadratic Programming

minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C , i = 1, . . . , l y^Tα = 0

Q_ij 6= 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables:

(50, 000² × 8/2) bytes = 10GB RAM to store Q Traditional optimization methods:

(18)

Decomposition Methods

Working on some variables each time (e.g., Osuna et al., 1997; Joachims, 1998; Platt, 1998)

Similar to coordinate-wise minimization Working set B , N = {1, . . . , l }\B fixed Sub-problem at the kth iteration:

minαB

1

2α^T_B (α^k_N)^TQ_BB QBN

Q_NB Q_NN

α_B α^k_N

−

e^T_B (e^k_N)^Tα_B α^k_N

subject to 0 ≤ α_t ≤ C , t ∈ B, y_B^Tα_B = −y^T_Nα^k_N

(19)

Avoid Memory Problems

The new objective function 1

2α^T_BQ_BBα_B + (−e_B + Q_BNα^k_N)^Tα_B + constant Only B columns of Q needed (|B| ≥ 2)

Calculated when used Trade time for space

(20)

How Decomposition Methods Perform?

Convergence not very fast

But, no need to have very accurate α Prediction not affected much

In some situations, # support vectors # training points

Initial α¹ = 0, some instances never used

(21)

An example of training 50,000 instances using LIBSVM

$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370

Time 79.524s

On a Xeon 2.0G machine

Calculating the whole Q takes more time

#SVs = 3,370 50,000

A good case where some remain at zero all the time

(22)

Issues of Decomposition Methods

Techniques for faster decomposition methods store recently used kernel elements working set size/selection

theoretical issues: convergence

and others (details not discussed here) Major software:

LIBSVM

http://www.csie.ntu.edu.tw/~cjlin/libsvm SVM^light

http://svmlight.joachims.org

(23)

Caching and Shrinking

Speed up decomposition methods Caching (Joachims, 1998)

Store recently used kernel columns in computer memory

100K Cache

$ time ./svm-train -m 0.01 -g 0.01 a6a 13.62s

40M Cache

$ time ./svm-train -m 40 -g 0.01 a6a 11.40s

(24)

Shrinking (Joachims, 1998)

Some elements remain bounded until the end Heuristically resized to a smaller problem See -h 1 option in LIBSVM software

After certain iterations, most bounded elements identified and not changed (Lin, 2002)

So caching and shrinking are useful

(25)

Outline

(26)

Let’s Try a Practical Example

A problem from astroparticle physics

1 1:2.61e+01 2:5.88e+01 3:-1.89e-01 4:1.25e+02 1 1:5.70e+01 2:2.21e+02 3:8.60e-02 4:1.22e+02 1 1:1.72e+01 2:1.73e+02 3:-1.29e-01 4:1.25e+02 ...

0 1:2.39e+01 2:3.89e+01 3:4.70e-01 4:1.25e+02 0 1:2.23e+01 2:2.26e+01 3:2.11e-01 4:1.01e+02 0 1:1.64e+01 2:3.92e+01 3:-9.91e-02 4:3.24e+01 Training and testing sets available: 3,089 and 4,000

Sparse format: zero values not stored

(27)

The Story Behind this Data Set

User:

I am using libsvm in a astroparticle physics application .. First, let me congratulate you to a really easy to use and nice package. Unfortunately, it gives me astonishingly bad results...

OK. Please send us your data

I am able to get 97% test accuracy. Is that good enough for you ?

User:

(28)

Training and Testing

Training

$./svm-train train.1

optimization finished, #iter = 6131 nSV = 3053, nBSV = 724

Total nSV = 3053 Testing

$./svm-predict test.1 train.1.model test.1.out Accuracy = 66.925% (2677/4000)

nSV and nBSV: number of SVs and bounded SVs (α_i = C ).

(29)

Why this Fails

After training, nearly 100% support vectors Training and testing accuracy different

$./svm-predict train.1 train.1.model o Accuracy = 99.7734% (3082/3089)

Most kernel elements:

K_ij = e^−kxⁱ^−x^j^k²^/4 (

= 1 if i = j ,

→ 0 if i 6= j . Some features in rather large ranges

(30)

Data Scaling

Without scaling

Attributes in greater numeric ranges may dominate Linearly scale each feature to [0, 1] by:

feature value − min max − min , There are other ways

Scaling generally helps, but not always

(31)

Data Scaling: Same Factors

A common mistake

$./svm-scale -l -1 -u 1 train.1 > train.1.scale

$./svm-scale -l -1 -u 1 test.1 > test.1.scale Same factor on training and testing

$./svm-scale -s range1 train.1 > train.1.scale

$./svm-scale -r range1 test.1 > test.1.scale

(32)

After Data Scaling

Train scaled data and then predict

$./svm-train train.1.scale

$./svm-predict test.1.scale train.1.scale.model test.1.predict

Accuracy = 96.15%

Training accuracy now is

$./svm-predict train.1.scale train.1.scale.model o Accuracy = 96.439%

Default parameter: C = 1, γ = 0.25

(33)

Different Parameters

If we use C = 20, γ = 400

$./svm-train -c 20 -g 400 train.1.scale

$./svm-predict train.1.scale train.1.scale.model o Accuracy = 100% (3089/3089)

100% training accuracy but

$./svm-predict test.1.scale train.1.scale.model o Accuracy = 82.7% (3308/4000)

Very bad test accuracy Overfitting happens

(34)

Overfitting

In theory

You can easily achieve 100% training accuracy This is useless

When training and predicting a data, we should Avoid underfitting: small training error

Avoid overfitting: small testing error

(35)

l and s: training; and 4: testing

(36)

Parameter Selection

Need to select suitable parameters C and kernel parameters

Example:

γ of e^−γkxⁱ^−x^j^k² a, b, d of (x^T_i xj/a + b)^d How to select them?

So performance better?

(37)

Performance Evaluation

Available data ⇒ training and validation Train the training; test the validation k-fold cross validation (CV):

Data randomly separated to k groups

Each time k − 1 as training and one as testing Select parameters/kernels with best CV result

(38)

Selecting Kernels

RBF, polynomial, or others?

For beginners, use RBF first

Linear kernel: special case of RBF

Performance of linear the same as RBF under certain parameters (Keerthi and Lin, 2003) Polynomial: numerical difficulties

(< 1)^d → 0, (> 1)^d → ∞ More parameters than RBF

(39)

A Simple Procedure

1. Conduct simple scaling on the data 2. Consider RBF kernel K (x, y) = e^−γkx−yk²

3. Use cross-validation to find the best parameter C and γ

4. Use the best C and γ to train the whole training set 5. Test

For beginners only, you can do a lot more

(40)

Contour of Parameter Selection

(41)

The good region of parameters is quite large SVM is sensitive to parameters, but not that sensitive

Sometimes default parameters work

but it’s good to select them if time is allowed

(42)

Outline

(43)

Multi-class Classification

k classes

One-against-the rest: Train k binary SVMs:

1st class vs. (2, · · · , k)th class 2nd class vs. (1, 3, . . . , k)th class

...

k decision functions

(w¹)^Tφ(x) + b₁ ...

(w^k)^Tφ(x) + b

(44)

Prediction:

arg max

j (w^j)^Tφ(x) + bj

Reason: If x ∈ 1st class, then we should have (w¹)^Tφ(x) + b₁ ≥ +1

(w²)^Tφ(x) + b₂ ≤ −1 ...

(w^k)^Tφ(x) + b_k ≤ −1

(45)

Multi-class Classification (Cont’d)

One-against-one: train k(k − 1)/2 binary SVMs (1, 2), (1, 3), . . . , (1, k), (2, 3), (2, 4), . . . , (k − 1, k) If 4 classes ⇒ 6 binary SVMs

y_i = 1 y_i = −1 Decision functions class 1 class 2 f¹²(x) = (w¹²)^Tx + b¹² class 1 class 3 f¹³(x) = (w¹³)^Tx + b¹³ class 1 class 4 f¹⁴(x) = (w¹⁴)^Tx + b¹⁴ class 2 class 3 f²³(x) = (w²³)^Tx + b²³ class 2 class 4 f²⁴(x) = (w²⁴)^Tx + b²⁴ class 3 class 4 f³⁴(x) = (w³⁴)^Tx + b³⁴

(46)

For a testing data, predicting all binary SVMs Classes winner

1 2 1

1 3 1

1 4 1

2 3 2

2 4 4

3 4 3

Select the one with the largest vote class 1 2 3 4

# votes 3 1 1 1 May use decision values as well

(47)

More Complicated Forms

Solving a single optimization problem (Weston and Watkins, 1999; Crammer and Singer, 2002; Lee et al., 2004)

There are many other methods A comparison in Hsu and Lin (2002)

RBF kernel: accuracy similar for different methods But 1-against-1 fastest for training

(48)

Outline

(49)

SVM doesn’t Scale Up

Yes, if using kernels

Training millions of data is time consuming

Cases with many support vectors: quadratic time bottleneck on

Q_{SV, SV}

For noisy data: # SVs increases linearly in data size (Steinwart, 2003)

Some solutions Parallelization Approximation

(50)

Parallelization

Multi-core/Shared Memory/GPU

• One line change of LIBSVM

Multicore Shared-memory

1 80 1 100

2 48 2 57

4 32 4 36

8 27 8 28

50,000 data (kernel evaluations: 80% time)

• GPU (Catanzaro et al., 2008) Distributed Environments

• Chang et al. (2007); Zanni et al. (2006); Zhu et al.

(2009). All use MPI; reasonably good speed-up

(51)

Approximately Training SVM

Can be done in many aspects Data level: sub-sampling Optimization level:

Approximately solve the quadratic program Other non-intuitive but effective ways I will show one today

Many papers have addressed this issue

(52)

Approximately Training SVM (Cont’d)

Subsampling

Simple and often effective More advanced techniques

Incremental training: (e.g., Syed et al., 1999)) Data ⇒ 10 parts

train 1st part ⇒ SVs, train SVs + 2nd part, . . . Select and train good points: KNN or heuristics For example, Bakır et al. (2005)

(53)

Approximately Training SVM (Cont’d)

Approximate the kernel; e.g., Fine and Scheinberg (2001); Williams and Seeger (2001)

Use part of the kernel; e.g., Lee and Mangasarian (2001); Keerthi et al. (2006)

Early stopping of optimization algorithms Tsang et al. (2005) and others

And many more

Some simple but some sophisticated

(54)

Approximately Training SVM (Cont’d)

Sophisticated techniques may not be always useful Sometimes slower than sub-sampling

covtype: 500k training and 80k testing rcv1: 550k training and 14k testing

covtype rcv1

Training size Accuracy Training size Accuracy

50k 92.5% 50k 97.2%

100k 95.3% 100k 97.4%

500k 98.2% 550k 97.8%

(55)

Approximately Training SVM (Cont’d)

Sophisticated techniques may not be always useful Sometimes slower than sub-sampling

covtype: 500k training and 80k testing rcv1: 550k training and 14k testing

covtype rcv1

Training size Accuracy Training size Accuracy

50k 92.5% 50k 97.2%

100k 95.3% 100k 97.4%

500k 98.2% 550k 97.8%

(56)

Discussion: Large-scale Training

We don’t have many large and well labeled sets Expensive to obtain true labels

Specific properties of data should be considered We will illustrate this point using linear SVM The design of software for very large data sets should be application different

(57)

Outline

(58)

Linear SVM

Data not mapped to another space

w,b,ξmin 1

2w^Tw + C

l

X

i =1

ξi

subject to yi(w^Txi + b) ≥ 1 −ξi, ξ_i ≥ 0, i = 1, . . . , l .

In theory, RBF kernel with certain parameters ⇒ as good as linear (Keerthi and Lin, 2003):

Test accuracy of linear ≤ Test accuracy of RBF But can be an approximation to nonlinear

Recently linear SVM an important research topic

(59)

Linear SVM for Large Document Sets

Bag of words model (TF-IDF or others) A large # of features

Accuracy similar with/without mapping vectors What if training is much faster?

A very effective approximation to nonlinear SVM

(60)

A Comparison: LIBSVM and LIBLINEAR

rcv1: # data: > 600k, # features: > 40k Using LIBSVM (linear kernel): > 10 hours Using LIBLINEAR (same stopping condition) Computation: < 5 seconds; I/O: 60 seconds Accuracy similar to nonlinear; more than 100x speedup

Training millions of data in a few seconds

See some results in Hsieh et al. (2008) by running LIBLINEAR

http:

//www.csie.ntu.edu.tw/~cjlin/liblinear

(61)

Testing Accuracy versus Training Time

news20 yahoo-japan

(62)

Why Training Linear SVM Is Faster?

In optimization, each iteration often needs

∇_if (α) = (Qα)i − 1 Nonlinear SVM

∇_if (α) = X^l

j =1y_iy_jK (x_i, x_j)α_j − 1 cost: O(nl ); n: # features, l : # data

Linear: use w ≡ Xl

j =1y_jα_jx_j and ∇_if (α) = y_iw^Tx_i − 1 Only O(n) cost if w is maintained

(63)

Extension: Training Explicit Form of Nonlinear Mappings

Linear-SVM method to train φ(x₁), . . . , φ(x_l) Kernel not used

Applicable only if dimension of φ(x) not too large Low-degree Polynomial Mappings

K (x_i, x_j) = (x^T_i x_j + 1)² = φ(x_i)^Tφ(x_j) φ(x) = [1,√

2x₁, . . . ,√

2x_n, x₁², . . . , x_n²,

√

2x1x2, . . . ,

√

2xn−1xn]^T

(64)

Testing Accuracy and Training Time

Data set

Degree-2 Polynomial Accuracy diff.

Training time (s)

Accuracy Linear RBF LIBLINEAR LIBSVM

a9a 1.6 89.8 85.06 0.07 0.02

real-sim 59.8 1,220.5 98.00 0.49 0.10

ijcnn1 10.7 64.2 97.84 5.63 −0.85

MNIST38 8.6 18.4 99.29 2.47 −0.40

covtype 5,211.9 NA 80.09 3.74 −15.98

webspam 3,228.1 NA 98.44 5.29 −0.76

Training φ(x_i) by linear: faster than kernel, but sometimes competitive accuracy

(65)

Discussion: Directly Train φ(x

_i

), ∀i

See details in our work (Chang et al., 2010) A related development: Sonnenburg and Franc (2010)

Useful for certain applications

(66)

Linear Classification: Data Larger than Memory

Existing methods cannot easily handle this situation See our recent KDD work (Yu et al., 2010)

KDD 2010 best paper award

Training several million data (or more) on your laptop

(67)

Linear Classification: Online Learning

For extremely large data, cannot keep all data

After using new data to update the model; may not need them any more

Online learning instead of offline learning

Training often by stochastic gradient descent methods

They use only a subset of data at each step Now an important research topic (e.g.,

Shalev-Shwartz et al., 2007; Langford et al., 2009;

Bordes et al., 2009)

(68)

Linear Classification: L1 Regularization

1-norm versus 2-norm

kwk₁ = |w₁| + · · · + |w_n|, kwk²₂ = w₁² + · · · + w_n²

w

|w |

w w²

2-norm: all wi are non-zeros; 1-norm: some wi may be zeros; useful for feature selection

Recently a hot topic; see our survey (Yuan et al., 2010)

(69)

Outline

(70)

Multiple Kernel Learning (MKL)

How about using

t₁K₁ + t₂K₂ + · · · + t_rK_r, where t₁ + · · · + t_r = 1 as the kernel

Related to parameter/kernel selection

If K₁ better ⇒ t₁ close to 1, others close to 0 Earlier development (Lanckriet et al., 2004): high computational cost

Many subsequent works (e.g., Rakotomamonjy et al., 2008).

Still ongoing; so far MKL has not been a practical tool yet

(71)

Ranking

Labels become ranking information e.g., x1 ranks higher than x2

RankSVM (Joachims, 2002): add constraint w^Tx_i ≥ w^Tx_j + ξ_ij if x_i ranks better than x_j Many subsequent works

However, whether SVM is the most suitable method for ranking is an issue

(72)

Other Directions

Semi-supervised learning

Use information from unlabeled data Active learning

Needs cost to obtain labels of data Cost sensitive learning

For unbalanced data Structured Learning

Data instance not an Euclidean vector Maybe a parse tree of a sentence Feature selection

(73)

Outline

(74)

Discussion and Conclusions

SVM and kernel methods are rather mature areas But still quite a few interesting research issues Many are extensions of standard classification (e.g., semi-supervised learning)

It is possible to identify more extensions through real applications

(75)

References I

G. H. Bakır, L. Bottou, and J. Weston. Breaking svm complexity with cross-training. In L. K.

Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 81–88. MIT Press, Cambridge, MA, 2005.

A. Bordes, L. Bottou, and P. Gallinari. SGD-QN: Careful quasi-Newton stochastic gradient descent. Journal of Machine Learning Research, 10:1737–1754, 2009.

B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.

B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008.

E. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. Parallelizing support vector machines on distributed computers. In NIPS 21, 2007.

Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11:1471–1490, 2010. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/lowpoly_journal.pdf.

(76)

References II

K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. Machine Learning, (2–3):201–233, 2002.

S. Fine and K. Scheinberg. Efficient svm training using low-rank kernel representations.

Journal of Machine Learning Research, 2:243–264, 2001.

C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.

C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines.

IEEE Transactions on Neural Networks, 13(2):415–425, 2002.

T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.

T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002.

S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667–1689, 2003.

S. S. Keerthi, O. Chapelle, and D. DeCoste. Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research, 7:1493–1515, 2006.

(77)

References III

G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the Kernel Matrix with Semidefinite Programming. Journal of Machine Learning Research, 5:27–72, 2004.

J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:771–801, 2009.

Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines. Journal of the American Statistical Association, 99(465):67–81, 2004.

Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In Proceedings of the First SIAM International Conference on Data Mining, 2001.

C.-J. Lin. A formal analysis of stopping criteria of decomposition methods for support vector machines. IEEE Transactions on Neural Networks, 13(5):1045–1052, 2002. URL http://www.csie.ntu.edu.tw/~cjlin/papers/stop.ps.gz.

E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. In Proceedings of CVPR’97, pages 130–136, New York, NY, 1997. IEEE.

J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.

A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. Journal of Machine

(78)

References IV

S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: primal estimated sub-gradient solver for SVM. In Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML), 2007.

S. Sonnenburg and V. Franc. COFFIN : A computational framework for linear SVMs. In Proceedings of the Twenty Seventh International Conference on Machine Learning (ICML), 2010.

I. Steinwart. Sparseness of support vector machines. Journal of Machine Learning Research, 4:

1071–1105, 2003.

N. A. Syed, H. Liu, and K. K. Sung. Incremental learning with support vector machines. In Workshop on Support Vector Machines, IJCAI99, 1999.

I. Tsang, J. Kwok, and P. Cheung. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6:363–392, 2005.

J. Weston and C. Watkins. Multi-class support vector machines. In M. Verleysen, editor, Proceedings of ESANN99, pages 219–224, Brussels, 1999. D. Facto Press.

C. K. I. Williams and M. Seeger. Using the Nystr¨om method to speed up kernel machines. In T. Leen, T. Dietterich, and V. Tresp, editors, Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001.

(79)

References V

H.-F. Yu, C.-J. Hsieh, K.-W. Chang, and C.-J. Lin. Large linear classification when data cannot fit in memory. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/kdd_disk_decomposition.pdf.

G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A comparison of optimization methods and software for large-scale l1-regularized linear classification. Journal of Machine Learning Research, 2010. URL http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf. To appear.

L. Zanni, T. Serafini, and G. Zanghirati. Parallel software for training large scale support vector machines on multiprocessor systems. Journal of Machine Learning Research, 7:

1467–1492, 2006.

Z. A. Zhu, W. Chen, G. Wang, C. Zhu, and Z. Chen. P-packSVM: Parallel primal gradient descent kernel SVM. In Proceedings of the 2009 edition of the IEEE International Conference on Data Mining, 2009.