### Optimization and Machine Learning

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at 25th Simon Stevin Lecture, K. U. Leuven Optimization in Engineering Center, January 17, 2013

### Outline

1 Introduction

2 Optimization methods for kernel support vector machines

3 Optimization methods for linear support vector machines

4 Discussion and conclusions

### Outline

1 Introduction

2 Optimization methods for kernel support vector machines

3 Optimization methods for linear support vector machines

4 Discussion and conclusions

### What is Machine Learning

Extract knowledge from data

Representative tasks: classification, clustering, and others

Classification Clustering An old area, but many new and interesting applications/extensions: ranking, etc.

### Data Classification

Given training data in different classes (labels known)

Predict test data (labels unknown) Classic example

1. Find a patient’s blood pressure, weight, etc.

2. After several years, know if he/she recovers 3. Build a machine learning model

4. New patient: find blood pressure, weight, etc 5. Prediction

Two main stages: training and testing

### Data Classification (Cont’d)

Representative methods

Nearest neighbor, naive Bayes Decision tree, random forest

Neural networks, support vector machines

### Why Is Optimization Used?

Usually the goal of classification is to minimize the test error

Therefore, many classification methods solve optimization problems

### Optimization and Machine Learning

Standard optimization packages may be directly applied to machine learning applications

However, efficiency and scalability are issues Very often machine learning knowledge must be considered in designing suitable optimization methods

We will discuss some examples in this talk

### Outline

1 Introduction

2 Optimization methods for kernel support vector machines

3 Optimization methods for linear support vector machines

4 Discussion and conclusions

### Kernel Methods

Kernel methods are a class of classification

techniques where major operations are conducted by kernel evaluations

A representative example is support vector machine

### Support Vector Classification

Training data (x_{i}, y_{i}), i = 1, . . . , l , x_{i} ∈ R^{n}, y_{i} = ±1
Maximizing the margin (Boser et al., 1992; Cortes
and Vapnik, 1995)

minw,b

1

2w^{T}w + C

l

X

i =1

max(1 − y_{i}(w^{T}φ(x_{i})+ b), 0)
High dimensional ( maybe infinite ) feature space

φ(x) = (φ1(x), φ2(x), . . .).

w: maybe infinite variables

### Support Vector Classification (Cont’d)

The dual problem (finite # variables) minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i} ≤ C , i = 1, . . . , l
y^{T}α = 0,

where Q_{ij} = y_{i}y_{j}φ(x_{i})^{T}φ(x_{j}) and e = [1, . . . , 1]^{T}
At optimum

w = Pl

i =1α_{i}y_{i}φ(x_{i})

Kernel: K (x_{i}, x_{j}) ≡ φ(x_{i})^{T}φ(x_{j}) ; closed form
Example: Gaussian (RBF) kernel: e^{−γkx}^{i}^{−x}^{j}^{k}^{2}

### Support Vector Classification (Cont’d)

Only x_{i} of α_{i} > 0 used ⇒ support vectors

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1

### Large Dense Quadratic Programming

minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i} ≤ C , i = 1, . . . , l
y^{T}α = 0

Q_{ij} 6= 0, Q : an l by l fully dense matrix
50,000 training points: 50,000 variables:

(50, 000^{2} × 8/2) bytes = 10GB RAM to store Q

### Large Dense Quadratic Programming (Cont’d)

For quadratic programming problems, traditionally we would use

Newton or quasi Newton

However, they cannot be directly applied here because Q cannot even be stored

Currently, decomposition methods (a type of coordinate descent methods) are what used in practice

### Decomposition Methods

Working on some variables each time (e.g., Osuna et al., 1997; Joachims, 1998; Platt, 1998)

Similar to coordinate-wise minimization Working set B , N = {1, . . . , l }\B fixed Sub-problem at the kth iteration:

minαB

1

2α^{T}_{B} (α^{k}_{N})^{T}Q_{BB} QBN

Q_{NB} Q_{NN}

α_{B}
α^{k}_{N}

−

e^{T}_{B} (e^{k}_{N})^{T}α_{B}
α^{k}_{N}

subject to 0 ≤ α_{t} ≤ C , t ∈ B, y_{B}^{T}α_{B} = −y^{T}_{N}α^{k}_{N}

### Avoid Memory Problems

The new objective function 1

2α^{T}_{B}Q_{BB}α_{B} + (−e_{B} + Q_{BN}α^{k}_{N})^{T}α_{B} + constant
Only B columns of Q are needed

|B| ≥ 2 due to the equality constraint and in general |B| ≤ 10 is used

Calculated when used : trade time for space But is such an approach practical?

### How Decomposition Methods Perform?

Convergence not very fast. This is known because of using only first-order information

But, no need to have very accurate α
decision function: X^{l}

i =1α_{i}K (x_{i}, x) + b
Prediction may still be correct with a rough α
Further, in some situations,

# support vectors # training points
Initial α^{1} = 0, some instances never used

### How Decomposition Methods Perform?

### (Cont’d)

An example of training 50,000 instances using the software LIBSVM

$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370

Time 79.524s

This was done on a typical desktop

Calculating the whole Q takes more time

#SVs = 3,370 50,000

A good case where some remain at zero all the time

### How Decomposition Methods Perform?

### (Cont’d)

Because many α_{i} = 0 in the end, we can develop a
shrinking techniques

Variables are removed during the optimization procedure. Smaller problems are solved

### Machine Learning Properties are Useful in Designing Optimization Algorithms

We have seen that special properties of SVM did contribute to the viability of decomposition method

For machine learning applications, no need to
accurately solve the optimization problem
Because some optimal α_{i} = 0, decomposition
methods may not need to update all the variables
Also, we can use shrinking techniques to reduce the
problem size during decomposition methods

### Differences between Optimization and Machine Learning

The two topics may have different focuses. We give the following example

The decomposition method we just discussed converges more slowly when C is large

Using C = 1 on a date set

# iterations: 508 Using C = 5, 000

# iterations: 35,241

Optimization researchers may rush to solve difficult cases of large C

That’s what I did before

It turns out that large C less used than small C Recall that SVM solves

1

2w^{T}w + C (sum of training losses)
A large C means to overfit training data
This does not give good testing accuracy

### Outline

1 Introduction

2 Optimization methods for kernel support vector machines

3 Optimization methods for linear support vector machines

4 Discussion and conclusions

### Linear and Kernel Classification

We have

Kernel ⇒ map data to a higher space Linear ⇒ use the original data

Intuitively, kernel should give superior accuracy than linear

There are even some theoretical results

We optimization people may think there is no need to specially consider linear SVM

However, this is wrong if we consider their practical use

### Linear and Kernel Classification (Cont’d)

Methods such as SVM and logistic regression can used in two ways

Kernel methods: data mapped to a higher dimensional space

x ⇒ φ(x)

φ(x_{i})^{T}φ(x_{j}) easily calculated; little control on φ(·)
Linear classification + feature engineering:

We have x without mapping. Alternatively, we can say that φ(x) is our x; full control on x or φ(x)

### Linear and Kernel Classification (Cont’d)

For some problems, accuracy by linear is as good as nonlinear

But training and testing are much faster

This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)

Recently linear classification is a popular research topic.

### Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel Data set #data #features Time Accuracy Time Accuracy

MNIST38 11,982 752 0.1 96.82 38.1 99.70

ijcnn1 49,990 22 1.6 91.81 26.8 98.69

covtype 464,810 54 1.4 76.37 46,695.8 96.11 news20 15,997 1,355,191 1.1 96.95 383.2 96.90 real-sim 57,848 20,958 0.3 97.44 938.3 97.82 yahoo-japan 140,963 832,026 3.1 92.63 20,955.2 93.31 webspam 280,000 254 25.7 93.35 15,681.8 99.26

Therefore, there is a need to develop optimization methods for large linear classification

### Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel Data set #data #features Time Accuracy Time Accuracy

MNIST38 11,982 752 0.1 96.82 38.1 99.70

ijcnn1 49,990 22 1.6 91.81 26.8 98.69

covtype 464,810 54 1.4 76.37 46,695.8 96.11 news20 15,997 1,355,191 1.1 96.95 383.2 96.90 real-sim 57,848 20,958 0.3 97.44 938.3 97.82 yahoo-japan 140,963 832,026 3.1 92.63 20,955.2 93.31 webspam 280,000 254 25.7 93.35 15,681.8 99.26

Therefore, there is a need to develop optimization methods for large linear classification

### Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel Data set #data #features Time Accuracy Time Accuracy

MNIST38 11,982 752 0.1 96.82 38.1 99.70

ijcnn1 49,990 22 1.6 91.81 26.8 98.69

covtype 464,810 54 1.4 76.37 46,695.8 96.11 news20 15,997 1,355,191 1.1 96.95 383.2 96.90 real-sim 57,848 20,958 0.3 97.44 938.3 97.82 yahoo-japan 140,963 832,026 3.1 92.63 20,955.2 93.31 webspam 280,000 254 25.7 93.35 15,681.8 99.26

Therefore, there is a need to develop optimization methods for large linear classification

### Why Linear is Faster in Training and Testing?

Let’s check the prediction cost
w^{T}x + b versus X^{l}

i =1α_{i}K (x_{i}, x) + b
If K (x_{i}, x_{j}) takes O(n), then

O(n) versus O(nl ) Linear is much cheaper; reason:

for linear, xi is available but

for kernel, φ(x_{i}) is not

### Optimization for Linear Classification

Now a popular topic in both machine learning and optimization

Most are based on first-order information:

coordinate descent, stochastic gradient descent, or cutting plane

The reason is again that no need to accurately solve optimization problems

Let’s see another development for linear classification

### Optimization for Linear Classification (Cont’d)

Martens (2010) and Byrd et al. (2011) propose the so called “Hessian free” approach

Let’s rewrite linear SVM as the following form minw

1

2w^{T}w + C
l

l

X

i =1

max(1 − y_{i}w^{T}x_{i}, 0)
What if we use a subset in the second term

C

|B|

X

i ∈B

max(1 − y_{i}w^{T}x_{i}, 0)

### Optimization for Linear Classification (Cont’d)

Then both gradient and Hessian-vector products can be cheaper

That is, if there are enough data, then the average training loss should be similar

This is a good example to take machine learning properties in designing optimization algorithms

### Optimization for Linear Classification (Cont’d)

Lessons

We must know the practical use of machine learning in order to design suitable optimization algorithms Here is how I started developing optimization algorithms for linear SVM

In 2006, I visited at Yahoo! for six months. I learned that

1. Document classification is heavily used

2. Accuracy of linear and nonlinear is similar for documents

### Outline

1 Introduction

2 Optimization methods for kernel support vector machines

3 Optimization methods for linear support vector machines

4 Discussion and conclusions

### Machine Learning Software

Algorithms discussed in this talk are related to my machine learning software

LIBSVM (Chang and Lin, 2011):

One of the most popular SVM packages; cited more than 11, 000 times on Google Scholar

LIBLINEAR (Fan et al., 2008):

A library for large linear classification; popular in Internet companies

The core of an SVM package is an optimization solver

### Machine Learning Software (Cont’d)

But designing machine learning software is quite different from optimization packages

You need to consider prediction, validation, and others

Also issues related to users (e.g., easy of use, interface, etc.) are very important for machine learning packages

### Conclusions

Optimization has been very useful for machine learning

We need to take machine learning knowledge into account for designing suitable optimization

algorithms

The interaction between optimization and machine learning is very interesting and exciting.