© Deng Cai, College of Computer Science, Zhejiang University

(1)

So Far…

Our goal (supervised learning):

 To learn a set of discriminant functions

Bayesian framework

 We could design an optimal classifier if we knew:

 P(_i) : priors and P(x | _i) : class‐conditional densities

 Using training data to estimate P(_i) and P(x | _i)

Directly learning discriminant functions from the training data

 We only know the form of the discriminant functions

 Linear Methods for Regression

(2)

2

Linear Methods for Classification

Deng Cai (蔡登)

College of Computer Science Zhejiang University

[email protected]

(3)

Discriminant Functions and Classifiers

Set of discriminant functions: , 1, ⋯ ,

Classifier assigns a feature vector to class if:

, ∀

⋯

⋯ Classification

(4)

Linear Regression of an Indicator Matrix

One VS. Rest

(5)

Cai, College of Computer Science, Zhejiang University

Sigmoid function

σ t

1 1

1

It is the cumulative distribution function (CDF) of the standard logistic distribution.

While the input can have any value from ∞ to ∞, the output takes only values between 0 and 1, and hence is interpretable as probability

σ：R → 0,1 S‐shaped

(logistic function)

(6)

Logistic Regression

Logistic Regression (LR) is a classification model used to describe the relationship between a categorical dependent variable and one or several independent variables by estimating probabilities using

sigmoid function.

1 , 1

1

1 , 1 1 1

1

1 1

1 , 1

1

(7)

Maximum Likelihood Estimation for Logistic Regression

Logistic Regression:

 : a convex function of ？

1 , 1

1

∈

log

∈

log 1

∈

log 1

∈

Homework

(8)

Objective Function of Logistic Regression:

Objective Function of Linear Regression:

Gradient Descent

Minimize a Differentiable Function

log 1

∈

(9)

Gradient Descent

(10)

Gradient Descent

A first‐order optimization algorithm.

Can find a local minimum of a function

One takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function;

Another name: steepest descent

(11)

Gradient Descent

If the multivariable function is defined and differentiable in a neighborhood of a point , then decreases fastest if one goes from in the direction of the negative gradient of at , .

If , for small enough, then .

With this observation in mind, one starts with a guess for a local minimum of , and considers the sequence , , , ⋯ such that

, 0

We have

⋯

so hopefully the sequence converges to the desired local minimum.

Note that the value of the step size is allowed to change at every iteration.

(12)

Gradient Descent Algorithm

L(a)

a a⁴ a³ a² a¹

(13)

Minimize a Differentiable Function

If the multivariable function is defined and differentiable in a neighborhood of a point , then decreases fastest if one goes from in the direction of the negative gradient of at , .

If , for small enough, then .

Taylor series for evaluating a function

If we use a linear approximation

The ʺZig‐Zaggingʺ nature of Gradient Decent

Why?

∆ ∆ ∆

2!

∆

3! ⋯

∆ ∆

(14)

Minimize a Differentiable Function

If we use a linear approximation, then Gradient Decent

If we use a quadratic approximation, then Newton’s Method

Quasi‐Newton

 DFP, BFGS, L‐BFGS, OWL‐QN

∆ ∆ ∆

2!

∆

3! ⋯

Choose ∆ that ∆ ^∆ _! is minimum

∆

∆ 0 ∆

∆

(15)

Regularized Logistic Regression

L2‐regularizer

L1‐regularizer (Sparse Logistic Regression) log 1

∈

(16)

Software

LIBLINEAR

 http://www.csie.ntu.edu.tw/~cjlin/liblinear/

(17)

Support Vector Machine

(18)

Two‐category Linearly Separable Case

If

 0 for examples from the positive class.

 0 for examples from the negative class.

Such a weight vector is called a separating vector or a solution vector

 Does solution vector unique?

(19)

Non‐uniqueness of hyperplane

classifier

(20)

Which one is better?

(21)

Binary Classification

Equation for hyperplane:

0 Negative class:

0 Positive class:

0

(22)

Geometrical Margin

Define as the distance from to the hyperplane

 Computation: let the projection of into the hyperplane be , then we have

0 : geometrical margin

(23)

Geometrical Margin

Small

Large If the hyperplane moves a little,

points with small

will be affected, but

points with large

(24)

Maximum Margin Classifier

Define the margin of a dataset be the minimum margin of each data point

Maximum margin classifier tries to achieve the maximum possible margin for a given dataset

 Thus maximize the confidence of classifying the dataset

Goal: Find the hyperplane with the largest margin

(25)

Why Maximum Margin?

Intuitively this feels safest

If we’ve made a small error in the location of the boundary, this gives us least chance of causing misclassification

There’s some theory (using VC dimension) that is related to the proposition that this is a good thing.

Empirically it works very, very well.

(26)

Maximum Margin Classifier

Geometrical margin is a value uniquely determined by the position of the hyperplane

If we scale , will not change as long as the hyperplane is kept fixed

Geometrical Margin

Margin maximized

(27)

Maximum Margin Classifier

We know can be made arbitrarily large

without changing the hyperplane, so we simply fix

(28)

Maximum Margin Classifier

,

(29)

Maximum Margin Classifier

Square and a coefficient are added for the

convenience of the derivation of optimization, and the minimizer of and is

obviously the same.

(30)

Support Vector Machine

Hyper plane of

maximum margin is supported by those

points (vectors) on the margin. Those are called Support Vectors. Non‐

support vectors can move freely without

affecting the position of

the hyperplane as long

as they don’t exceed the

margin.

(31)

History of SVM

SVM is related to statistical learning theory [3]

SVM was first introduced in 1992 [1]

SVM becomes popular because of its success in handwritten digit recognition

 1.1% test error rate for SVM. This is the same as the error rates of a carefully constructed neural network, LeNet 4.

 See Section 5.11 in [2] or the discussion in [3] for details

SVM is now regarded as an important example of “kernel methods”, one of the key area in machine learning

[1] Bernhard E. Boser , Isabelle M. Guyon , Vladimir N. Vapnik, A Training Algorithm for Optimal Margin Classifiers.

Proceedings of the Fifth Annual Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992.

[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th

(32)

Weakness of the Original Model

When an outlier appear, the optimal hyperplane may be pushed far away from its original/correct place. The resultant margin will also be smaller than before.

• Red Solid: the original hyperplane

• Dark dashed: the new hyperplane

,

(33)

Slack Variables

Assign a slack variable to each data point.

That means we allow the point to deviate

from the correct margin by a distance of

(Actually when

considering geometrical

(34)

New Objective Function

Slack variables can’t be arbitrarily large, we want to minimize the sum of all slack variables

,

(35)

New Objective Function

,

We would pay a cost of the objective function being increased by . The parameter controls the relative weighting

between the twin goals of making the small (makes the margin large) and of ensuring that most examples have

functional margin at least 1.

(36)

Software

Lots of SVM software:

LibSVM (C++)

 http://www.csie.ntu.edu.tw/~cjlin/libsvm/

SVMLight (C)

(37)

Unconstrained Optimization Problem of SVM

min,

1 2

1 0

1 max 1 , 0

min, max 1 , 0 1

2

Loss function Regularizer

ℓ max 1 , 0 Hinge loss

ℓ 1

Linear regression: ∑ _∈

Loss function

Square loss Logistic regression: ∑ _∈ log 1

Loss function

(38)

A General formulation of classifiers

38

min ℓ

Loss function Regularizer

Hinge loss: ℓ max 1 , 0 Square loss: ℓ 1

Logistic loss: ℓ log 1

Ordinary regression Logistic regression SVM

L2‐regularizer L1‐regularizer

(39)