Advanced Topics in Learning and Vision

(1)

Advanced Topics in Learning and Vision

Ming-Hsuan Yang

[email protected]

(2)

Announcements

• More course material available on the course web page

• Project web pages: Everyone needs to set up one (format details will soon be available)

• Reading (due Nov 1):

- RVM application for 3D human pose estimation [1].

- Viola and Jones: Adaboost-based real-time face detector [25].

- Viola et al: Adaboost-based real-time pedestrian detector [26].

A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevance vector regression. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 882–888, 2004.

P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004.

P. Viola, M. Jones, and D. Snow. Markov face models. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 2, pages 734–741, 2003.

(3)

Overview

• Linear classifier: Fisher linear discriminant (FLD), linear support vector machine (SVM).

• Nonlinear classifier: nonlinear support vector machine,

• SVM regression, relevance vector machine (RVM).

• Kernel methods: kernel principal component, kernel discriminant analysis.

• Ensemble of classifiers: Adaboost, bagging, ensemble of homogeneous/heterogeneous classifiers.

(4)

Fisher Linear Discriminant

• Based on Gaussian distribution: between-scatter/within-scatter matrices.

• Supervised learning for multiple classes.

• Trick often used in ridge regression: S_w⁰ = S_w + λI without using PCA

• Fisherfaces vs. Eigenfaces (FLD vs. PCA).

• Further study: Aleix Martinez’s paper on analyzing FLD for object recognition [10] [11].

A. M. Martinez and A. Kak. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001.

A. M. Martinez and M. Zhu. Where are linear feature extraction methods applicable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1934–1944, 2005.

(5)

Intuitive Justification for SVM

Two things come into picture:

• what kind of function do we use for classification?

(6)

Support Vector Machine

• Objective: To find an optimal hyperplane that correctly classifies data

points as much as possible and separates the points of two classes as far as possible.

• Approach: Formulate a constrained optimization problem, then solve it using constrained quadratic programming (constrained QP).

• Theorem: Structural Risk Minimization.

• Issues: VC dimension, linear separability, feature space, multiple class.

(7)

Main Idea

• Given a set of data points which belong to either of two classes, an SVM finds the hyperplane:

- leaving the largest possible fraction of points of the same class on the same side.

- and maximizing the distance of either class from the hyperplane.

• Find the optimal separating hyperplane that minimizes the risk of misclassifying the training samples and unseen test samples.

• Question: Why do we need this? Does it work well in test samples (i.e., generalization)?

• Answer: Structural Risk Minimization.

(8)

Vapnik-Chervonenkis (VC) Dimension

• A property of a set of functions (hypothesis space H, learning machine, learner) {f (α)} (we use α as a generic set of parameters: a choice of α specifies a particular function).

• Shattered : if a given set of N points can be labeled in all possible 2^N ways, and for each labeling, a member of the set {f (α)} can be found which

correctly assigns those labels, we say that the set of points is shattered by that set of functions.

• The functions f (α) are usually called hypothesis, and the set

{f (α) : α ∈ Λ} is called the hypothesis space and denoted by H.

• The VC dimension for the set of functions {f (α)} , i.e., the hypothesis

space H, is defined as the maximum number of training points that can be shattered by H.

• Frequently used to find the sample complexity, mistake bound algorithm, neural network capacity, computational learning theory, etc.

(9)

• Example:

• While it is possible to find a set of 3 points that can be shattered by the set of oriented lines, it is not possible to shatter a set of 4 points (with any

labeling). Thus the VC dimension of the set of oriented lines in R² is 3.

(10)

Linear Classifiers

• Instance space: X = Rⁿ.

• Set of class labels: Y = {+1, −1}.

• Training data set: {(x₁, y₁), . . . , (x_N, y_N)}.

• Hypothesis space:

H_lin(n) = {h : Rⁿ → Y |h(x) = sign(w · x + b), w ∈ Rⁿ, b ∈ R}

sign(w · x + b) =

+1 if w · x + b > 0

−1 otherwise

(1)

• V C(H_lin(2)) = 3, V C(H_lin(n)) = n + 1.

• Idea:

- First need to find a hypothesis h ∈ H_lin(n)

- Then the parameters to minimize error for unseen examples.

(11)

Expected Risk

Suppose we are given N observations. Each observation consists of a pair: a vector x_i ∈ <ⁿ, i = 1, . . . , N and the associated class label y_i. Assume there exists some unknown probability distribution P (x, y) from which these data points are drawn, i.e., the data points are assumed independently drawn and identically distributed.

Consider a machine whose task is to learn the mapping x_i → y_i ∈ {−1, 1}, the machine is defined by a set of possible mappings x → f (x, α) ∈ {−1, 1},

where the functions f (x, α) themselves are labeled by the adjustable parameters α.

Expected Risk: The expectation of the test error for a trained machine is

R(α) =

Z 1

2|y − f (x, α)|dP (x, y) (2)

(12)

Upper Bound for Expected Risk

The empirical risk R_emp is

R_emp(α) = 1 2N

N

X

i=1

|y_i − f (x_i, α)| (3)

Under PAC (Probably Approximately Correct) model, Vapnik shows that the bound for the expected risk which holds with probability 1 − η (0 ≤ η ≤ 1),

R(α) ≤ R_emp(α) +

rh(log(2N/h) + 1) − log(η/4)

N (4)

where h is the VC dimension of f (x, α) (i.e., H). The second term on the right hand side is called VC confidence.

(13)

Implications of the Bound for Expected Risk

rh(log(2N/h) + 1) − log(η/4) N

• To achieve small expected risk, that is good generalization performance ⇒ both the empirical risk and the ratio between VC dimension and the number of data points have to be small.

• Since the empirical risk is usually a decreasing function of h, it turns out, for a given number of data points, there is an optimal value of the VC

dimension.

(14)

rh(log(2N/h) + 1) − log(η/4) N

(15)

Implications of the bound for R(α) (cont)

• The choice of an appropriate value for h (which in most techniques is

controlled by the number of free parameters of the model) is crucial in order to get good performance, especially when the number of data points is

small.

• When using a multilayer perceptron or a radial basis function network, this is equivalent to the problem of finding an appropriate number of hidden units.

• This is known to be difficult, and it is usually solved by cross-validation techniques.

(16)

Structural Risk Minimization (SRM)

• It is not enough just to minimize the empirical risk as often done by most neural networks.

• Need to overcome the problem of choosing an appropriate VC dimension.

• Structural Risk Minimization Principle: To make the expected risk small, both sides in (4) should be small

• Minimize the empirical risk and VC confidence simultaneously:

minH_n (R_emp(α) +

rh(log(2N/h) + 1) − log(η/4)

N ) (5)

• Introduce “structure” by dividing the entire class of functions into nested subsets.

• For each subset, we must be able to compute h or a bound of h.

(17)

• Then, SRM consists of finding that subset of functions which minimizes the bound on the actual risk.

• The problem of selecting the right subset for a given amount of observations is referred as capacity control.

• A trade off between reducing the training error and limiting model complexity.

• Occam Razor

models should be no more complex than is sufficient to explain the data.

• “Things should be made as simple as possible – but not any simpler”

Albert Einstein.

(18)

Structural Risk Minimization (cont)

• To implement SRM, one needs the nested structure of hypothesis space:

H₁ ⊂ H₂ ⊂ . . . ⊂ H_n ⊂ . . .

with the property that h(n) ≤ h(n + 1) where h(n) is the VC dimension of H_n.

(19)

• A learning machine with larger complexity (higher VC dimension) ⇒ small empirical risk.

• A simpler learning machine (lower VC dimension) ⇒ low VC confidence.

• SRM picks a trade-off in between VC dimension and empirical risk such that the risk bound is minimized.

• Problems:

- It is usually difficult to compute the VC dimension of H_n, and there are only a small number of models for which we know how to compute the VC dimension.

- Even when the VC dimension of H_n is known, it is not easy to solve the minimization problem. In most cases, one will have to minimize the

empirical risk for every set H_n, and then choose the H_n that minimizes the (5).

(20)

Support Vector Machine (SV Algorithm)

One implementation based on Structural Risk Minimization theory

• Each particular choice of structure gives rise to a learning algorithm, consisting of performing SRM in the given structure of sets of functions.

• The SVM algorithm is based on a structure on the set of separating hyperplanes.

• For a set of hyperplane functions,

H₁ ⊂ H₂ ⊂ . . . ⊂ H_n ⊂ . . .

(21)

Margin

The margin γ_i of a point x_i with respect to a linear classifier

h(x) = sign(w · x + b) is defined as the distance of x_i from the hyperplane w · x + b = 0.

γ_i =

w·x_i+b

||w||

(6)

The margin of a set of points {x₁, . . . , x_N} is defined as the margin of the point closest to the hyperplane:

γ = min

i γ_i = min

i

w·x_i+b

||w||

(7)

(22)

The SV Algorithm

Consider a hyperplane parameterized by w and b, (w, b) ∈ S × R,

{x ∈ S : (w · x) + b = 0} (8)

If we additionally require

i=1,...,nmin |(w · x_i) + b| = 1 (9) i.e., that the scaling w and b be such that the point closest to the hyperplane has a distance of _||w||¹ . Thus, the margin between the two classes, measured perpendicular to the hyperplane is at least _||w||² .

(23)

Proposition 1. (Vapnik 95) Let R be the radius of the smallest ball

B_R(a) = {x ∈ F : ||x − a|| < R}(a ∈ F ) containing the points x₁, x₂, . . . , x_N, and let

f_w,b = sign((w · z) + b) (10)

be canonical hyperplane decision functions defined on these points. Then the set {f_w,b : ||w|| ≤ A} has a VC dimension h satisfying

h < R²A² + 1 (11)

In other words,

Maximizing the margin _||w||¹

⇒ Minimizing ||w||

⇒ smallest acceptable VC dimension

⇒ Constructing an optimal hyperplane is an implementation of SRM!

(24)

Linear Support Vector Machine

Given a set of points x_i ∈ <ⁿ with i = 1, 2, . . . , N. Each point x_i belongs to either of two classes with the label y_i ∈ {−1, 1}.

Definition 1. The set S is linearly separable if there exist w ∈ <ⁿ and b ∈ <

such that

y_i(w · x_i + b) ≥ 1, i = 1, 2, . . . , N. (12) The pair (w, b) defines a hyperplane of equation w · x_i + b = 0 named

separating hyperplane. The signed distance d_i of a point x_i from the separating hyperplane (w, b) is given by

d_i = w · x_i + b

||w|| (13)

With (12) and (13), for all x_i ∈ S, we have y_id_i ≥ 1

||w|| (14)

(25)

Linear SVM (cont)

∀x_i ∈ S y_id_i ≥ 1

||w||

Therefore, _||w||¹ is the lower bound on the distance between the points x_i and the separating hyperplane (w, b).

Definition 2. The canonical representation of the separating hyperplane is obtained by rescaling the pair (w, b) into the pair (w⁰, b⁰) such that the distance of the closest point, say x_j equals ¹

||w⁰||.

Definition 3. Given a linearly separable set S, the optimal separating

hyperplane (OSH) is the separating hyperplane for which the distance of the closest point of S is maximum (i.e., maximize ¹

||w⁰||).

(26)

Constrained Quadratic Programming

Problem 1.

Minimize ¹₂w · w

subject to y_i(w · x_i + b) ≥ 1, i = 1, 2, . . . , N

Let α = α₁, α₂, . . . , α_N be the N nonnegative Lagrange multipliers associated with the constraints in (1), the solution to Problem 1 is equivalent to

determining the saddle point of the function

L_P = 1

2w · w −

N

X

i=1

α_i(y_i(w · x_i + b) − 1)

with L_P = L(w, b, α)

(27)

Solving Constrained QP

At saddle point, L_P has minimum for w = w and b = b requiring

∂L

∂b =

N

X

i=1

y_iα_i = 0 (15)

∂L

∂w = w −

N

X

i=1

α_iy_ix_i = 0 ⇒ w =

N

X

i=1

α_iy_ix_i (16)

with ∂L

∂w = ( ∂L

∂w₁, ∂L

∂w₂, · · · , ∂L

∂w_n)

Since these are equality constraints in the dual formulation, we can substitute them into L_P to give

N N N N

(28)

Solving Constrained QP using Dual

Problem 2.

Maximize -¹₂α^TDα + PN

i=1 α_i subject to PN

i=1 y_iα_i = 0 α ≥ 0

where D is an N × N matrix such that

D_ij = y_iy_jx_i · x_j (18) For the solution at the saddle point, (w, b), it follows that from Problem 2 that

w =

N

X

i

α_iy_ix_i (19)

(29)

Solving Constrained QP Using Dual

b can be determined from α, which is a solution of the dual problem, and from the Kuhn-Tucker conditions

α_i(y_i(w · x_i + b) − 1) = 0, i = 1, . . . , N (20) Recall (16) and constraints in Problem 1

∂L

∂w = w −

N

X

i=1

α_iy_ix_i = 0 ⇒ w =

N

X

i=1

α_iy_ix_i

y_i(w · x_i + b) ≥ 1, i = 1, 2, . . . , N.

Note that the only α_i that can be nonzero in (20) are those for which the constraints (12) are satisfied with the equality sign.

(30)

Support Vectors

y_i(w · x_i + b) ≥ 1, i = 1, 2, . . . , N. w =

N

X

i=1

α_iy_ix_i

Most of the constraints in (12) are satisfied with inequality signs i.e., most α_i solved from the dual are null.

⇒ the vectors w is a linear combination of a relative small percentage of the points x_i.

⇒ these points are termed support vectors because they are the closest points from the OSH and the only points of S needed to determine the OSH (name of the game).

The problem of classify a new data point x is now simply solved by looking at the sign of

sign(w · x + b)

(31)

Soft Margin Classifier

In the case that the set S is not linearly separable or one simply ignore

whether or not the set S is linearly separable, the previous analysis can be generalized by introducing N nonnegative variable ξ = (ξ₁, ξ₂, . . . , ξ_N) such that

y_i(w · x_i + b) ≥ 1 − ξ_i, i = 1, . . . , N (21) Purpose: to allow for a small number of misclassified points for better

generalization or computational efficiency.

(32)

Generalized OSH

The generalized OSH is then regarded as the solution to Problem 3.

Minimize ¹₂w · w + C PN i=1 ξ_i

subject to y_i(w · x_i + b) ≥ 1 − ξ_i, i = 1, . . . , N ξ ≥ 0

Role of C:

• as a regularization parameter (cf. Radial Basis Function, fitting).

• large C ⇒ minimize the number of misclassified points.

• small C ⇒ maximize the minimum distance _||w||¹ .

(33)

Dual Problem

Problem 4.

Maximize -¹₂α^TDα + PN

i=1 α_i subject to PN

i=1 y_iα_i = 0 0 ≤ α_i ≤ C, i = 1, 2, . . . , N As before,

w =

N

X

i=1

α_iy_ix_i

and b can be determined from α, solution of the dual Problem 4 and from the new Kuhn-Tucker conditions

α_i(y_i(w · x_i + b) − 1 + ξ) = 0 (22)

(C − α_i)ξ = 0 (23)

(34)

(C − α_i)ξ = 0 Two cases:

• α_i < C

⇒ ξ_i = 0

⇒ the support vectors lie at a distance _||w||¹ from the OSH

⇒ called margin vectors

• α_i = C

1. ξ_i > 1, misclassified points

2. 0 < ξ_i ≤ 1, points correctly classified but closer than _||w||¹ from the OSH 3. ξ = 0, margin vectors (rare case)

Neglecting the last rare case, we refer to all the support vectors for which α_i = C as errors. All the points that are not support vectors are correctly classified and lie outside the margin strip.

(35)

Nonlinear Support Vector Machine

• Note that the only way the data points appear in the training problem is in the form of dot products x_i · x_j.

• In higher dimensional space (feature space), it is very likely that a linear separator (hyperplane) can be constructed.

• E.g.: we map the data points to input space Rⁿ to some feature space of higher dimension, R^m, (m > n) using function Φ.

Φ : Rⁿ → R^m Example:

(36)

i.e., on the functions of the form Φ(x_i) · Φ(x_j). In other words,

f (x) = w · x =

N

X

i=1

y_iα_ix_i · x + b

f (x) = w · x =

N

X

i=1

y_iα_iΦ(x) · Φ(x) + b

• But the transformation operator, Φ, is computationally expensive.

• If there were a “kernel function” K such that K(x_i, x_j) = Φ(x_i) · Φ(x_j), we would only need to use K in the training algorithm.

• One example, K(x_i, x_j) = e^−||xⁱ^−x^j^||²^/2σ².

• All the previous derivations in linear SVM hold (substituting dot product with kernel function), since we are still doing a linear separation, but in a

different space.

(37)

• Map the training data nonlinearly into a higher-dimensional feature space via Φ, and construct a separating hyperplane with maximum margin there.

• This yields a nonlinear decision boundary in input space. By the use of

kernel function, it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space.

Mapping between input space and feature space [9].

•

(38)

Mercer’s Condition for Kernel Function

• The idea of constructing support vector networks comes form considering general forms of the dot product in a Hilbert space.

Φ(u) · Φ(v) ≡ K(u, v) (24)

• Question: Which kernel does there exist a pair {H, Φ} such that K(x_i, x_j) = Φ(x_i) · Φ(x_j)

• Answer: Mercer’s condition. It tells us whether or not a prospective kernel is actually a dot product in some space.

• According to the Hilbert-Schmidt Theory, any symmetric function K(u, v), with K(u, v) ∈ L₂, can be expanded in the form

K(u, v) = X

i

λ_iΦ_i(u) · Φ(v) (25)

(39)

where λ_i ∈ R and Φ are eigenvalues and eigenfunctions Z

K(u, v)Φ(u)du = λ_iΦ_i(v)

of the integral operator defined by the kernel K(u, v).

• A sufficient condition to ensure that (24) defines a dot product in a feature space is that all the eigenvalues in the expansion (25) are positive. To

guarantee that these coefficients are positive, it is necessary and sufficient (Mercer’s theorem) that the condition

Z Z

K(u, v)g(u)g(v)dudv > 0

is satisfied for all g such that

(40)

Some Kernel Functions in SVM

Simple dot product:

K(x, y) = x · y Vovk’s polynomial:

K(x, y) = (x · y + 1)^p Radial basis function (RBF):

K(x, y) = e^−||x−y||²^/2σ²

Two layer neural network:

K(x, y) = tanh(κx · y − δ)

(41)

Some Properties of SVM

Decision surface in (a) by a polynomial classifier, and in (b) by a RBF where the support vectors are indicated in dark fill. Note the reduced number of them and their position close to the boundary. In (b), the support vectors are the

(42)

Architecture for General Support Vector Machine

• Linear SVM: Kernel function is just a dot product in input space

• Nonlinear SVM: Need to choose an appropriate kernel function

(43)

Multiple Classes

For K-class pattern recognition problem, several approaches have been proposed

• One-against-the-rest: construct a hyperplane between class k and the K − 1 other classes ⇒ K SVM’s, etc.

• One-against-one: construct a hyper plane for any two classes ⇒ ^K(K−1)₂ SVM’s.

• K-class SVM: by Watkins.

• John Platt’s DAG method

(44)

SVM: Training and Testing

• Training: Solve a complex quadratic optimization problem

- Speed up: Chucking, Sequential Minimization Optimization (SMO).

• Testing: The number of support vectors may be large - Speed up: Reduced set of support vectors, etc.

(45)

SVM Regression

• Introduce ε-insensitive loss

|y − f (x)|_ε = max{0, |y − f (x)| − ε} (26)

• Estimate f (x) = w · x + b by minimizing 1

2||w||² + C N

N

X

i=1

|y_i − f (x_i)|_ε (27)

(46)

Relevance Vector Machine [Tipping NIPS00]

• Main idea [22]

- Extend SVM to have probabilistic interpretations.

- Further sparsify and have fewer support vectors, i.e., relevant vectors (prototypes).

- Bayesian learning: put priors over parameters (i.e., introducing hyperparameters).

• Approach: Put priors on target value y_i, and the weights w_i. p(y|w, σ²) = (2πσ²)^−N/2 exp{− 1

2σ²||y − Φw||²} (28)

p(w|β) =

N

Y

i=0

N (α_i|0, β_i⁻¹) (29) where Φ is a design matrix with Φ_nm = K(x_n, x_m−1) and Φ_n1 = 1, and β is a vector of N + 1 hyperparameters.

• Hyperparameters: σ and β.

(47)

• With that, we have posterior of the weights

p(w|y, β, σ²) = (2π)^{−(N +1)/2}|Σ|^−1/2exp{−1

2(w − µ)Σ⁻¹(w − µ)} (30)

Σ = (Φ^TBΦ + A)⁻¹

µ = ΣΦ^TBy (31)

where A is a diagonal matrix and let A = diag(β₀, β₁, . . . , β_N).

• Integrating out the weights, we have the marginal likelihood for the hyperparameters.

p(y|β, σ²) = (2π)^−N/2|B⁻¹+ΦA⁻¹Φ^T|^−1/2 exp{−1

y^T(B⁻¹+ΦA⁻¹Φ^T)⁻¹y}

(48)

• For the function sinc(x) = |x|⁻¹ sin |x|. With 100 training samples, SVM regression uses up 39 support vectors. RVM uses 9 relevant vectors.

• Williams et al. apply RVM to learn a regression function of in-plane image displacement for visual tracking [27] [28].

• Agarwal and Triggs apply RVM to learn a regression function of human joint angles from silhouette images [1].

(49)

Applications

• Pattern Recognition: hand digit recognition, 3D object recognition, face detection, face recognition, pedestrian detection, gender classification, visual tracking, expression recognition, speaker identification, text

classification.

• Regression: time series prediction, relevance vector machine

• Signal Processing: seismic signal classification, density estimation, DNA sequence classification.

(50)

Review: Optical Flow

Let image intensity be given by I(x, y, t), use first-order Taylor expansion

I(x + dx, y + dy, t + dt) = I(x, y, t) + ∂I

∂xdx + ∂I

∂ydy + ∂I

∂tdt + h.o.t. (33) Based on brightness constancy assumption,

I(x + dx, y + dy, t + dt) = I(x, y, t) (34) Thus

∂I

∂xdx + ∂I

∂ydy + ∂I

∂t dt = 0 (35)

Let

dx

dt = u , dy

dt = v (36)

We have

−∂I

∂t = ∂I

∂xu + ∂I

∂yv (37)

(51)

Lucas-Kanade

E(u, v) = X

x,y∈ROI

(I(x + u, y + v, t + dt) − I(x, y, t))² (38)

Minimize E(u, v)

P I²_x P I_xI_y P I_xI_y P I²_y

u v

=

P −I_tI_x P −I_tI_y

(39)

(52)

SVM Tracking [Avidan CVPR01]

• Incorporate optical flow with SVM [2] [3].

• Estimate the displacement direction using optical flow

• Use the SVM score to determine the most likely target location

• Given I, use first order Taylor expansion

I^∗ = I + uI_x + vI_y (40)

where I_x, I_y are x and y derivative of I, and u, v are motion parameters.

• With SVM score

f (x) = P^N

i=1 y_iα_iK(x_i, x_j) + b f (x) = PN

i=1 y_iα_iK(I + uI_x + vI_y, x_j) (41)

(53)

• Use dot product kernel for K, and maximize the above function

E(u, v) =

N

X

i=1

y_iα_i((I + uI_x + vI_y)^Tx_j)² (42)

• Take partial derivatives w.r.t u, and v

∂E

∂u =

N

X

i=1

y_iα_iI^T_xx_j(I + uI_x + vI_y)^Tx_j = 0 (43)

∂E

∂v =

N

X

i=1

y_iα_iI^T_y x_j(I + uI_x + vI_y)^Tx_j = 0 (44)

(54)

where

A₁₁ = PN

i=1 α_iy_i(x^T_i I_x)² A₁₂ = A₂₁ = PN

i=1 α_iy_i(x^T_i I_x)(x^T_i I_y) A₂₂ = P^N

i=1 α_iy_i(x^T_i I_y)² b₁ = − PN

i=1 α_iy_i(x^T_i I_x)(x^T_i I) b₂ = − PN

i=1 α_iy_i(x^T_i I_y)(x^T_i I)

(46)

• Similar to optical flow in equation form

• Also use pyramid to search over scale

(55)

Life Beyond SVM

• Mistake-Bound On-Line Learning: Winnow, SNoW

• Ensemble of Homogeneous Classifiers: Boosting, Bagging

• Ensemble of Heterogeneous Classifiers: Kittler’s method

• Random Subspace Method: Monte Carlo approach

• Kernel methods: Kernel PCA, Kernel Fisher Linear Discriminant

• Generative Models, Graphical Models, Nonlinear PCA, Probabilistic PCA, Mixture of Probabilistic PCA, etc.

• Maximum entropy approach

(56)

SVM v.s. SNoW on Face Detection

A benchmark on SVM and SNoW based on 5732 training samples and 500 testing samples using a Sun Ultra Sparc 10. Each sample is an 20 × 20 image.

Nonlinear SVM Linear SVM SNoW

Training Accuracy 100% 100% 100%

Testing Accuracy 100% 96% 97%

Memory Requirement 83MB 24MB 7MB

World Clock Time 5.8hr 3.7hr 0.6hr

(57)

Concluding Remarks

Pros

• Optimal hyperplane.

• Some kernels have infinite VC dimension.

• Can deal with high dimensional data.

Cons

• Numerical stability problems in solving constrained QP.

• Usually require positive/negative examples.

(58)

References

Introductory articles: [9] [21] [16] [14] [4] [6]

Book: [8] [23] [24] [18] [7]

Ph.D. Thesis: [5] [17] [20]

Vision papers: [16] [14] [12] [15]

Comparison with RBF: [19]

Kernel Machines web site: http://www.kernel-machines.org Boosting web site:

http://www.boosting.org/

References

[1] A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevance vector regression. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 882–888, 2004.

[2] S. Avidan. Support vector tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 184–191, 2001.

[3] S. Avidan. Support vector tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8):1064–1072, 2004.

[4] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 1998.

(59)

[5] C. Cortes. Prediction of Generalization Ability in Learning Machines. PhD thesis, Department of Computer Science, University of Rochester, 1995.

[6] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20, 1995.

[7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines.

Cambridge University Press, 2000.

[8] S. Haykin. Neural Networks : A Comprehensive Foundation. Prentice Hall, 1998.

[9] M. A. Heasrt, B. Scholkopf, S. Dumais, E. Osuna, and J. Platt. Trends and controversies - support vector machines. IEEE Intelligent Systems, 13(4):18–28, 1998.

[10] A. M. Martinez and A. Kak. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001.

[11] A. M. Martinez and M. Zhu. Where are linear feature extraction methods applicable?

IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1934–1944, 2005.

[12] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio. Pedestrian detection using wavelet templates. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 193–199, 1997.

[13] E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications.

Technical Report AI MEMO 1602, MIT AI Lab, 1997.

(60)

Proceedings of the Fifth International Conference on Computer Vision, pages 555–562, 1998.

[16] M. Pontil and A. Verri. Support vector machines for 3D object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(6):637–646, 1998.

[17] B. Scholkopf. Support Vector Learning. PhD thesis, Informatik der Technischen Universitat Berlin, 1997.

[18] B. Scholkopf, C. Burges, and A. Smola, editors. Advances in Kernel Methods. MIT Press, 1998.

[19] B. Scholkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, and T. Poggio. Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE

Transactions on Signal Processing, 45(11):2758–2765, 1997.

[20] A. Smola. Learning with Kernels. PhD thesis, GMD, 1998.

[21] A. J. Smola and B. Scholkopf. A tutorial on support vector regression. Technical Report TR-1998-030, Neuro COLT, GMD First, 1998.

[22] M. Tipping. The relevance vector machine. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Advances in Neural Information Processing Systems, pages 46–53. MIT Press, 2000.

[23] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.

[24] V. Vapnik. Statistical Learning Theory (Adaptive and Learning Systems for Signal Processing, Communications, and Control). John Wiley & Sons, 1998.

[25] P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004.

(61)

[26] P. Viola, M. Jones, and D. Snow. Markov face models. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 2, pages 734–741, 2003.

[27] O. Williams, A. Blake, and R. Cipolla. A sparse probabilistic learning algorithm for real-time tracking. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 1, pages 353–360, 2003.

[28] O. Williams, A. Blake, and R. Cipolla. Sparse bayesian learning for efficient visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence,

27(8):1292–1304, 2005.