Advanced Topics in Learning and Vision
Ming-Hsuan Yang
mhyang@csie.ntu.edu.tw
Announcements
• More course material available on the course web page
• Project web pages: Everyone needs to set up one (format details will soon be available)
• Reading (due Nov 1):
- RVM application for 3D human pose estimation [1].
- Viola and Jones: Adaboost-based real-time face detector [25].
- Viola et al: Adaboost-based real-time pedestrian detector [26].
A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevance vector regression. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 882–888, 2004.
P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004.
P. Viola, M. Jones, and D. Snow. Markov face models. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 2, pages 734–741, 2003.
Overview
• Linear classifier: Fisher linear discriminant (FLD), linear support vector machine (SVM).
• Nonlinear classifier: nonlinear support vector machine,
• SVM regression, relevance vector machine (RVM).
• Kernel methods: kernel principal component, kernel discriminant analysis.
• Ensemble of classifiers: Adaboost, bagging, ensemble of homogeneous/heterogeneous classifiers.
Fisher Linear Discriminant
• Based on Gaussian distribution: between-scatter/within-scatter matrices.
• Supervised learning for multiple classes.
• Trick often used in ridge regression: Sw0 = Sw + λI without using PCA
• Fisherfaces vs. Eigenfaces (FLD vs. PCA).
• Further study: Aleix Martinez’s paper on analyzing FLD for object recognition [10] [11].
A. M. Martinez and A. Kak. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001.
A. M. Martinez and M. Zhu. Where are linear feature extraction methods applicable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1934–1944, 2005.
Intuitive Justification for SVM
Two things come into picture:
• what kind of function do we use for classification?
Support Vector Machine
• Objective: To find an optimal hyperplane that correctly classifies data
points as much as possible and separates the points of two classes as far as possible.
• Approach: Formulate a constrained optimization problem, then solve it using constrained quadratic programming (constrained QP).
• Theorem: Structural Risk Minimization.
• Issues: VC dimension, linear separability, feature space, multiple class.
Main Idea
• Given a set of data points which belong to either of two classes, an SVM finds the hyperplane:
- leaving the largest possible fraction of points of the same class on the same side.
- and maximizing the distance of either class from the hyperplane.
• Find the optimal separating hyperplane that minimizes the risk of misclassifying the training samples and unseen test samples.
• Question: Why do we need this? Does it work well in test samples (i.e., generalization)?
• Answer: Structural Risk Minimization.
Vapnik-Chervonenkis (VC) Dimension
• A property of a set of functions (hypothesis space H, learning machine, learner) {f (α)} (we use α as a generic set of parameters: a choice of α specifies a particular function).
• Shattered : if a given set of N points can be labeled in all possible 2N ways, and for each labeling, a member of the set {f (α)} can be found which
correctly assigns those labels, we say that the set of points is shattered by that set of functions.
• The functions f (α) are usually called hypothesis, and the set
{f (α) : α ∈ Λ} is called the hypothesis space and denoted by H.
• The VC dimension for the set of functions {f (α)} , i.e., the hypothesis
space H, is defined as the maximum number of training points that can be shattered by H.
• Frequently used to find the sample complexity, mistake bound algorithm, neural network capacity, computational learning theory, etc.
• Example:
• While it is possible to find a set of 3 points that can be shattered by the set of oriented lines, it is not possible to shatter a set of 4 points (with any
labeling). Thus the VC dimension of the set of oriented lines in R2 is 3.
Linear Classifiers
• Instance space: X = Rn.
• Set of class labels: Y = {+1, −1}.
• Training data set: {(x1, y1), . . . , (xN, yN)}.
• Hypothesis space:
Hlin(n) = {h : Rn → Y |h(x) = sign(w · x + b), w ∈ Rn, b ∈ R}
sign(w · x + b) =
+1 if w · x + b > 0
−1 otherwise
(1)
• V C(Hlin(2)) = 3, V C(Hlin(n)) = n + 1.
• Idea:
- First need to find a hypothesis h ∈ Hlin(n)
- Then the parameters to minimize error for unseen examples.
Expected Risk
Suppose we are given N observations. Each observation consists of a pair: a vector xi ∈ <n, i = 1, . . . , N and the associated class label yi. Assume there exists some unknown probability distribution P (x, y) from which these data points are drawn, i.e., the data points are assumed independently drawn and identically distributed.
Consider a machine whose task is to learn the mapping xi → yi ∈ {−1, 1}, the machine is defined by a set of possible mappings x → f (x, α) ∈ {−1, 1},
where the functions f (x, α) themselves are labeled by the adjustable parameters α.
Expected Risk: The expectation of the test error for a trained machine is
R(α) =
Z 1
2|y − f (x, α)|dP (x, y) (2)
Upper Bound for Expected Risk
The empirical risk Remp is
Remp(α) = 1 2N
N
X
i=1
|yi − f (xi, α)| (3)
Under PAC (Probably Approximately Correct) model, Vapnik shows that the bound for the expected risk which holds with probability 1 − η (0 ≤ η ≤ 1),
R(α) ≤ Remp(α) +
rh(log(2N/h) + 1) − log(η/4)
N (4)
where h is the VC dimension of f (x, α) (i.e., H). The second term on the right hand side is called VC confidence.
Implications of the Bound for Expected Risk
R(α) ≤ Remp(α) +
rh(log(2N/h) + 1) − log(η/4) N
• To achieve small expected risk, that is good generalization performance ⇒ both the empirical risk and the ratio between VC dimension and the number of data points have to be small.
• Since the empirical risk is usually a decreasing function of h, it turns out, for a given number of data points, there is an optimal value of the VC
dimension.
R(α) ≤ Remp(α) +
rh(log(2N/h) + 1) − log(η/4) N
Implications of the bound for R(α) (cont)
• The choice of an appropriate value for h (which in most techniques is
controlled by the number of free parameters of the model) is crucial in order to get good performance, especially when the number of data points is
small.
• When using a multilayer perceptron or a radial basis function network, this is equivalent to the problem of finding an appropriate number of hidden units.
• This is known to be difficult, and it is usually solved by cross-validation techniques.
Structural Risk Minimization (SRM)
• It is not enough just to minimize the empirical risk as often done by most neural networks.
• Need to overcome the problem of choosing an appropriate VC dimension.
• Structural Risk Minimization Principle: To make the expected risk small, both sides in (4) should be small
• Minimize the empirical risk and VC confidence simultaneously:
minHn (Remp(α) +
rh(log(2N/h) + 1) − log(η/4)
N ) (5)
• Introduce “structure” by dividing the entire class of functions into nested subsets.
• For each subset, we must be able to compute h or a bound of h.
• Then, SRM consists of finding that subset of functions which minimizes the bound on the actual risk.
• The problem of selecting the right subset for a given amount of observations is referred as capacity control.
• A trade off between reducing the training error and limiting model complexity.
• Occam Razor
models should be no more complex than is sufficient to explain the data.
• “Things should be made as simple as possible – but not any simpler”
Albert Einstein.
Structural Risk Minimization (cont)
• To implement SRM, one needs the nested structure of hypothesis space:
H1 ⊂ H2 ⊂ . . . ⊂ Hn ⊂ . . .
with the property that h(n) ≤ h(n + 1) where h(n) is the VC dimension of Hn.
• A learning machine with larger complexity (higher VC dimension) ⇒ small empirical risk.
• A simpler learning machine (lower VC dimension) ⇒ low VC confidence.
• SRM picks a trade-off in between VC dimension and empirical risk such that the risk bound is minimized.
• Problems:
- It is usually difficult to compute the VC dimension of Hn, and there are only a small number of models for which we know how to compute the VC dimension.
- Even when the VC dimension of Hn is known, it is not easy to solve the minimization problem. In most cases, one will have to minimize the
empirical risk for every set Hn, and then choose the Hn that minimizes the (5).
Support Vector Machine (SV Algorithm)
One implementation based on Structural Risk Minimization theory
• Each particular choice of structure gives rise to a learning algorithm, consisting of performing SRM in the given structure of sets of functions.
• The SVM algorithm is based on a structure on the set of separating hyperplanes.
• For a set of hyperplane functions,
H1 ⊂ H2 ⊂ . . . ⊂ Hn ⊂ . . .
Margin
The margin γi of a point xi with respect to a linear classifier
h(x) = sign(w · x + b) is defined as the distance of xi from the hyperplane w · x + b = 0.
γi =
w·xi+b
||w||
(6)
The margin of a set of points {x1, . . . , xN} is defined as the margin of the point closest to the hyperplane:
γ = min
i γi = min
i
w·xi+b
||w||
(7)
The SV Algorithm
Consider a hyperplane parameterized by w and b, (w, b) ∈ S × R,
{x ∈ S : (w · x) + b = 0} (8)
If we additionally require
i=1,...,nmin |(w · xi) + b| = 1 (9) i.e., that the scaling w and b be such that the point closest to the hyperplane has a distance of ||w||1 . Thus, the margin between the two classes, measured perpendicular to the hyperplane is at least ||w||2 .
Proposition 1. (Vapnik 95) Let R be the radius of the smallest ball
BR(a) = {x ∈ F : ||x − a|| < R}(a ∈ F ) containing the points x1, x2, . . . , xN, and let
fw,b = sign((w · z) + b) (10)
be canonical hyperplane decision functions defined on these points. Then the set {fw,b : ||w|| ≤ A} has a VC dimension h satisfying
h < R2A2 + 1 (11)
In other words,
Maximizing the margin ||w||1
⇒ Minimizing ||w||
⇒ smallest acceptable VC dimension
⇒ Constructing an optimal hyperplane is an implementation of SRM!
Linear Support Vector Machine
Given a set of points xi ∈ <n with i = 1, 2, . . . , N. Each point xi belongs to either of two classes with the label yi ∈ {−1, 1}.
Definition 1. The set S is linearly separable if there exist w ∈ <n and b ∈ <
such that
yi(w · xi + b) ≥ 1, i = 1, 2, . . . , N. (12) The pair (w, b) defines a hyperplane of equation w · xi + b = 0 named
separating hyperplane. The signed distance di of a point xi from the separating hyperplane (w, b) is given by
di = w · xi + b
||w|| (13)
With (12) and (13), for all xi ∈ S, we have yidi ≥ 1
||w|| (14)
Linear SVM (cont)
∀xi ∈ S yidi ≥ 1
||w||
Therefore, ||w||1 is the lower bound on the distance between the points xi and the separating hyperplane (w, b).
Definition 2. The canonical representation of the separating hyperplane is obtained by rescaling the pair (w, b) into the pair (w0, b0) such that the distance of the closest point, say xj equals 1
||w0||.
Definition 3. Given a linearly separable set S, the optimal separating
hyperplane (OSH) is the separating hyperplane for which the distance of the closest point of S is maximum (i.e., maximize 1
||w0||).
Constrained Quadratic Programming
Problem 1.
Minimize 12w · w
subject to yi(w · xi + b) ≥ 1, i = 1, 2, . . . , N
Let α = α1, α2, . . . , αN be the N nonnegative Lagrange multipliers associated with the constraints in (1), the solution to Problem 1 is equivalent to
determining the saddle point of the function
LP = 1
2w · w −
N
X
i=1
αi(yi(w · xi + b) − 1)
with LP = L(w, b, α)
Solving Constrained QP
At saddle point, LP has minimum for w = w and b = b requiring
∂L
∂b =
N
X
i=1
yiαi = 0 (15)
∂L
∂w = w −
N
X
i=1
αiyixi = 0 ⇒ w =
N
X
i=1
αiyixi (16)
with ∂L
∂w = ( ∂L
∂w1, ∂L
∂w2, · · · , ∂L
∂wn)
Since these are equality constraints in the dual formulation, we can substitute them into LP to give
N N N N
Solving Constrained QP using Dual
Problem 2.
Maximize -12αTDα + PN
i=1 αi subject to PN
i=1 yiαi = 0 α ≥ 0
where D is an N × N matrix such that
Dij = yiyjxi · xj (18) For the solution at the saddle point, (w, b), it follows that from Problem 2 that
w =
N
X
i
αiyixi (19)
Solving Constrained QP Using Dual
b can be determined from α, which is a solution of the dual problem, and from the Kuhn-Tucker conditions
αi(yi(w · xi + b) − 1) = 0, i = 1, . . . , N (20) Recall (16) and constraints in Problem 1
∂L
∂w = w −
N
X
i=1
αiyixi = 0 ⇒ w =
N
X
i=1
αiyixi
yi(w · xi + b) ≥ 1, i = 1, 2, . . . , N.
Note that the only αi that can be nonzero in (20) are those for which the constraints (12) are satisfied with the equality sign.
Support Vectors
yi(w · xi + b) ≥ 1, i = 1, 2, . . . , N. w =
N
X
i=1
αiyixi
Most of the constraints in (12) are satisfied with inequality signs i.e., most αi solved from the dual are null.
⇒ the vectors w is a linear combination of a relative small percentage of the points xi.
⇒ these points are termed support vectors because they are the closest points from the OSH and the only points of S needed to determine the OSH (name of the game).
The problem of classify a new data point x is now simply solved by looking at the sign of
sign(w · x + b)
Soft Margin Classifier
In the case that the set S is not linearly separable or one simply ignore
whether or not the set S is linearly separable, the previous analysis can be generalized by introducing N nonnegative variable ξ = (ξ1, ξ2, . . . , ξN) such that
yi(w · xi + b) ≥ 1 − ξi, i = 1, . . . , N (21) Purpose: to allow for a small number of misclassified points for better
generalization or computational efficiency.
Generalized OSH
The generalized OSH is then regarded as the solution to Problem 3.
Minimize 12w · w + C PN i=1 ξi
subject to yi(w · xi + b) ≥ 1 − ξi, i = 1, . . . , N ξ ≥ 0
Role of C:
• as a regularization parameter (cf. Radial Basis Function, fitting).
• large C ⇒ minimize the number of misclassified points.
• small C ⇒ maximize the minimum distance ||w||1 .
Dual Problem
Problem 4.
Maximize -12αTDα + PN
i=1 αi subject to PN
i=1 yiαi = 0 0 ≤ αi ≤ C, i = 1, 2, . . . , N As before,
w =
N
X
i=1
αiyixi
and b can be determined from α, solution of the dual Problem 4 and from the new Kuhn-Tucker conditions
αi(yi(w · xi + b) − 1 + ξ) = 0 (22)
(C − αi)ξ = 0 (23)
(C − αi)ξ = 0 Two cases:
• αi < C
⇒ ξi = 0
⇒ the support vectors lie at a distance ||w||1 from the OSH
⇒ called margin vectors
• αi = C
1. ξi > 1, misclassified points
2. 0 < ξi ≤ 1, points correctly classified but closer than ||w||1 from the OSH 3. ξ = 0, margin vectors (rare case)
Neglecting the last rare case, we refer to all the support vectors for which αi = C as errors. All the points that are not support vectors are correctly classified and lie outside the margin strip.
Nonlinear Support Vector Machine
• Note that the only way the data points appear in the training problem is in the form of dot products xi · xj.
• In higher dimensional space (feature space), it is very likely that a linear separator (hyperplane) can be constructed.
• E.g.: we map the data points to input space Rn to some feature space of higher dimension, Rm, (m > n) using function Φ.
Φ : Rn → Rm Example:
i.e., on the functions of the form Φ(xi) · Φ(xj). In other words,
f (x) = w · x =
N
X
i=1
yiαixi · x + b
f (x) = w · x =
N
X
i=1
yiαiΦ(x) · Φ(x) + b
• But the transformation operator, Φ, is computationally expensive.
• If there were a “kernel function” K such that K(xi, xj) = Φ(xi) · Φ(xj), we would only need to use K in the training algorithm.
• One example, K(xi, xj) = e−||xi−xj||2/2σ2.
• All the previous derivations in linear SVM hold (substituting dot product with kernel function), since we are still doing a linear separation, but in a
different space.
• Map the training data nonlinearly into a higher-dimensional feature space via Φ, and construct a separating hyperplane with maximum margin there.
• This yields a nonlinear decision boundary in input space. By the use of
kernel function, it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space.
Mapping between input space and feature space [9].
•
Mercer’s Condition for Kernel Function
• The idea of constructing support vector networks comes form considering general forms of the dot product in a Hilbert space.
Φ(u) · Φ(v) ≡ K(u, v) (24)
• Question: Which kernel does there exist a pair {H, Φ} such that K(xi, xj) = Φ(xi) · Φ(xj)
• Answer: Mercer’s condition. It tells us whether or not a prospective kernel is actually a dot product in some space.
• According to the Hilbert-Schmidt Theory, any symmetric function K(u, v), with K(u, v) ∈ L2, can be expanded in the form
K(u, v) = X
i
λiΦi(u) · Φ(v) (25)
where λi ∈ R and Φ are eigenvalues and eigenfunctions Z
K(u, v)Φ(u)du = λiΦi(v)
of the integral operator defined by the kernel K(u, v).
• A sufficient condition to ensure that (24) defines a dot product in a feature space is that all the eigenvalues in the expansion (25) are positive. To
guarantee that these coefficients are positive, it is necessary and sufficient (Mercer’s theorem) that the condition
Z Z
K(u, v)g(u)g(v)dudv > 0
is satisfied for all g such that
Some Kernel Functions in SVM
Simple dot product:
K(x, y) = x · y Vovk’s polynomial:
K(x, y) = (x · y + 1)p Radial basis function (RBF):
K(x, y) = e−||x−y||2/2σ2
Two layer neural network:
K(x, y) = tanh(κx · y − δ)
Some Properties of SVM
Decision surface in (a) by a polynomial classifier, and in (b) by a RBF where the support vectors are indicated in dark fill. Note the reduced number of them and their position close to the boundary. In (b), the support vectors are the
Architecture for General Support Vector Machine
• Linear SVM: Kernel function is just a dot product in input space
• Nonlinear SVM: Need to choose an appropriate kernel function
Multiple Classes
For K-class pattern recognition problem, several approaches have been proposed
• One-against-the-rest: construct a hyperplane between class k and the K − 1 other classes ⇒ K SVM’s, etc.
• One-against-one: construct a hyper plane for any two classes ⇒ K(K−1)2 SVM’s.
• K-class SVM: by Watkins.
• John Platt’s DAG method
SVM: Training and Testing
• Training: Solve a complex quadratic optimization problem
- Speed up: Chucking, Sequential Minimization Optimization (SMO).
• Testing: The number of support vectors may be large - Speed up: Reduced set of support vectors, etc.
SVM Regression
• Introduce ε-insensitive loss
|y − f (x)|ε = max{0, |y − f (x)| − ε} (26)
• Estimate f (x) = w · x + b by minimizing 1
2||w||2 + C N
N
X
i=1
|yi − f (xi)|ε (27)
Relevance Vector Machine [Tipping NIPS00]
• Main idea [22]
- Extend SVM to have probabilistic interpretations.
- Further sparsify and have fewer support vectors, i.e., relevant vectors (prototypes).
- Bayesian learning: put priors over parameters (i.e., introducing hyperparameters).
• Approach: Put priors on target value yi, and the weights wi. p(y|w, σ2) = (2πσ2)−N/2 exp{− 1
2σ2||y − Φw||2} (28)
p(w|β) =
N
Y
i=0
N (αi|0, βi−1) (29) where Φ is a design matrix with Φnm = K(xn, xm−1) and Φn1 = 1, and β is a vector of N + 1 hyperparameters.
• Hyperparameters: σ and β.
• With that, we have posterior of the weights
p(w|y, β, σ2) = (2π)−(N +1)/2|Σ|−1/2exp{−1
2(w − µ)Σ−1(w − µ)} (30)
Σ = (ΦTBΦ + A)−1
µ = ΣΦTBy (31)
where A is a diagonal matrix and let A = diag(β0, β1, . . . , βN).
• Integrating out the weights, we have the marginal likelihood for the hyperparameters.
p(y|β, σ2) = (2π)−N/2|B−1+ΦA−1ΦT|−1/2 exp{−1
yT(B−1+ΦA−1ΦT)−1y}
• For the function sinc(x) = |x|−1 sin |x|. With 100 training samples, SVM regression uses up 39 support vectors. RVM uses 9 relevant vectors.
• Williams et al. apply RVM to learn a regression function of in-plane image displacement for visual tracking [27] [28].
• Agarwal and Triggs apply RVM to learn a regression function of human joint angles from silhouette images [1].
Applications
• Pattern Recognition: hand digit recognition, 3D object recognition, face detection, face recognition, pedestrian detection, gender classification, visual tracking, expression recognition, speaker identification, text
classification.
• Regression: time series prediction, relevance vector machine
• Signal Processing: seismic signal classification, density estimation, DNA sequence classification.
Review: Optical Flow
Let image intensity be given by I(x, y, t), use first-order Taylor expansion
I(x + dx, y + dy, t + dt) = I(x, y, t) + ∂I
∂xdx + ∂I
∂ydy + ∂I
∂tdt + h.o.t. (33) Based on brightness constancy assumption,
I(x + dx, y + dy, t + dt) = I(x, y, t) (34) Thus
∂I
∂xdx + ∂I
∂ydy + ∂I
∂t dt = 0 (35)
Let
dx
dt = u , dy
dt = v (36)
We have
−∂I
∂t = ∂I
∂xu + ∂I
∂yv (37)
Lucas-Kanade
E(u, v) = X
x,y∈ROI
(I(x + u, y + v, t + dt) − I(x, y, t))2 (38)
Minimize E(u, v)
P I2x P IxIy P IxIy P I2y
u v
=
P −ItIx P −ItIy
(39)
SVM Tracking [Avidan CVPR01]
• Incorporate optical flow with SVM [2] [3].
• Estimate the displacement direction using optical flow
• Use the SVM score to determine the most likely target location
• Given I, use first order Taylor expansion
I∗ = I + uIx + vIy (40)
where Ix, Iy are x and y derivative of I, and u, v are motion parameters.
• With SVM score
f (x) = PN
i=1 yiαiK(xi, xj) + b f (x) = PN
i=1 yiαiK(I + uIx + vIy, xj) (41)
• Use dot product kernel for K, and maximize the above function
E(u, v) =
N
X
i=1
yiαi((I + uIx + vIy)Txj)2 (42)
• Take partial derivatives w.r.t u, and v
∂E
∂u =
N
X
i=1
yiαiITxxj(I + uIx + vIy)Txj = 0 (43)
∂E
∂v =
N
X
i=1
yiαiITy xj(I + uIx + vIy)Txj = 0 (44)
where
A11 = PN
i=1 αiyi(xTi Ix)2 A12 = A21 = PN
i=1 αiyi(xTi Ix)(xTi Iy) A22 = PN
i=1 αiyi(xTi Iy)2 b1 = − PN
i=1 αiyi(xTi Ix)(xTi I) b2 = − PN
i=1 αiyi(xTi Iy)(xTi I)
(46)
• Similar to optical flow in equation form
• Also use pyramid to search over scale
Life Beyond SVM
• Mistake-Bound On-Line Learning: Winnow, SNoW
• Ensemble of Homogeneous Classifiers: Boosting, Bagging
• Ensemble of Heterogeneous Classifiers: Kittler’s method
• Random Subspace Method: Monte Carlo approach
• Kernel methods: Kernel PCA, Kernel Fisher Linear Discriminant
• Generative Models, Graphical Models, Nonlinear PCA, Probabilistic PCA, Mixture of Probabilistic PCA, etc.
• Maximum entropy approach
SVM v.s. SNoW on Face Detection
A benchmark on SVM and SNoW based on 5732 training samples and 500 testing samples using a Sun Ultra Sparc 10. Each sample is an 20 × 20 image.
Nonlinear SVM Linear SVM SNoW
Training Accuracy 100% 100% 100%
Testing Accuracy 100% 96% 97%
Memory Requirement 83MB 24MB 7MB
World Clock Time 5.8hr 3.7hr 0.6hr
Concluding Remarks
Pros
• Optimal hyperplane.
• Some kernels have infinite VC dimension.
• Can deal with high dimensional data.
Cons
• Numerical stability problems in solving constrained QP.
• Usually require positive/negative examples.
References
Introductory articles: [9] [21] [16] [14] [4] [6]
Book: [8] [23] [24] [18] [7]
Ph.D. Thesis: [5] [17] [20]
Vision papers: [16] [14] [12] [15]
Comparison with RBF: [19]
Kernel Machines web site: http://www.kernel-machines.org Boosting web site:
http://www.boosting.org/
References
[1] A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevance vector regression. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 882–888, 2004.
[2] S. Avidan. Support vector tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 184–191, 2001.
[3] S. Avidan. Support vector tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8):1064–1072, 2004.
[4] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 1998.
[5] C. Cortes. Prediction of Generalization Ability in Learning Machines. PhD thesis, Department of Computer Science, University of Rochester, 1995.
[6] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20, 1995.
[7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines.
Cambridge University Press, 2000.
[8] S. Haykin. Neural Networks : A Comprehensive Foundation. Prentice Hall, 1998.
[9] M. A. Heasrt, B. Scholkopf, S. Dumais, E. Osuna, and J. Platt. Trends and controversies - support vector machines. IEEE Intelligent Systems, 13(4):18–28, 1998.
[10] A. M. Martinez and A. Kak. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001.
[11] A. M. Martinez and M. Zhu. Where are linear feature extraction methods applicable?
IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1934–1944, 2005.
[12] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio. Pedestrian detection using wavelet templates. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 193–199, 1997.
[13] E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications.
Technical Report AI MEMO 1602, MIT AI Lab, 1997.
Proceedings of the Fifth International Conference on Computer Vision, pages 555–562, 1998.
[16] M. Pontil and A. Verri. Support vector machines for 3D object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(6):637–646, 1998.
[17] B. Scholkopf. Support Vector Learning. PhD thesis, Informatik der Technischen Universitat Berlin, 1997.
[18] B. Scholkopf, C. Burges, and A. Smola, editors. Advances in Kernel Methods. MIT Press, 1998.
[19] B. Scholkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, and T. Poggio. Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE
Transactions on Signal Processing, 45(11):2758–2765, 1997.
[20] A. Smola. Learning with Kernels. PhD thesis, GMD, 1998.
[21] A. J. Smola and B. Scholkopf. A tutorial on support vector regression. Technical Report TR-1998-030, Neuro COLT, GMD First, 1998.
[22] M. Tipping. The relevance vector machine. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Advances in Neural Information Processing Systems, pages 46–53. MIT Press, 2000.
[23] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
[24] V. Vapnik. Statistical Learning Theory (Adaptive and Learning Systems for Signal Processing, Communications, and Control). John Wiley & Sons, 1998.
[25] P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004.
[26] P. Viola, M. Jones, and D. Snow. Markov face models. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 2, pages 734–741, 2003.
[27] O. Williams, A. Blake, and R. Cipolla. A sparse probabilistic learning algorithm for real-time tracking. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 1, pages 353–360, 2003.
[28] O. Williams, A. Blake, and R. Cipolla. Sparse bayesian learning for efficient visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence,
27(8):1292–1304, 2005.