Multiple Kernel Learning for Dimensionality Reduction

(1)

Multiple Kernel Learning for Dimensionality Reduction

Yen-Yu Lin, Tyng-Luh Liu, Member, IEEE, and Chiou-Shann Fuh, Member, IEEE

Abstract—In solving complex visual learning tasks, adopting multiple descriptors to more precisely characterize the data has been a feasible way for improving performance. The resulting data representations are typically high-dimensional and assume diverse forms.

Hence, finding a way of transforming them into a unified space of lower dimension generally facilitates the underlying tasks such as object recognition or clustering. To this end, the proposed approach (termed MKL-DR) generalizes the framework of multiple kernel learning for dimensionality reduction, and distinguishes itself with the following three main contributions: First, our method provides the convenience of using diverse image descriptors to describe useful characteristics of various aspects about the underlying data.

Second, it extends a broad set of existing dimensionality reduction techniques to consider multiple kernel learning, and consequently improves their effectiveness. Third, by focusing on the techniques pertaining to dimensionality reduction, the formulation introduces a new class of applications with the multiple kernel learning framework to address not only the supervised learning problems but also the unsupervised and semi-supervised ones.

Index Terms—Dimensionality reduction, multiple kernel learning, object categorization, image clustering, face recognition.

Ç 1 I

NTRODUCTION

T

^HE fact that most visual learning problems deal with high-dimensional data has made dimensionality reduction an inherent part of the current research. Besides having the potential for a more efficient approach, working with a new space of lower dimension can often gain the advantage of better analyzing the intrinsic structures in the data for various applications. For example, dimensionality reduction can be performed to compress data for a compact representation [25], [56], to visualize high-dimensional data [40], [47], to exclude unfavorable data variations [8], or to improve the classification power of the nearest neighbor rule [9], [54].

Despite the great applicability, existing dimensionality reduction methods often suffer from two main restrictions.

First, many of them, especially the linear ones, require data to be represented in the form of feature vectors. The limitation may eventually reduce the effectiveness of the overall algorithms when the data of interest could be more precisely characterized in other forms, e.g., bag-of-features [2], [33], matrices, or high-order tensors [54], [57]. Second, there seems to be a lack of a systematic way of integrating multiple image features for dimensionality reduction. When addressing applications where no single descriptor can appropriately depict the whole data set, this shortcoming

becomes even more evident. Alas, it is usually the case for addressing today’s vision applications, such as the recognition task in the Caltech-101 data set [14] or the classification and detection tasks in the Pascal VOC challenge [13]. On the other hand, the advantage of using multiple features has indeed been consistently pointed out in a number of recent research efforts, e.g., [7], [18], [31], [50], [51].

Aiming to overcome the above-mentioned restrictions, we introduce a framework called MKL-DR that incorpo- rates multiple kernel learning (MKL) into the training process of dimensionality reduction (DR) methods. It works with multiple base kernels, each of which is created based on a specific kind of data descriptor, and fuses the descriptors in the domain of kernel matrices. We will illustrate the formulation of MKL-DR with graph embedding [54], which provides a unified view for a large family of DR methods.

Any DR technique expressible by graph embedding can therefore be generalized by MKL-DR to boost their power by simultaneously taking account of data characteristics captured in different descriptors. It follows that the proposed approach can extend the MKL framework to address, as the corresponding DR methods would do, not only the supervised learning problems but also the unsupervised and semi-supervised ones.

2 R

ELATED

W

ORK

Since the relevant literature is quite extensive, our survey instead emphasizes the key concepts crucial to the establish- ment of the proposed framework.

2.1 Dimensionality Reduction

Techniques to perform dimensionality reduction for high- dimensional data can vary considerably from each other due to, e.g., different assumptions about the data distribution or the availability of the data labeling. We categorize them as follows:

. Y.-Y. Lin and T.-L. Liu are with the Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan.

E-mail: {yylin, liutyng}@iis.sinica.edu.tw.

. C.-S. Fuh is with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan.

E-mail: fuh@csie.ntu.edu.tw.

Manuscript received 25 Jan. 2010; revised 9 July 2010; accepted 28 July 2010;

published online 30 Sept. 2010.

Recommended for acceptance by S. Belongie.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number TPAMI-2010-01-0054.

Digital Object Identifier no. 10.1109/TPAMI.2010.183.

0162-8828/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society

(2)

2.1.1 Unsupervised DR

Principal component analysis (PCA) [25] is the most well- known one that finds a linear mapping by maximizing the projected variances. For nonlinear DR techniques, isometric feature mapping (Isomap) [47] and locally linear embedding (LLE) [40] both exploit the manifold assumption to yield the embeddings. And, to resolve the out-of-sample problem in Isomap and LLE, locality preserving projections (LPP) [23] are proposed to uncover the data manifold by a linear relaxation.

2.1.2 Supervised DR

Linear discriminant analysis (LDA) assumes that the data of each class have a Gaussian distribution, and derives a projection from simultaneously maximizing the between- class scatter and minimizing the within-class scatter.

Alternatively, marginal Fisher analysis (MFA) [54] and local discriminant embedding (LDE) [9] adopt the assumption that the data of each class spread as a submanifold, and seek a discriminant embedding over these submanifolds.

2.1.3 Semi-Supervised DR

If the observed data are partially labeled, dimensionality reduction can be performed by carrying out discriminant analysis over the labeled ones while preserving the intrinsic geometric structures of the remaining. Such techniques are useful, say, for vision applications where user interactions are involved, e.g., semi-supervised discriminant analysis (SDA) [6]

for content-based image retrieval with relevance feedback.

2.1.4 Kernelization

It is possible to kernelize a certain type of linear DR techniques into nonlinear ones. As shown in [6], [9], [23], [34], [41], [54], the kernelized versions generally can achieve significant improvements. In addition, kernelization provides a convenient way for DR methods to handle data not in vector form by specifying an associated kernel, e.g., the pyramid matching kernel [21] for data in the form of bag-of- features or the dissimilarity kernel [38] based on the pairwise distances.

2.2 Graph Embedding

A number of dimensionality reduction methods focus on modeling the pairwise relationships among data and utilize graph-based structures. In particular, the framework of graph embedding [54] provides a unified formulation for a broad set of such DR algorithms. Let ¼ fxi2 IR^dg^N_i¼1 be the data set. A DR scheme accounted for by graph embedding involves a complete graph G whose vertices are over . A corresponding affinity matrix W ¼ ½wij 2 IR^NN is used to record the edge weights that characterize the similarity relationships between pairs of training samples. Then, the optimal linear embedding v2 IR^d can be obtained by solving

v¼ arg min

v^>XDX^>v¼1; or v^>XL⁰X^>v¼1

v^>XLX^>v; ð1Þ

where X ¼ ½x1x₂ xN is the data matrix and L ¼ diagðW 11Þ W is the graph Laplacian of G. Depending on the property of a problem, one of the two constraints in (1) will be used in the optimization. If the first constraint is

chosen, a diagonal matrix D ¼ ½dij 2 IR^NN is included for scale normalization. Otherwise, another complete graph G⁰ over is required for the second constraint, where L⁰and W⁰¼ ½w⁰_ij 2 IR^NN are, respectively, the graph Laplacian and affinity matrix of G⁰. The optimization problem (1) has an intuitive interpretation: v^>X¼ ½v^>x1 v^>xN represents the projected data; graph Laplacian L (or L⁰) is to explore the pairwise distances of the projected data, while diagonal matrix D is to weightedly combine their distances to the origin. More precisely, the meaning of (1) can be better understood with the following equivalent problem:

minv

X^N

i;j¼1

kv^>xi v^>xjk²wij ð2Þ

subject to X^N

i¼1

kv^>xik²dii¼ 1; or ð3Þ X^N

i;j¼1

kv^>xi v^>xjk²w⁰_ij ¼ 1: ð4Þ

The constrained optimization problem (2) implies that only distances to the origin or pairwise distances of projected data (in the form of v^>x) are modeled by the framework. By specifying W and D (or W and W⁰), Yan et al. [54] show that a set of dimensionality reduction methods, such as PCA [25], LPP [23], LDA, and MFA [54] can be expressed by (1).

Clearly, LDE [9] and SDA [6] are also in the class of graph embedding.

2.3 Multiple Kernel Learning

MKL refers to the process of learning a kernel machine with multiple kernel functions or kernel matrices. Recent research efforts on MKL, e.g., [1], [20], [29], [39], [45], have shown that learning SVMs with multiple kernels not only increases the accuracy but also enhances the interpretability of the resulting classifiers. Our MKL formulation is to find an optimal way to linearly combine the given kernels.

Suppose we have a set of base kernel functions fkmg^M_m¼1(or base kernel matrices fKmg^M_m¼1). An ensemble kernel function k (or an ensemble kernel matrix K) is then defined by

kðxi; xjÞ ¼X^M

m¼1

mkmðxi; xjÞ; m 0; ð5Þ

K¼X^M

m¼1

mKm; m 0: ð6Þ

Consequently, an often-used MKL model from binary-class data fðxi; yi2 1Þg^N_i¼1is

fðxÞ ¼X^N

i¼1

iyikðxi; xÞ þ b ð7Þ

¼X^N

i¼1

iyi

X^M

m¼1

mkmðxi; xÞ þ b: ð8Þ

Optimizing over both the coefficients fig^N_i¼1 and fmg^M_m¼1 is one particular form of the MKL problems. Our approach utilizes such an MKL optimization to yield more flexible dimensionality reduction schemes for data in different feature representations.

(3)

2.4 Dimensionality Reduction with Multiple Kernels Our approach is related to the work of Kim et al. [27], where learning an optimal kernel over a given convex set of kernels is coupled with kernel Fisher discriminant analysis (KFDA) for binary-class data. Motivated by their idea of learning an optimal kernel for improving the KFDA performance, we instead consider establishing a general framework of dimensionality reduction for data in various feature representations via multiple kernel learning [32]. As we will show later, MKL-DR can be used to conveniently deal with image data depicted by different descriptors, and effectively tackle not only supervised but also semi- supervised and unsupervised learning tasks. To the best of our knowledge, such a generalization of multiple kernel learning is novel.

3 T

HE

MKL-DR F

RAMEWORK

We first discuss the construction of base kernels from multiple descriptors, and then explain how to integrate them for dimensionality reduction. Finally, we present an optimization procedure to complete the framework.

3.1 Kernel as a Unified Feature Representation Consider again a data set of N samples, and M kinds of descriptors to characterize each sample. Let ¼ fxig^N_i¼1, xi¼ fxi;m2 Xmg^M_m¼1, and dm:Xm Xm! 0 [ IR^þ be the distance function for data representation under the mth descriptor. In general, the domains resulting from distinct descriptors, e.g., feature vectors, histograms, or bags of features, are different. To eliminate such variances in representation, we express data under each descriptor as a kernel matrix. There are several ways to accomplish this goal, such as using the RBF kernel for data in the form of vector or the pyramid match kernel [21] for data in the form of bag-of-features. We may also convert pairwise distances between data samples to a kernel matrix [50], [58]. By coupling each representation with its corresponding distance function, we obtain a set of M dissimilarity-based kernel matrices fKmg^M_m¼1, where

Kmði; jÞ ¼ kmðxi; xjÞ ¼ exp d²_mðxi;m; xj;mÞ

²_m

ð9Þ and mis a positive constant. Our use of dissimilarity-based kernels is convenient and advantageous in solving visual learning tasks, especially due to the fact that a number of well- designed descriptors and their associated distance functions have been introduced over the years. However, Kmin (9) is not always guaranteed to be positive semidefinite. Following [58], we resolve this issue by first computing the smallest eigenvalue of Km. Then, if it is negative, we add its absolute value to the diagonal of Km. With (5), (6), and (9), determining a set of optimal ensemble coefficients f1; 2; . . . ; Mg can now be interpreted as finding appropriate weights for best fusing the M feature representations.

Note that in our formulation, accessing the data is restricted to referencing the resulting M kernels defined in (9). The main advantage of doing so is that it enables our approach to work with different descriptors and distance functions, without the need to explicitly handle the variations among the representations.

3.2 The MKL-DR Algorithm

Instead of designing a specific dimensionality reduction algorithm, we choose to describe MKL-DR upon graph embedding. This way we can emphasize the flexibility of the proposed approach: If a dimensionality reduction scheme is explained by graph embedding, then it will also be extendible by MKL-DR to handle data in multiple feature representations. Recall that there are two possible types of constraints in graph embedding. For ease of presentation, we discuss how to develop MKL-DR subject to constraint (4). However, the derivation can be analogously applied when using constraint (3).

Kernelization in MKL-DR is accomplished in a similar way to that in kernel PCA [41], but with the key difference in using multiple kernels fKmg^M_m¼1. Suppose the ensemble kernel K in MKL-DR is generated by linearly combining the base kernels fKmg^M_m¼1as in (6). Let : X ! F denote the feature mapping induced by K. Via , the training data can be implicitly mapped to a high-dimensional Hilbert space, i.e.,

xi7! ðxiÞ; for i¼ 1; 2; . . . ; N: ð10Þ Since optimizing (1) or (2) can be reduced to solving the eigenvalue problem XLX^>v¼ XL⁰X^>v, it implies that an optimal v lies in the span of training data, i.e.,

v¼X^N

n¼1

nðxnÞ: ð11Þ

To show that the underlying algorithm can be reformulated in the form of inner product and accomplished in the new feature space F , we observe that by plugging each mapped sample ðxiÞ into (2), projection v would appear exclusively in the form of v^>ðxiÞ. Hence, it suffices to show that in MKL-DR, v^>ðxiÞ can be evaluated via the kernel trick

v^>ðxiÞ ¼X^N

n¼1

X^M

m¼1

nmkmðxn; xiÞ ¼ ^>IK^ðiÞ; ð12Þ

where

¼ ½ 1 N^>2 IR^N; ð13Þ

¼ ½ 1 M^>2 IR^M; ð14Þ

IK^ðiÞ¼

K1ð1; iÞ KMð1; iÞ ...

.. .

... K1ðN; iÞ KMðN; iÞ 2

66 4

3 77

5 2 IR^NM: ð15Þ

With (2) and (12), we define the constrained optimization problem for 1D MKL-DR as follows:

min;

X^N

i;j¼1

k^>IK^ðiÞ ^>IK^ðjÞk²wij ð16Þ

subject toX^N

i;j¼1

k^>IK^ðiÞ ^>IK^ðjÞk²w⁰_ij¼ 1; ð17Þ

m 0; m ¼ 1; 2; . . . ; M: ð18Þ The additional constraints in (18) arise from the use of the ensemble kernel in (5) or (6), and are to ensure that the resulting kernel K in MKL-DR is a nonnegative combination of base kernels.

(4)

Observe from (12) that the one-dimensional projection v of MKL-DR is specified by a sample coefficient vector and a kernel weight vector . The two vectors, respectively, account for the relative importance among the samples and the base kernels in the construction of the projection.

To generalize the formulation to uncover a multidimensional projection, we consider a set of P sample coefficient vectors, denoted by

A¼ ½12 P: ð19Þ With A and , each 1D projection vi is determined by a specific sample coefficient vector _iand the (shared) kernel weight vector . The resulting projection V ¼ ½v1v2 vP will map samples to a P -dimensional euclidean space.

Analogously to the 1D case, a projected sample xi can be written as

V^>ðxiÞ ¼ A^>IK^ðiÞ2 IR^P: ð20Þ The optimization problem (16) can now be extended to accommodate the multidimensional projection

min

A;

X^N

i;j¼1

kA^>IK^ðiÞ A^>IK^ðjÞk²wij ð21Þ

subject to X^N

i;j¼1

kA^>IK^ðiÞ A^>IK^ðjÞk²w⁰_ij¼ 1; ð22Þ

m 0; m¼ 1; 2; . . . ; M: ð23Þ Before specifying the details of how to solve the constrained optimization problem (21) in the next section, we give an illustration of the four kinds of spaces related to MKL-DR and the connections among them in Fig. 1. The four spaces, in order, are the input space of each feature representation, the reproducing kernel Hilbert space (RKHS) induced by each base kernel and the ensemble kernel, and the projected euclidean space.

3.3 Optimization

Since direct optimization to (21) is difficult, we instead adopt an iterative, two-step strategy to alternately optimize A and . At each iteration, one of A and is optimized while the other is fixed, and then the roles of A and are

switched. Iterations are repeated until convergence or a maximum number of iterations is reached.

On optimizing A. By fixing and using the property kuk²¼ traceðuu^>Þ for a column vector u, the optimization problem (21) is reduced to

min

A traceðA^>S_WAÞ subject to traceðA^>S_W0AÞ ¼ 1;

ð24Þ

where

S_W ¼X^N

i;j¼1

wijðIK^ðiÞ IK^ðjÞÞ^>ðIK^ðiÞ IK^ðjÞÞ^>; ð25Þ

S_W0 ¼X^N

i;j¼1

w⁰_ijðIK^ðiÞ IK^ðjÞÞ^>ðIK^ðiÞ IK^ðjÞÞ^>: ð26Þ

The optimization problem (24) is a trace ratio problem, i.e., minAtraceðA^>S_WAÞ=traceðA^>S_W0AÞ. Following [9] and [52], one can obtain a closed-form solution by transforming (24) into the corresponding ratio trace problem, i.e., minAtrace½ðA^>S_W0AÞ¹ðA^>S_WAÞ. Consequently, the columns of the optimal A¼ ½12 P are the eigenvectors corresponding to the first P smallest eigenvalues in

S_W¼ S_W0: ð27Þ On optimizing . By fixing A and kuk²¼ u^>u, the optimization problem (21) becomes

min

^>S_W^A

subject to ^>S_W^A0¼ 1 and 0;

ð28Þ

where

S_W^A ¼X^N

i;j¼1

wijðIK^ðiÞ IK^ðjÞÞ^>AA^>ðIK^ðiÞ IK^ðjÞÞ; ð29Þ

S_W^A0¼X^N

i;j¼1

w⁰_ijðIK^ðiÞ IK^ðjÞÞ^>AA^>ðIK^ðiÞ IK^ðjÞÞ: ð30Þ

The additional constraints 0 cause the optimization to (28) no longer be formulatable as a generalized eigenvalue

Fig. 1. Four kinds of spaces in MKL-DR: (a) the input space of each feature representation, (b) the RKHS induced by each base kernel, (c) the RKHS by the ensemble kernel, and (d) the projected euclidean space.

(5)

problem. Indeed, it now becomes a nonconvex quadrati- cally constrained quadratic programming (QCQP) problem, and is known to be hard to solve. We instead consider solving its convex relaxation by adding an auxiliary variable B of size M M:

min;B traceðS_W^ABÞ ð31Þ subject to traceðS_W^A0BÞ ¼ 1; ð32Þ e^>_m 0; m¼ 1; 2; . . . ; M; ð33Þ

1 ^>

B

" #

0; ð34Þ

where em in (33) is a column vector whose elements are 0 except that its mth element is 1, and the constraint in (34) means that the square matrix is a positive semidefinite. The optimization problem (31) is a semidefinite programming (SDP) relaxation of the nonconvex QCQP problem (28), and can be efficiently solved by SDP. One can verify the equivalence between the two optimization problems (28) and (31) by replacing the constraint (34) with B ¼ ^>. In view of that the constraint B ¼ ^> is nonconvex, it is relaxed to B ^>. Applying the Schur complement lemma, B ^> can be equivalently expressed by the constraint in (34). (Refer to [49] for the details.) Concerning the computational complexity, we note that the numbers of constraints and variables in (31) are, respectively, linear and quadratic to M, the number of the adopted descriptors. In practice, the value of M is often small. (M ¼ 4 10 in our experiments.) Thus, like most of the other DR methods, the computational bottleneck of MKL-DR is still in solving the generalized eigenvalue problems, whose complexity is OðN³Þ.

Listed in Algorithm 1 (Fig. 2), the procedure of MKL-DR requires an initial guess to either A or in the alternating optimization. We have tried two possibilities: 1) is initialized by setting all of its elements as 1 to equally weight base kernels; 2) A is initialized by assuming AA^>¼ I. In our empirical testing, the second initialization strategy gives more stable performances and is thus adopted in the experiments. Pertaining to the convergence of the optimization procedure, since SDP relaxation has

been used, the values of the objective function are not guaranteed to monotonically decrease throughout the iterations. Still, the optimization procedures rapidly converge after only a few iterations in all of our experiments.

3.4 Novel Sample Embedding

After accomplishing the training procedure of MKL-DR, we are ready to project a testing sample, say z, into the learned space of lower dimension by

z7! A^>IK^ðzÞ; where ð35Þ IK^ðzÞ2 IR^NM and IK^ðzÞðn; mÞ ¼ kmðxn; zÞ: ð36Þ Depending on the applications, some postprocessing, such as the nearest neighbor rule for classification or k-means clustering for data grouping, is then applied to the projected sample(s) to complete the task. In the remainder of this paper, we specifically discuss three sets of experimental results to demonstrate the effectiveness of MKL-DR, including supervised learning for object categorization, unsupervised learning for image clustering, and semi- supervised learning for face recognition.

4 E

XPERIMENTAL

R

ESULTS:

S

UPERVISED

L

EARNING FOR

O

BJECT

C

ATEGORIZATION

Applying MKL-DR to object categorization is appropriate as the complexity of the task often requires the use of multiple feature descriptors. And in our experiments, the effectiveness of MKL-DR will be investigated through a supervised learning formulation.

4.1 Data Set

The Caltech-101 data set [14], collected by Fei-Fei et al., is used in our experiments for object categorization. It consists of 101 object categories and one additional class of background images. The total number of categories is 102, and each category contains roughly 40 to 800 images.

Although each target object often appears in the central region of an image, the large class number and the substantial intraclass variations still make the data set very challenging. Indeed, the data set provides a good test bed to demonstrate the advantage of using multiple image descriptors for complex recognition tasks. Note that as the

Fig. 2. Algorithm 1.

(6)

images in Caltech-101 are not of the same size, we resize them to around 60,000 pixels, without changing their aspect ratio. Fig. 3 shows an image example from each category of the data set.

To implement Algorithm 1 for object recognition, we need to decide a set of descriptors for depicting the diverse objects and the underlying graph-based DR method to be generalized. Based on them, we can then derive a set of base kernels and a pair of affinity matrices, respectively. The details are described as follows.

4.2 Image Descriptors and Base Kernels

Ten different image descriptors are considered and they, respectively, yield the following base kernels (denoted below in bold and in abbreviation):

. GB-Dist: For a given image, we randomly sample 400 edge pixels, and apply geometric blur descriptor [3] to them. With these image features, we adopt the distance function, as is suggested in (2) of the work by Zhang et al. [58], to obtain the dissimilarity- based kernel.

. GB: The base kernel is constructed in the same way to that of GB-Dist, except that the geometric distortion term is excluded in evaluating the distance.

. SIFT-Dist: The base kernel is analogously constructed as in GB-Dist, except that now the SIFT descriptor [33] is used to extract features.

. SIFT-SPM: We apply the SIFT descriptor with three different scales to an evenly sampled grid of each image, and use k-means clustering to generate visual words from the resulting local features of all images.

Then, the base kernel is built by matching spatial pyramids, which is proposed in [30].

. SS-Dist/SS-SPM: The two base kernels are, respectively, constructed as in SIFT-Dist and SIFT-SPM, except that the SIFT descriptor is replaced with the self-similarity descriptor [43]. Note that for the latter descriptor, we set the size of each patch to 5 5, and the radius of the window to 40.

. C2-SWP/C2-ML: Biologically inspired features are also adopted. Specifically, both the C2 features derived by Serre et al. [42] and by Mutch and Lowe

[35] have been considered. For each of the two kinds of C2 features, an RBF kernel is obtained.

. PHOG: We also adopt the PHOG descriptor [5] to capture image features, and limit the pyramid level up to 2. Together with the ² distance, it yields the resulting base kernel.

. GIST: The images are resized to 128 128 pixels prior to applying the gist descriptor [37]. Then, an RBF kernel is established.

The parameters in the above descriptors and distance functions are tuned independently. Namely, for each descriptor, we sample a set of parameter values and try to find a good way to linearly combine the corresponding pairwise distance matrices. To that end, we begin with an initial weight distribution focusing solely on the one yielding the best performance. We then separately and sequentially adjust each individual weight, and repeat the process until no further improvement can be attained. Such a scheme is to ensure that the resulting base kernels individually achieve their best performances.

4.3 Dimensionality Reduction Methods

We investigate two supervised DR techniques, namely, LDA and LDE [9], and show how MKL-DR can generalize them.

Both LDA and LDE perform discriminant learning on a fully labeled data set ¼ fðxi; yiÞg^N_i¼1, but make different assumptions about the data distribution: While, in LDA, data of each class are supposed to form a Gaussian, in LDE, they are assumed to spread as a submanifold. Nevertheless, both techniques can be specified by a pair of affinity matrices to fit the formulation of graph embedding (2). For convenience, the resulting MKL dimensionality reduction schemes are, respectively, termed as MKL-LDA and MKL-LDE.

4.3.1 Affinity Matrices for LDA

The two affinity matrices W ¼ ½wij and W⁰¼ ½w⁰_ij are defined as

wij ¼ 1=nyi; if yi ¼ yj; 0; otherwise;

ð37Þ

w⁰_ij¼ 1

N; ð38Þ

Fig. 3. The Caltech-101 data set. One example comes from each of the 102 categories. All of the 102 categories are used in the experiments of supervised object recognition, while the 20 categories marked by the red bounding boxes are used in the following experiments of unsupervised image clustering.

(7)

where nyi is the number of data points that belong to class yi. See [54] for the derivation.

4.3.2 Affinity Matrices for LDE

In LDE, not only the data labels but also the neighborhood relationships are simultaneously considered, namely,

wij¼ 1; if yi¼ yj^ ½i 2 NkðjÞ _ j 2 NkðiÞ;

0; otherwise;

ð39Þ

w⁰_ij¼ 1; if yi6¼ yj^ ½i 2 Nk⁰ðjÞ _ j 2 Nk⁰ðiÞ;

0; otherwise;

ð40Þ where i 2 NkðjÞ means that sample xiis one of the k nearest neighbors of sample xj. The definitions of the affinity matrices are faithful to those in LDE [9]. However, there are now multiple image descriptors and each of them would yield an affinity matrix. Since we typically do not know/

assume in advance which would be more important to a given task, we simply average the resulting affinity matrices to derive a unified one.

4.4 Quantitative Results

Like in [2], [50], [58], we randomly pick 30 images from each of the 102 categories, and split them into two disjoint subsets: One contains Ntrain images per category, and the other consists of the rest. The two subsets are, respectively, used as the training and testing data. Via MKL-DR, the data are projected to the learned space, and the recognition task is accomplished there by enforcing the nearest-neighbor rule.

To relieve the effect of sampling, the whole process of performance evaluation is redone 20 times by using different random splits between the training and testing subsets. The recognition rates are measured in the cases where the value of Ntrainis, respectively, set as 5, 10, 15, 20, and 25.

Coupling the 10 base kernels with the affinity matrices of LDA and LDE, we can, respectively, derive MKL-LDA and MKL-LDE using Algorithm 1. Their effectiveness is investigated by comparing with KFD (kernel Fisher discriminant) [34] and KLDE (kernel LDE) [9]. Since KFD

considers only one base kernel at a time, we implement four strategies to take account of the information from the 10 resulting KFD classifiers, including

1. KFD-Voting: It is constructed based on the voting result of the 10 KFD classifiers. If there is any ambiguity in the voting result, the next nearest neighbor in each KFD classifier will be considered, and the process is continued until a decision on the class label can be made.

2. KFD-Concatenate: For each sample, we concatenate its separately learned feature vectors, each of which is normalized by dividing the standard deviation of the pairwise distances among the projected training data.

3. KFD-AvgKernel: KFD is reapplied to the average kernel of the 10 base ones.

4. KFD-SAMME: By viewing each KFD classifier as a multiclass weak learner, we boost them by SAMME [59], which is a multiclass generalization of AdaBoost.

Analogously, the four strategies are also adopted for the KLDE classifiers.

The values of parameters fmg in (9) are critical to the performance of MKL-DR. However, it is almost infeasible to find their optimal values exhaustively. We instead adopt the following procedure that would give satisfactory results. Observe that the larger the m, the more evenly the entries in Km would distribute. Fixing some values of, say, s and t, we adjust the value of mby binary search such that the largest s entries in Kmwill take up t percent of the sum of all entries. Given s and t, the values of fmg can thus be determined. We set s as a constant and exhaustively seek the optimal t. The resulting fmg will then serve as the initialization to the procedure described in the end of Section 4.2.

Table 1 summarizes the mean recognition rates and the standard deviations of KFD classifiers and MKL-LDA classifiers when different amounts of training data are available. By focusing on Ntrain¼ 15, we observe that MKL-LDA achieves a significant performance gain of 13.9 percent ð¼74:5% 60:6%Þ over the best recognition rate TABLE 1

Recognition Rates of LDA-Based Classifiers on the Caltech-101 Data Set

½Mean std percent.

(8)

by the 10 KFD classifiers. It suggests that the 10 base kernels tend to complement each other, and our approach can effectively fuse them to result in a more powerful classifier.

On the other hand, while KFD-Voting, KFD-Concatenate, and KFD-SAMME try to combine the separately trained KFD classifiers, MKL-LDA jointly integrates the 10 kernels into the learning process. The quantitative results show that MKL-LDA can make the most of fusing various feature descriptors, and improve the recognition rates from 69.8, 71.7, and 72.3 percent to 74.5 percent. Besides, MKL-LDA outperforms KFD-AvgKernel. That is, the ensemble kernel based on the learned kernel weight vector by MKL-DR is more effective than the average kernel. Similar improvements can be observed in cases where different numbers of training data per class are used.

The quantitative results of KLDE and MKL-LDE are reported in Table 2. Like MKL-LDA, MKL-LDE achieves similar degrees of improvements over the KLDE classifiers and their combinations. In addition, we explore the effect of the dimensions of the unified feature space by MKL-DR.

Illustrated with MKL-LDE and KLDE, we evaluate their recognition rates over a range of embedding dimensions, i.e., P in (20). The results are plotted in Fig. 4. We see that the recognition rates by MKL-LDE and KLDE all converge around the dimensions of 90 110. Compared with KLDE,

MKL-LDE can achieve similar degrees of accuracy with fewer dimensions.

When the number of training data per class from Caltech- 101 is set as 15, the recognition rate of 74.5 percent by MKL- LDA and 74.9 percent by MKL-LDE are favorably compar- able to those by most existing approaches. In [2], Berg et al.

report a recognition rate of 48 percent based on deformable shape matching. Using the pyramid matching kernel over data in the bag-of-features representation, the recognition rate by Grauman and Darrell [21] is 50 percent. Subse- quently, Lazebnik et al. [30] improve it to 56.4 percent by considering the spatial pyramid matching kernel. In [58], Zhang et al. combine the geometric blur descriptor and spatial information to achieve 59.05 percent. Our related work [31] that performs adaptive feature fusing via locally combining kernel matrices has a recognition rate of 59.8 percent, while merging 12 kernel matrices from the support kernel machines (SKMs) [1] by Kumar and Sminchi- sescu [28] yields 57.3 percent. Frome et al. [16] propose to learn a local distance for each training sample, and derive 60.3 percent. To tackle the weakly labeled attribute of Caltech-101, Bosch et al. [5] suggest finding the ROIs of images before performing recognition, and report an accuracy rate of 70.4 percent. By exploring the features from subcategories, Todorovic and Ahuja [48] report a recognition rate around 73 percent. Similar results are obtained by Christoudias et al. [10] using localized Gaussian processes with multiple kernels. Gehler and Nowozin [19] carry out multiple kernel learning in a boosting manner, and achieve 74.6 percent. In Fig. 5, we summarize the recognition rates of our approach and several published techniques, including [2], [4], [5], [14], [17], [19], [21], [24], [26], [30], [35], [42], [55], [58], under different sizes of training data.

We complete the section by discussing the convergence property of our algorithm. Take, for example, learning MKL-LDA with different sizes of training data per class.

The values of the objective function (21) through the iterative optimization are, respectively, shown in Figs. 6a, 6b, 6c, 6d, and 6e. In each iteration, two such values are plotted to account for updating either A or . It can be TABLE 2

Recognition Rates of LDE-Based Classifiers on the Caltech-101 Data Set

½Mean std percent.

Fig. 4. Recognition rates versus different dimensions of the projected data when N_train¼ 15.

(9)

observed that all of the optimization procedures rapidly converge after a few iterations. Also, increasing the size of training data tends to speed up the convergence, which is reasonable since sufficient information generally facilitates solving an optimization task.

5 E

XPERIMENTAL

R

ESULTS:

U

NSUPERVISED

L

EARNING FOR

I

MAGE

C

LUSTERING

To explain the link between MKL-DR and unsupervised learning, we investigate the problem of image clustering. In this case, MKL-DR can be viewed as a preprocessing tool to enrich the capacity of an existing clustering technique.

There are two main advantages for so doing. First, since MKL-DR can learn a unified space for image data in multiple representations, it enables the underlying clustering algorithm to simultaneously consider characteristics captured by distinct descriptors. Second, a majority of clustering algorithms, e.g., k-means, are designed to work only in the euclidean space. With MKL-DR, no matter what the original spaces the data reside in, they all can be projected to the learned euclidean space, and consequently our formulation can extend the applicability of such a clustering method.

5.1 Data Set

We follow the setting in [12], where affinity propagation [15]

is used for unsupervised image categorization, and select the same 20 categories from Caltech-101 for the image clustering experiments. Examples from the 20 image categories are shown in Fig. 3, and each is marked with a bold red bounding box. Due to the category-wise differ- ences in the number of images, we randomly select 30 images from each category to form a data set of 600 images.

Since the data set is now a subset of Caltech-101, it is convenient to use the same 10 descriptors and distance functions that are discussed in Section 4.2 to establish the base kernels for MKL-DR.

5.3 Dimensionality Reduction Method

For image clustering, we consider implementing MKL-DR with LPP [23], and denote it as MKL-LPP. The LPP technique is known to be an unsupervised DR scheme that can uncover a low-dimensional subspace by preserving the neighborhood structures. The property is particularly useful since respecting the locality information often plays a key factor in the clustering outcomes. To carry out MKL- LPP, we need to reduce LPP to the formulation of graph embedding (2). This is accomplished by defining W ¼ ½wij and D ¼ ½dij as

wij¼ 1; if i2 NkðjÞ _ j 2 NkðiÞ;

0; otherwise;

ð41Þ

dij¼ PN

n¼1win; if i¼ j;

0; otherwise:

(

ð42Þ

Note that LPP is specified by an affinity matrix W and a diagonal matrix D, instead of a pair of affinity matrices W and W⁰. In Section 3.3, although we only discuss how to derive MKL-DR with DR methods that can be expressed by a pair of W and W⁰, the derivation for those by a pair of W and D is indeed analogous, and the details are omitted here for the sake of space.

Coupling LPP with the base kernels, MKL-LPP would project the given image data to a learned space, where clustering algorithms will be performed. In the experiments, we restrict the number of clusters to the number of classes in the data set, i.e., 20, for all of the tested clustering algorithms, and evaluate their performances with the following two criteria: normalized mutual information (NMI) [46], and clustering accuracy (ACC). (Refer to [53] for their definitions.)

For the purpose of comparison, we first consider affinity propagation [15] for clustering (without any data preprocessing). The clustering technique is devised to detect representative exemplars (clusters) by taking the similarities between data pairs as input. When image data are represented by each of the 10 feature representations, the pairwise similarities are set to the negative distances measured by the corresponding distance function. The

Fig. 5. Recognition rates of several published systems on Caltech-101 versus different amounts of training data.

Fig. 6. The values of the objective function of MKL-LDA through the iterative optimization procedure when Ntrainis set as (a) 5, (b) 10, (c) 15, (d) 20, and (e) 25, respectively.

(10)

clustering results evaluated based on NMI and ACC are reported in the third column of Table 3.

We then, respectively, adopt kernel LPP and MKL-LPP to preprocess the data. The main difference between the two is that kernel LPP learns a projection by taking one base kernel into account at a time, while MKL-LPP considers the 10 base kernels simultaneously. In the fourth column of Table 3, we show the clustering results of applying affinity propagation to the projected data by both schemes. It can be observed that with the advantage of exploring data characteristics from various aspects, MKL-LPP can achieve significant improvements in the clustering outcomes: NMI is increased from 0.621 to 0.737 and ACC is improved from 59.2 to 78.3 percent. Furthermore, owing to its better use of complementary image descriptors, MKL-LPP also outperforms KLPP-Concatenate and KLPP-AvgKernel.

The same experiments are repeated by replacing affinity propagation with k-means, and the results are given in the last two columns of Table 3. Note that if no additional preprocessing is performed, k-means is only applicable to

data under the representations of C2-SWP, C2-ML, and GIST in that the others lead to non-euclidean spaces. Again, considerable performance gains with MKL-LPP can be concluded from the clustering results.

Besides demonstrating the usefulness of MKL-LPP with the quantitative results, it would be insightful if the projected data can be visually compared. However, for kernel LPP and MKL-LPP, directly embedding the data into a 2D space for visualization is not practical since both would not yield good clustering results in such a low- dimensional space. Instead, we first use kernel LPP and MKL-LPP to embed the data to low-dimensional spaces in which they, respectively, achieve their best clustering performance. Then, we apply multidimensional scaling (MDS) [11] to find the 2D projections. In Fig. 7, we show 2D visualizations of the projected data, respectively, obtained by kernel LPP with the base kernels GB-Dist and GIST, and by MKL-LPP with all the 10 base kernels. Each point in the figures represents a data sample, and its color indicates its class label. For the purpose of better illustra- TABLE 3

Clustering Performances on the 20-Class Image Data Set

[NMI/ACC] percent.

Fig. 7. The 2D visualizations of the projected data. Each point represents a data sample, and its color indicates its class label. The projections are learned by (a) kernel LPP with base kernel GB-Dist (KLPP + GB-Dist), (b) kernel LPP with base kernel GIST (KLPP + GIST), and (c) MKL-LPP with all the 10 base kernels (MKL-LPP + All kernel).

(11)

tion, only data from 10 of the 20 classes (i.e., even-numbered classes) are plotted. Fig. 7 reveals that MKL-LPP can effectively utilize the data characteristics extracted by different descriptors and results in a more meaningful projection. That is, data of the same class tend to gather together, while data from different classes are kept apart.

This property facilitates a clustering algorithm to identify representative clusters and achieve better performances.

6 E

XPERIMENTAL

R

ESULTS:

S

EMI

-S

UPERVISED

L

EARNING FOR

F

ACE

R

ECOGNITION

Our last set of experiments focuses on evaluating the performance gains of applying MKL-DR to semi-supervised learning tasks. In solving such problems, the given data set is often partially labeled, and one is required to make use of both the class information of the labeled data and the intrinsic relationships of the unlabeled ones to accomplish the learning tasks. Specifically, we consider the face recognition problem to demonstrate the advantages of our approach, and exploit the property that the face images of an identity spread as a submanifold if they are sufficiently sampled to guide and regularize the optimization process in learning a more effective face recognition system.

6.1 Data Set

The CMU PIE database [44] is used in our experiments of face recognition. It is comprised of face images of 68 subjects. For a practical setting, we divide the 68 people into four equal-size disjoint groups, each of which contains face images from 17 subjects characterized by a certain kind of variations. (See Fig. 8 for an overview.) Specifically, for each subject in the first group, we consider only the images of the frontal pose (C27) taken in varying lighting conditions (those under the directory “lights”). For subjects in the second and third groups, the images with near frontal poses (C05, C07, C09, C27, and C29) under the directory

“expression” are used. While each image from the second group is rotated by a randomly sampled angle within ½45; 45, each from the third group is instead occluded by a nonface patch whose area is about 10 percent

of the face region. Finally, for subjects in the fourth group, the images with out-of-plane rotations are selected under the directory “expression” and with the poses (C05, C11, C27, C29, and C37). All images are cropped and resized to 51 51 pixels.

Performing face recognition over the resulting data set is challenging because the distances among data of the same class (identity) could be even larger than those among data of distinct classes if improper descriptors are used. On the other hand, adding the aforementioned variations to the data set is useful for emulating the practical situations, which are often caused, say, by imperfect face detectors or in uncontrolled environments.

Again our objective is to select a set of visual features that can well capture subjects’ characteristics as well as tolerate the large intraclass variations. Totally, we consider using four image descriptors and their respective distance function. Via (9), they result in four dissimilarity-based kernels described as follows:

. RsL2: Each sample is represented by its pixel intensities in raster scan order. Also, the euclidean (L²) distance is used to correlate two images. This is a widely used representation for face images.

. RsLTS: The base kernel is similar to RsL2, except that the distance function is now based on the least trimmed squares (LTS) with 20 percent outliers allowed. It is designed to take account of the partial occlusions in a face image.

. DeLight: The underlying feature representation is obtained from the delighting algorithm [22], and the corresponding distance function is set as 1 cos , where is the angle between a pair of samples under the representation. Some delighting results are shown in Fig. 9. It can be seen that variations caused by different lighting conditions are significantly alleviated under the representation.

. LBP: As is illustrated in Fig. 10, we divide each image into 96 ¼ 24 4 regions, and use a rotation-

Fig. 8. Four kinds of intraclass variations caused by (a) different lighting conditions, (b) in-plane rotations, (c) partial occlusions, and (d) out-of- plane rotations.

Fig. 9. Images obtained by applying the delighting algorithm [22] to the five images in Fig. 8a. Clearly, variations caused by different lighting conditions are alleviated.

Fig. 10. Each image is divided into 96 regions. The distance between the two images is obtained when circularly shifting causes ⁰to be the new starting radial axis.

(12)

invariant local binary pattern (LBP) operator [36]

(with operator setting LBP_8;1^riu2) to detect 10 distinct binary patterns. Thus, an image can be represented by a 960-dimensional vector, where each dimension records the number of occurrences that a specific pattern is detected in the corresponding region. To achieve rotation invariance, the distance between two such vectors, say, xiand xj, is the minimal one among the 24 values computed from the distance function 1 sumðminðxi; xjÞÞ=sumðmaxðxi; xjÞÞ by circularly shifting the starting radial axis for xj. Clearly, the base kernel is constructed to deal with variations resulting from rotations.

6.3 Dimensionality Reduction Method

We adopt SDA [6] as the semi-supervised DR technique to be generalized by MKL-DR. SDA carries out discriminant learning over labeled data while preserving the geometric structure of unlabeled data. Analogously to LDA and LDE in Section 4.3, SDA can be specified by two affinity matrices W ¼ ½wij and W⁰¼ ½w⁰_ij, when a partially labeled data set,

¼ fxp; y_pg^N_p¼1^‘ [ fxqg^N_q¼N^‘^þN_‘_þ1^u , is available

wij¼ 1=nyiþ sij; if yi¼ yj;

sij; otherwise;

ð43Þ

w⁰_ij¼ 1=N‘; if both xiand xjare labeled;

0; otherwise;

ð44Þ where

sij ¼ 1; if i2 NkðjÞ _ j 2 NkðiÞ;

0; otherwise;

ð45Þ and is a positive parameter to adjust the relative importance between the label information and the neighborhood relationships.

In the experiments, we randomly select 12 images from each subject. Three of them serve as the labeled training data, while the other nine as the unlabeled ones. In total, we have 816 (i.e., 12 68) images for training (204 of them are labeled), and the remainder for testing. Since the numbers of images for subjects in different groups may not be the

same, an accuracy rate is first computed for each subject.

And the reported recognition rate is the average of them.

The overall procedure is repeated eight times to reduce the effect of sampling.

We report the mean recognition rates and the standard deviation in the third column of Table 4. It can be observed that MKL-SDA achieves a significant performance gain of 16.8 percent ð¼78:8% 62:0%Þ over the best recognition rate by the four kernel SDA (or KSDA for short) classifiers. It also outperforms KSDA-Voting, KSDA-Concatenate, KSDA- AvgKernel, and KSDA-SAMME, and improves the recognition rates from 65.7 to 70.8 percent to 78.8 percent.

To evaluate the effect of using unlabeled training data in SDA, we compare MKL-SDA with MKL-LDA and MKL- LDE. The main difference among them is that MKL-SDA considers both labeled and unlabeled training data, while MKL-LDA and MKL-LDE use only the labeled ones. The quantitative results in Table 4 show that MKL-SDA can boost the recognition rate about 10 percent by making use of the additional information from the unlabeled training data.

We also provide the recognition rates with respect to each of the four groups in the last four columns of Table 4. (We name each group according to the type of its intraclass variation.) Note that each of such recognition rates is computed by considering only the data in a particular group. And no new classifiers are trained. As expected, the four base kernels generally result in classifiers that produce good performances in dealing with some specific kinds of intraclass variations. For example, the base kernel DeLight achieves a near perfect result for subjects in the Lighting group, and RsLTS yields satisfactory results in the Occlusion group. However, none of them is good enough for dealing with the whole data set. On the other hand, MKL-DR can effectively combine the four base kernels to complement them, and leads to a remarkable increase in accuracy.

Finally, we discuss the learned combination weights over the four base kernels, i.e., in Algorithm 1. In Fig. 11, we plot the averages and the standard deviations of the learned

by MKL-SDA. Observe first that the weights are not directly in proportion to the individual performances of the base kernels. This is mostly due to that the scales of the four base kernels are different. Further, the degrees of information complement and redundancy among these base kernels should also be taken into account since they are considered TABLE 4

Recognition Rates of Several Classifiers on the CMU PIE Data Set

½Mean std percent.

(13)

jointly in learning the weights. The other useful property readily inferred from Fig. 11 is that the standard deviations in the combination weights are uniformly small, and it highlights the desirable stability of MKL-DR.

7 C

ONCLUSIONS

The proposed MKL-DR introduces a new paradigm to fortify a broad scope of existing dimensionality reduction techniques. Its main advantage lies in the ability of learning a unified space of low dimension for data in multiple feature representations. Such a flexibility is important in tackling complicated vision problems, and allows one to explore more prior knowledge for effectively analyzing a given data set, including choosing a proper set of visual features to better characterize the data and adopting a graph-based DR method to appropriately model the relationship among the data points. Throughout this work, MKL-DR has been comprehensively evaluated in three important computer vision applications, including supervised object recognition, unsupervised image clustering, and semi-supervised face recognition. The promising experimental results further consolidate the usefulness of our approach.

Also, as we have demonstrated, MKL-DR can extend the multiple kernel learning framework to address not only the supervised learning problems but also the unsupervised and semi-supervised ones. This aspect of generalization introduces a new frontier in applying MKL to solving vision and learning applications.

A

CKNOWLEDGMENTS

The authors would like to thank the anonymous reviewers for their comments. This work is supported in part by grants 95-2221-E-001-031-MY3 and 97-2221-E-001-019-MY3.

R

EFERENCES

[1] F. Bach, G. Lanckriet, and M. Jordan, “Multiple Kernel Learning, Conic Duality, and the SMO Algorithm,” Proc. Int’l Conf. Machine Learning, 2004.

[2] A. Berg, T. Berg, and J. Malik, “Shape Matching and Object Recognition Using Low Distortion Correspondences,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 26-33, 2005.

[3] A. Berg and J. Malik, “Geometric Blur for Template Matching,”

Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 607- 614, 2001.

[4] O. Boiman, E. Shechtman, and M. Irani, “In Defense of Nearest- Neighbor Based Image Classification,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.

[5] A. Bosch, A. Zisserman, and X. Mun˜oz, “Image Classification Using Random Forests and Ferns,” Proc. IEEE Int’l Conf. Computer Vision, 2007.

[6] D. Cai, X. He, and J. Han, “Semi-Supervised Discriminant Analysis,” Proc. IEEE Int’l Conf. Computer Vision, 2007.

[7] J. Carreira and C. Sminchisescu, “Constrained Parametric Min- Cuts for Automatic Object Segmentation,” Proc. IEEE Conf.

Computer Vision and Pattern Recognition, 2010.

[8] C.-P. Chen and C.-S. Chen, “Lighting Normalization with Generic Intrinsic Illumination Subspace for Face Recognition,” Proc. IEEE Int’l Conf. Computer Vision, pp. 1089-1096, 2005.

[9] H.-T. Chen, H.-W. Chang, and T.-L. Liu, “Local Discriminant Embedding and Its Variants,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 846-853, 2005.

[10] M. Christoudias, R. Urtasun, and T. Darrell, “Bayesian Localized Multiple Kernel Learning,” technical report, Electical Eng. and Computer Science Dept., Univ. of California, Berkeley, 2009.

[11] T. Cox and M. Cox, Multidimentional Scaling. Chapman & Hall, 1994.

[12] D. Dueck and B. Frey, “Non-Metric Affinity Propagation for Unsupervised Image Categorization,” Proc. IEEE Int’l Conf.

Computer Vision, 2007.

[13] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A.

Zisserman, The PASCAL Visual Object Classes Challenge (VOC2007) Results, 2007.

[14] L. Fei-Fei, R. Fergus, and P. Perona, “Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories,” Proc. IEEE Computer Vision and Pattern Recognition Workshop Generative-Model Based Vision, 2004.

[15] B. Frey and D. Dueck, “Clustering by Passing Messages between Data Points,” Science, vol. 315, pp. 972-976, 2007.

[16] A. Frome, Y. Singer, and J. Malik, “Image Retrieval and Classification Using Local Distance Functions,” Advances in Neural Information Processing Systems, pp. 417-424, MIT Press, 2006.

[17] A. Frome, Y. Singer, F. Sha, and J. Malik, “Learning Globally- Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification,” Proc. IEEE Int’l Conf. Computer Vision, 2007.

[18] C. Galleguillos, B. McFee, S. Belongie, and G. Lanckriet, “Multi- Class Object Localization by Combining Local Contextual Inter- actions,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.

[19] P. Gehler and S. Nowozin, “On Feature Combination for Multi- class Object Classification,” Proc. IEEE Int’l Conf. Computer Vision, 2009.

[20] M. Go¨nen and E. Alpaydin, “Localized Multiple Kernel Learn- ing,” Proc. Int’l Conf. Machine Learning, pp. 352-359, 2008.

[21] K. Grauman and T. Darrell, “The Pyramid Match Kernel:

Discriminative Classification with Sets of Image Features,” Proc.

IEEE Int’l Conf. Computer Vision, pp. 1458-1465, 2005.

[22] R. Gross and V. Brajovic, “An Image Preprocessing Algorithm for Illumination Invariant Face Recognition,” Proc. Int’l Conf. Audio- and Video-Based Biometric Person Authentication, pp. 10-18, 2003.

[23] X. He and P. Niyogi, “Locality Preserving Projections,” Advances in Neural Information Processing Systems, MIT Press, 2003.

[24] A. Holub, M. Welling, and P. Perona, “Combining Generative Models and Fisher Kernels for Object Recognition,” Proc. IEEE Int’l Conf. Computer Vision, pp. 136-143, 2005.

[25] I. Joliffe, Principal Component Analysis. Springer-Verlag, 1986.

[26] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell, “Gaussian Processes for Object Categorization,” Int’l J. Computer Vision, vol. 88, no. 2, pp. 169-188, 2010.

[27] S.-J. Kim, A. Magnani, and S. Boyd, “Optimal Kernel Selection in Kernel Fisher Discriminant Analysis,” Proc. Int’l Conf. Machine Learning, pp. 465-472, 2006.

[28] A. Kumar and C. Sminchisescu, “Support Kernel Machines for Object Recognition,” Proc. IEEE Int’l Conf. Computer Vision, 2007.

[29] G. Lanckriet, N. Cristianini, P. Bartlett, L. Ghaoui, and M. Jordan,

“Learning the Kernel Matrix with Semidefinite Programming,”

J. Machine Learning Research, vol. 5, pp. 27-72, 2004.

[30] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features:

Spatial Pyramid Matching for Recognizing Natural Scene Cate- gories,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2169-2178, 2006.

Fig. 11. The learned kernel weights by MKL-SDA.