Because KPP is a linear dimension reduction and can be computed efficiently, the resource limitation of mobile is addressed by nature. The further objectives of the new projection matrix learning algorithm are: (1) preserving the classification accuracy of the original features and (2) reducing the storage overhead of projection matrix.
The symbol is defined as follows. X ∈ RD×N denotes the data set containing N instances, with the column vector xi denotes data point i in D dimensional space. K denotes the kernel matrix, with Kij = k(xi, xj), where k(, ) is the kernel function.
3.2.1 Projection learning to approximate kernel matrix
Our primary goal is to find a linear feature transformation (by projection) where the classification performance of resultant feature is similar to the original ones. The goal is similar to that of feature map methods [50], which try to find a explicit feature
transfor-Figure 3.3: Illustration for Kernel preserving projection (KPP). Conventionally, a kernel function is the inner product after performing feature transformation to a higher or even infinite-dimensional space (Left). Our proposal, KPP, goes another way (Right). It is a linear feature transformation by projection that “reduces” the dimension of the original features. The inner product of the signatures generated by KPP approximates the original kernel. The justification of approximation is in Section 3.2.4.
mation ¯Φ(x) where
k(xi, xj)≈ ¯Φ(xi)· ¯Φ(xj). (3.1) Because kernel functions determine the input space of SVM (in dual form) by implicit feature transformation, the feature transformation ¯Φ(x) will yield a SVM similar to that of kernel function k(, ) and thus similar performance.
Traditionally, a feature map ¯Φ increases the feature dimension and therefore limits the scalability of data and feature dimension. Gavves et al. proposed [24] a feature selection and weighting method for additive kernels by learning the weights of feature dimensions such that the kernel matrix of resultant low dimensional features approximates the original kernel. Although the method improves scalability, it does not consider cross dimension correlations and applies only to additive kernel, while our method applies to general ker-nels.
Motivated by the feature map methods and the applicability in unseen images in mobile recognition, we aim to learn a projection matrix P∈ Rd×D such that the resultant kernel matrix of low dimension signature approximates the kernel of the original features
K≈ (PX)T(PX)
= XTPTPX, (3.2)
as illustrated in Fig. 3.3. Preliminarily we try to derive the projection matrix as
P∗ = arg min
P ∥K − XTPTPX∥F, (3.3)
where∥ · ∥F denotes the Frobenius norm. The formulation is similar to multidimensional scaling, where the pair-wise distance equals to the kernel function k(xi, xj).
3.2.2 Information-theoretic-based regularization
To avoid the projection matrix P from overfitting, we introduce a regularization to Eq. 3.3 that maximizes the variance of pair-wise similarities of training data. In other words, we want the distribution of similarities spreads as wide as possible, similar to the equal partition objective in hashing algorithm like SPLH and RMMH. From an informa-tion theory point of view, if the probability distribuinforma-tion of random variable X is a normal distribution, its entropy is a function of variance
H(X) = 1
2ln(2πeσ2X). (3.4)
Therefore, maximizing variance is to maximize the entropy.
Assuming the data distribution is zero mean, maximizing kernel values variance of signatures will lead to
P∗ = arg max
P ∥XTPTPX∥2F. (3.5)
Combining with the original objective function, the resultant objective function can be
formulated as:
P∗ = arg min
P ∥K − XTPTPX∥F
−λ∥XTPTPX∥2F. (3.6)
3.2.3 Sparse projection matrix
The last requirement for the projection matrix is to minimize the storage overhead.
Given the input and output dimension of the projection matrix, storage reduction can be achieved by introducing the sparse constraint on the projection matrix (i.e., increase the number of zero entries). To add sparsity constraint on the projection matrix, we introduce an L1 penalty to the objective function. Therefore, the final objective function becomes
P∗ = arg min
P ∥K − XTPTPX∥F
− λ∥XTPTPX∥2F + η∥P∥1. (3.7)
Despite the practical necessity for sparse projection matrix, since it is also a distance metric and the input features are high dimensional, the matrix should be sparse as argued in [56, 32]. Our experimental results also show that even a very sparse projection matrix (i.e., 12% non-zero entries) can have competitive performance for mobile visual recognition.
3.2.4 Learning cross dimension correlations through RBF kernel
In Eq. 3.7, the target kernel K can be any kernels, and we have to specify it before learning. Because linear projection is very similar to distance metric as mentioned in Section 2.4, the target kernel should consider the theoretical benefits of distance metric.
In particular, as argued in [38, 56, 48], distance metric captures the correlation between different dimensions; therefore, the target kernel should also contain cross dimension cor-relations. Among the popular kernels for visual recognition (e.g., linear, χ2, intersection, RBF), RBF is the one that captures cross dimension correlations as explained in the next paragraph, and our experiments also show that RBF kernel has better classification
accu-racy (cf. Section 3.3.2). Therefore, we choose RBF as the target kernel.
Conventionally, the justification for RBF kernel is that it introduces an infinite dimen-sional feature transformation, but for high dimendimen-sional features, the contribution of high order feature transformation is actually very small. Consider the Taylor expansion of RBF kernel with the feature vectors normalized to unit length:
e−γ∥x−y∥2 = e−γ(2−2x·y)
Bingham [6] shows that in very high-dimensional space, two random vectors x, y are sufficiently close to be orthogonal, or more precisely speaking, given x, y∈ RD,
xTy≈ 0 (3.9)
when D≫ 0. So Eq. 3.8 is dominated by the leading terms
e−γ∥x−y∥2 ≈ c0+ c1∑
i
xiyi+ c2∑
i,j
xixjyiyj. (3.10)
Notice the first two terms are the same as the linear kernel, so the actual benefit of RBF kernel over linear kernel comes from the third term, which introduces the correlations be-tween different dimensions. In other words, RBF kernel outperforms linear kernel because it considers the correlation between dimensions for high dimensional features, which is the same as linear distance metric. Therefore, we can approximate the correlation introduced by RBF kernel using the projection matrix P as discussed in Section 2.4.
3.2.5 Optimization solver
The final step is to solve the optimization problem in Eq. 3.7. Note that the problem is not convex, so the initial guess of P affects the results. Instead of using the standard procedure for non-convex optimization problem by starting from several random initial
guesses and selecting the one with the optimal result, we choose the initial guess as follows.
Because the main goal of our objective function is to find a projection matrix that preserves kernel values, which is embedded in the first term in Eq. 3.7, we choose P0 that optimizes the first term. The projection matrix is guaranteed to be optimal when K = XTPTPX, which leads to an approximate solution for PT0P0:
PT0P0 ≈ (XT)†KX†, (3.11)
where X†represents the Moore-Penrose pseudoinverse. An approximated P0is then com-puted by eigenvalue decomposition. Experimental results show that the initial guess of P0
is fairly successful.
Starting from the initial guess, we solve the optimization problem using the L1General solver [60]. The solver is a general solver for optimization problem with weighted L1 regularization. We choose the active-set variant of projected scale sub-gradient algorithm.