• 沒有找到結果。

The support vector machine (SVM) [55] is a statistically robust classification algorithm which yields state-of-the-art performance. The SVM applies the kernel trick to implicitly map data to a high-dimensional feature space and finds an optimal separating hyperplane there [46, 55]. The rich features of kernel functions provide good separating ability to the SVM. With the kernel trick, the SVM does not really map the data but achieves the effect of performing classification in the high-dimensional feature space.

The expense of the powerful classification performance brought by the kernel trick is that the resulting decision function can only be represented as a linear combination of kernel evaluations with the training instances but not an actual separating hyperplane:

f (x) =

m i=1

αiyiK(xi, x) + b

where xi ∈ Rn and yi ∈ {1, −1}, i = 1, . . . , m are feature vectors and labels of n-dimensional training instances, αi’s are corresponding weights of each instance, b is the

bias term, and K is a nonlinear kernel function. Although only those instances near the optimal separating hyperplane will get nonzero weights to become support vectors, for large-scale datasets, the amount of support vectors can still be very large.

The formulation of the SVM is a quadratic programming optimization problem. Due to the O(m2) space complexity for training on a dataset with m instances, there is a scalability issue in solving the optimization problem since it may not fit into memory.

Decomposition methods such as the sequential minimal optimization (SMO) [40] and LIBSVM [7] are popular approaches to solve this scalability problem. Decomposition methods are very efficient for moderate-scale datasets and result in good classification accuracy, but they still suffer from slow convergence for large-scale datasets. Since in the iteration of the optimization, the computing cost increases linearly with the number of support vectors. Large number of support vectors will incur many kernel evaluations, where the computational cost is O(mn) in each iteration. This heavy computational load causes the decomposition methods converge slowly, and hence decomposition methods are still challenged to handle large-scale data. Furthermore, too many support vectors will cause inefficiency in testing.

In contrast, without using the kernel function, the linear SVM has much more efficient techniques to solve, such as LIBLINEAR [15] and SVMperf [24]. The linear SVM obtains an explicit optimal separating hyperplane for the decision function

f (x) = w· x + b

where only a weight vector w ∈ Rn and the bias term b are required to be maintained in the optimization of the linear SVM. Therefore, the computation load in each iteration of the optimization is only O(n), which is less than that of nonlinear SVMs. Compared to nonlinear SVMs, the linear SVM can be much more efficient on handling large-scale datasets. For example, for the Forest cover type dataset [5], training by LIBLINEAR takes merely several seconds to complete, while training by LIBSVM with nonlinear kernel function consumes several hours. Despite the efficiency of the linear SVM for large-scale data, the applicability of the linear SVM is constrained. It is only appropriate to the

tasks with linearly separable data such as text classification. For ordinary classification problems, the accuracy of the linear kernel SVM is usually lower than that of nonlinear ones.

An approach of leveraging the efficient linear SVM solvers to train the nonlinear SVM is explicitly listing the features induced by the nonlinear kernel function:

K(x, y) = ϕ(x)· ϕ(y)

where ϕ(x) and ϕ(y) are explicit features of x and y induced by the kernel function K.

The explicitly feature mapped instances ϕ(xi), i = 1, . . . , m are utilized as the input of the linear SVM solver. If the number of features is not too much, it can be very fast to train the nonlinear SVM in this way. For example, the work of [8] explicitly lists the features of low-degree polynomial kernel function and uses the explicit features to feed into a linear SVM solver. However, the technique of explicitly listing the feature mapping is merely applicable to the kernel function which induces low-dimensional feature mapping, for example, the low-degree polynomial kernel function [8]. It is difficult to utilize on high-degree polynomial kernel functions since the induced mapping is very high-dimensional, and is not applicable to the commonly used Gaussian kernel function, whose implicit feature mapping is infinite-dimensional. Restricting the polynomial kernel function to low-degree loses some power of the nonlinearity, and the polynomial kernel function is less widely used than the Gaussian kernel function since in the same cost of computation, its accuracy is usually lower than using the Gaussian kernel function [8].

The feature mapping of the Gaussian kernel function can be uniformly approximated by random Fourier features [43,44]. However, the random Fourier features are dense, and a large number of random Fourier features are needed to reduce the variation. Too many features will lower the efficiency of the linear SVM solver, and require much storage space. If there are not enough amount of random Fourier features, the large variation will degrade the precision of approximation and result in poor accuracy. Although the linear SVM solver is applicable to the very high-dimensional text data, the features of text data are sparse, i.e., there are only a few non-zero features in each instance of the text data.

In this chapter, we propose a compact feature mapping for approximating the feature mapping of the Gaussian kernel function by Taylor polynomial-based monomial features, which sufficiently approximates the infinite-dimensional implicit feature mapping of the Gaussian kernel function by low-dimensional features. Then we can explicitly list the approximated features of the Gaussian kernel function and capitalize with a linear SVM solver to train a Gaussian kernel SVM. This technique takes advantage of the efficiency of the linear SVM solver and achieves close classification performance to the Gaussian kernel SVM.

We first transform the Gaussian kernel function to an infinite series and show that its infinite-dimensional feature mapping can be represented as a Taylor series of monomial features. By keeping only the low-order terms of the series, we obtain a feature mapping ϕ which consists of a low-degree Taylor polynomial-based monomial features. Then the¯ Gaussian kernel evaluation can be approximated by the inner product of the explicitly mapped data:

K(x, y)≈ ¯ϕ(x) · ¯ϕ(y).

Hence we can utilize the mapping ¯ϕ to transform data to a low-degree Taylor polynomial-based monomial features, and then use the transformed data as the input to an efficient linear SVM solver.

Unlike the uniform approximation of random Fourier features which requires a large number of features to reduce variations, approximating by Taylor polynomial-based mono-mial features concentrates the important information of the Gaussian kernel function on the features of low-degree terms. Therefore, only the monomial features in low-degree terms of the Taylor polynomial are sufficient to precisely approximate the Gaussian ker-nel function. Merely a few number of low-degree monomial features are able to achieve good approximating precision, and hence can result in similar classification accuracy to a normal Gaussian kernel SVM. Furthermore, if the features of the original data have some extent of sparseness, the Taylor polynomial of monomial features will also be sparse.

Hence it will be very efficient to work with linear SVM solvers. By approximating the feature mapping of the Gaussian kernel function with a compact feature set and

leverag-ing the efficiency of linear SVM solvers, we can perform fast classification on large-scale data and obtain the classification performance similar to using nonlinear kernel SVMs.

The experimental results show that the proposed method is useful for classifying large-scale datasets. Although its speed is a bit slower than using the linear SVM, it achieves better accuracy which is very close to a normal nonlinear SVM solver, and is still very fast. Compared to using random Fourier features and explicit features of low-degree polynomial kernel function with linear SVM solvers, our Taylor polynomial of monomial features technique achieves higher accuracy in similar complexity.

The rest of this chapter is organized as follows: In Section 4.2, we discuss some related works and briefly review the SVM for preliminaries. Then in Section 4.3, we propose the method of approximating the infinite-dimensional implicit feature mapping of the Gaussian kernel function by a low-dimensional Taylor polynomial-based monomial feature mapping. In Section 4.4, we demonstrate the approach for efficiently training the Gaussian kernel SVM by the Taylor polynomial-based monomial features with a linear SVM solver. Section 4.5 shows the experimental results, and finally, we conclude the chapter in Section 4.6.

4.2 Preliminary

In this section, we first survey some related works of training the SVM on large-scale data, and then review the SVM to give preliminaries of this work.

4.2.1 Related Work

In the following, we briefly review some related works of large-scale SVM training. De-composition methods are very popular approaches to tackle the scalability problem of training the SVM [7, 37, 40]. The quadratic programming (QP) optimization problem of the SVM is decomposed into a series of QP problems to solve, where each sub-problem works only on a subset of instances to optimize. The work of [37] proved that optimizing on the QP sub-problem will reduce the objective function and hence will

con-verge. The sequential minimal optimization (SMO) [40] is an extreme decomposition.

The QP problem is decomposed into the smallest sub-problems, where each sub-problem works only on two instances and can be analytically solved which prevents the use of numerical QP solvers. The popular SVM implementation LIBSVM [7] is an SMO-like algorithm with improved working set selection strategies. The decomposition methods consume constant amount of memory and can run fast. However, decomposition methods still suffer from slow convergence for training on very large-scale data.

There are SVM training methods which do not directly solve the QP optimization problem, for example, the reduced SVM (RSVM) [28] and the core vector machine (CVM) [52]. The RSVM adopts a reduced kernel matrix to formulate an L2-loss SVM problem, where the reduced kernel matrix is a rectangular sub-matrix of the full kernel matrix. The reduced problem is then approximated by a smooth optimization problem and then be solved by a fast Newton method. The CVM [52] models an L2-loss SVM by a minimum enclosing ball problem, where the solution of the minimum enclosing ball problem will be the solution of the SVM. In which, the data are viewed as points in the kernel-induced feature space, and the target is to find a minimum ball to enclose all the points. A fast variant of the CVM is the ball vector machine (BVM) [51], which simply moves a pre-defined large enough ball to enclose all points.

Explicitly mapping the data with the kernel induced feature mapping is a way to cap-italize with the efficient linear SVM solver to solve nonlinear kernel SVMs. This method is simple and can capitalize with existing packages of linear SVM solvers like LIBLIN-EAR [15] and SVMperf [24]. The work of [8] is most related to our work, which explic-itly maps the data by a feature mapping corresponding to low-degree polynomial kernel functions, and then uses a linear SVM solver to find an explicit separating hyperplane in the explicit feature space. Since the dimensionality of its explicit feature mapping is factorial to the degree, this approach is only applicable to low-degree polynomial kernel functions. Since the degree is a parameter of the polynomial kernel, the dimensional-ity which increases with degree constrains the value of degree to be small, which causes some loss of the nonlinearity of the polynomial kernel. In contrast, our method is a Taylor

polynomial-based approximation of the implicit feature mapping of the Gaussian kernel function, and the dimensionality of our approximated feature mapping increases with the degree of the Taylor polynomial, where this degree is not a kernel parameter and hence will not constrain the nonlinearity of the kernel function. Although using a higher degree will get a better approximating precision and hence usually result in better accuracy, our experimental results show that using with degree-2, which results in a low-dimensional explicit mapping, is enough to obtain similar accuracy to the Gaussian kernel SVM. Also, the Gaussian kernel function is more commonly used than the polynomial kernel function since it usually achieves better accuracy in similar computational cost.

Random Fourier features of [43,44] uniformly approximates the implicit feature map-ping of the Gaussian kernel function. However, the random Fourier features are dense, and a large number of features are required to reduce the variation. Too few features will have very large variation, which causes poor approximation and results in low accuracy.

4.2.2 Review of the SVM

The SVM [55] is a statistically robust learning method with state-of-the-art performance on classification. The SVM trains a classifier by finding an optimal separating hyperplane which maximizes the margin between two classes of data. Without loss of generality, suppose there are m instances of training data. Each instance consists of a (xi, yi) pair where xi ∈ Rndenotes the n features of the i-th instance and yi ∈ {+1, −1} is its class label. The SVM finds the optimal separating hyperplane w· x + b = 0 by solving the quadratic programming optimization problem:

arg min

w,b,ξ

1

2||w||2+ C

m i=1

ξi

subject to yi(w· xi+ b)≥ 1 − ξi, ξi ≥ 0, i = 1, ..., m.

Minimizing 12||w||2in the objective function means maximizing the margin between two classes of data. Each slack variable ξi denotes the extent of xi falling into the erroneous region, and C > 0 is the cost parameter which controls the trade-off between maximizing

the margin and minimizing the slacks. The decision function is f (x) = w· x + b, and the testing instance x is classified by sign(f (x)) to determine which side of the optimal separating hyperplane it falls into.

The SVM’s optimization problem is usually solved in dual form to apply the kernel trick:

The function K(xi, xj) is called kernel function, which implicitly maps xi and xj into a high-dimensional feature space and computes their inner product there. By applying the kernel trick, the SVM implicitly maps data into the kernel induced high-dimensional space to find an optimal separating hyperplane. A commonly used kernel function is the Gaussian kernel K(x, y) = exp(−g||x − y||2) with the parameter g > 0, whose implicit feature mapping is infinite-dimensional. The original inner product is called linear kernel K(x, y) = x· y. The corresponding decision function of the dual form SVM is f(x) =

m

i=1αiyiK(xi, x) + b, where αi, i = 1, . . . , m are called supports, which denote the weights of each instance to compose the optimal separating hyperplane in the feature space. The instances with nonzero supports are called support vectors. Only the support vectors involve in constituting the optimal separating hyperplane. With the kernel trick, the weight vector w becomes a linear combination of kernel evaluations with support vectors: w =m

i=1αiyiK(xi, ). On the contrary, the linear kernel can obtain an explicit weight vector w =m

i=1αiyixi.

4.3 Approximating the Gaussian Kernel Function by