SVM and Privacy-Preservation - 隱私保存的高效率資料分類方法

We first briefly review the SVM in Section 3.3.1 to give the preliminaries of this work.

Then in Section 3.3.2, we discuss the privacy violation problems of the SVM classifier that a subset of the training data will inevitably be disclosed.

3.3.1 Review of the SVM

The SVM is a statistically robust learning method based on the structural risk minimiza-tion [55]. It trains a classifier by finding an optimal separating hyperplane which maxi-mizes the margin between two classes of data in the kernel induced feature space. Without loss of generality, suppose that there are m instances of training data. Each instance con-sists of an (xi, y_i) pair where x_i ∈ R^Nis a vector containing attributes of the i-th instance, and y_i ∈ {+1, −1} is the class label for the instance. The objective of the SVM is to find the optimal separating hyperplane w· x + b = 0 between the two classes of data. To classify a testing instance x, the decision function is

f (x) = w· x + b (3.1)

The corresponding classifier is sgn(f (x)).

The SVM finds the optimal separating hyperplane by solving the following quadratic programming optimization problem:

arg min

w,b,ξ

2||w||²+ C

∑m i=1

ξ_i

subject to

y_i(w· xi+ b)≥ 1 − ξi, ξ_i ≥ 0, for i = 1, ..., m

(3.2)

In the objective function, minimizing ¹₂||w||² corresponds to maximizing the margin be-tween w· x + b = 1 and w · x + b = −1. The constraints aim to put the instances

−1 −0.5 0 0.5 1

−1

−0.5 0 0.5 1

y=1 y=−1 SV wx+b=1 wx+b=−1 wx+b=0

Figure 3.3: The SVM maximizes the margin between two classes of data. Squared points are support vectors.

with positive labels at one side of the margin w· x + b ≥ 1, and the ones with negative labels at the other side w· x + b ≤ −1. The variables ξi, i = 1,· · · , m are called slacks.

Each ξ_i denotes the extent of x_i falling outside its corresponding region. C is called cost parameter, which is a positive constant specified by the user. The cost parameter denotes the penalty of slacks. The objective function of the optimization problem is a trade-off between maximizing the margin and minimizing the slacks. A larger C corresponds to assigning higher penalty to slacks, which will result in less slacks but a smaller margin.

The value of the cost parameter C is usually determined by cross-validation. Fig. 3.3 gives an example to illustrate the concept of the formulation of the SVM.

The optimization problem of the SVM is usually solved in its dual form derived by applying the Lagrange multipliers and KKT-conditions [6, 55]. Solving the dual problem is equivalent to solving the primal problem. The dual form of the SVM’s optimization problem implies the applicability of the kernel trick since the data vectors of the training instances{x1, x₂,· · · , xm} and the testing instance x appear only in dot product com-putations both in the optimization problem and the decision function. A kernel function K(x, y) implicitly maps the data x and y into some high-dimensional space and com-putes their dot product there without actually mapping the data [55]. By replacing the dot products with kernel functions, the kernelized dual form of the SVM’s optimization

problem is

i=1α_iy_ix_i in the duality, the kernelized decision function in the dual form is

f (x) =

∑m i=1

α_iy_iK(x_i, x) + b (3.4)

The bias term b can be calculated from KKT-complementarity conditions [6, 55] after solving the optimization problem.

By applying the kernel trick, the SVM implicitly maps data into a high-dimensional space and finds an optimal separating hyperplane there. The testing is also done in the kernel induced high-dimensional space by the kernelized decision function. The ker-nel induced mapping and high-dimensional space are usually called feature mapping and feature space respectively. The original dot product is called linear kernel, i.e., K(x, y) = x· y. With linear kernel, the optimal separating hyperplane is found in the original space without feature mapping. The feature mapping of nonlinear kernel func-tions could be very complex and we may not even know the actual mapping. A commonly used kernel function is Gaussian kernel

K(x, y) = exp(−g||x − y||²) (3.5)

where g > 0 is a parameter. Gaussian kernel represents each instance by a kernel-shaped function sitting on the instance; each instance is represented by its similarity to all other instances. The induced mapping of Gaussian kernel is infinite-dimensional [46, 55].

In the decision function of the dual form (3.4), it is seen that only the non-zero α_i’s and the corresponding (x_i, y_i) pairs are required to be kept in the decision function. Those (x_i, y_i) pairs with non-zero α_i are called support vectors. They are the instances falling

outside their corresponding region after solving the optimization problem (the squared points in Fig. 3.3). Support vectors are the informative points to make up the SVM classifier. All training data except support vectors are discarded after training. For ease of exposition, we will denote support vectors, the (xi, yi) pairs with nonzero αi after training, as (SVi, y_i). The number of support vectors is denoted as m^′. So in the following paragraphs the decision function will be represented as

f (x) =

m^′

∑

i=1

α_iy_iK(SV_i, x) + b (3.6)

3.3.2 Privacy Violation of the SVM Classifiers

From the decision function of the SVM classifier (3.6), we note that the support vectors existing in the SVM classifier are a subset of the training data. Parts of the training data are kept in their original content in the decision function for performing kernel evaluations with the testing instance. Releasing the SVM classifier will violate privacy due to the inclusion of the sensitive content.

The linear kernel SVM is an exception. The SVM classifier learned with the linear kernel is inherently privacy-preserving. With the linear kernel, the support vectors incor-porated in the decision function f (x) =∑m^′

i=1α_iy_iSV_i· x + b can be linearly combined to one vector w =∑m^′

i=1α_iy_iSV_iso

f (x) = w· x + b (3.7)

Hence the classifier sgn(f (x)) of the linear kernel SVM can be simply represented by a hyperplane w· x + b = 0. The w is a linear combination of all support vectors. Sensitive content of each individual support vector is destroyed by the weighted adding up, and therefore the classifier does not include individual private information of the training data.

Fig. 3.3 shows a linear kernel SVM classifier. Merely the separating hyperplane w·x+b = 0 is enough to classify the data. No individual support vector (squared points in Fig. 3.3) needs to be kept in the classifier. Hence the linear kernel SVM classifier is inherently

−1 −0.5 0 0.5 1

Figure 3.4: A Gaussian kernel SVM classifier: All support vectors must be kept in the classifier, which violates privacy.

privacy-preserving.

Since the linear kernel SVM is only suitable to learn a classifier on linearly separable data, its usability on classification is limited. For linearly inseparable data, the linear kernel is inappropriate. A large part of the power of the SVM comes from the kernel trick.

Without applying kernel functions, the SVM is merely a linear separator only suitable to linearly separable data. By replacing the dot products with kernel functions in the SVM formulation, data are non-linearly mapped into a high-dimensional feature space, and the SVM learns a linear classifier there. Since data in high-dimensional space are highly sparse, it is easy to separate the data there by a linear separator.

However, the inherent privacy-preserving property of the linear kernel SVM classifier disappears when the nonlinear kernel is applied. In the nonlinear kernel SVM, the w in the decision function f (x) = w· x + b cannot be computed explicitly like the linear ker-nel. The vector w exists in the kernel induced feature space as w = ∑_m^′

i=1α_iy_iΦ(SV_i), where Φ() denotes the feature mapping induced by the kernel function. Since the fea-ture mapping is done implicitly, the w can only be stated as a linear combination of kernel evaluations as w = ∑m^′

i=1α_iy_iK(SV_i, ), and the decision function is f (x) =

∑m^′

i=1α_iy_iK(SV_i, x) + b. This restriction causes us not able to linearly combine the sup-port vectors into one vector w. The classifier has all supsup-port vectors in their original content to make possible the kernel evaluations K(SVi, x) between the testing instance x and each support vector.

Fig. 3.4 illustrates a Gaussian kernel SVM trained on a small dataset. The three curves are the points evaluated to f (x) = −1, +1, and 0 in the figure. They correspond to the hyperplanes w· Φ(x) + b = −1, +1, and 0 in the kernel induced feature space.

The support vectors are the instances falling into wrong region in the feature space. The curve corresponding to f (x) = 0 is the decision boundary in the original space, which is the optimal separating hyperplane in the feature space. All support vectors (the squared points) are required to be kept in the classifier in order to do kernel evaluations with the testing instance, i.e., the computation of the decision function (3.6), to decide which side of the separating hyperplane in the feature space the testing instance falls into. Releasing the classifier will expose the private content of support vectors, which are intact tuples of a subset of the training data, therefore violating the privacy.

在文檔中隱私保存的高效率資料分類方法 (頁 63-68)