Secure Kernel Matrix with the Reduced SVM

2.4 Secure SVM Outsourcing with Random Linear Transformation

2.4.1 Secure Kernel Matrix with the Reduced SVM

Since the full kernel matrix Q of the conventional SVM formulation (2.1) with common kernel functions is built upon the dot products or Euclidean distance among all training instances, if the service provider needs to build the full kernel matrix to solve SVM prob-lems, there will be similar security weakness like the rotationally/translationally trans-formed data. We avoid the use of the full kernel matrix by applying the reduced SVM (RSVM) with random reduced set [27, 28, 34, 35] for solving SVMs, which helps to pre-vent the disclosure of dot product/Euclidean distance relationships among instances and also plays an important role in our scheme for the utilization of random linear transfor-mation to perturb the data in outsourcing the SVM.

The Reduced SVM with Random Reduced Set

In the following, we first briefly describe the RSVM, and then explain the RSVM with completely random vectors as the reduced set. The RSVM is a SVM scaling up method, which utilizes a reduced kernel matrix. Each element of the reduced kernel matrix is computed from an instance in the training dataset and an instance in the reduced set, which can be a subset of the training dataset. The number of instances in the reduced set is typically 1% to 10% of the size of the training dataset [27, 28]. Hence the reduced kernel matrix is much smaller than a full kernel matrix and can easily fit into the main memory.

Without loss of generality, let x_i ∈ Rⁿ, i = 1, . . . , m denote the instances of the training dataset, and y_i ∈ {1, −1}, i = 1, . . . , m are their corresponding labels. Let

R = {rj|rj ∈ Rⁿ, j = 1, . . . , ¯m} denotes the reduced set, where ¯m << m. The original RSVM paper adopted a subset of the training dataset as the reduced set [28]. The reduced kernel matrix K is an m× ¯m matrix where

K_i,j = k(x_i, r_j), i = 1, . . . , m, j = 1, . . . , ¯m. (2.3)

The RSVM problem is formulated as

arg min

The optimization problem of the RSVM can be solved by a normal linear SVM solver [30] or the smooth SVM [29] used in the original RSVM paper [28]. Empirical studies showed that the RSVM can achieve similar classification performance to a conventional SVM [27, 28, 30].

An interesting property of the RSVM is that the reduced set R is not necessary to be a subset of the training dataset [27]. Completely random vectors can act as the in-stances of the reduced set [34, 35]. In the RSVM, the inin-stances of the reduced set work as pre-defined support vectors. Unlike the conventional SVM which selects the instances near the optimal separating hyperplane as support vectors, the RSVM fits the optimal separating hyperplane to pre-defined support vectors by determining appropriate support values [27]. A larger value of the cost parameter C is usually required to let the RSVM fit well to the pre-defined support vectors [28, 30].

If random vectors are adopted as the reduced set, then an element in the reduced kernel

matrix is the kernel evaluation between an instance in the training dataset and a random vector in the reduced set but not the kernel evaluation between the training instances like the kernel matrix of a conventional SVM. Therefore, revealing the reduced kernel matrix with random vectors as the reduced set does not reveal the information of the dot product or Euclidean distance relationships among instances. This constitutes the secure kernel matrix, which avoids the security weakness from disclosing the kernel matrix of a conventional SVM given that the random vectors of the reduced set are kept secret, i.e., revealing the secure kernel matrix to the service provider for solving the RSVM problem will not incur the risk of disclosing dot product or Euclidean distance relationships of the training data.

Note that just building and sending the secure kernel matrix to the service provider is not appropriate for outsourcing the SVM training since a reduced kernel matrix is com-puted from a fixed kernel parameter, while there are various SVMs with different kernel parameters to be trained in the parameter search process. The data owner will be imposed much computation load as well as much communication cost to build and send many se-cure kernel matrices with different kernel parameters to the service provider. Our goal is to outsource the SVM training which must minimize as much load of the data owner as possible.

In the following, we first discuss the robustness of the secure kernel matrix, and then in next subsection, we design a scheme which enables the service provider to build the secure kernel matrix from the data perturbed by random linear transformation.

Robustness of the Secure Kernel Matrix

In the following, we prove that the service provider who has the secure kernel matrix K along with some leaked training instances obtained from some external information sources is not able to derive the content of both the random vectors in the reduced set and the remaining secret training instances.

Lemma 3 The service provider cannot obtain the content of random vectors of the re-duced set from the secure kernel matrix with leaked training instances.

Proof 3 Suppose the service provider has obtained m− 1 training instances and at least n of them are linearly independent, where m is the number of training instances and n is the dimensions of the data and m > n. If the service provider is able to know which ele-ments of the matrix K the n linearly independent leaked training instances involve with, it can derive the content of rj’s by first inferring the underlying dot product/Euclidean distance of the kernel values and then utilizing the leaked instances and the inferred dot product/Euclidean distance to set up equations to derive the content of r_j’s. However, the service provider cannot identify which elements the leaked instances are involved because this requires the knowledge of the random vectors. In the case of m < n, it is straight-forward that the random vectors of the reduced set cannot be derived due to insufficient number of linearly independent instances to set up simultaneous equations.

Without knowing the content of the random vectors in the reduced set, the service provider cannot derive the content of secret training instances from the secure kernel matrix even if there is only one training instance not leaked.

Corollary 1 The service provider cannot derive the content of unknown training in-stances even if m− 1 of all m training instances are leaked.

Proof 4 Since each element of the secure kernel matrix K consists of K_i,j = k(x_i, r_j), i.e., it is evaluated from a training instance and a secret random vector, to derive any x_i from elements of K, the content of r_j’s is required. However, the service provider cannot obtain the content of random vectors in the reduced set as shown in Lemma 3. Hence the service provider is not able to derive the content of remaining secret training instances.

As shown in Lemma 3 and Corollary 1, as long as the random vectors of the reduced set are kept secret, the secure kernel matrix is robust in security even if part of training instances are leaked.

2.4.2 Building the Secure Kernel Matrix from Data Perturbed by

在文檔中隱私保存的高效率資料分類方法 (頁 32-36)