Summary - 隱私保存的高效率資料分類方法

We propose a privacy-preserving outsourcing scheme of the SVM which protects the data by the random linear transformation. It provides higher security on the data privacy than existing works while achieves similar classification accuracy to a normal SVM classifier.

The random linear transformation-based outsourcing scheme protects both the privacy of data and generated classifiers, and imposes very little overhead on the data owner.

Chapter 3 On the Design and Analysis of the Privacy-Preserving SVM Classifier

3.1 Introduction

There is an increasing degree of concern on the privacy protection of personal information recently due to the popularity of electronic data held by commercial corporations. Data mining techniques [11] have been viewed as a threat to the sensitive content of personal information. This kind of privacy issue has led to research for privacy-preserving data mining techniques [2, 4, 32]. One of the important data mining tasks is classification.

The classification algorithm learns a classification model (i.e., the classifier) from labeled training data for the future use of classifying unseen data. There have been many privacy-preserving schemes designed for various classification algorithms [1, 4]. The support vector machine (SVM) [6, 55], a powerful classification algorithm with state-of-the-art performance, has also attracted lots of attention from researchers who studied privacy-preserving data mining techniques [9, 26, 54, 58, 59].

However, a problem has still not been addressed in existing privacy-preserving SVM work: the classifier learned by the SVM contains some intact instances of the training data. The classification model of the SVM inherently violates the privacy. Revealing the classifier will also reveal the private content of some individuals in the training data.

Consequently the classifier learned by the SVM cannot be publicly released or be shipped to clients with privacy-preservation.

There is a significant difference between the SVM and other popular classification al-gorithms: the classifier learned by the SVM contains some intact instances of the training data. The subset of the training data kept in the SVM classifier are called support vec-tors, which are the informative entries making up the classifier. The support vectors are intact instances taken from the training data. The inclusion of those intact instances of the training data prevents the SVM classifier from being public releasing or shipping to client users since the release of the SVM classifier will disclose individual privacy which may violate the privacy-preservation requirements for some legal or commercial reasons.

For instance, HIPAA laws require the medical data not to be released without appropriate anonymization [21]. The leakage of personal information is also prohibited by laws in many countries.

Most popular classification algorithms do not suffer from such direct violation of in-dividual privacy. For example, in the decision tree classifier, each node of the decision tree stands for an attribute and denotes splitting points of the attribute values for proceed-ing to the next level [42]. The na¨ıve Bayesian classifier consists of prior probabilities of each class and class conditional independent probabilities of each value [20]. The neu-ral network classifier possesses simply a set of weights and biases, accompanied with an activation function [20]. Unlike the SVM classifier which contains some intact training instances, these classifiers merely have aggregate statistics of the training data. Disclos-ing aggregate statistics also breaches the privacy in some extent since the actual content of some training instances may be derived from the aggregate statistics with the help of external information sources [36]. However, the direct privacy violation of the SVM classifier which discloses some intact training instances without any extent of protection is much severer. As long as the privacy-preserving issue is considered, this is the fun-damental difference between the SVM and other popular classification algorithms. The classifier of the SVM inherently violates the privacy. It incorporates a subset of training data. Hence releasing the classifier will violate the privacy of individuals. The other

ex-ample of the classification algorithm which also directly violates individual privacy in its classification model is the k-nearest neighbor (kNN) classifier, which requires all training instances being kept in the classifier [20].

The violation of privacy in the classification model will restrict the applicability of the SVM. Consider an application scenario as follows: A hospital, or a medical institute, has collected a large amount of medical records. The institute intends to capitalize those collected medical records to learn an SVM classifier for predicting whether a patient is subject to a disease or not. Due to the inclusion of some medical records in the classifier, releasing the classifier to other hospitals or research institutes will expose the sensitive information of some patients. This violation of privacy limits the applicability of the learned SVM classifier. Although the identifier field of each record has been removed, the identity of individual data may still be recognized from quasi-identifiers like gender, blood type, age, date of birth, and zip code [48].

There is also an increasing trend to outsource IT services to external service providers.

Major IT companies like Google and Microsoft are constructing infrastructures to run Software as a Service. This benefits small companies to run applications in the cloud-computing environment. Outsourcing can save much hardware, software and personnel investments, but data privacy is a critical concern in outsourcing since the external service providers may be malicious or compromised. For using SVM classifiers in the cloud-computing environment, the private information of the training data should not be dis-closed to unauthorized parties. Fig. 3.1 illustrates a general application scenario: the training data owner trains a classifier, and then publishes or ships the classifier to client users, or puts to the cloud-computing environment.

Although the anonymous data publishing technique k-anonymity [49] can be applied to data mining tasks [23], the performance may be degraded due to the distortion of data caused by generalized and suppressed quasi-identifiers. Furthermore, k-anonymity actually breaches privacy since the identity may be recognized from generalized quasi-identifiers and unmodified attributes with the help of external information sources.

Existing works which studied the privacy-preserving SVMs [9, 26, 54, 58, 59] mainly

Figure 3.1: Application scenario: Releasing the learned SVM classifier to clients or out-sourcing to cloud-computing service providers without exposing the sensitive content of the training data.

focused on privacy-preservation at training time. The privacy violation of the classifica-tion model of the SVM and releasing the SVM classifier has not been addressed. The methods proposed in [26, 54, 58, 59] aim to prevent the training data from being revealed to each other when the training data are separately held by several parties. Testing must be cooperatively done by the holders of the training data. The work of [9] considered a sce-nario that the training data owner delivers the perturbed training data to an untrustworthy 3rd-party to learn an SVM classifier.

To the best of our knowledge, there has not been work extending the notion of privacy-preservation to the release of the SVM classifier. In this chapter, we propose the Privacy-Preserving SVM Classifier (abbreviated as PPSVC) to protect the sensitive content of support vectors in the SVM classifier. The PPSVC is designed for the SVM classifier trained with the commonly used Gaussian kernel function. It post-processes the SVM classifier to destroy the attribute values of support vectors, and outputs a function which precisely approximates the decision function of the original SVM classifier to act as a privacy-preserving SVM classifier. Fig. 3.2 shows the concept of the PPSVC. The sup-port vectors in the decision function of the SVM classifier are transformed to a Taylor polynomial of linear combinations of monomial feature mapped support vectors, where the sensitive content of individual support vectors are destroyed by the linear combination.

We prove that the PPSVC is robust against adversarial attacks, and in the experiments, we verified with real data that the PPSVC can achieve comparable classification accuracy to

the original SVM classifier.

Figure 3.2: The PPSVC post-processes the SVM classifier to transform it to a privacy-preserving SVM classifier which does not disclose the private content of the training data.

The PPSVC can be viewed as a general scheme which is able to offer a proper com-promise between the approximating precision and the computational complexity of the resulted classifier. A higher degree of approximation will result in a classifier with close classification accuracy to the original at the cost of higher computational complexity. The PPSVC with a low approximation degree, i.e., low computational complexity, is enough to precisely approximate the SVM classifier and hence achieves comparable classification accuracy. In the experiments, we demonstrate that the Taylor polynomial in the PPSVC with degree≤ 5 is able to obtain almost the same accuracy with the original SVM classi-fier.

The privacy-preserving release of the SVM classifier enabled by PPSVC can benefit the users other than the data owner without compromising privacy. For example, in addi-tion to learning an SVM from the medical records, learning from the financial transacaddi-tions collected by a bank is useful to predict the credit of customers, and learning a spam filter from a mail server or learning a network intrusion detector from network server’s logs are also important applications of classification. The privacy violation of the SVM classifier will restrict its use only to the ones who can collect the data, but collecting the data is usually an expensive task or can only be performed by professional institutes. Since the PPSVC makes available the release of the SVM classifier without violating privacy, the SVM classifiers are not restricted to be utilized by the data owners, but can benefit the users who are not able to collect a large amount of training data.

The following summarizes our contributions:

• We address the privacy violation problem of releasing or publishing the SVM clas-sifier. We propose the PPSVC, which precisely approximates the decision function

of the Gaussian kernel SVM classifier in a privacy-preserving form. The PPSVC is realized by transforming the original decision function of the SVM classifier to an infinite series of linear combinations of monomial feature mapped support vec-tors in which the infinite series is then approximated by a Taylor polynomial. The releasable PPSVC benefits the classifier users with the good classification perfor-mance of the SVM without violating the individual privacy of the training data.

• We study the SVM kernel parameter’s influence on the approximating precision of the PPSVC, and provide a simple but subtle strategy for selecting the kernel parameter for obtaining good approximating precision in PPSVC. We also study the security issue of the PPSVC by considering the adversarial attack with the help of external information sources.

• Extensive experiments are conducted to evaluate the performance of the PPSVC.

Experimental results on real data show that the PPSVC can achieve almost the same accuracy with the original SVM classifier. The effect of the kernel parameter selecting strategy is also experimented and the results validate the claim that it does not apparently affect the classification performance.

The rest of this chapter is organized as follows: Section 3.2 briefly reviews the related work of the privacy-preserving data mining and privacy-preserving SVMs. Section 3.3 reviews the SVM and discusses the privacy violation of its classification model. Section 3.4 constructs the PPSVC. In Section 3.5, we discuss the security and approximating precision issues of the PPSVC. Section 3.6 shows the experimental results. Section 3.7 concludes this chapter.

3.2 Related Work

In this section, we first briefly review some privacy-preserving data mining works, and then focus on the works related to privacy-preserving SVMs.

The work of [4] utilized a randomization-based perturbation approach to perturb the data. The data are individually perturbed by adding noise randomly drawn from a known

distribution. A decision tree classifier is then learned from the reconstructed aggregate distributions of the perturbed data. In [1], a condensation-based approach is proposed.

Data are first clustered into groups, and then pseudo-data are generated from those clus-tered groups. Data mining tasks are then done on the generated synthetic data instead of the original data.

The k-anonymity [49] is an anonymous data publishing technique. It makes each quasi-identifier value be able to indistinguishably map into at least k-records by general-izing or suppressing the values in quasi-identifier attributes. The l-diversity [33] enhances k-anonymity by making each sensitive value appear no more than m/l times in a quasi-identifier group with m tuples. The k-anonymity has been successfully utilized in data mining. For example, the work of [23] studied the performance of the SVM built upon the anonymized data and the anonymized data with additional statistics of the generalized fields. The distortion of data in k-anonymity may degrade the data mining performance, and the privacy is actually breached due to the disclosing of generalized values and un-modified sensitive attributes, which may incur the risk of being identified from the help of external information sources.

Another family of privacy-preserving data mining algorithms is distributed meth-ods [39]. The distributed methmeth-ods perform data mining over the entire dataset which is separately held by several parties without compromising the data privacy of each party.

The dataset may either be horizontally partitioned, vertically partitioned, or arbitrarily partitioned. The distributed privacy-preserving data mining algorithm exchanges neces-sary information between parties to compute aggregate results without sharing the actual private content with each other. This method capitalizes the secure multi-party compu-tations from cryptography. Several privacy-preserving SVM works [26, 54, 58, 59] also belong to this family.

In the following, we detail the works of privacy-preserving SVMs. The works of [26, 54, 58, 59] designed privacy-preserving protocols to exchange the necessary informa-tion for training the SVM on the data partiinforma-tioned among different parties without revealing the actual content of each one’s data to others. In [54, 58, 59], the secure multi-party

in-teger sum are utilized in the protocols to cooperatively compute the Gram matrix in the SVM formulation from the data separately held by several parties. In [26], a privacy-preserving protocol to perform the kernel adatron algorithm for training the SVM on the data separately held by different parties is designed based on the additively homomor-phic public-key cryptosystem. In these distributed methods, at the end of running the protocols, each party will hold a share of the learned SVM classifier. Testing must be co-operatively performed by all involved parties since the support vectors, which come from the training data, are separately held. The goal of these distributed methods is to train an SVM classifier from the whole data separately held by different parties without compro-mising each party’s privacy, and is orthogonal to our work for releasing the learned SVM classifier without violating the privacy of support vectors.

The work of [9] exploits the rotation invariant property of common kernel functions, and applies the rotation matrix to transform the data for outsourcing the training of the SVM to an external service provider without revealing the actual content of the data.

The purpose of this work is also orthogonal to our work for privacy-preserving release of the SVM classifier. The privacy-preserving scheme used in this work for outsourcing the SVM training is not able to be utilized in privacy-preserving release of the SVM classifier since it requires the testing data also be rotationally transformed by the same matrix applying to the training data, but the matrix should be kept secret, or the original content of the rotationally transformed support vectors can be recovered by multiplying the inverse of the matrix.

Compared to existing privacy-preserving SVM works where [9] aims at outsourcing the SVM training without revealing the actual content of data and [26, 54, 58, 59] aim at cooperatively train the SVM without revealing each one’s own data when data are separately held, our work addresses the inherent privacy violation problem of the SVM classifier which incorporates a subset of training data, and design a mathematical trans-forming method to protect the private content of support vectors to make available the release of the SVM classifier. Compared to anonymous data publishing techniques, our scheme achieves better performance and provides stronger privacy protection by hiding

all the feature values.

在文檔中隱私保存的高效率資料分類方法 (頁 54-63)