Experimental Analysis - 隱私保存的高效率資料分類方法

In the experiments, we first compare the the classification performance between a con-ventional SVM and the RSVM with random vectors as the reduced set to evaluate the effectiveness of the proposed scheme on classification. Then we measure the computa-tional time imposed on the data owner of using our privacy-preserving outsourcing SVM scheme, and compare with the computational time of training the SVM locally to demon-strate the computational load saved from using the outsourcing scheme. Finally, we com-pare the classification performance with the SVM trained from the anonymized data since the anonymous data publishing technique [50] is suitable for revealing the datasets where only the identities of instances are concerned.

Since training SVMs on large datasets is very time consuming, for the ease of ex-periments, we choose the datasets with moderate size for performing experiments. The difference in the scale of consuming time between outsourcing and local training is clear to demonstrate the efficacy of our scheme. The datasets used in the experiments are avail-able at the UCI machine learning repository [5]. We select some medical datasets and bank credit datasets, which have stronger privacy concerns, to evaluate the effectiveness of the scheme. The medical datasets include Wisconsin breast cancer, Pima Indian di-abetes, Liver disorder, Statlog heart disease, which contain medical records of patients.

The bank credit datasets are Australian credit and German credit numeric version, in which personal information of bank customers is contained. Besides, we also adopt some datasets which has less concern in data privacy, including the Ionosphere, which collects the radar data of free electrons in the ionosphere, and some other datasets in the LIBSVM

website [7], including Fourclass and Svmguide3. The statistics of the datasets are shown in Table 2.1. The programs of random linear transformation and the RSVM are written in Matlab. The experimental platform is a PC featured with an Intel Core 2 Q6600 CPU and 4GB RAM, running Windows XP.

Table 2.1: Dataset Statistics

Dataset Number of instances Number of attributes

Heart 270 13

Breast 683 10

Australian credit 690 14

Liver 345 6

German credit 1000 24

Diabetes 768 8

Ionosphere 351 34

Fourclass 862 2

Svmguide3 1243 22

2.7.1 Utility of Classification

In this section, we compare the classification performance between a conventional SVM implementation, the LIBSVM [7], and the RSVM with random reduced set to show the effectiveness of our scheme on classification.

We solve the L2-norm RSVM problem (the (2.4) of Section 2.4.1) by the smooth SVM method [28, 29, 30]. Randomly generated vectors are adopted as the reduced set for training the RSVM. The size of the reduced set is set to 10% of the size of the training dataset. The adopted kernel function in both the RSVM and the LIBSVM is the Gaussian kernel function. The cost/kernel parameters for training the RSVM and LIBSVM are respectively determined by applying the grid search using cross-validation, where the search range is the default of LIBSVM’s parameter search tool [7, 22].

Figure 2.2 shows the experimental results of comparing the classification performance.

The reported accuracy is the 5-fold cross-validation average. It is seen that the classifi-cation accuracy of the RSVM with random vectors as the reduced set is similar to a conventional SVM, which validates that our scheme is effective for classification.

70 RSVM with random reduced set LIBSVM

Accuracy (%)

Figure 2.2: Comparison of the classification accuracy between the RSVM with random reduced set and a conventional SVM.

2.7.2 Efficiency of Outsourcing

To demonstrate the benefits of outsourcing, we measure the computational overhead im-posed on the data owner with using the privacy-preserving outsourcing scheme, and com-pare it with the computational time of training the SVM locally by the data owner itself to show how much computational cost can be saved from utilizing the privacy-preserving outsourcing scheme.

Table 2.2 shows the comparison of the required computing time of the data owner with and without utilizing the outsourcing. The SVM training includes the parameter search process and training the final classifier by the selected parameter combination.

The search range adopted here is also the default of the LIBSVM’s parameter search tool [7, 22]. The training time of both the RSVM and the LIBSVM is listed for reference.

Note that we do not aim to compare the training time between the two training methods since they are different implementations of the SVM. When using the outsourcing scheme for the dataset with m instances in n-attribute, to perturb the data, the data owner needs to generate an n× n random matrix for transformation, generate ⌈m/10⌉ random vectors in n-dimensional for the reduced set of the RSVM, and transform the m training instances and⌈m/10⌉ random vectors by matrix multiplication. It is seen that these computations can be executed very fast. On all of the datasets, they are done within 0.5 millisecond.

However, locally training the SVM takes at least several seconds to complete, which costs the data owner more than 10,000 times of computing time than the outsourcing.

The difference in the scale of computing time between outsourcing and locally training is very large. This validates the claim that the proposed privacy-preserving outsourcing scheme incurs merely little computational overhead to the data owner. The computational load of the data owner can be significantly reduced, which clearly justify the efficacy of the privacy-preserving outsourcing scheme.

Table 2.2: Time comparison of training SVMs with/without outsourcing Privacy- Locally Locally

Dataset preserving Training Training Outsourcing (RSVM) (LIBSVM) Heart Disease 0.12 ms 2.9 s 6.5 s Australian. credit 0.17 ms 9.7 s 40.4 s

German credit 0.35 ms 19.7 s 141.9 s Breast cancer 0.14 ms 9.8 s 12.4 s

Diabetes 0.13 ms 10.8 s 123.4 s

Liver disorder 0.07 ms 2.6 s 32.3 s

Ionosphere 0.29ms 3.2s 9.4s

Fourclass 0.12ms 11.7s 35.9s

Svmguide3 0.49ms 28.5s 155.2s

Scalability of Random Linear Transformation

We measure the computing time of the perturbation with random linear transformation on large-scale synthetic datasets to evaluate the scalability. The number of the instances of the synthetic datasets ranges from 10000 to 50000, where the dimensionality of those datasets are in 500 and 1000-dimensional, respectively. The computing time of the ran-dom linear transformation is shown in Figure 2.3. It is seen that the ranran-dom linear trans-formation scales well with the increase of instances. Randomly transforming 50000 in-stances in 1000-dimensional takes less than 5 seconds to complete.

Efficiency of Outsourcing the Testing

The overhead imposed on the data owner for outsourcing the testing is randomly trans-forming the testing instances. We randomly generate 10000 testing instances for each dataset to compare the time of outsourcing testing and local testing, where classifiers of each dataset are the ones trained above. The results are reported in Table 2.3. It is seen

Figure 2.3: Computing time of perturbing data by random linear transformation.

that the outsourcing scheme can save tens to hundreds of times of computational load for the data owner. Since the SVM testing is already efficient, the difference between outsourcing the testing and the local testing is not as significant as the cases of training.

Table 2.3: Time comparison of testing 10,000 instances with/without outsourcing Privacy- Locally Locally

Dataset preserving Testing Testing Outsourcing (RSVM) (LIBSVM) Heart Disease 1.40 ms 43.29 ms 233.20 ms Australian credit 1.46 ms 111.35 ms 1380.08 ms

German credit 3.13 ms 195.74 ms 1606.66 ms Breast cancer 1.23 ms 103.21 ms 113.17 ms

Diabetes 0.96 ms 109.69 ms 816.05 ms Liver disorder 0.39 ms 46.79 ms 430.28 ms

Ionosphere 5.22ms 79.49ms 269.06ms

Fourclass 0.17ms 106.61ms 265.85ms

Svmguide3 3.09ms 233.19ms 1302.06ms

2.7.3 Utility Comparison with k-Anonymity

In this section, we compare the classification performance between the RSVM with ran-dom reduced set with the SVM classifiers trained from anonymous data anonymized by the k-anonymity technique [23,49]. If only the identities of data are concerned, the anony-mous data publishing technique can be adopted for sending the anonymized data to service provider for outsourcing the SVM.

There are three of the above datasets containing quasi-identifier attributes: Statlog heart has{age, sex}, Pima Indian diabetes has {age, number of pregnant, body mass in-dex}, and German credit has {purpose, credit amount, personal status and sex, present

residence since, age, job}. Value generalization hierarchies are first built on the quasi-identifiers of each dataset, and then the Datafly algorithm [49] is performed to achieve k-anonymity. Since the SVM is a value-based algorithm, for numerical attributes, each generalized range is represented by the mean value, and for categorical data, the gener-alized category is represented by exhibiting all children categories [23]. The cost/kernel parameters to train the SVMs from anonymized data are determined by grid-search using cross-validation.

The performance comparison between the RSVM with random reduced set and the SVMs trained from k-anonymized data with k = 32 and k = 128 is shown in Figure 2.4. The reported accuracy is 5-fold cross-validation average. On German credit dataset, the accuracy of applying the k-anonymity technique with k = 32 is similar to the RSVM with random reduced set, but it falls down when k = 128 due to the severer distortion of the quasi-identifier values. On the Heart and Diabetes datasets, k = 32 is enough to significantly distort their quasi-identifier values and thus results in lower accuracy.

70 75 80 85

Heart Diabetes Ger. Credit

RSVM with random reduced set SVM on k=32

SVM on k=128

Accuracy (%)

Figure 2.4: Classification performance comparison between the RSVM with random re-duced set and the SVMs trained from k-anonymized data.

It is seen that the distortion of quasi-identifiers to achieve k-anonymity will hurt the performance of the SVM, and the performance may get worse when a large k is applied for better identity protection. Compared with outsourcing the SVM by k-anonymity, our scheme hardly hurts the performance of the SVM, and provides better protection to the data privacy since all attributes are perturbed by the random linear transformation.

在文檔中隱私保存的高效率資料分類方法 (頁 48-54)