Approximating the SVM Classifier - Security and Approximating Precision of the PPSVC

3.5 Security and Approximating Precision of the PPSVC

3.6.1 Approximating the SVM Classifier

The objective of the PPSVC is to precisely approximate the SVM classifier without com-promising the privacy of the training instances which are selected as support vectors. We test the approximation ability of the PPSVC by comparing the accuracy between the origi-nal SVM classifiers and their corresponding PPSVCs with different approximation degree d_u.

We consider several public real datasets available in the UCI machine learning repos-itory [5] to evaluate the performance of the PPSVC. We select some medical datasets to test the effectiveness of the PPSVC on medical applications as we have mentioned in Sec-tion 3.1. The classifiers trained from such medical datasets are for predicting whether a patient is subject to a specific disease. The Wisconsin breast cancer dataset, which con-tains clinical cases of breast cancer detection, is for predicting whether the organization is benign or malignant. The liver disorders dataset contains various blood tests and drink

behavior records to learn a classifier for predicting liver disorders from excessive alcohol consumption. The Pima Indian diabetes dataset are medical records of female Pima Indian heritage, which are used to learn classifiers to predict if a patient is subject to diabetes, and the Statlog heart dataset is a heart disease database. We also select two credit datasets to test the effectiveness of the PPSVC for predicting the credit of customers. The Stat-log Australian credit approval dataset is for credit card applications. The StatStat-log German credit dataset is for classifying people to good or bad credit risks. This dataset comes with two formats, where one contains both categorical and numeric attributes, and the other is pure numeric. We adopt the pure numeric version for the ease of using with the SVM.

Two physical datasets, ionosphere and sonar, are also selected to test the effectiveness of the PPSVC on various applications. Targets of the radar data in the ionosphere dataset are free electrons in the ionosphere. The label indicates if the signal shows evidence of some type of structure in the ionosphere. The sonar dataset is for training a classifier to discriminate whether the sonar signals bounced off metal or rock. For the ease of exper-iments, the chosen datasets are all binary classes. For multi-class problems, the popular one-against-one or one-against-all methods [7] can also be applied to the PPSVC. The statistics of the datasets are given in the table below.

Dataset Heart Ionosphere Liver Diabetes

# instances 270 351 345 768

# attributes 13 34 6 8

Dataset Australian German Sonar Breast

# instances 690 1000 208 683

# attributes 14 24 60 10

All attribute values have been scaled to [−1, 1] or [0, 1] in preprocessing steps to pre-vent different value range and numerical difficulty. We use LIBSVM [7] as our tool to train the SVM classifiers. The value of the cost parameter C and the kernel parameter g to train the SVM are determined by applying the grid-search using cross-validation, where the upper-bound of g’s search range is the reciprocal of the number of attributes of each dataset, as we have discussed in Section 3.5.2. The SVM classifiers trained by LIBSVM are then transformed to PPSVCs to protect the support vectors. The experimental results are shown in Figure 3.8 for comparing the classification accuracy between the original

Figure 3.8: Classification Accuracy of the original SVM classifier and PPSVCs with d_u = 1 to 5.

SVM classifier and PPSVCs with d_u = 1 to d_u = 5. The classification accuracy reported in the figure is the 5-fold cross-validation average.

It is seen that the PPSVC with du = 1 usually does not have good approximation to the original SVM classifier. Because with du = 1, the approximation of the infinite series in the privacy-preserving decision function is similar to that of approximating exp(x) by 1 + x. This linear approximation usually cannot achieve good precision.

On medical datasets, except the liver disorder dataset, the PPSVC achieves the same accuracy with the original SVM classifier in d_u=2 on the breast cancer, heart disease, and diabetes datasets. The PPSVC can handle these problems well in low approximation degree. On the liver disorder dataset, the PPSVC achieves the same accuracy until d_u = 5.

Since the two classes of instances in this dataset are highly overlapped, a bit of difference in the decision boundary will result in much variation in the classifying results. This causes it to require higher precision in the approximation of the decision boundary to obtain a better classifying performance.

On the credit datasets German and Australian, the PPSVC in low approximation de-gree is enough to give very good approximation. In these datasets, many attributes are indicator variables which are transformed from original categorical attributes in prepro-cessing. The values of the indicator variables in two classes of instances are separated clearly, which leaves only a few instances in the region close to the decision boundary.

Hence a rough approximation is enough to achieve similar classifying accuracy. Note that the better accuracy obtained by PPSVC with d_u = 1 in Australian does not represent that

the PPSVC achieves better performance. It is caused by the poor linear approximation in d_u = 1, where some overlapped instances are accidentally classified to their correct labels by the imprecise approximating decision boundary.

On physical problems, the ionosphere dataset achieves satisfying approximation in d_u = 2. The sonar dataset needs d_u = 4 to obtain similar accuracy. This may come from the larger kernel parameter determined by cross-validation on this dataset, which has touched the upper bound #attributes¹ . Larger kernel parameter will result in lower ap-proximating precision, and hence it requires higher approximation degree.

In general, with d_u = 2, the classification accuracy resulted by the PPSVC soon gets close to the original SVM classifier. With d_u = 3, the PPSVC gets almost the same classification accuracy to the original SVM classifier on most datasets. On all datasets, the PPSVC gets the same classification accuracy to the original SVM classifier with du ≤ 5. This verifies our claim that the PPSVC can precisely approximate the original SVM classifier by a low approximation degree d_u, and hence results in a classifier with moderate complexity. The PPSVC in low approximation degree can effectively approximate the SVM classifier and possess privacy-preserving property which protects the private content of support vectors.

在文檔中隱私保存的高效率資料分類方法 (頁 83-86)