Comparison of Accuracy and Efficiency - 隱私保存的高效率資料分類方法

4.5 Experiments

4.5.2 Comparison of Accuracy and Efficiency

We show the accuracy, training time and testing time of applying the degree-2 TPM fea-ture mapping with linear SVM solvers to compare with normal Gaussian kernel SVMs.

We also compare with using other explicit feature mapping with linear SVM solvers, the random Fourier features [43], and the degree-2 explicit feature mapping of polynomial kernel [8].

The authors of [8] have provided a program which integrates degree-2 polynomial mapping with LIBLINEAR, and thus we will use it in the experiments. For TPM and random Fourier feature mapping, we separately mapped all data first, and then use the mapped data as the input to LIBLINEAR. From Table 4.3, it is seen that the average number of nonzero features in the degree-2 TPM-feature mapped data is in the range between 90.3 and 118.1. Since the random Fourier features are dense, for comparing accuracy in a similar complexity with degree-2 TPM feature mapping in training with the linear SVM, we use 200 features for random Fourier feature mapping. The degree-2 explicitly polynomial feature mapped data has the same number of nonzero features with the degree-2 TPM feature mapped data.

All parameters for training are determined by cross-validation¹. The results of training time and testing accuracy of the three methods are reported in Table 4.4, and the results of testing time are reported in Table 4.5. For the ease of comparison, we also show the differences in time and accuracy to the Gaussian kernel SVM.

We first consider on the results of our proposed degree-2 TPM feature mapping (TPM-2). It is seen that on IJCNN2001 and Adult datasets, the resulted accuracy is similar to that of the Gaussian kernel SVM, but consumes much less time on training. On the Forest cover type dataset, the accuracy is not as good as using a normal Gaussian kernel SVM.

The reason is that this dataset needs a large value of the Gaussian kernel parameter g to separate the two classes of data. But the approximating precision of the TPM feature mapping decreases as the value of g increases. Therefore, the TPM feature mapping needs to use a smaller g to work with the SVM, but a small value of g does not separate the data

1The degree-2 polynomial kernel function is K(x, y) = (gx· y + r)², where we fix r to 1 as that done by [8].

well and results in lower accuracy. However, it takes only several minutes to complete the training, compared to several hours of the Gaussian kernel SVM. Although the accuracy is not as high as a normal Gaussian kernel SVM, but the improvement on training time is large and can provide a good trade-off between accuracy and efficiency. The results show that the low-degree TPM feature mapping with a linear SVM solver can well approximate the classification ability of the Gaussian kernel SVM in relatively very low computational cost.

The degree-2 polynomial mapping (Poly-2) also results in similar accuracy on

IJCNN2001 and Adult datasets, but on the Forest cover type dataset, it does not perform well and is only slightly better than the linear SVM. Since the degree is one of the pa-rameters of the polynomial kernel function, the nonlinear ability of the polynomial kernel function is restricted by the low-degree, which causes it cannot separate this dataset well.

The degree of our TPM feature mapping is related to the precision of approximation but not a parameter of the Gaussian kernel function, and degree-2 is usually enough to approx-imate well and hence is able to achieve better accuracy. The computing time of explicit polynomial feature mapping is usually faster here since its program provided by their au-thors integrates the feature mapping, which reads the original data from disk to perform feature mapping in memory, and the feature mapping can be executed fast. Our prototype of the TPM is a separate feature mapping, and the linear SVM solver must read the larger mapped data from disk. Since the disk reading is slow, it usually takes longer time than Poly-2. The difference is more apparent in the testing. From Table 4.5, we can see that the resulted classifiers of TPM-2 and Poly-2 have similar number of nonzero features in the weight vector w. Since the Poly-2 reads original data to perform in-memory feature mapping, it runs faster than TPM-2 which reads larger mapped data from disk. We leave the integration of the TPM feature mapping with the linear SVM solver as a future work.

Then we consider on the random Fourier features (Fourier-200). It is seen that the accuracy resulted from Fourier-200 is poor since 200 features are still too few to approx-imate the Gaussian kernel function well. The random Fourier features method requires a large number of features to reduce the variation, but with 200 features, it already consumes

longer time than TPM-2 and Poly-2 in Adult and IJCNN 2001 datasets. In the comparison of testing efficiency, although there are only 200 nonzero features in the weight vector w of Fourier-200, it still runs slower than TPM-2 and Poly-2. Because the random Fourier features are dense, all the mapped testing data also have 200 nonzero features, while the TPM-2 and Poly-2 feature mapped data are sparse. Hence Fourier-200 runs slower in testing than both TPM-2 and Poly-2 which have dense weight vectors but sparse testing data.

4.6 Summary

We propose the Taylor polynomial-based monomial (TPM) feature mapping which ap-proximates the infinite-dimensional implicit feature mapping of the Gaussian kernel func-tion by low-dimensional features, and then utilize the TPM feature mapped data with a fast linear SVM solver to approximately train a Gaussian kernel SVM. The experimen-tal results show that TPM feature mapping with a linear SVM solver can achieve similar accuracy to a Gaussian kernel SVM but consume much less time.

Chapter 5 Conclusion

In this dissertation, we study the privacy as well as efficiency issues in utilizing the sup-port vector machines. We show that existing works are not secure for privacy-preserving outsourcing of the SVM, and consider on the inherent privacy violation problem of the SVM classifier. We propose solutions for these problems, and prove that the proposed techniques are strong in security. We also develop an efficient SVM training scheme for large-scale data.

In Chapter 2, we propose a privacy-preserving outsourcing scheme of the SVM which protects the data by the random linear transformation. It achieves similar classification accuracy to a normal SVM classifier, and provides higher security on the data privacy than existing works based on the geometric transformation. The privacy of both the data and generated classifiers are protected, and the overhead imposed on the data owner is very little.

In Chapter 3, we propose the privacy-preserving SVM classifier to tackle the inher-ent privacy violation problem of the classification model of the SVM, where some intact instances of the training data called support vectors are revealed. The Gaussian kernel SVM classifier is post-processed to a privacy-preserving classifier which precisely ap-proximates the prediction ability of the SVM classifier and does not disclose the private content of support vectors. By protecting the content of support vectors, the privacy-preserving SVM classifier can be publicly released without violating individual privacy.

In Chapter 4, based on the kernel approximation technique of Chapter 3, we

pro-pose the Taylor polynomial-based monomial feature mapping which sufficiently approx-imates the infinite-dimensional implicit feature mapping of the Gaussian kernel function by explicit dimensional features, and then utilize the explicitly feature mapped low-dimensional data with a fast linear SVM solver to approximately train a Gaussian kernel SVM. The experimental results show that the proposed scheme can achieve similar clas-sification accuracy to a normal Gaussian kernel SVM but consume much less time.

Bibliography

[1] C. C. Aggarwal and P. S. Yu, “A condensation approach to privacy preserving data mining,” in Proceedings of the 9th International Conference on Extending Database Technology (EDBT), 2004.

[2] D. Agrawal and C. C. Aggarwal, “On the design and quantification of privacy preserving data mining algorithms,” in Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), 2001.

[3] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “Order preserving encryption for numeric data,” in Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD), 2004.

[4] R. Agrawal and R. Srikant, “Privacy preserving data mining,” in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD), 2000.

[5] A. Asuncion and D. Newman, UCI Machine Learning Repository, 2007, available at http://www.ics.uci.edu/^∼mlearn/MLRepository.html.

[6] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,”

Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998.

[7] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001, software available at http://www.csie.ntu.edu.tw/^∼cjlin/libsvm.

[8] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin, “Training and testing low-degree polynomial data mappings via linear SVM,” Journal of Machine Learning Research, vol. 11, pp. 1471–1490, 2010.

[9] K. Chen and L. Liu, “Privacy preserving data classification with rotation pertur-bation,” in Proceedings of the 5th IEEE International Conference on Data Mining (ICDM), 2005.

[10] K. Chen, G. Sun, and L. Liu, “Towards attack-resilient geometric data perturbation,”

in Proceedings of the 7th SIAM International Conference on Data Mining (SDM), 2007.

[11] M.-S. Chen, J. Han, and P. S. Yu, “Data mining: An overview from database per-spective,” IEEE Trans. Knowl. Data Eng., vol. 8, no. 6, pp. 866–883, 1996.

[12] Y.-W. Chen and C.-J. Lin, “Combining SVMs with various feature selection strate-gies,” in Feature extraction, foundations and applications. Springer, 2006.

[13] R. Collobert, S. Bengio, and Y. Bengio, “A parallel mixture of SVMs for very large scale problems,” Neural Computation, vol. 14, no. 5, pp. 1105–1114, 2002.

[14] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, “Privacy preserving mining of association rules,” in Proceedings of the 8th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining (KDD), 2002.

[15] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR:

A library for large linear classification,” Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008, software available at http://www.csie.ntu.edu.tw/

∼cjlin/liblinear.

[16] C. Gentry, “Computing arbitrary functions of encrypted data,” Communications of the ACM, vol. 53, no. 3, pp. 97–105, 2010.

[17] S. Goldwasser and S. Micali, “Probabilistic encryption,” Journal of Computer and System Sciences, vol. 28, no. 2, pp. 270–299, 1984.

[18] R. P. Grimaldi, Discrete and Combinatorial Mathematics: An Applied Introduction.

Pearson Education, 2004.

[19] H. Hacıg¨um¨us¸, B. Iyer, C. Li, and S. Mehrotra, “Executing SQL over encrypted data in the database-service-provider model,” in Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002.

[20] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kauf-mann, 2006.

[21] HIPAA, Standard for privacy of individually identifiable health information, 2001.

[Online]. Available: http://www.hhs.gov/ocr/privacy/index.html

[22] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, “A practical guide to support vector clas-sification,” Department of Computer Science, National Taiwan University, http:

//www.csie.ntu.edu.tw/^∼cjlin/papers/guide/guide.pdf, Tech. Rep., 2003.

[23] A. Inan, M. Kantarcioglu, and E. Bertino, “Using anonymized data for classifica-tion,” in Proceedings of the 25th IEEE International Conference on Data Engineer-ing (ICDE), 2009.

[24] T. Joachims, “Training linear SVMs in linear time,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006.

[25] M. Kantarcioglu and C. Clifton, “Privacy-preserving distributed mining of associa-tion rules on horizontally partiassocia-tioned data,” IEEE Transacassocia-tions on Knowledge and Data Engineering, vol. 16, no. 9, pp. 1026–1037, 2004.

[26] S. Laur, H. Lipmaa, and T. Mielik¨ainen, “Cryptographically private support vector machines,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006.

[27] Y.-J. Lee and S.-Y. Huang, “Reduced support vector machines: A statistical theory,”

IEEE Transactions on Neural Networks, vol. 18, no. 1, pp. 1–13, 2007.

[28] Y.-J. Lee and O. L. Mangasarian, “RSVM: Reduced support vector machines,” in Proceedings of the 1st SIAM International Conference on Data Mining (SDM), 2001.

[29] ——, “SSVM: A smooth support vector machine for classification,” Computational Optimization and Applications, vol. 20, no. 1, pp. 5–22, 2001.

[30] K.-M. Lin and C.-J. Lin, “A study on reduced support vector machines,” IEEE Trans-actions on Neural Networks, vol. 14, no. 6, pp. 1449–1459, 2003.

[31] K.-P. Lin and M.-S. Chen, “Releasing the SVM classifier with privacy-preservation,”

in Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), 2008.

[32] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” Journal of Cryptology, vol. 15, pp. 177–206, 2002.

[33] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, “l-diversity:

Privacy beyond k-anonymity,” in Proceedings of the 22nd IEEE International Con-ference on Data Engineering (ICDE), 2006.

[34] O. L. Mangasarian, E. W. Wild, and G. M. Fung, “Privacy-preserving classification of vertically partitioned data via random kernels,” ACM Transactions on Knowledge Discovery from Data, vol. 2, no. 3, pp. 12:1–12:16, 2008.

[35] O. L. Mangasarian and T. Wild, “Privacy-preserving classification of horizontally partitioned data via random kernels,” 07-03, Data Mining Institute, Computer Sci-ences Department, University of Wisconsin - Madison, Tech. Rep., 2007.

[36] B. Mozafari and C. Zaniolo, “Publishing naive bayesian classifiers: Privacy without accuracy loss,” in Proceedings of the 35th International Conference on Very Large Data Bases (VLDB), 2009.

[37] E. Osuna, R. Freund, and F. Girosi, “An improved training algorithm for support vector machines,” in Proceedings of the 1997 IEEE Workshop on Neural Networks for Signal Processing (NNSP), 1997.

[38] P. Paillier, “Public-key cryptosystems based on composite degree residuosity classes,” in Advances in Cryptography - EUROCRYPT’99, vol. 1592 of Lecture Notes in Computer Science. Springer-Verlag Berlin Heidelberg, 1999, pp. 223–

238.

[39] B. Pinkas, “Cryptographic techniques for privacy-preserving data mining,” ACM SIGKDD Explorations Newsletter, vol. 4, no. 2, pp. 12–19, 2002.

[40] J. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines,” in Advances in Kernel Methods: Support Vector Learning. MIT Press, 1998.

[41] D. Prokhorov, “IJCNN 2001 neural network competition,” Slide presentation in IJCNN’01, Ford Research Laboratory, Tech. Rep., 2001.

[42] J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

[43] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in Neural Information Processing Systems 20 (NIPS), 2008.

[44] ——, “Uniform approximation of functions with random bases,” in Proceedings of the 46th Annual Allerton Conference on Communication, Control, and Computing, 2008.

[45] P. Samarati, “Protecting respondents’ identities in microdata release,” IEEE Trans-actions on Knowledge and Data Engineering, vol. 13, no. 6, pp. 1010–1027, 2001.

[46] B. Sch¨olkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002.

[47] A. J. Smola, B. Sch¨olkopf, and K.-R. M¨uller, “The connection between regulariza-tion operators and support vector kernels,” Neural Networks, vol. 11, pp. 637–649, 1998.

[48] L. Sweeney, “Uniqueness of simple demographics in the U.S. population,” LIDAP-WP4, Carnegie Mellon University, Laboratory for International Data Privacy, 2000.

[49] ——, “Achieving k-anonymity privacy protection using generalization and suppres-sion,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Sys-tems, vol. 10, no. 5, pp. 571–588, 2002.

[50] ——, “k-anonymity: A model for protecting privacy,” International Journal of Un-certainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5, pp. 557–570, 2002.

[51] I. W. Tsang, A. Kocsor, and J. T. Kwok, “Simpler core vector machines with enclos-ing balls,” in Proceedenclos-ings of the 24th International Conference on Machine Learnenclos-ing (ICML), 2007.

[52] I. W. Tsang, J. T. Kwok, and P.-M. Cheung, “Core vector machines: Fast SVM training on very large data sets,” Journal of Machine Learning Research, vol. 6, pp.

363–392, 2005.

[53] J. Vaidya and C. Clifton, “Privacy-preserving association rule mining in vertically partitioned data,” in Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2002.

[54] J. Vaidya, H. Yu, and X. Jiang, “Privacy-preserving SVM classification,” Knowledge and Information Systems, vol. 14, pp. 161–178, 2008.

[55] V. N. Vapnik, Statistical Learning Theory. John Wiley and Sons, 1998.

[56] W. K. Wong, D. W. Cheung, E. Hung, B. Kao, and N. Mamoulis, “Security in out-sourcing of association rule mining,” in Proceedings of the 33rd International Con-ference on Very Large Data Bases (VLDB), 2007.

[57] W. K. Wong, D. W. Cheung, B. Kao, and N. Mamoulis, “Secure kNN computation on encrypted databases,” in Proceedings of the 35th SIGMOD International Confer-ence on Management of Data (SIGMOD), 2009.

[58] H. Yu, X. Jiang, and J. Vaidya, “Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data,” in Proceedings of the 2006 ACM Symposium on Applied Computing (SAC), 2006.

[59] H. Yu, J. Vaidya, and X. Jiang, “Privacy-preserving SVM classification on vertically partitioned data,” in Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2006.

在文檔中隱私保存的高效率資料分類方法 (頁 111-122)