CHAPTER IV EXPERIMENT RESULTS
4.2 D ISCUSSION
In this research, we presented a novel Chinese e-government document classification with integrated word segmentation method based on the Internet. From the comparison of overall performances by two techniques, we found there are no significant differences in CKIP or GSP.
However, it’s noteworthy that our proposed segmentation method does not require an extra corpus that has to be pre-built manually and also does not need tediously
maintaining the corpus by hand.
Besides, in an elaborate comparison between two models, including full text and title, we found significant differences in the contrastive performance between full text and title. Results from Table 4.4, Table 4.5 and Fig 4.7, showed that the performance of classification from title is better than the performance of classification from full text.
Two main problems that appear in Chinese e-government documents classification applications were efficiently tackled in this work : (1) to overcome OOV problem. (2) to reduce dimensionality. To compare the results reported by Table 4.5, we found this OOV word “兩公約” (International Covenant on Civil and Political Rights and International Covenant on Economic, Social and Cultural Rights) was successfully found by GSP and “兩公約” in fact is an absolute keyword for categorizing into Student Affairs, while CKIP can not segment out. Beside, through Table 4.2, results showed that the longer words were segmented out, the fewer number of words were extracted.
Through Table 4.2 and Table 4.5, we found that the results by GSP have fewer total numbers of extracted words. Longer extracted words and fewer segmented out words verify our assumption: The computing complexity will be reduced because total number of extracted words is fewer than total number of extracted words by CKIP.
Chapter V
Conclusion and Future Work
In this chapter, conclusions and future works with our empirical results were discussed. In section 5.1, four conclusions are addressed, including the eight term weighting methods, two different test data sets, multi-class SVM classifiers and Chinese word segmentation systems – CKIP and our proposed system. In section 5.2, we described the limitations in our research and future work.
5.1 Conclusion
In conclusion, we have discussed two Chinese word segmentation implementations for Chinese e-government document in full text and title. And we compared them with eight term weighting methods based on several multi-class classifiers: OAOSVM, OAASVM, BSVM, and BTMSVM.
Experiments on Chinese e-government documents classification problems showed that: Our proposed Chinese word segmentation system – GSP has no statistic differences with CKIP. Thus, relying web corpus without pre-built dictionary is a possible method to detect Chinese words.
Comparing with the performance of eight term weighting methods, we found that – binary, tf, and tfchi perform consistently the best while tfig and tfgr perform consistently the worst in all experiments. The method, tfidf, however, has outperforming results with BTMSVM classifier in computing title data set. We are convinced that BTMSVM favors tfidf method. Therefore, we suggest binary, tf, tfidf and tfchi are more suitable for practical use.
To choose title only for Chinese e-government documents classification is much efficient. We considered the overall performance of different data set we chose in the experiments. The overall performances of data sets containing title only were superior than data sets containing full text. We observed the e-government document in Fig.
2.1. Due to the nature of e-government documents, the documents are designed for exchanging information between units. It can make sense that titles in document have contained whole main keynotes. Besides, the length of articles in title is shorter. Thus, to select title of e-government documents for segmentation and classification is efficient.
In considering the multi-class classifiers, we found that BTMSVM consistently perform better overall performance in comparison of four multi-class SVMs.
Although OAOSVM performs well or even comparable to BTMSVM, as we mentioned earlier, OAOSVM may require prohibitively-expensive computing resources in k-class problems while k is large.
5.2 Limitations and future work
Limitations
The research presents a study on Chinese word segmentation via the Internet. There are two limitations which exist uncertainties affecting our research performance.
1.The emerging of the lexicon corpus – Google Suggest is in recent years. Its returning suggested terms has not been well-studied in the research literature for linguistic applications. According to Google’s online documentation, Google Suggest provides terms completion based on overall popularity of search strings. Therefore, it might occur typographical errors (typos) that the Internet clients misspelled words.
2. Moreover, it still has complexity to proceed translation between simplified Chinese words to traditional Chinese words which returning from Wikipedia in our proposed system. For example, there are three traditional Chinese words – “檯”, “臺” and “颱”
mapping to the same simplified Chinese word – “台”. Besides, this simplified Chinese word “台” is also commonly used in Taiwan. Therefore, although we have carefully ignored these words in mapping table, complexity of translation is still another inevitable uncertainty in our investigation.
Future works
The future works includes the application of the integration with other word segmentation systems, the application of automatically establishing synonymous mapping tables, the application of parallels segmentation, the application of the integration with correcting typos and the application of ranking term values according their position.
First, our proposed method – GSP can be integrated into other Chinese word segmentation methods such as CKIP. We believe our without-pre-built dictionary
system can rely on other systems to get better efficiencies.
Second, the other future work is how to automatically establish synonymous mapping tables. Observing the results provided by Table 4.5, we found the two absolute keywords “兩公約” (International Covenant on Civil and Political Rights and International Covenant on Economic, Social and Cultural Rights) and “兩公約施 行法” (Act to Implement the International Covenant on Civil and Political Rights and the International Covenant on Economic, Social and Cultural Rights) can be concluded as synonymous words. Therefore, if we can automatically build synonymous mapping tables, it will be helpful in reducing the dimension.
Third, another interesting future work is to experiment on a parallels segmentation based on maximum forward matching in GSP. To develop a parallels segmentation then will improve the speed of segmentation procedure.
Forth, there may have typos on e-government documents. Therefore, to develop a correcting words function based on our system by adapting other techniques such as levenstein distance is also we interest.
Finally, to consider different position or fonts of terms may contribute different ranking values to assist us on information retrieval is worth to discuss. In our experiments, we found terms in the title of documents that have keywords. Further, we may focus on the future work that different terms position or fonts may product more discriminative ability to help us on information retrieval.
References
[1] An, X. (2009). The Electronic Records Management in E-government Strategy:
Case Studies and the Implications. Paper presented at the International Conference on Networking and Digital Society.
[2] Anttiroiko, A.-V. (2005). Towards Ubiquitous Government: The Case of Finland.
e-Service Journal, 4(1), 65-99.
[3] Bøhn, C. (2009). Extracting Named Entities and Synonyms from Wikipedia for use in News Search. Master of Science in Computer Science.
[4] Braga-Neto, U. M., & Dougherty, E. R. (2004). Is cross-validation valid for small-sample microarray classification? Bioinformatics, 20(3), 374-380.
[5] Bunescu, R. C., & Pasca, M. (2006). Using Encyclopedic Knowledge for Named Entity Disambiguation. Paper presented at the Proceedings of 11th Conference of European Chapter of the Association for Computational Linguistics (EACL).
[6] Campbell, C. (2002). Kernel methods: a survey of current techniques.
Neurocomputing, 48(1-4), 63-84.
[7] Carter, L., & Bélanger, F. (2005). The utilization of e-government services: citizen trust, innovation and acceptance factors. Information Systems Journal, 15(1), 5-25.
[8] Chang, J.-S., & Lai, Y.-T. (2004). A Preliminary Study on Probabilistic Models for Chinese Abbreviations. Paper presented at the SIGHAN Workshop On Chinese Language Processing.
[9] Chen, A. (2003). Chinese word segmentation using minimal linguistic knowledge.
Paper presented at the Proceedings of the second SIGHAN workshop on Chinese language processing.
[10] Chen, A., He, J., Xu, L., Gey, F., C. , & Meggs, J. (1997). Chinese text retrieval without using a dictionary. SIGIR Forum, 31(SI), 42-49.
[11] Chen, C.-M., & Liu, C.-Y. (2009). Personalized e-news monitoring agent system for tracking user-interested Chinese news events. Applied Intelligence, 30(2), 121-141.
[12] Chen, C.-M., Liu, C.-Y., Chiu, W.-C., & Lee, T.-H. (2006). Personalized E-News Monitoring Agent System for Tracking the User-Interested Chinese News Events. Paper presented at the SMC '06. IEEE International Conference on Systems, Man and Cybernetics.
[13] Chen, K.-J., & Bai, M.-H. (1998). Unknown Word Detection for Chinese by a Corpus-based Learning Method. Computational Linguistics and Chinese Language Processing, 3(1), 27-44.
[14] Chen, K.-J., & Liu, S.-H. (1992). Word identification for Mandarin Chinese
sentences. Paper presented at the Proceedings of the 14th conference on Computational linguistics.
[15] Chen, K.-J., & Ma, W.-Y. (2002). Unknown word extraction for Chinese documents. Paper presented at the Proceedings of the 19th international conference on Computational linguistics.
[16] Cheng, K.-S., Young, G., H., & Wong, K.-F. (1999). A study on word-based and integral-bit Chinese text compression algorithms. Journal of the American Society for Information Science, 50(3), 218-228.
[17] Cheng, L., Zhang, J., Yang, J., & Ma, J. (2008). An Improved Hierarchical Multi-class Support Vector Machine with Binary Tree Architecture. Paper presented at the ICICSE '08. International Conference on Internet Computing in Science and Engineering.
[18] Cheong, S., oh, S. H., & Lee, S.-Y. (2004). Support Vector Machines with Binary Tree Architecture for Multi-Class Classification. Neural Information Processing-Letters and Reviews, 2(3), 47-51.
[19] Cock, M. D., & Cornelis, C. (2005). Fuzzy Rough Set Based Web Query Expansion. Paper presented at the Proceedings of Rough Sets and Soft Computing in Intelligent Agent and Web Technology, International Workshop at WI-IAT.
[20] Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning.
[21] Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines : and other kernel-based learning methods: Cambridge University Press.
[22] Cucerzan, S. (2007). Large-Scale Named Entity Disambiguation Based on Wikipedia Data. EMNLP 2007: Empirical Methods in Natural Language Processing, June 28-30, 2007, Prague, Czech Republic.
[23] Devadoss, P. R., Pan, S. L., & Huang, J. C. (2003). Structurational analysis of e-government initiatives: a case study of SCO. Decision Support Systems, 34(3), 253-269.
[24] Emerson, T. (2005, 2005). The Second International Chinese Word Segmentation Bakeoff. Paper presented at the Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju, Korea.
[25] Fu, G., Kit, C., & Webster, J. J. (2008). Chinese word segmentation as morpheme-based lexical chunking. Information Sciences, 178(9), 2282-2296.
[26] Gabay, D., Eliahu, Z. B., & Elhadad, M. (2008). Using Wikipedia Links to Construct Word Segmentation Corpora. Paper presented at the Proceedings of the WIKIAI-08 Workshop, AAAI-2008 Conference.
[27] Galavotti, L., Sebastiani, F., & Simi, M. (2000). Experiments on the Use of
Feature Selection and Negative Evidence in Automated Text Categorization.
Paper presented at the Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries.
[28] Gao, J., Li, M., Wu, A., & Huang, C.-N. (2005). Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics, 31(4), 531-574.
[29] Goh, C.-l., Asahara, M., & Matsumoto, Y. (2005). Chinese Word Segmentation by Classification of Characters. Computational Linguistics and Chinese Language Processing, 10(3), 381-396.
[30] Hearst, M. A. (1998). Support Vector Machines. IEEE Intelligent Systems, 13(4), 18-28.
[31] Hong, C.-M., Chen, C.-M., & Chiu, C.-Y. (2009). Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems. Expert Systems with Applications, 36(2, Part 2), 3641-3651.
[32] Houle, M. E., & Grira, N. (2007). A correlation-based model for unsupervised feature selection. Paper presented at the Proceedings of the sixteenth ACM conference on Conference on information and knowledge management.
[33] How, B. C., & Narayanan, K. (2004). An Empirical Study of Feature Selection for Text Categorization based on Term Weightage. Paper presented at the Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence.
[34] Hsu, C.-W., & Lin, C.-J. (2002a). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415-425.
[35] Hsu, C.-W., & Lin, C.-J. (2002b). A Simple Decomposition Method for Support Vector Machines. Machine Learning, 46(1-3), 291-314.
[36] Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., et al. (2008). Enhancing text clustering by leveraging Wikipedia semantics. Paper presented at the Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, Singapore, Singapore.
[37] Hung, S.-Y., Tang, K.-Z., Chang, C.-M., & Ke, C.-D. (2009). User acceptance of intergovernmental services: An example of electronic document management system. Government Information Quarterly, 26(2), 387-397.
[38] Jean, T.-S. (1981). The pragmatics of information retrieval experimentation, revisited. Information Processing & Management, 28(4), 467-490.
[39] Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features Machine Learning: ECML-98 (pp. 137-142).
[40] Joachims, T. (2006). Training linear SVMs in linear time. Paper presented at the Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.
[41] Khare, R., & Rifkin, A. (1997). XML: a door to automated Web applications.
IEEE Internet Computing, 1(4), 78-87.
[42] Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333-347.
[43] Kim, K. J. (2003). Financial time series forecasting using support vector machines. Neurocomputing, 55, 307-319.
[44] Ko, S.-J., & Lee, J.-H. (2001). Feature Selection Using Association Word Mining for Classification. Paper presented at the Proceedings of the 12th International Conference on Database and Expert Systems Applications.
[45] Koby, C., & Yoram, S. (2002). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, 265-292.
[46] Kwok, K. L. (1997). Comparing representations in Chinese information retrieval.
Paper presented at the Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval.
[47] Lan, M., Tan, C. L., Su, J., & Lu, Y. (2009). Supervised and Traditional Term Weighting Methods for Automatic Text Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721-735.
[48] Layne, K., & Lee, J. (2001). Developing fully functional E-government: A four stage model. Government Information Quarterly, 18(2), 122-136.
[49] Lee, S., M. , Xin, T., & Silvana, T. (2005). Current practices of leading e-government countries. Communications of the ACM, 48(10), 99-104.
[50] Liu, Y., & Zheng, Y. F. (2005). One-against-all multi-class SVM classification using reliability measures. Paper presented at the 2005 IEEE International Joint Conference on Neural Networks, 2005. IJCNN '05. Proceedings. . [51] McDermott, R. (1999). Why Information Technology Inspired But Cannot
Deliver Knowledge Management. California Management Review, 41(4), 103-117.
[52] Mineau, P. S. G. W. (2005). Beyond TFIDF Weighting for Text Categorization in the Vector Space Model. International Joint Conference on Artificial Intelligence, 1130-1135.
[53] Mori, T. (2002). Information gain ratio as term weight: the case of summarization of IR results. Paper presented at the Proceedings of the 19th international conference on Computational linguistics.
[54] Pang, B., & Shi, H. (2009). Research on Improved Algorithm for Chinese Word
Segmentation Based on Markov Chain. Paper presented at the Proceedings of the 2009 Fifth International Conference on Information Assurance and Security.
[55] Sahlgren, M., & C¨oster, R. (2004). Using bag-of-concepts to improve the performance of support vector machines in text categorization. Paper presented at the Proceedings of the 20th international conference on Computational Linguistics.
[56] Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513-523.
[57] Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
[58] Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.
[59] Sprague Jr., R. H. (1995). Electronic document management: challenges and opportunities for information systems managers. MIS Quarterly, 19(1), 29-49.
[60] Sproat, R., & Emerson, T. (2003). The first international Chinese word segmentation Bakeoff. Paper presented at the Proceedings of the second SIGHAN workshop on Chinese language processing.
[61] Su, C.-Y., Lin, T.-C., & Wu, S.-H. (2007, May 15-18). Using Wikipedia to Translate OOV Terms on MLIR. Paper presented at the Proceedings of NTCIR-6 Workshop Meeting, Tokyo, Japan.
[62] Sun, X., Wang, H., & Zhang, Y. (2006). Chinese Abbreviation-Definition Identification: A SVM Approach Using Context Information Paper presented at the PRICAI 2006: Trends in Artificial Intelligence.
[63] Tan, B., & Peng, F. (2008). Unsupervised query segmentation using generative language models and wikipedia. Paper presented at the Proceeding of the 17th international conference on World Wide Web.
[64] Tang, L., Rajan, S., & Narayanan, V. K. (2009). Large scale multi-label classification via metalabeler. Paper presented at the Proceedings of the 18th international conference on World wide web.
[65] Teahan, W. J., Rodger, M., Yingying, W., & Ian, H. W. (2000). A compression-based algorithm for Chinese word segmentation. Computational Linguistics, 26(3), 375-393.
[66] Tsai, J.-L., Sung, C.-L., & Hsu, W.-L. (2003). Chinese Word Auto-Confirmation Agent. Paper presented at the Proceedings of ROCLING XV.
[67] Ulrich, H. G. K., & el. (1999). Pairwise classification and support vector machines Advances in kernel methods: support vector learning (pp. 255-268):
MIT Press.
[68] Vapnik, V. (1999). The Nature of Statistical Learning Theory (2 ed.): Springer.
[69] Vojislav, K. (2001). Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models: MIT Press.
[70] Wang, P., Hu, J., Zeng, H.-J., & Chen, Z. (2008). Using Wikipedia knowledge to improve text classification. Knowledge and Information Systems.
[71] Wen, J., & Cheng, L. (2007). Innovation in e-Government Initiatives: New Website Service Interfaces and Market Creation - The Taiwan Experience.
Paper presented at the PICMET 2007 Proceedings, Portland, Oregon USA.
[72] Wong, P.-k., & Chan, C. (1996). Chinese word segmentation based on maximum matching and word binding force. Paper presented at the Proceedings of the 16th conference on Computational linguistics.
[73] Wu, A., & Jiang, Z. (2000). Statistically-enhanced new word identification in a rule-based Chinese system. Paper presented at the Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics.
[74] Wu, C.-H., Fang, I.-C., Wu, C.-H., Lin, W.-T., & Li, C.-H. (2009). A Novel Parallel Cross-validated Support Vector Machine on Patent Classification System. Paper presented at the PICMET 2009 Proceedings, Portland, Oregon USA.
[75] Xue, N., Xia, F., Chiou, F.-d., & Palmer, M. (2005). The Penn Chinese TreeBank:
Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207-238.
[76] Yang, C. C., Luk, J. W. K., Yung, S. K., & Yen, J. (2000). Combination and boundary detection approaches on Chinese indexing. Journal of the American Society for Information Science 51(4), 340-351.
[77] Yang, W.-S., & Jan, Y.-S. (2009). Increasing the authoritativeness of web recommendations using PageRank-based approaches. Online Information Review, 33(2), 362-375.
[78] Zhang, Y., & Vines, P. (2004). Using the web for automated translation extraction in cross-language information retrieval. Paper presented at the Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval.
[79] Zheng, X., Huang, L., Chen, Z., Yu, Z., & Yang, W. (2009). Hiding Information by Context-Based Synonym Substitution. Paper presented at the Proceedings of the 8th International Workshop on Digital Watermarking.
[80] Zheng, Z., & Srihari, R. (2003). Optimally Combining Positive and Negative Features for Text Categorization. Paper presented at the Workshop for Learning from Imbalanced Datasets II, Proceedings of the ICML.
[81] Zheng, Z., Wu, X., & Srihari, R. (2004). Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 6(1), 80-89.
[82] Zhou, M. (2000). A block-based robust dependency parser for unrestricted Chinese text. Paper presented at the Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics.
[83] Zhu, S., Ji, X., Xu, W., & Gong, Y. (2005). Multi-labelled classification using maximum entropy method. Paper presented at the Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval.