國立臺中教育大學數位內容科技學系
碩士在職專班碩士論文
指導教授:吳智鴻 博士
網路斷詞技術應用於
電子公文分類系統
A Novel Web-based Word Segmentation Method
for Chinese E-Government Documents
Classification System
研究生:吳嘉祥 撰
中華民國九十九年六月
Acknowledgements
I would like to thank Dr. Chih-Hung Wu, for giving me advice with regards to the ideas, structure and general problems of my thesis and being a great friend, teacher, and mentor.
Special thanks to Dr. Gwo-Jen Hwang and Dr. Ming-Hsiung Ying. With their advice and supervision, my thesis became more complete and solid.
I would also dedicate this thesis to my parents and my wife’s parents for their unwavering love and support and for serving as incredible models throughout my life.
Finally, I would like to thanks to my wife Jzybor and my son Webber for their support and encouragement. Thank you for being there for me when I needed you the most, and providing consistent distractions to sustain my positive spirits.
網路斷詞技術應用於電子公文分類系統
國立臺中教育大學數位內容科技學系
研究生:吳嘉祥 指導教授:吳智鴻 博士中文摘要
電子公文管理是在考量整合政府資訊或知識其中重要的一項。如何正確並即 時的將與日俱增的中文公文進行分類是本研究者所關注的。然而儘管已有很多文 件分類技術提出,在中文文件分類上,依然存在兩個問題:一,系統無法斷出新 詞 (Out Of Vocabulary);二,空間向量過大。在遇到系統無法斷出 OOV 時,很可 能會因為漏失關鍵的語詞,致使分類的正確率無法提高。而傳統的斷詞方法,會 因斷出的詞過於片斷或斷出錯誤的詞,而導致空間向量過大。當空間向量過大 時,會面臨計算時間過於冗長或無法計算的問題。本研究提出一個利用網路當成 辭典的混合式斷詞法,藉助網路含有相對於傳統字典異質性的語料庫,協助斷出 可能是關鍵詞的新詞及較長且完整的詞,並將斷出的詞利用不同的權重計算法擴 展成空間向量模型,找出理想的權重計算法來減少空間向量的計算。最後,使用 多類別支援向量機,進行電子公文分類。本斷詞方法的提出,使得自動斷出新詞 並且不需預先建立辭庫進行電子公文分類的想法更加可行。並且,我們建議選擇 主旨部份作為中文斷詞及電子公文分類,即能有理想的表現。最後,結果也顯示在權重計算方法及多重分類器的選擇上:binary, tf, tfidf 及 tfchi 配合
BTMSVM 適合實際使用。
A Novel Web-based Word Segmentation Method for
Chinese E-Government Documents Classification System
Department of Digital Content and Technology
National Taichung University
Student: Chia-Hsiang Wu Advisor: Dr. Chih-Hung Wu
Abstract
While many aspects exist in managing corporate on government knowledge and information, one key issue is how to efficiently and timely classify e-government documents into correct categories in the situation of the growing and unprecedented amounts of e-government documents. Although many text classification methods have been proposed, they still remain two problems: segmenting out of vocabulary (OOV) words in Chinese documents and huge-scale dimensionality. The first problem is that segmenting OOV words has high possibility of missing absolute-keywords in documents classification. The latter problem, conventional Chinese word segmentation approaches may face over misspelled or productive segmented results. The over productive results will lead to the huge-scale dimensionality, which will cost huge computing time. In this paper, we proposed a novel hybrid classification method to overcome the problems in segmenting OOV words and to reduce the huge-scale dimensionality by segmented OOV words and longer complete words, ideal term weighting methods and suitable classifiers. First, we use the Internet for its heterogeneous information as corpus and next, considering absolute-keywords for flexible feature selection. Then, the results of segmenting words will be computed with different term weighting methods and expanded into vector space models (VSMs). Finally, the data sets in VSMs are used in the input dataset for the emerging
technology in computational intelligence – multi-class support vector machine (SVM) to classify Chinese e-government documents. Research results proved that: First, the proposed method in establishing a Chinese word segmentation without pre-built lexicon corpus system has overall excellent performance in classification of Chinese e-government documents. Second, to choose title of Chinese e-government documents classification is suggested. Third, in considering the combination of feature selection methods and multi-class classifiers - binary, tf, tfidf and tfchi term weighting methods combined with BTMSVM classifier are suitable for practical use.
Keywords: Web-based Chinese word segmentation, term weighting methods, multi-class Chinese e-documents management
CONTENT
CHAPTER I INTRODUCTION ...1
1.1BACKGROUND AND RESEARCH MOTIVATION...1
1.2RESEARCH OBJECTIVE...7
1.3THESIS OUTLINE...8
CHAPTER II LITERATURE REVIEW...10
2.1THE GROWING IMPORTANCE OF E-GOVERNMENT DOCUMENTS MANAGEMENT...10
2.2METHODS OF CHINESE WORD SEGMENTATION...13
2.2.1 The smallest unit of a Chinese word ...13
2.2.2 Existing methods of Chinese word segmentation ...15
2.2.3 The benefits of relying on Web-based Corpus ...17
2.3CONCEPTS OF SUPPORT VECTOR MACHINE...20
2.3.1 Non-linear support vector machine ...21
2.3.2 Multi-class support vector machines ...23
2.4REPRESENTATION OF TERM WEIGHTING METHODS...28
2.4.1 Vector Space Model...28
2.4.2 Bag of Words - Eight Term Weighting Methods...29
2.5EVALUATION MEASURES...33
CHAPTER III DESIGN OF THE WEB-BASED CHINESE WORD SEGMENTATION ...35
3.1RESEARCH MODEL...35
3.2CHINESE E-GOVERNMENT DOCUMENTS DATA...37
3.3DATA PREPROCESSING...38
3.4THE DESIGN OF PROPOSED GSP...39
3.4.1 The Internet Corpus as GSP Dictionaries...40
3.4.2 Maximum Matching Algorithm ...42
3.5TERM WEIGHT CALCULATION AND VECTOR SPACE MODEL...45
3.6PERFORMANCE EVALUATION...49
3.7EXAMPLES IN PRACTICAL OPERATION FOR OUR RESEARCH MODEL...52
CHAPTER IV EXPERIMENT RESULTS...64
4.1EXPERIMENT RESULTS...64
4.2DISCUSSION...76
CHAPTER V CONCLUSION AND FUTURE WORK ...78
5.1CONCLUSION...78
L
IST OF
T
ABLES
TABLE 1.1THE COMPARISON OF WORD SEGMENTATION BY CKIP AND GSP.1...4
TABLE 1.2DIFFERENT RESULTS AFFECT THE VSM SIZE.2...6
TABLE 2.1ALL FREE LEGAL WORDS OF A SENTENCE.3...15
TABLE 3.1DATASET WERE SUB-DIVIDED INTO FOUR CATEGORIES.4...38
TABLE 3.2PUNCTUATION LIST.5...39
TABLE 3.3RESULTS FROM QUERYING WIKIPEDIA.6...41
TABLE 3.4THE REPRESENTATION OF EIGHT WEIGHTING METHODS.7...47
TABLE 3.5TERM-“教育部” IN ARTICLES WAS DENOTED A, B, C AND D.8...48
TABLE 3.6 TF MATRIX.9...48
TABLE 3.7COMPARISON VALUES OF KEYWORDS BY DIFFERENT BOW MODELS.10...48
TABLE 3.8THE RANDOM DISTRIBUTION INTO 5 FOLDS OF DATASET.11...49
TABLE 3.9CONFUSION MATRIX AND COMMON EVALUATION MATRIX.12...51
TABLE 3.10THE EVALUATION METHODS.13...51
TABLE 3.11COLLECTED DOCUMENTS IN RAW CONTENTS AND SUB-DIVIDED CATEGORIES.14...53
TABLE 3.12REMOVE NON CHINESE CHARACTERS.15...53
TABLE 3.13THE COMPARISON OF RAW CONTENT AND EXTRACTED RESULTS.16...54
TABLE 3.14THE REPRESENTATION OF BINARY MATRICES.17...56
TABLE 3.15THE REPRESENTATION OF TF MATRICES.18...57
TABLE 3.16THE REPRESENTATION OF TFIDF MATRICES.19...58
TABLE 3.17THE REPRESENTATION OF TFCHI MATRICES.20...59
TABLE 3.18THE REPRESENTATION OF TFRF MATRICES.21...60
TABLE 3.19THE REPRESENTATION OF TFOR MATRICES.22...61
TABLE 3.20THE REPRESENTATION OF TFIG MATRICES.23...62
TABLE 3.21THE REPRESENTATION OF TFGR MATRICES.24...63
TABLE 4.1CONTRIBUTION TO CHINESE WORD SEGMENTATION FROM TWO SOURCE CORPUS.25...64
TABLE 4.2THE AVERAGE LENGTH OF EXTRACTED WORDS.26...65
TABLE 4.3THE AVERAGE PERFORMANCE OF 5-FOLD CROSS VALIDATION WHERE DATASET IS FULL TEXT.2766 TABLE 4.4THE AVERAGE PERFORMANCE OF 5-FOLD CROSS VALIDATION WHERE DATASET IS TITLE.28...67
LIST OF FIGURES
FIG.1.1THE IDEAL EXCHANGE SERVICE OF E-GOVERNMENT DOCUMENTS.1...2
FIG.1.2RESEARCH SCHEME.2...9
FIG.2.1THE XML STRUCTURE OF A CHINESE E-GOVERNMENT DOCUMENT.3...13
FIG.2.2THE ABRIDGED RETURN OF WIKIPEDIA QUERY IN XML FORMAT.4...19
FIG.2.3THE ABRIDGED RETURN OF GOOGLE SUGGEST QUERY IN XML FORMAT.5...20
FIG.2.4THE ILLUSTRATION OF BTMSVM PROCESS.6...27
FIG.2.5THE REPRESENTATION OF VECTOR SPACE MODEL IN TEXT DOCUMENTS.7...29
FIG.3.1OVERVIEW OF THE RESEARCH MODEL.8...36
FIG.3.2THE DESIGN OF PROPOSED GSP.9...40
FIG.3.3THE PSEUDO CODE OF GSP.10...44
FIG.3.4THE PROCESSING OF WORD SEGMENTATION BY GSP.11...45
FIG.3.5DOCUMENTS WERE DENOTED WITH A, B, C AND D (LAN, ET AL.,2009).12...46
FIG.3.6TAICHUNG HSIEN’S GOVERNMENT BULLETIN.13...52
FIG.3.7THE ONLINE DEMONSTRATION OF CKIP.14...54
FIG.4.1THE PERFORMANCE OF EXACT MATCH RATIO ON FULL TEXT.15...68
FIG.4.2THE PERFORMANCE OF MICRO-F1 ON FULL TEXT.16...68
FIG.4.3THE PERFORMANCE OF MACRO-F1 ON FULL TEXT.17...69
FIG.4.4THE PERFORMANCE OF EXACT MATCH RATIO ON TITLE.18...70
FIG.4.5THE PERFORMANCE OF MICRO-F1 ON TITLE.19...71
FIG.4.6THE PERFORMANCE OF MACRO-F1 ON TITLE.20...71
Chapter I
Introduction
In this chapter, we will introduce the background and research motivation about the present study in section 1.1 and discuss why the topic is important. Afterward we can derive the research objective from the introduction which is illustrated in section 1.2. In section 1.3, we briefly show the research scheme.
1.1 Background and Research Motivation
For a decade now, information exchange has being enabled faster and faster, since the investments in building electronic government (e-government) website services in Taiwan increased (Wen & Cheng, 2007). Since the mid-1990s, public organizations across the world and at different governmental levels have been applying Internet technologies in innovative ways to deliver services, engage citizens, and improve performance, and certainly Taiwan is no exception (Lee, Xin, & Silvana, 2005).
But the Internet is only the pipeline and storage system for knowledge exchange and in fact does not create knowledge sharing in a corporate culture. With the convenience of Internet technology, the published documents are increasing enormously and the ensuing need to organize these digital form of documents is unavoidable (Sebastiani, 2002; Wen & Cheng, 2007). If lacking of proper management for these vast electronic documents, the Internet in fact is merely acting like a distributed, highly scattered storehouse (McDermott, 1999).
Carter’s research indicates that if citizens perceive the service to be easy to use , citizens’ intentions to use a local e-government service will increase (Carter & Bélanger, 2005). The government is not only collecting mountains of information but making it accessible and usable for ordinary citizens.
(http://www.america.gov/st/usg-english/2009/June/20090623100625ajesrom0.278148 8.html) In addition, Sprague Jr. (1995) also indicated that how document management can improve the information management tasks and generates business values. Thus, how to manage e-government documents becomes an important task, and there are many studies have been focusing on electronic document management system (EDMS) (Hung, Tang, Chang, & Ke, 2009). The expectation of this thesis was illustrated by Fig. 1.1.
Fig. 1.1 The ideal exchange service of e-government documents. 1
However, until now there is little investigation in Chinese e-government documents management (An, 2009), in spite of its importance. Overall, while many aspects exist in managing corporate or government knowledge and information (Devadoss, Pan, & Huang, 2003; Layne & Lee, 2001), one key issue is how to efficiently and timely classify Chinese e-government documents into correct categories in the situation of the growing and unprecedented amounts of electronic documents.
Out of vocabulary (OOV) in Chinese word segmentation (Sproat & Emerson, 2003) and huge-scale dimensionality in bag of words model (BOW) (Sahlgren & C¨oster,
2004) are the two big challenges in documentation classification problems that interest us in this research, while we focus on information management in the Chinese e-government documents.
OOV has three problems that one is abbreviated words (Chang & Lai, 2004; Sun, Wang, & Zhang, 2006), another is name entity recognition (Gao, Li, Wu, & Huang, 2005) and the other is compound nouns (Zhou, 2000). These three forms of words also appear frequently in Chinese e-government documents.
1. The first is abbreviation words. The government agencies sometimes abbreviate words and use them in the official electronic documents, e.g.,” 98 高中課程綱要” (98 senior course syllabuses) was abbreviated as “98 課綱”.
2. The second is named entity recognition, e.g., “體健科” (Education Department Physical and Health Education Section) is an official unit name of Taichung Hsien or “替代役” (substitute military service) is a new proper noun.
3. The third is compound nouns, which was often generated for new policies in the Chinese e-government documents, e.g., “教育優先區” (educational priority area) was combined by “教育” (education) “優先” (priority) and “area” (區).
In natural language processing, many studies show the three forms of OOV words will not only worsen the performance of Chinese word segmentation but lower the accuracy of the documents classification since these can not be segmented correctly. New words will be one of the big problems in documents classification since these words cannot be recognized from constructed dictionary or corpus (Emerson, 2005).
The study of Sproat and Emerson (2003) indicated that more than 60% of word segmentation errors results from new words were not in a dictionary. Table 1.1
illustrates in the most well-known Chinese word segmentation system - Chinese Knowledge Information Processing (CKIP) (Chen, C.-M., Liu, Chiu, & Lee, 2006; Xue, Xia, Chiou, & Palmer, 2005) also has the problem in segmenting OOV words. In short, this so called parts of speech words (POS) become frequently used words in the Chinese electronic official documents. Especially in the third tier Chinese e-government documents, terms like speech words even occurred frequently.
Table 1.1The comparison of word segmentation by CKIP and GSP.1
String CKIP GSP
98 課綱
(98 senior course syllabus)
98 (Neu) 課(Na) 綱(Na)
98 課綱
教育優先區
(Educational priority area)
教育(Na) 優先(VH) 區(Nc) 教育優先區 行政院衛生署 (Department of Health, Executive Yuan) 行政院(Nc) 衛生署(Nc) 行政院衛生署 攜手計畫課後扶助
(After School Alternative Program) 攜手(D) 計劃(VF) 課(Na) 後(Ng) 扶助(VC) 攜手計畫課後扶助 健康促進學校
(Taiwan Health Promoting School) 健康(VH) 促進(VK) 學校(Nc) 健康促進學校 國民教育教學輔導團 (Compulsory education advisory group) 國民教育(Na) 教學(VA) 輔導團(Na) 國民教育 教學輔導 資訊科技融入教學 (Integrating Technology into Instruction) 資訊(Na) 科技(Na) 融入(VC) 教學(VA) 資訊科技融入教學 新流感 (H1N1) 新流(Na) 感(VK) 新流感
Fortunately, it seems to utilize the Web for finding candidate words in our word segmentation is a possible method. Recently, there are many studies trying to use the
Web as corpus to overcome OOV problems. Su, Lin et al (2007) tried to incorporate the Wikipedia as a live dictionary to apply in Multilingual Information Retrieval (MLIR), which is also dealing with OOV terms. Besides, many researches also tried to develop named entity disambiguation methods via named entity denotations by Wikipedia (Bunescu & Pasca, 2006; Cucerzan, 2007). Zhang and Vines’s (2004) method can automatically and correctly extract translations of OOV terms from both Chinese-English and English-Chinese by the Web, including adopting Google search engine and Yahoo! search engine.
Second, huge-scale dimensionality computing is another problem. In the vector space model (VSM), most traditional text classification methods are based on variation of the“bag of words (BOW)"models, which are counting term frequency statistic approaches. The BOW model is represented as a vector in the term space, i.e.,
d= ( f1,…, f ), where k is the term set size and k f values is usually calculated i between 0 and 1 (Lan, Tan, Su, & Lu, 2009; Sebastiani, 2002).
However, the BOW models usually lead high dimension computing problem due to misspelled Chinese word segmentation, corrupted collection (Hu, et al., 2008; Ko & Lee, 2001) and redundant term extraction (Kwok, 1997) because the compositions of Chinese words are open ended. Table 1.2 illustrates that different results will affect the vector space size. “行政院衛生署疾病管制局 (Centers of Disease Control, Department of Health, Executive Yuan)”, in the CKIP system, the string was segmented as ”行政院(Nc) 衛生署(Nc) 疾病(Na) 管制局(Nc).” In the name entity expressions, the segmented results were no wrong but occupied bigger space in the VSM. Owing bigger space in VSM means costing much in computing the BOW models. In our goal - to present another kind of correct name entity and to minimize
the VSM, these terms can be recognized as a complete junk - “行政院衛生署疾病管 制局”. Thus, to detect more complete noun phrase is the main task in this research. The longer a term was segmented, the smaller the size of the VSM has (Fu, Kit, & Webster, 2008).
Table 1.2 Different results affect the VSM size.2
Legal Results of word segmentation Occupied vector space size
行政院衛生署疾病管制局 1 行政院 衛生署 疾病 管制局 4
Furthermore, there still exists another factor affecting the dimensionality of the vector space - term weighting methods. The choice of different data feature calculating models can have considerable impact on both quality and performance. An ideal term weighting method not only has the power to discriminate the candidate words and reduce the dimensionality of vector space for an effective computing, but also can eliminate noise from the feature set (Houle & Grira, 2007).
In the electronic documents classification, most studies tend to pursue the higher accuracy. However, classification accuracy is indeed important when comparing with the past classification method by domain expert, but in the face of high-dimensional and large-scale dataset like electronic documents, the vector space and accuracy should be considered at the same time to help us to carry out further analysis (Wu, C.-H., Fang, Wu, Lin, & Li, 2009).
the problems in segmenting OOV words and reduce the huge-scale dimensionality for assisting Chinese e-government documents managers in knowledge/information management. With such the web-dictionaries in hand we can then look at how we can make use of it to handle Chinese word segmentation in a way so that the composition of proper nouns, abbreviation words, and compound nouns, become more complete.
1.2 Research Objective
Here we propose a Good Segmentation aPproach – GSP. As we mentioned earlier, the motivation for this thesis is to explore ways of finding and making use of the Web in Chinese word segmentation for e-government classification. First, we used the Internet for its heterogeneous information as corpus, consideringthat OOV words can be segmented precisely. We focus on the main idea: the longer words are segmented, the lower dimensionality of vector space is generated. Table1.1 illustrated our proposed model – GSP can successfully overcome the OOV problem while these OOV words were segmented out over shattered by CKIP.
We want to compare the results of our proposed system with CKIP. After pre-processing of texts segmentation, we choose some popular feature selection methods, e.g. tfrf, tfidf and so on, since many BOW models have been proposed as methods of feature selection in VSM (Ko & Lee, 2001). Then the representation of computed term weighting methods were used as the input dataset for the emerging technology in computational intelligence - support vector machine (SVM) to classify Chinese e-government documents. SVM is a supervised learning algorithm which has achieved state of the art results for binary document classification (Joachims, 1998). In addition, due to the nature of Chinese e-government documents, the classification is a multi-class job. Therefore, we adopt a hierarchical SVM classifier i.e., BTMSVM
to tackle our problems (Cheng, L., Zhang, Yang, & Ma, 2008).
1.3 Thesis Outline
The structure of this thesis is as follows. In this study, we first introduce the background, the motivation, and the objective of our thesis in Chapter I. In chapter 2, we briefly discuss Chinese word segmentation related works and an overview of the state-of-art the technical concepts that are relevant to information retrieval, followed by a summary of various related work that is relevant to us. In chapter 3, we provide the overview of GSP model, the process and test corpus of our model. We then show the experimental result for the proposed method in chapter 4. Finally, we give some conclusions and discussions in chapter 5. An overview of the present study is illustrated by Fig. 1.2.
Chapter II
Literature Review
Sebastiani (2002) formally had the definition of documents classification: classification must be accomplished on the basis of endogenous knowledge only, namely, knowledge extracted from the documents. Thus, how to extracted documents content precisely as index units is the first step. In the following section 2.1 is the dataset. In section 2.2 we will discuss existing methods of Chinese word Segmentation. In section 2.3 we introduce concepts of Support Vector Machine and multi-class Support Vector Machines classifier. In section 2.4, we introduce the representation of term weighting Methods in VSM for feature selection. In section 2.5 we introduce the evaluation measures.
2.1 The Growing Importance of E-government Documents Management
The past two decades have seen growing importance placed on intergovernmental research in the electronic government (Hung, et al., 2009). One focus area is providing service of official electronic documents exchanging. Official documents are important tools by which government agencies perform their duties and communicate their views. Since official documents must be up-to-date and acted on speedily if they are to achieve their function, they are closely connected with administrative efficiency. (http://www.rdec.gov.tw/ct.asp?xItem=4088069&CtNode=10091&mp=100)
The e-government program in Taiwan has been initialized since 2001. One project of the program is to develop electronic document exchange service. The Document and File Management Computerization Guidelines (1999 revision), enacted by the Executive Yuan in December 1999, calls for changing the electronic document exchange format to XML (eXtensible Markup Language), which is a standard format
employed across the Internet. By providing integrated web-based government service, the e-documents indeed can easily attain the goal of Government to Citizen (G2C), Government to Business (G2B), Government to Government (G2G), or Government
to Employees (G2E) (http://www.rdec.gov.tw/ct.asp?xItem=13673&ctNode=3702&mp=100).
Currently, there were over 85% official documents adopted electronic interchange mechanism in central government (two-month publications of The Research, Development and Evaluation Commission), and over 70% in other government agencies. The revised guidelines also divide electronic document exchange at government agencies into three classes: Class 1 is the use of a third-party central processing service, class 2 is direct point-to-point transmission, and class 3 is posted by the issuer to an online bulletin board. The third tier documents can be easily posted to an online bulletin board and read from WWW (http://www.odbbs.gov.tw). Among these different tier Chinese e-government documents, the third tier of Chinese e-government documents is the highest frequently relying for immediately informing between the government units.
The amounts of 3rd tier documents are increasing enormously because of its convenience and timely exchanging. Unlike other tier documents, the 3rd tier document is no need to exchange between agencies or companies through a complex procedure. The operation of 3rd tier documents publish, merely posting on the bulletin, is in fact reducing a lot time in informing.
The XML structure of Chinese e-government documents
rule in the not long future (Anttiroiko, 2005) (http://www.intelligenttaiwan.nat.gov.tw/). For achieving easily exchanging between government agencies or companies, National Archives Administration (NAA) in Taiwan has initiated the National Archives Information System (NAIS) project for 2003-2006. Specifically, the NAA attempted to address several core implementation challenges, such as creating baseline rules for computerized record management and developing electronic record management systems to support a national electronic archives retrieval system that would meet security and authentication requirements. All agencies must provide a catalog of their records and archives with a pre-specified XML data format via e-mail or on website periodically to the NAA.
The design of XML is for easily readable for both human and machines. The XML format, produced by World Wide Web Consortium has a goal to satisfy the purpose of simplicity, generality, and usability over the Internet. Based on these rules, XML documents then is flexible in declaring tags (W3C, XML 1.0). Even a simple XML documents still contains tags annotated with attribute-value pairs. In other words, tags defined by XML formatting, containing self description, are readable not only by machine but human through a parser program (Khare & Rifkin, 1997). Fig. 2.1 illustrates a sample XML document of the Chinese e-government documents in Class 3rd: During these XML tags, the tags are not only structural but also annotate its function by name in beginning and ending. For example, in the pair of tags - <description> and </description>. The texts are ”主旨:檢送本縣「98學年度疑似身 心障礙特殊教育學生鑑定及安置工作」…”. We can easily figure out the texts between the defined tags “description” stands for the main contents of the Chinese e-government documents and in fact the tags’ name have annotated its specification. Therefore, the followed XML format Chinese e-government documents will be not
only easily read by people but also parsed by computer machines. <?xml version=”1.0” encoding=”UTF-8”?> <rss version=”2.0”> <channel> <title> 特教科-◎最速件◎有關本縣 98 學年度疑似身心障礙特殊教育學生鑑定 及安置工作說明會相關事宜。 </title> <link> http://www.tcc.edu.tw/news/board_show.php?bo_no=62330 </link> <description> 主旨:檢送本縣「98 學年度疑似身心障礙特殊教育學生鑑定及安置工作」 說明會日程表1 份,請各校確依說明段辦理,請 查照。 說明: 2009-- 各校輔導主任應請親自出席工作說明會,如未能親自出席,請 務必指派負責特教業務之承辦人(或組長)參加,以利鑑定工作推 動… </description> <pubDate> 2009-09-23 </pubDate> <guid> http://www.tcc.edu.tw/news/board_show.php?bo_no=62330 </guid> </channel> </rss>
Fig. 2.1 The XML structure of a Chinese e-government document.3
2.2 Methods of Chinese Word Segmentation
2.2.1 The smallest unit of a Chinese word
In a Chinese sentence, the smallest unit is one character, but the smallest word is at least the composition of any two adjacent characters. Chinese sentence, unlike
English, has no obvious blanks as words boundaries to separate each word. Besides, all Chinese characters can either be a morpheme or freely combined to form a legal word (Tsai, Sung, & Hsu, 2003; Wong & Chan, 1996). Take “對抗新流感” (Fight against H1N1) as an example, the compose of the string can be defined as:
對 抗 新 流 感
S = C1 C2 C 3 C4 C 5
where C1, C2…,C are n size of characters. n
Then we get the results of Table 2.1. Table 2.1 shows that the string composed of five characters can be segmented out various acceptable results. But even single character can be a morpheme, utilizing every character to represent in the VSM is still conceptually unfeasible and implausible due to huge dimensionality and ineffective computing. Therefore, Chinese word segmentation in text is always the first and a vital step in Natural Language Processing (NLP) (Chen, Aitao , He, Xu, Gey, & Meggs, 1997; Chen, K.-J. & Bai, 1998; Chen, K.-J. & Liu, 1992).
Table 2.1 All free legal words of a sentence.3
morphemes/words description denote length
對抗 fight against C1C2 2 新流感 H1N1 C3 C4C 5 3 對 right C1 1 抗 resist C2 1 新 new C 3 1 流 flow C4 1 感 feel C 5 1 流感 flu C4C 5 2
In this study, we also extracted words at least length of two character long whether by CKIP or by our proposed GSP system, since many studies in Chinese word segmentation were relying at least two adjacent characters long (Chen, Aitao, 2003; Chen, Aitao , et al., 1997; Wu, A. & Jiang, 2000).
2.2.2 Existing methods of Chinese word segmentation
A number of Chinese word segmentation techniques have been explored in the literature. Generally speaking, they can be classified into dictionary-based methods and statistical-based methods and semantic knowledge based methods (Teahan, Rodger, Yingying, & Ian, 2000), and hybrid word identification methods (Hong, Chen, & Chiu, 2009).
Dictionary-based word identification is the basic method in Chinese word
segmentation. Given a pre-built dictionary, an input string is compared with words in the dictionary to find the one that matches the greater number of characters of the string.
Maximum matching in dictionary-based method is one of most popular structural segmentation algorithms for Chinese texts. This method favors long words and is a
greedy algorithm by design (Wong & Chan, 1996). There are basic two strategies in Chinese word identification uses a maximum matching algorithm with reference to the lexicon. This matching algorithm may proceed from left to right (i.e. forward match) or from right to left (i.e. backward match) (Cheng, K.-S., Young, & Wong, 1999; Goh, Asahara, & Matsumoto, 2005; Teahan, et al., 2000). The following steps are the concept:
1. The longest matching heuristic, and
2. the direction of matching is from the beginning of a sentence to its end or from the end to its beginning.
Chinese Knowledge Information Processing system
One of the most well-known segmentation system is Chinese Knowledge Information Processing (CKIP) (Yang, W.-S. & Jan, 2009), while considering Chinese word segmentation methods in researches. It is based on the variant of dictionary-based approach, which improve new word guessing ability (Chen, C.-M. & Liu, 2009). The CKIP system is a research team formed by the Institute of Information Science and the Institute of Linguistics of Academia Sinica in 1986 and demonstrates the online interface at: http://ckipsvr.iis.sinica.edu.tw/ (Su, et al., 2007). There are more than 80,000 entries in CKIP system. And given a 5 million word Chinese corpus with word segmented, CKIP shows 3.51% of words are not listed in the lexicon introduced by (Chen, K.-J. & Ma, 2002).
Statistical-based approach can be defined as a purely mathematical process that
extracts words from sentences. Given characters in a sequence, and use the statistical method (e.g. Mutual Information) to find the relationship between the adjacent characters, and then to predict the possibilities of the next character.
Flaws of existing methods
There are certainly having drawbacks of these approaches, while first, the dictionary-based method will have good performance after building as complete as possible corpus by manpower. However, it is unfeasible and unnecessary to have a really complete dictionary containing all possible words used in Chinese texts; second, The effectiveness of the statistical-based method significantly depended on particular and large quantities of text as training data (Gao, et al., 2005); And the semantic knowledge method needs to involve more linguistic information (Teahan, et al., 2000).
2.2.3 The benefits of relying on Web-based Corpus
In some recent publications, more and more articles tried to use the Internet doing word segmentation, instead of using any predefined dictionaries or raw corpus. Kilgarriff & Grefenstette (2003) considered the Web contains enormous quantities of text, in numerous languages and language types, on a vast array of topics. The take of investigation on the Web is that it is a fabulous linguists’ playground. (Bunescu & Pasca, 2006; Cucerzan, 2007) used Wikipedia’s knowledge to detect and disambiguate named entities. (Bøhn, 2009) used Wikipedia to automatically generate a dictionary of named entities and synonyms that are all referring to the same entity. (Tan & Peng, 2008) thought query segmentation is similar to Chinese word segmentation. Therefore, he proposed a model to combine query segmentation and the Web, especially Wikipedia in providing additional evidence for concept discovery. (Su, et al., 2007) also tried to use Wikipedia as a tool to translate OOV terms on multilingual information retrieval. (Gabay, Eliahu, & Elhadad, 2008) incorporated the Wikipedia as a live dictionary to solve MLIR problems.
Wikipedia is a very dynamic and quickly growing resource. Wikipedia (http://www.wikipedia.org) is a web-based, free content encyclopedia project. Anyone can edit or create new articles. The number of Chinese articles is more than 270 thousands. Based on so many volunteers all over the world co-editing Wikipedia, the numbers of articles are still growing up. To sum up, Wikipedia is a very dynamic and quickly growing resource and articles about newsworthy events are often added within days of their occurrence. Each entry of Wikipedia has links to entries if there are entries describing the same topic returning in XML format. The site is available at: (http://zh.wikipedia.org/w/api.php?action=opensearch&limit=10&format=xml&searc
h= qs )
where qs is the sent query string. The following is the abridged result while we
set qs =”教育”: Therefore, after parsing the structure tags, we view texts as
candidate words between the pair tags “<Text >”, which were belonged in each pair tags “<item>”
<SearchSuggestion version="2.0"> <Query xml:space="preserve">教育</Query> <Section> <Item> <Text xml:space="preserve">教育</Text> <Description xml:space="preserve"> 教育,通常有廣義和狹義兩種概念… </Description> <Url xml:space="preserve">http://zh.wikipedia.org/zh-tw/…</Url> </Item> <Item> <Text xml:space="preserve">教育優先區</Text>
<Description xml:space="preserve">教育優先區(educational priority area,簡稱EPA)</Description> <Url xml:space="preserve">http://zh.wikipedia.org/zh-tw/… </Url> </Item> </Section> </SearchSuggestion>
Fig. 2.2 The abridged return of Wikipedia query in XML format.4
Google Suggest, provided by Google Company, is an assisted function in search engine box. While we type words, Google Suggest will offer suggest terms. As we type more words, Google Suggest also continually communicates Google and comes back longer and correlate words to meet our typing (http://www.google.com/support/websearch/bin/answer.py?hl=en&answer=106230). According to Google’s online documentation, this facility suggests term completion based on overall popularity of search strings. Since the terms were ranked by people’s query terms, meaningful and complete, we can consider the pop-out suggest terms as candidate words in our word segmentation works (Cock & Cornelis, 2005). Google Suggest also returns the suggestion in XML format. The site is available at:
http://www.google.com/complete/search?output=toolbar&hl=zh-TW&js=true&qu=
= qs
where qs is our sent query string. The following is the abridged result while we
set qs =”新流”: Therefore, after parsing the structure tags, we view texts as
candidate words in the tag “< suggestion data >”, which is belonged in each pair < CompleteSuggestion > <toplevel> <CompleteSuggestion> <suggestion data="新流感"/> <num_queries int="15300000"/> </CompleteSuggestion> <CompleteSuggestion> <suggestion data="新流感疫苗"/> <num_queries int="6580000"/> </CompleteSuggestion> </toplevel>
Fig. 2.3 The abridged return of Google Suggest query in XML format.5
2.3 Concepts of Support Vector Machine
The SVM was first developed by Vapnik (Vapnik, 1999) and Cortes and Vapnik (Cortes & Vapnik, 1995), and was mainly used in multidimensional data classification methods. It has a good validity; its application areas include Text Categorization, Image Recognition, Hand-Written Digit Recognition and Bioinformatics, and others. The SVM is based on a statistical learning theory; it uses a structural risk minimization principle (SRM) to replace the practical risks of most study methods.
Further, it tries to find out a hyperplane in a high-dimensional space to carry out the data separation in order to minimize the error ratio. In addition, the SRM has been shown to be superior to the traditional empirical risk minimization principle (ERM), which was employed by conventional neural networks. SRM is able to minimize the upper bound of the generalization error as opposed to ERM that minimizes the error on training data. Thus, the solution of the SVM may be a global optimum solution while the solutions of other neural network models tend to fall into a local optimal solution, and overfitting is unlikely to occur with the SVM (Cristianini & Shawe-Taylor, 2000; Hearst, 1998; Kim, 2003).
2.3.1 Non-linear support vector machine
Given training data( , ),...,( , )x y1 1 x yn n , where Xi are the input vectors (1x n) and y are the associated output values (i 1x n) ofXi, the basic idea in the non-linear
SVM is to map data x into a high-dimensional feature space via a mapping function( )x (also called kernel function), which is selected by the user in advance.
By replacing the inner product with a non-linear pattern problem, the kernel function can perform a non-linear mapping to a high-dimensional feature space F (Vapnik, 1999). Kernel functions perform non-linear mapping between the input space and a feature space. The approximating feature map for the Mercer kernel is ( ) ( ) ( )T
k x, y x y , which performs the non-linear mapping.
According to the concept of SVMs, a data point is viewed as an n-dimensional
vector (a list of n numbers). We want to know whether we can separate such points
with an (n1)-dimensional hyperplane. Given a finite set of observations of datum vectorx ; 1,...,i i n, where xRn, the basic idea in designing a non-linear SVM model is to map the input vector n
x into vectors z of a higher-dimensional feature space F (z( )x , where denotes the mapping ), and to solve a linear n f
classification problem in the feature space
1 1 2 2
( ) ( ), ( ),..., ( )T n f n n z a x a x a x x x (1)Currently, the popular kernel functions in machine learning theories are as follows (Campbell, 2002; Vojislav, 2001): Gaussian (RBF) kernel:
2 2 , exp 2 k i j i j x x x x (2) Polynomial kernel: ( , ) ( )d k x xi j x xTi j t (3) Linear kernel: k( , )x xi j x xTi j (4)Multilayer perceptron: ( , ) tanh ( T )
i j i
k x x x x b (5)
In Eq. (2), x and i x are input space vectors, and j 2denotes the variance in the
Gaussian kernel. In Eq. (3), t is the intercept constant term and d is the degree of the polynomial. In Eq. (4), a certain value of b is used only in the multilayer perceptron. The learning algorithm for a non-linear classifier SVM follows the design of an optimal separating hyperplane in a feature space. The procedure is associated with hard and soft margin classifier SVMs in the x-space. Accordingly, the dual Lagrangian in the z-space is 1 , 1 1 ( ) 2 l l T d i i j i j i j i i j L
y y z z (6)Further, by using the chosen kernels, we can maximize the Lagrangian as follows:
Maximize: 1 , 1 1 ( ) ( ) 2 l l d i i j i j i j i i j L k
y y x x (7) Subject to: i 0, i1,...,l, (8) 1 0 l i i i
y . (9)Note that the constraints must be revised for use in a non-linear soft margin classifier SVM. The only difference between these constraints and those of the
separable non-linear classifier are with regard to the upper bound C on the Lagrange
multiplieri. Consequently, the constraints of the optimization problem become Subject to: Ci 0, i1,...,l, (10) 1 0 l i i i
y . (11)In this way, the influence of the training data point will be limited and remains on the wrong side of a separating non-linear hypersurface. The decision hypersurface
( )
d x and the indicator function, which were determined by the nonlinear SVM classifier, are as follows:
1 ( ) ( , ) l i i i j i d k b
x y x x , (12) 1 ( ) ( ( )) ( , ) l F i i i j i i sign d sign k b
x x y x x . (13)Depending upon the chosen kernel, the bias term b may be an implicit part of the kernel function. For example, b is not required when Gaussian RBF are used as kernels. When b is included within other kernel functions, the non-linear SVM classifier is as follows: 1 1 ( ) ( ( )) ( , ) ( , ) number of SVs l F i i i j s s s i s
i x sign d sign k b sign k
x y x x y x x .(14)2.3.2 Multi-class support vector machines
Due to the nature of the SVM as a binary classifier it is necessary in a multi-class task to consider the strategy for combing several classifiers.
Review of strategies of multi-class support vector machines
At present, there are some basic types of approaches for multi-class SVMs – including OAOSVM, OAASVM, and BSVM. One is considering “One Against One (OAO) (Ulrich & el, 1999)” method: all data were to decompose the multi-class (e.g.
Although the OAO approach demonstrates superior performance, it may require prohibitively-expensive computing resources for many real-world problems. Another is “One Against All (OAA) (Liu & Zheng, 2005)” method which shows somewhat less accuracy, but still demands heavy computing resources, especially for real-time applications (Cheong, oh, & Lee, 2004). And the other is “BSVM” (Hsu & Lin, 2002b). BSVM in coping with multi-class classification method uses Crammer and Singer's formulation (Koby & Yoram, 2002).
OAOSVM (Ulrich & el, 1999)
OAOSVM constructs k(k-1)/2 classifiers where each one is trained on data from two classes. For training data from the ith and the jth classes, we solve the following binary classification problem:
ij ij ijb wmin, ,
t T ij ij t ij T ij w C w w ) ( ) ( 2 1 T ij w ) ( ψ ij t ij t b x) 1 , ( if yt i T ij w ) ( ψ ij t ij t b x) 1 ( , if yt j , 0 ij t (15) In classification, the “Max Wins” voting strategy is suggested: ifsign(( ) ( ) 1 ij) t ij t ij b x
w says x is in the ith class, then the vote for the ith class
is added by one. Otherwise, the jth is increased by one. Then we predict xis in the
class with largest vote (Hsu & Lin, 2002a).
OAASVM (Liu & Zheng, 2005)
OAASVM constructs k SVM classifiers where k is the number of classes. The ith SVM is trained with all of the examples in the ith class with positive labels, and all
other examples with negative labels. Thus given training data l training data
) ,
(x1 y1 ,..,(xl,yl), where n
R
x1 , i=1,...,l and y1{1,...,k} is the class of x , the ith i SVM solves the following problem.
i i ib wmin, ,
l j T i i j i T i w C w w 1 ) ( ) ( 2 1 T i w ) ( ψ i j i i b x) 1 , ( if yi i T i w ) ( ψ i j i i b x) 1 ( , if yi i , 0 i j j=1,…,l (16) where the training data x are mapped to a higher dimensional space by the functioniψ and C is the penalty parameter.
The decision function: x is in the class which has the largest value.
class of x ≣ arg i T i k
imax1,..., ((w) (x)b (17)
BSVM (Hsu & Lin, 2002a)
BSVM uses Crammer and Singer’s formulation (Koby & Yoram, 2002). They proposed an approach for the multi-class problem by formulating the following primal problem: } { }, {wminm i
m wm C ii 2 || || 2 1 s.t. m i i i T m i T yx w x e w i m,i (18)where C > 0 is the regularization parameter, w is the weight vector associated with m
class m, y m
m
i i
e 1 , and yi,m 0 if yi . Note that, in (18) the constraint for m
m = y corresponds to the non-negativity constraint, i i0. The decision function is
arg
m
max T m
w . (19)
Several works are investigated in the combination of binary SVMs in a hierarchical structure (Cheng, L., et al., 2008; Cheong, et al., 2004). The hierarchical multi-class SVMs with binary tree architecture hierarchically divides the data set into two subsets from root to the leaf until every subset consists of only one class. The hierarchical SVM models take advantages of both the efficient computation of the tree architecture and the high classification accuracy of SVMs, e.g., binary tree multi-class support vector machine (BTMSVM) (Cheng, L., et al., 2008).
Improved hierarchical multi-class SVM
BTMSVM (Cheng, L., et al., 2008) is based on class similarity in feature space. The definitions are described as followed as:
li s s i i x l m 1 ) ( 1 , (20)where mi the center of class i in the feature space.
|| || i j ij m m d , (21)
i i i j i i j i l t l t t t l s j l t t s j i l l s s i l l j i x x K l x x K l l x x K l j m m 1 1 , 1 1 2 , 1 s s1 , 2 2 1 t t 1 s s i ) ( 1 ) ( 2 ) ( 1 ) x l 1 x l 1 ( || || , (22)where dij is the distance between class i and class j in the feature space.
i i i i i l s l s s s l s i s t i t t l t t t l t i x x K l x x K l x x K m x R 1 1 1 2 ,..., 1 ,..., 1 ) , ( 1 ) , ( 2 ) , ( max || || max , (23)the class i .
The hierarchical order is constructed based on searching the biggest similarity between classes. Hence,
2 2 2 /|| || ) , (i j R R mi mj similarity j i (24)
Then we can get the construction order as the following steps (see Fig.2.4). Step 1: Computing the similarity among each class based on Eqs.(24).
Step 2: Selecting two classes with the biggest similarity( ji, ) to train SVM and
unit these two classes;
Step 3: Repeat computing max values of similarity in each node of the sub-tree to decide how the classes are divided into two groups to form the next node as the same step 2.
Step 4: Executing Step 2 and Step 3 when the number of the class bigger than 2. Step 5: Constructing the most upper SVM of binary tree when the number of the
class equals to 2
2.4 Representation of Term weighting Methods
2.4.1 Vector Space Model
Vector Space Model (VSM) (Salton, G., Wong, & Yang, 1975) is one of the frequently used indexing models for information retrieval (see Fig 2.5). In VSM, text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Most of top-performing text categorization methods: SVM and kNN all require probabilistic or statistic or weight information (Mineau, 2005). In text representation of VSM, terms are words, phrases, or any other indexing units used to identify the contents of a text. However, no matter which indexing units in use, each term in a document vector must be associated with a value (weight) which measures the importance of this term and denotes how much this term contributes to the categorization task of the document (Lan, et al., 2009)
keywords Categorized Documents f1 f2 f3 f4 f5 f6 f7 f 8 f 9 f10 1 d y y y - y - - - - y 2 d - y - y y - - - y - 1 C 3 d - y y y y - y - - 4 d - - - y - - y - y y 5 d - y y - - y - y y - 2 C 6 d - - y - - - y y y y
Fig. 2.5 The representation of Vector Space Model in text documents.7
2.4.2 Bag of Words - Eight Term Weighting Methods
Based on the VSM, there are many various statistical BOW representation. Eight methods are introduced in this study, each of which uses a term-goodness criterion threshold to achieve a desired degree of term elimination from the full words of a document corpus, some are traditional and some are the-state-of-art. These criteria are:
Binary, Term Frequency alone (tf), Inverse Document Frequency (idf), Relevant Frequency (rf), Information Gain (ig), Gain Ratio (gr) Odds Ratio (or), Chi-square (chi) (Lan, et al., 2009; Zheng, Z., Wu, & Srihari, 2004).
Binary and tf alone (Salton, Gerard & Buckley, 1988)
In the VSM, one simple and common representation is the binary representation, where the appearance of an index is indicated with a 1 in the document vector representation. All absent words have a weight of 0. The other is raw term frequencies (tf). tf in one document appears to closely represent the content of the document,
counting the raw term occurrences.
Matrices of combining multiplication operation. (Lan, et al., 2009)
There are still many collection frequency factors borrowed from Information Retrieval. In these methods, not only the number of each term in a document is
considered, but also the total number of documents. Let N be the total number of
documents in the system or the collection. Then we can get:
Term Frequency (
tfij)
(Salton, Gerard & Buckley, 1988)In VSM, Term Frequency - Inverse Document Frequency (TF-IDF) is the most common weighting method used to describe documents, particularly in IR problems (Mineau, 2005). The tf is defined as:
Lef f be the raw frequency count of term ij t in document i d . Then, the j
normalized term frequency (denoted by tf ) of ij t is given by i
, } ,..., , max{ 1 2 |v|j ij ij f f f f tf (25)
where the maximum is computed over all terms that appear in document d . If termj
i
t does not appear in d then j tf = 0. Recall that |v| in the vocabulary size of the ij
collection.
Inverse Document Frequency (idf) (Salton, Gerard & Buckley, 1988)
Let df be the number of documents in which term i t appears at least once. Thus, i the inverse document frequency (denoted by idf ) of term i t is given by:i
i i
df N
idf log , (26)
The intuition here is that if a term appears in a large number of documents in the collection, it is probably not important or not discriminative. The final TF-IDF term weight is given by:
ij ij tf
W x idfi, (27)
Relevant Frequency
(
rf) (Lan, et al., 2009)
Relevant Frequency (rf) is a new simple term weighting method based on VSM proposed by (Lan, et al., 2009). In their consideration, the more concentrated a high-frequency term is in the positive category than in the negative category, the more contributions it makes in selecting the positive samples from the negative samples.
Therefore, rf is defined as:
) ) , 1 max( 2 ( log c a rf (28) where:
1. a is the number of documents in the positive category which contain this term
2. c is the number of documents in the negative category which contain this term
Chi-square (chi) (Galavotti, Sebastiani, & Simi, 2000)
Chi-square measures the lack of independence between a term t and a category
j
c and can be compared to the chi-square distribution with one degree of freedom to judge extremeness. It is defined as : 2 X ( t ,c )=i ) ( ) ( ) ( ) ( )] , ( ) , ( ) , ( ) , ( [ 2 i i i i i i c P c P t P t P c t P c t P c t P c t P N (29)
where:
N is the total number of documents.
1. ( t ;c ): presence of t and membership in i c . i
2. ( t ; c ): presence of t and non-membership in i c . i 3. (t ;c ): absence of t and membership in i c . i 4. (t ;c ): absence of t and non-membership in i c . i
where: t and c represent a term and a category respectively. The frequencies of i
the four tuples in the collection are denoted by A;B;C and D respectively. The first and last tuples represent the positive dependency between t and c , while the other i
two represent the negative dependency.
Information Gain (ig) (Galavotti, et al., 2000)
Information gain (ig) Information gain measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document. The information gain of term t and category c is defined to be : i
IG( t ,c )=i ) ( ) ( ) , ( log ) , ( } , { } , { Pt P c c t P c t P t t t c c c i i
, (30)Gain ratio (gr) (Mori, 2002)
The gain ratio compensates for the number of attributes by normalizing by the information encoded in the split itself.
GR( t ,c )=i ) ( 2 log ) ( ) ( ) ( ) , ( log ) , ( } , { } , { } , { c P c P c P t P c t P c t P i i i i c c c t t t c c c
(31)Odds ratio (or) (Zheng, Z. & Srihari, 2003)
by that of the negative class. The basic idea is that the distribution of features on the relevant documents is different from the distribution of features on the nonrelevant documents. It has been used by Mladenić for selecting terms in text categorization. It is defined as follows: OR( t ,c )=logi ) | ( )] | ( 1 [ )] , ( 1 )[ | ( c t P c t P c t P c t P i i i , (32) 2.5 Evaluation Measures
The commonly used performance evaluation criteria for multi-class classification are Exact Match Ratio, Macro-average F1, and Micro-average F1 for text categorization (Tang, Rajan, & Narayanan, 2009).
Exact Match Ratio (Zhu, Ji, Xu, & Gong, 2005) is defined as:
Given _l testing distance. Let yi be the true label vector for the i th instance and
i
y
be the predicted one, and the indicator function be:
. otherwise 0 , true is s if 1 [s]
I
l i i i y y l Ratio Match ExactI
1 ] [ 1 , (33)Macro-average F1 and Micro- average F1 (Jean, 1981)
One of the most commonly used performance measures for information retrieval systems is the F1. F1 combines P scores and R scores into a single value using a formula: ) ( 2 R P PR F , (34)
Precision is the percentage of the positively classified examples which are actually positive. Recall is the fraction of positive examples which has been classified as
positive. For label j ,
l i i j l i i j i j y i y P 1 1 and
l i i j l i i j i j y i y R 1 1 , (35)The F1 for label j can be written as:
l i l i i j i j l i i j i j y y i y P 1 1 1 2 , (36)There are two methods developed to extend the F1 from single-label to multi-label (Jean, 1981). The macro-average F1 is the unweighted mean of label F1,
d j l i l i i j i j l i i j i j y y i y d 1 1 1 1 2 1 , (37)The other measure is the micro-average F1, which considers predictions from all instances together and calculate the F1 across all labels:
d j l i d j l i i j i j d j l i i j i j y y i y 1 1 1 1 1 1 2 , (38)Chapter III
Design of the Web-based Chinese word Segmentation
In this chapter, we describe the design of our proposed novel system - GSP. The research model is introduced in the first section. We describe how we deal with dat preprocessing in section 3.2. The basic idea of Chinese word segment in GSP system is described in section 3.3. In section 3.4, we introduce the collection Chinese e-government documents. In section 3.5, we introduce how we extracting sentences from Articles. In section 3.6, we introduce eight term weighting calculation into VSM, while we also demonstrate the process in practical operation for our research model in section 3.7.
3.1 Research Model
We design model for simultaneous working in Chinese word segmentation, feature selection and classifier. Research model of the thesis was portrayed by Fig. 3.1
Fig. 3.1 Overview of the research model.8
There are two Chinese word segmentation systems combining with the state of art soft computing techniques, including methods of feature selection and multi-class classifiers to cope with Chinese e-government documents classification. In this study, the comparison of two Chinese word segmentation methods are CKIP and our proposed system – GSP.
The first step is Chinese e-government documents collecting and data preprocessing. We manually classify collected documents into four categories and eliminate non Chinese words, including digits, stop words and English words. In the second step, we extract sentences from collected documents and divided them into two models: 1.data containing full text, 2. data containing title only, after collecting Chinese e-government documents of Taichung Hsien. In third step, data will be segmented out by Chinese word segmentation systems and expanded into vector space model. Due to two Chinese word segmentation systems, we will have four vector space models, i.e., full text-CKIP, full text-GSP, title-CKIP, and title-GSP. Owing these vector space models, in the fourth step, we computed different term weighting methods for feature selection, e.g., Binary matrix, tf matrix, tfidf matrix,
tfchi matrix, tfig matrix, tfor matrix ,tfrf matrix and tfgr matrix. In fifth step, because the nature of Chinese e-government documents is designed for different government agencies for exchanging, we need multi-class classifiers to deal with the classification of four-categories data set. OAASVM, OAOSVM, BSVM, and BTMSVM are adapted in this step. In evaluation step, we compare the performance of two different Chinese word segmentation by Exact Match Ratio, Macro-average F1, and Micro-average F1.
3.2 Chinese E-government Documents Data
The Chinese e-government documents of Taichung Hsien in 3rd tier meets our needs ,while we select the working dataset. We choose the 3rd tier document of the local government based the two following reasons: 1. the 3rd tier documents are the highest frequency relying method among all. 2. Until now, in 3rd tier documents exchanging between agencies, the documents issued by Taichung Hsien government,
is the first one that obeys XML format by NAA’s rules.
We collected 391 articles for the working data set from November, 2009 to December, 2009. The collecting period is lasting 2 month long because the 3rd tier documents in Taichung Hsien’s government bulletin based two reasons: 1. some routinely reporting documents are monthly. 2. The published news articles in bulletin are usually short than one month but no longer than two month.
The collected articles were manually classified into 4 categories: Academic Affairs, Student Affairs, General Affairs and Counseling Office, which were according the structure of school administration. Academic Affairs includes affairs of students, curriculum and teaching; Student Affairs includes living education, citizen education and health education; General Affairs includes renewing and maintaining the campus, fee rate determination and fee revenue management; counseling office includes students’ psychological and physical guidance and counseling. The sub-divided result is illustrated by Table 3.2.
Table 3.1 Dataset were sub-divided into four categories.4
Category Number of articles
Academic Affairs 158 Student Affairs 138 General Affairs 54 Counseling Office 41 Total 391 3.3 Data Preprocessing
In our research, we want to focus on the comparison results of Chinese word segmentation; therefore we remove non Chinese characters and replace them as