B ACKGROUND AND R ESEARCH M OTIVATION - 網路斷詞技術應用於電子公文分類系統

CHAPTER I INTRODUCTION

1.1 B ACKGROUND AND R ESEARCH M OTIVATION

For a decade now, information exchange has being enabled faster and faster, since the investments in building electronic government (e-government) website services in Taiwan increased (Wen & Cheng, 2007). Since the mid-1990s, public organizations across the world and at different governmental levels have been applying Internet technologies in innovative ways to deliver services, engage citizens, and improve performance, and certainly Taiwan is no exception (Lee, Xin, & Silvana, 2005).

But the Internet is only the pipeline and storage system for knowledge exchange and in fact does not create knowledge sharing in a corporate culture. With the convenience of Internet technology, the published documents are increasing enormously and the ensuing need to organize these digital form of documents is unavoidable (Sebastiani, 2002; Wen & Cheng, 2007). If lacking of proper management for these vast electronic documents, the Internet in fact is merely acting like a distributed, highly scattered storehouse (McDermott, 1999).

Carter’s research indicates that if citizens perceive the service to be easy to use , citizens’ intentions to use a local e-government service will increase (Carter &

Bélanger, 2005). The government is not only collecting mountains of information but making it accessible and usable for ordinary citizens.

(http://www.america.gov/st/usg-english/2009/June/20090623100625ajesrom0.278148 8.html) In addition, Sprague Jr. (1995) also indicated that how document management can improve the information management tasks and generates business values. Thus, how to manage e-government documents becomes an important task, and there are many studies have been focusing on electronic document management system (EDMS) (Hung, Tang, Chang, & Ke, 2009). The expectation of this thesis was illustrated by Fig. 1.1.

Fig. 1.1 The ideal exchange service of e-government documents. 1

However, until now there is little investigation in Chinese e-government documents management (An, 2009), in spite of its importance. Overall, while many aspects exist in managing corporate or government knowledge and information (Devadoss, Pan, &

Huang, 2003; Layne & Lee, 2001), one key issue is how to efficiently and timely classify Chinese e-government documents into correct categories in the situation of the growing and unprecedented amounts of electronic documents.

Out of vocabulary (OOV) in Chinese word segmentation (Sproat & Emerson, 2003) and huge-scale dimensionality in bag of words model (BOW) (Sahlgren & C¨oster,

2004) are the two big challenges in documentation classification problems that interest us in this research, while we focus on information management in the Chinese e-government documents.

OOV has three problems that one is abbreviated words (Chang & Lai, 2004; Sun, Wang, & Zhang, 2006), another is name entity recognition (Gao, Li, Wu, & Huang, 2005) and the other is compound nouns (Zhou, 2000). These three forms of words also appear frequently in Chinese e-government documents.

1. The first is abbreviation words. The government agencies sometimes abbreviate words and use them in the official electronic documents, e.g.,” 98 高中課程綱要” (98 senior course syllabuses) was abbreviated as “98 課綱”.

2. The second is named entity recognition, e.g., “體健科” (Education Department Physical and Health Education Section) is an official unit name of Taichung Hsien or

“替代役” (substitute military service) is a new proper noun.

3. The third is compound nouns, which was often generated for new policies in the Chinese e-government documents, e.g., “教育優先區” (educational priority area) was combined by “教育” (education) “優先” (priority) and “area” (區).

In natural language processing, many studies show the three forms of OOV words will not only worsen the performance of Chinese word segmentation but lower the accuracy of the documents classification since these can not be segmented correctly.

New words will be one of the big problems in documents classification since these words cannot be recognized from constructed dictionary or corpus (Emerson, 2005).

The study of Sproat and Emerson (2003) indicated that more than 60% of word segmentation errors results from new words were not in a dictionary. Table 1.1

illustrates in the most well-known Chinese word segmentation system - Chinese Knowledge Information Processing (CKIP) (Chen, C.-M., Liu, Chiu, & Lee, 2006;

Xue, Xia, Chiou, & Palmer, 2005) also has the problem in segmenting OOV words. In short, this so called parts of speech words (POS) become frequently used words in the Chinese electronic official documents. Especially in the third tier Chinese e-government documents, terms like speech words even occurred frequently.

Table 1.1The comparison of word segmentation by CKIP and GSP.1

String CKIP GSP

98 課綱

(98 senior course syllabus)

98 (Neu) 課(Na) 綱(Na)

98 課綱教育優先區

(Educational priority area)

教育(Na) 優先(VH) 區(Nc)

教育優先區行政院衛生署

(Department of Health, Executive Yuan)

行政院(Nc) 衛生署(Nc)

行政院衛生署

攜手計畫課後扶助

(After School Alternative Program)

(Taiwan Health Promoting School)

Fortunately, it seems to utilize the Web for finding candidate words in our word segmentation is a possible method. Recently, there are many studies trying to use the

Web as corpus to overcome OOV problems. Su, Lin et al (2007) tried to incorporate the Wikipedia as a live dictionary to apply in Multilingual Information Retrieval (MLIR), which is also dealing with OOV terms. Besides, many researches also tried to develop named entity disambiguation methods via named entity denotations by Wikipedia (Bunescu & Pasca, 2006; Cucerzan, 2007). Zhang and Vines’s (2004) method can automatically and correctly extract translations of OOV terms from both Chinese-English and English-Chinese by the Web, including adopting Google search engine and Yahoo! search engine.

Second, huge-scale dimensionality computing is another problem. In the vector space model (VSM), most traditional text classification methods are based on variation of the“bag of words (BOW)＂models, which are counting term frequency statistic approaches. The BOW model is represented as a vector in the term space, i.e., d= ( f₁,…, f ), where k is the term set size and _k f values is usually calculated _i between 0 and 1 (Lan, Tan, Su, & Lu, 2009; Sebastiani, 2002).

However, the BOW models usually lead high dimension computing problem due to misspelled Chinese word segmentation, corrupted collection (Hu, et al., 2008; Ko &

Lee, 2001) and redundant term extraction (Kwok, 1997) because the compositions of Chinese words are open ended. Table 1.2 illustrates that different results will affect the vector space size. “行政院衛生署疾病管制局 (Centers of Disease Control, Department of Health, Executive Yuan)”, in the CKIP system, the string was segmented as ”行政院(Nc) 衛生署(Nc) 疾病(Na) 管制局(Nc).” In the name entity expressions, the segmented results were no wrong but occupied bigger space in the VSM. Owing bigger space in VSM means costing much in computing the BOW models. In our goal - to present another kind of correct name entity and to minimize

the VSM, these terms can be recognized as a complete junk - “行政院衛生署疾病管制局”. Thus, to detect more complete noun phrase is the main task in this research.

The longer a term was segmented, the smaller the size of the VSM has (Fu, Kit, &

Webster, 2008).

Table 1.2 Different results affect the VSM size.²

Legal Results of word segmentation Occupied vector space size

行政院衛生署疾病管制局 1

行政院衛生署疾病 管制局

Furthermore, there still exists another factor affecting the dimensionality of the vector space - term weighting methods. The choice of different data feature calculating models can have considerable impact on both quality and performance. An ideal term weighting method not only has the power to discriminate the candidate words and reduce the dimensionality of vector space for an effective computing, but also can eliminate noise from the feature set (Houle & Grira, 2007).

In the electronic documents classification, most studies tend to pursue the higher accuracy. However, classification accuracy is indeed important when comparing with the past classification method by domain expert, but in the face of high-dimensional and large-scale dataset like electronic documents, the vector space and accuracy should be considered at the same time to help us to carry out further analysis (Wu, C.-H., Fang, Wu, Lin, & Li, 2009).

What we are going to explore in this thesis is how we can automatically overcome

the problems in segmenting OOV words and reduce the huge-scale dimensionality for assisting Chinese e-government documents managers in knowledge/information management. With such the web-dictionaries in hand we can then look at how we can make use of it to handle Chinese word segmentation in a way so that the composition of proper nouns, abbreviation words, and compound nouns, become more complete.

在文檔中網路斷詞技術應用於電子公文分類系統 (頁 13-19)