Organization of This Dissertation - 台語文處理技術：以變調及詞性標記為例

Chapter 1 Introduction

1.4 Organization of This Dissertation

This dissertation is divided into six chapters.

We introduce the overall background in Chapter 1. A researcher with a background in computer science may not be familiar with the Taiwanese language, given the monolingual education in Taiwan. Therefore, we devote space to describing the background of the language, including its history, language population, different types of scripts, and abbreviations.

Chapter 2 describes the resources and our survey of written Taiwanese

processing. We omit the plentiful research results in the Mandarin and English fields for the sake of space. Written Taiwanese processing is an almost uncultivated field, and has received very little attention. Generally speaking, most journal editors are not interested in this field; therefore, we cite numerous websites rather than academic papers. In regards to the digital resources of written Taiwanese, we introduce fonts, dictionaries, corpora, electronic books, etc. We also introduce recent written Taiwanese processing techniques,

including input method, word segmentation, tagging, script conversion, text-to-speech, translation, and parsing techniques.

In Chapter 3, we introduce the coding, I/O of POJ, and text processing for written Taiwanese. English and Mandarin have their own processing problems.

For example, it is necessary to manipulate the word stemming problem and the modifier of a prepositional phrase in English processing, and the Han character encoding and word segmentation problem for Mandarin. As to POJ, it is necessary to solve some fundamental problems, including encoding, display, and search, which are not the same as English and Mandarin. We first introduce the POJ character code, and mention numbered POJ as the interchange code for various POJ encodings. Then, we propose a two-stage search strategy: perform string matching and then filter the results. In addition, we propose query expansions, including toneless, glottal stop, checked syllable, and vowel search, because it is difficult for someone with a Mandarin education to distinguish the differences. We also describe the display method for POJ, and some POJ word processing utilities, including phoneme segmentation, spelling checker, and

syllable/word/sentence count utilities. At the end of this chapter, we describe a word segmentation method for HR mixed script.

In Chapter 4, we propose a rule-based tone sandhi algorithm. We address some problems raised by the Taiwanese tone sandhi system by describing a set of computational rules to approximate this system, as well as the results obtained from our implementation. Using POJ text as the source, we took a sentence as the unit, translated every word into Mandarin via OTMD, and obtained POS information from the CED made by the CKIP group of the Academia Sinica. Using the POS data and tone sandhi rules formulated based on linguistics, we then tagged each syllable with its post-sandhi tone marker.

Finally, we implemented a Taiwanese tone sandhi processing system that takes a POJ script sentence as the input and outputs the tone markers. Our system achieved accuracy rates of 97.4% and 89.0% with the observation and test data, respectively.

For example, if a user inputs the POJ sentence:

“Chhin-chhiըϸ án-ni lâi kóng, chþi lán Tâi-ôan k̚n-k̚n chi ݚt-tiap-á-kú ê kang-hu, ài soaϸ chiը ը soaϸ, ài hái chiը ը hái, beh joݚah chiը ը joݚah, kôaϸ chiը ը kôaϸ”

Our tone sandhi algorithm adds the tone sandhi markers:

“Chhin-chhiըϸ án-ni# lâi kóng#, chþi lán Tâi-ôan# k̚n-k̚n hi ݚt-tiap&

-á-kú# ê kang-hu#, ài soaϸ# chiը ը soaϸ#, ài hái# chiը ը hái#, beh$

joݚah# chiը ը joݚah#, kôaϸ# chiը ը kôaϸ#.”

We then concatenate all of the sound files for the corresponding syllables to

an MP3 format sound file and return it to the user. The purpose of the Taiwanese tone sandhi algorithm is to implement a real-time Taiwanese tone sandhi system.

In Chapter 5, we propose a POS tagging method using the OTMD and 10 million Mandarin words as training data to tag Taiwanese. The literary written Taiwanese corpora have both POJ script and HR mixed script, with genres that include prose, novels, and drama. We followed the tagset drawn up by CKIP. We developed a word alignment checker to assist with the word alignment work for the two scripts, and then used the OTMD to find the corresponding Mandarin candidate words, selected the most adequate Mandarin word from the Mandarin training data using an HMM probabilistic model, and finally tagged the word using an MEMM (Maximal Entropy Markov Model) classifier. We achieved an accuracy rate of 91.5% in the Taiwanese POS tagging work and analyzed the errors.

For example, the original data was a paragraph by paragraph parallel corpus with POJ and HR mixed scripts, like:

góa chiong chháu-bц-á kòa t̚ piah- téng, hêng-lí khêng khêng leh, chȘ tòa sió-tiàm ê tha-tha-mì téng-kôan, …

ㆹ⮯勱ⷥṼ㍃t̚⡩枪炻埴㛶 khêng khêng leh炻⛸tòa⮷

⸿ê tha-thá-mì枪kôan炻…

First, our word alignment program rearranged the data as:

“ㆹ [góa] ⮯ [chiong] 勱 ⷥ Ṽ [chháu-bц-á] ㍃ [kòa] t̚[t̚] ⡩ 枪 [piah-téng]炻[,] 埴㛶[hêng-lí] khêng[khêng] khêng[khêng] leh[leh]炻 [,] ⛸[chȘ] tòa[tòa] ⮷ ⸿ [sió-tiàm] ê[ê] tha-thá-mì[tha-tha-mì] 枪 kôan[téng-kôan]炻[,] …”

Second, we referenced the OTMD and added the Mandarin translation(s) for every word. We called these Mandarin translation(s) candidate words. We performed this task because we intended to use the Mandarin language model:

“ㆹ [góa]{ ㆹ } ⮯ [chiong]{ ⮯ } 勱 ⷥ Ṽ [chháu-bц-á]{@ 勱 ⷥ Ṽ } ㍃ [kòa]{ⷞ;㍃;㇜} t̚[t̚]{⛐} ⡩枪[piah-téng]{䇮⡩ᶲ}炻[,]{炻} 埴㛶 [hêng-lí]{埴㛶} khêng[khêng]{㓞㊦;䚌溆} khêng[khêng]{㓞㊦;䚌溆}

leh[leh]{⑏} 炻[,]{炻} ⛸[chȘ]{⛸} tòa[tòa]{ỷ} ⮷⸿[sió-tiàm]{@⮷

⸿} ê[ê]{䘬} tha-thá-mì[tha-tha-mì]{⟴⟴䰛} 枪 kôan[téng-kôan]{ᶲ 朊} 炻[,]{炻} …”

Note that the words “勱ⷥṼ” and “⮷⸿” are not found in OTMD, we treat the HR mixed script as the Mandarin candidate word. Third, we use Hidden Markov Model to select the most suitable Mandarin word from the candidate words:

“{ㆹ}<ㆹ> {⮯}<⮯> {@勱ⷥṼ}<勱ⷥṼ> {ⷞ;㍃;㇜}<ⷞ>

{⛐}<⛐> {䇮⡩ᶲ}<䇮⡩ᶲ> {炻}<炻> {埴㛶}<埴㛶> {㓞㊦;䚌溆}<㓞㊦> {㓞㊦;䚌溆}<㓞㊦> {⑏}<⑏> {炻}<炻> {⛸}<⛸>

{ỷ}<ỷ> {@⮷⸿}<⮷⸿> {䘬}<䘬> {⟴⟴䰛}<⟴⟴䰛> {ᶲ 朊}<ᶲ朊> {炻}<炻> …”

Note that, since the words “勱ⷥṼ” and “⮷⸿” are not found in the OTMD, we treated the HR mixed script as the Mandarin candidate words. Third, we used the Hidden Markov Model to select the most suitable Mandarin word from the candidate words:

“<ㆹ>(Nh) <⮯>(D) <勱ⷥṼ>(Na) <ⷞ>(VC) <⛐>(P) <䇮⡩

ᶲ>(Nc) <炻>(COMMACATEGORY) <埴㛶>(Na) <㓞㊦>(VC) <㓞

㊦>(VC) <⑏>(T) <炻>(COMMACATEGORY) <⛸>(VA) <ỷ

>(VCL) <⮷⸿>(Na) <䘬>(DE) <⟴⟴䰛>(Na) <ᶲ朊>(Ncd) <炻 (COMMACATEGORY)> …”

Finally, we got the Taiwanese POS tagging result:

“ㆹ[góa](Nh) ⮯[chiong](D) 勱ⷥṼ[chháu-bц-á](Na) ㍃[kòa](VC) t̚[t̚](P) ⡩ 枪 [piah-téng](Nc) 炻 [,](COMMACATEGORY) 埴㛶 [hêng-lí](Na) khêng[khêng](VC) khêng[khêng](VC) leh[leh](T) 炻 [,](COMMACATEGORY) ⛸ [chȘ](VA) tòa[tòa](VCL) ⮷ ⸿ [sió-tiàm](Na) ê[ê](DE) tha-thá-mì [tha-tha-mì](Na) 枪 kôan[téng-kôan](Ncd) 炻[,](COMMACATEGORY) …”

We hope that this POS tagging system can assist us to develop a Taiwanese parser.

A summary of our work will be given in Chapter 6. This dissertation is not the end of our work on written Taiwanese processing tasks. Chapter 6 will also propose future directions for written Taiwanese processing research.

Chapter 2 Resources and Survey of

在文檔中台語文處理技術：以變調及詞性標記為例 (頁 40-47)