• 沒有找到結果。

Chapter 5 POS Tagging Method

5.2 POS Tagging Methods

Fig 5 - 1 shows a system architecture diagram:

Fig 5 - 1 Taiwanese Language POS Tagging System Architecture Diagram

5.2.1 Origin of the Corpus

The corpus we chose was the result of the DADWT project of the NMTL. It contains over 2.58 million syllables in both POJ and HR mixed scripts with paragraph by paragraph alignment, including novels, prose, dramas and poems (Iunn, 2007b).

5.2.2 Word for Word Alignment

First, we developed a word alignment program to aid with manual processing. We arranged for the word alignment of two scripts, where the paragraphs were already aligned. This program not only collates the number of syllables in the two scripts, it also compares and contrasts the two scripts with the contents of the OTMD. If the program does not find the two words within the

Para. for para.

word may be an unknown word, inconsistent usage of the Han character, or typo.

The OTMD contains over 62,000 entries, including the POJ script, Taiwanese HR mixed script, Mandarin translation, and English translation (about 10,000 records). This online dictionary contains synthesized Taiwanese pronunciations, and receives an average of around 2,700 daily inquiries (counted from 1/1/2008 to 12/31/2008). The English field was added in year 2007; however, the data still needs to be completed (Iunn, 2000).

5.2.3 Searching for the Corresponding Mandarin Candidate

Words

Next, we continued to search for the corresponding Mandarin candidate words from the POJ and HR mixed script word pairs via the OTMD. The mapping was one-to-many. In short, a Taiwanese word pair would have more than one Mandarin word counterpart. For example, “ài[ッ]” in Taiwanese has the meanings of “ッ ‘love (person),’ ” “╄㬉 ‘like (thing),’ ” “天 ‘want to,’ ” “暨 天 ‘need to,’ ” etc., in Mandarin. However, we were not able to find counterparts for certain words, since they were not contained in the OTMD. We also found some that had different HR mixed script usage.

For instance, the word pair that appears as “庫岷 [khah-iâϸ]” in the corpus appears as “khah岷[khah-iâϸ]” in the dictionary. With regard to problems of this nature, we applied the following solution: if the POJ and HR mixed script word

pair could not be found, we temporarily removed the HR mixed script and searched for the Mandarin word counterpart again using the POJ script. If the characters of the HR mixed script were all Han characters, then we regarded the Han characters as a Mandarin candidate word (assuming that the word is the Taiwanese and Mandarin common words).

This method might increase the number of the Mandarin candidate words, especially for single syllable words. For instance, the word pair “廱[chцan]”

appears in the text. We could not find an entry that contained both “廱” and

“chцan” in the OTMD. The corresponding Mandarin translations of “chцan” in the dictionary are “㈕” and “ᶲ.” We added “廱” as the supplementary Mandarin translation, but the meanings of these three words differ.

If the strategy was still unable to find any result, the HR mixed script was directly recognized as the Mandarin candidate word. For instance, no dictionary entry was found for the word pair appearing as “㚱⼊[iú-hêng] ” in the text, neither could one be found in the search using the POJ script “iú-hêng.” So the HR mixed script “㚱⼊” was directly recognized as the Mandarin candidate word (Lau, 2007).

5.2.4 Selecting the Best Mandarin Translation

We employed the Hidden Markov Model and Viterbi algorithm and made use of the bigram word training data of the ten-million word balanced Sinica corpus of the CKIP Group of Academia Sinica to select the most appropriate

Assume that a particular sentence contains m words. The first word, w1, is selected from the candidate words of

11 12 11,w ,...,wn

w ; the second word, w2, is selected from the candidate words of

22 22 21,w ,...,wn

w , and the mth word, wm, is selected from the candidate words of

mnm m

m w w

w 1, 2,..., . Sˆ w1w2wm, which is the most probable word sequence, is selected from the candidate words, such that P(Sˆ w1w2wm) is maximized.

The HMM assumes that the word wi is only influenced by the previous

wordwi1, thus P(Sˆ w1w2wm)# –m 

( . Therefore, it searches for the

word sequence Sˆ w1w2wm, which maximizes ¦m 

may not be a legal Mandarin sentence (Samuelsson, 2003).

In practice, we use the Viterbi algorithm to eliminate repeated computations and reduce the time complexity from exponential time to polynomial time. If a sentence S has m words, and every word has n candidate words, the time complexity will be O(nm). The Viterbi algorithm reduces the time complexity to O(n2um) (Manning & Schütze, 1999).

5.2.5 Selecting the Most Appropriate POS According to the

Corresponding Mandarin Word

We applied the Maximal Entropy Markov Model (MEMM) to the POS tag

selection.

(Manning & Schütze, 1999) stated that “Maximum entropy modeling is a framework for integrating information from many heterogeneous information sources for classification. The data for a classification problem is described as a number of features. Each feature corresponds to a constraint on the model. … Choosing the maximum entropy model is motivated by the desire to preserve as much uncertainty as possible.”

MEMM includes a set of possible word and tag contexts, or “histories” (H), and the POS tagging set (T). – likelihood of the training data using p: – – –

n

matching strategy; thus, wi m1m2mn and under certain circumstances,

mn

m

m2 3  . If wi is a known word, the three morpheme features are set to null. Moreover, if wi is at the beginning or end of a sentence, certain features are likewise given a null value. For instance, when i=1, the feature values of

1

The ten-million word pos tagged balanced Sinica corpus of the CKIP Group was used as the training data and the search for the most probable pos sequence was implemented using the Viterbi Algorithm to search for pos sequence t such

that – – –

( SP D was maximized (Berger, Pietra, &

Pietra, 1996; McCallum, Freitag, & Pereira, 2000; Rabiner, 1989; Ratnaparkhi, 1996; Samuelsson, 2003; Tai, 2007; Y.-f. Tsai & Chen, 2004).