Chapter 5 POS Tagging Method
5.4 Error Analysis
5.4.4 Propagation Error
Four of the POS tagging errors were probably due to the occurrence of a previous POS tagging error. These are categorized as propagation errors and include one unknown word.
5.4.5 Other Cases
The personal name “⣑岄” of “⣑岄 ah/Thian-sù ah” (not an unknown word)
was tagged as “A,” with the suffix “ah” tagged as “T” or “Di” (this appeared twice in all; once, the selected Mandarin word was “⓲” and in the other instance it was “Ḯ”).
The Taiwanese word “⮵/tùi” under general circumstances is synonymous with the Mandarin word “⽆.” This word appeared nine times in the test data.
The system selected the Mandarin word “⮵” seven times and the word “⽆”
twice for its counterpart. However, under both circumstances the POS tag of the word was always “P”; thus the different word choice did not affect the accuracy of the POS tagging.
There were also 18 other errors made, mainly due to our inability to clearly analyze the proper POS tags for the words at the time.
5.4.6 Summary of Error Conditions
A summary of the causes of the errors made during the POS tagging and their frequency percentages is given in Table 5 - 7.
Table 5 - 7 The Reasons for the POS Tagging Errors
Reason Count Percentage(%) Remark
Selection of inappropriate
Mandarin word 13 27.1
Absence of appropriate
Mandarin word 2 4.2
Unknown word 8 16.7
Personal name 4 8.3
Propagation error 4 8.3 Includes an unknown word Totally 30 62.5 After discounting the repeat
count
5.5 Discussion
5.5.1 Is Improvement Possible ?
The most ideal situation would be to resolve the foregoing errors and then use this method to conduct the Taiwanese POS tagging to achieve an accuracy rate of 96.8%. However, there is an apparent difficulty in the realization of this goal.
There are differences between the Taiwanese word order and the Mandarin word order; thus, the selection of an incorrect Mandarin word, and consequently incorrect POS tagging, occurred with high probability. Although it is possible to add new entries to the OTMD to resolve the problem of unavailable appropriate Mandarin word choices, the accuracy rate could only be raised by about 5%.
The unknown word problem was the second leading cause of POS tagging errors. From the Mandarin perspective, these words are not actually unknown words; this problem mostly resulted from the fact that translations between different languages are not one-to-one mappings. Another significant factor involves the use of hyphens in the POJ script, as their usage has not yet been standardized. It is probable that due to the use of Han characters, word boundaries are relatively vague in the different languages of the Chinese language family.
5.5.2 Hyphen Problems, Distinction between Taiwanese and
Mandarin
In Taiwanese, some words take on the POJ script and, thus, the use of the hyphen. On one hand, they are used to separate the syllables of words, making it possible for a syllable to correspond to a Han character; on the other hand, they serve as word separators. Each syllable in a hyphenated word represents a unigram, and a space separates each word. Unfortunately, no original word boundaries of Han character writing can be found to correspond to the hyphenated word.
In addition, Taiwanese has around 3,000 legal syllables, whereas Mandarin has around 1,200 legal syllables (K.-i. Chan, 2008). Because of this, it may be said that the Taiwanese language has more single-syllable words. However, as a single-syllable word may have several corresponding Han characters, the use of two-syllable or multi-syllable words resolves most of the problems.
For instance, if the Taiwanese word “忁ᾳ” is written as “chit ê” (no hyphen used), the syllable “chit” may be made to correspond to several Mandarin words, such as “忁,” “借,” “岒,” “䷼,” etc. The syllable “ê” may also be made to correspond to several Mandarin words, such as “䘬,” “ᾳ,” “朳,” etc. If the word is written as “chit-ê” (hyphenated), it is usually directly read as “忁ᾳ.” Hence, under the POJ script, the writer may tend to use a hyphen to link a single-syllable word to another single-syllable word, if these two single-syllable words may likely form one composite word or one phrase. Present practices
show that the word “忁ᾳ” may appear hyphenated or in a separated syllable form, thus creating the presence of inconsistencies.
Since the use of hyphenated words creates the problem of one Taiwanese word corresponding to two Mandarin words, if the original text is not revised and the Mandarin corresponding word is manifested as an unknown word, it may be possible to just remove the hyphen and try again. This method may reduce the chance of POS tagging errors due to the unknown word factor.
5.5.3 The Distinction between Different Eras or Different Genres
For questions about whether texts of a different era or a different literary genre would affect the accuracy rate of the POS tagging, please refer to the data shown in the following tables. Table 5 - 8 shows the POS tagging accuracy rates for the texts of three types of literary genres and Table 5 - 9 shows the POS tagging accuracy rates for the texts of literary works belonging to three different periods or eras. Table 5 - 8 shows that the POS tagging accuracy rate for novel materials are comparably lower; whereas Table 5 - 9 indicates that no significant difference may be noted in the POS tagging accuracy rates for the literary works of different periods. However, due to the limited amount of data available, further empirical studies are necessary to attest to the foregoing analysis findings.
Table 5 - 8 Tagging Accuracy Rates for Different Genres Genre No. of
Words
No. of Tagging
errors Accuracy rate(%)
prose 277 21 92.4
drama 58 4 93.1
novel 229 23 90.0
Table 5 - 9 Tagging Accuracy Rates for Different Eras Era No. of
Words
No. of Tagging
errors Accuracy rate (%)
Ching Dynasty 186 15 91.9
Japanese-ruled 212 17 92.0
Post-war 166 16 90.4
5.6 Summary
We proposed a Taiwanese POS tagging method using a statistical method and Mandarin training data, and achieved an accuracy rate of 91.5%. Due to the lack of Taiwanese training data, we sought the help of Mandarin.
This strategy could also be applied to other languages that lack resources.
We thought that this was a very important idea. It is preferable to select an intermediate language close to the target language from the viewpoint of the language family.
We also developed an online Taiwanese word segmentation and POS tagging system for people who are interested in this topic. Users can input Taiwanese text and get POS tagging results. It is somewhat difficult for a user to prepare both POJ and HR mixed scripts; therefore, we also provide the functions in the absence of one of these two scripts (Lau & Iunn, 2007). However, that will
decrease the accuracy rate.
If we can construct a Taiwanese Mandarin parallel corpus, we can then use other methods like the Coerced Markov Models proposed by (Fung & Wu, 1995) to do the Taiwanese POS tagging task.
A more suitable tagset for Taiwanese and an electronic word dictionary based on the Taiwanese word segmentation standard are necessary for advanced searches. We hope that we can proceed to the construction of the Taiwanese Treebank.
Chapter 6 Conclusion and Future Work
6.1 Our Contributions to Written Taiwanese Resources
and Processing
We have described the tasks we have performed in written Taiwanese related research.
For digital written Taiwanese resources, the important infrastructure, we have established:
(a) A 22,000 entry online Taiwanese syllable dictionary (OTSD). This had a total of more than 290,000 searches from more than 32,000 different IP addresses (as of January 2003), with more than 250 searches per day for the past year (as of December 30, 2008) (Iunn, 2003c, 2003f);
(b) A 62,000 entry online Taiwanese-Mandarin dictionary (OTMD). This had a total of more than 2.4 million searches from more than 125,000 different IP addresses (as of December 2002), with more than 2,700 searches per day for the past year (as of December 30, 2008); we also developed a Google gadget interface for the OTMD (Iunn, 2000, 2002, 2003g, 2007c);
(c) A Taiwanese corpus with 5,800,000 syllables in HR mixed script and 3,400,000 syllables in POJ script, and the Online Taiwanese Concordancer System (OTCS) based on this corpus. This had a total of nearly 1,900,000 searches from more than 56,400 different IP addresses (as of January 2003), with about 1,630 searches per day for the past year (as of December 30, 2008) (C.-C. Cheng et al., 2007; Iunn, 2003b, 2003e; Iunn & Lau, 2007);
(d) A Preliminary Taiwanese Word Frequency Report for the Taiwanese POJ and HR mixed scripts based on the above Taiwanese corpus (Iunn, 2005b, 2005c);
(e) A 2,580,000-word Digital Archive Database for Written Taiwanese (2nd stage) (DADWT), which contains literature data with POJ and HR mixed scripts paragraph alignment. This had a total of more than 1,320,000 page visits (as of December 2006), with 1,672 page visits per day on average. We also developed a Google gadget interface for the DADWT, which can randomly select an article (Iunn, 2006a, 2007b, 2007d);
For the coding and I/O of POJ, we proposed a two-stage search method via string matching and a filter program. We also proposed a query expansion scheme for toneless, glottal stop, checked syllable, and vowel searches, and described a display method. The problems mentioned above are quite different from other languages, such as English and Mandarin. The Taiwanese syllable query expansion is an important achievement since no other systems fully provide these functions. We also provide the first online Taiwanese word
segmentation system.
In relation to the processing techniques, we translated every word into Mandarin via the OTMD, obtained the POS information from the CED made by the CKIP group, proposed a rule-based tone sandhi algorithm to solve the Taiwanese tone sandhi problem, and implemented an online text-to-speech system to read out the Taiwanese literature data for users (Iunn, 2006a). We achieved accuracy rates of 97.4% and 89.0% for the training and test data, respectively. These accuracy rates are higher than other research results so far.
We also proposed a statistics-based POS tagging method using the OTMD and 10-million-word Mandarin training database to tag the Taiwanese. We followed the tagset drawn up by CKIP, did the POJ script and HR mixed script word alignment work, searched the OTMD to find corresponding Mandarin candidate words, selected the most adequate Mandarin word using an HMM probabilistic model from the Mandarin training data, and tagged the word using an MEMM classifier. We achieved an accuracy rate of 91.5% in this work. It is difficult to make a comparison with other research since the tagsets are different.
Since we have been performing this pioneering work in written Taiwanese related research, other researchers and graduate/PhD students have contacted us to get Taiwanese texts in order to do related search. Most of them have used the Taiwanese corpus and DADWT, including (K.-i. Chan, 2008; Niu, 2004).
Yi-fen Huang, a PhD student at CMU1, put part of the DADWT data into the CMU
SPICE2 system to include Taiwanese text-to-speech and speech recognition.
Hong-tin Teng, an assistant professor in the Department of Taiwanese Languages and Literature of National Taichung University, utilized the contents of our website to teach her students. In addition, the SMHLA project intends to use the method we have proposed to perform the POS tagging task for Hakka.
Other researchers have not contacted us, but did their research using the OTCS or DADWT, including (Chang, 2007; S.-L. Chen, 2006; Y.-F. Cheng, 2007;
Liao, 2008). In the “Workshop on The Use of Language Corpora of Taiwan ‘⎘䀋 婆妨婆㕁⹓ἧ䓐ⶍἄ⛲,’ ” held on February 2 and 3, 2007, at National United University, Ying Cheng introduced Taiwan Southern Min research using the OTCS (MOE Advisory Office, 2007)
At times we have received email from strangers asking us to fix our website when it had problems. Additionally, an undergraduate studying Taiwanese languages once told us that they could not finish their homework when our website was out of service3.
6.2 Future Work and Prospects for Written Taiwanese
Processing Research
It might be possible to improve the Taiwanese tone sandhi problem in the following ways:
2 Speech Processing - Interactive Creation and Evaluation.
3 Personal communication in July 2008.
(a) Solicit assistance from linguists. It is hoped that linguistics will define a standard for part-of-speech analysis and word segmentation, and that a dictionary conforming to such a standard will be built.
(b) Improve word segmentation, especially the processing of morphology, quantitative words, and proper nouns.
(c) Improve the processing of POS tags to account for ambiguity.
(d) Change the dictionary’s POS tags, such as by making use of Embree’s POS analysis (Embree, 1984).
(e) Improve the sandhi rules.
(f) Find alternative ways of modeling sandhi processing such as template theory or optimality theory.
(g) Use a machine learning method to model tone sandhi processing if we can construct a corpus with tone sandhi markers.
In relation to the Taiwanese POS tagging, if we could construct a Taiwanese Mandarin parallel corpus, we could then use other methods, like the Coerced Markov Models proposed by (Fung & Wu, 1995), to do the Taiwanese POS tagging task.
Based on the results mentioned above, we will try to describe the blueprint for our future written Taiwanese processing research.
First, in relation to the infrastructure, we need to amend the original data and prepare the following items:
(a) A suitable tagset for the Taiwanese language;
(b) An electronic word dictionary based on a word segmentation standard.
Second, we need to establish a Taiwanese corpus with syntactic tags. We also need to establish a Taiwanese corpus with semantic tags, and the discussion of the semantic role of Taiwanese is necessary. This suggested corpus should provide both POJ and HR mixed script transcription and tone sandhi markers.
Finally, we can construct a Taiwanese treebank.
If we submit the Taiwanese corpus to the Linguistic Data Consortium (LDC), the status of this language will be promoted (UPenn, 1992).
In addition, speech and text conversion techniques, and an OCR technique (from images to text, like a Google Book search (Google Inc., 2007)) for Taiwanese are also important.
On the other hand, if we have already established the Taiwanese corpora, at least at the 10-million-word level for example, it will be possible for us to develop the field of Taiwanese applied linguistics via computational research.
We think that some of the issues worth investigating include:
(a) Zipf’s law:
Zipf’s law states that, given a large corpus of natural language, where the words are listed in descending order of frequency, with f the frequency of a word and r its rank, then
f r1
v . Mandelbrot derived a more general
relationship between rank and frequency: f P(rU)B , where P, B, and Ȱ are the parameters of a text. These parameters are different for different languages. What are the parameters of Taiwanese (Manning & Schütze,
1999)?
(b) Lexicography:
“Collins COBUILD Learner’s Dictionary” is the first dictionary to use the computational corpus-based approach. They use word frequency data to select words to include in the dictionary, with contextualized examples from the corpus. There are also other corpus-based dictionaries, like “The New Oxford Dictionary of English,” “The Oxford-Hachette French Dictionary,” etc. Can we establish a Taiwanese dictionary via corpus?
(c) Lexical change:
Every language changes day by day, but many people believe that the Taiwanese language is changing more rapidly than other languages, mainly under the influence of Japanese and Mandarin, because of political or historical factors. If we attain a diachronic Taiwanese corpus with data from different time periods, we can get more precise quantitative data to describe this phenomenon (Iunn & Kao, 2004; Khu, 2008; Li, 2000; McEnery, Xiao, &
Tono, 2006).
(d) Script selection for HR mixed script:
Though the written Taiwanese orthography has not yet been standardized, the specific written form is gradually being accepted by some through common practice. We think the mainstream written Taiwanese orthography is HR mixed script. What percentage of the POJ in the HR mixed script comes from the word types/tokens’ point of view? Why does a writer select POJ or a Han character? Are there different selection attitudes in different
genres? The Taiwanese corpus may give us more satisfactory answers (K.-i.
Chan, 2008).
(e) Co-occurrence of words:
Word usage is an important factor in language learning. It is necessary for us to establish Taiwanese collocation data via the Taiwanese corpus.
(f) Machine translation:
Taiwan is a multi-ethnic and multi-lingual society, whose languages interact with each other frequently. It is necessary to develop language translation systems, such as Taiwanese/Mandarin, Taiwanese/Hakka, Taiwanese/Aboriginal languages, etc.
On the other hand, Taiwanese/English and Taiwanese/Japanese translations are also important when we want to communicate with the international community.
Housewives are also a societal reality in Taiwan, and translation between Taiwanese and Southeast Asian languages is becoming increasingly important. However, we think that this will be difficult to realize in the near future due to a lack of resources.
Written Taiwanese processing and Taiwanese computational linguistics are nearly uncultivated fields, and need many researchers.
Reference
Academia Sinica. (2008). Southern Min and Hakka Language Archive. Retrieved 1/24, 2009, from
http://www.ling.sinica.edu.tw/files/SelectedResearchProject971111-12.pdf Benenson, A. Transliterator (ToCyrillic). Retrieved 11/29, 2008, from
https://addons.mozilla.org/zh-TW/firefox/addon/883?id=883&application=firefo x
Berger, A. L., Pietra, S. A. D., & Pietra, V. J. D. (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1), 39-71.
Brill, E. (1993). Automatic grammar induction and parsing free text: A
transformation-based approach, Proceedings of the DARPA Speech and Natural Language Workshop (pp. 237-242).
Chan, K.-i. (2008). Comparison with the Usage of Academic and Non-academic Taiwanese Words '⎘婆⬠埻栆朆⬠埻栆䘬娆⼁ἧ䓐㭼庫'. National Taitung University, Taitung.
Chan, K.-k. (1997). The Discussion of Taiwanese Word Segmentation Principles '⎘婆 㕟娆⍇⇯妶婾'. In The Project Report for the Collecting, Cataloging and Select Editing of Taiwanese Literature Publications '⎘䀋㔯⬠↢䇰䈑㓞普ˣ䚖澍ˣ怠 澨䶐廗妰䔓䳸㟰⟙⏲' (pp. 45-72). Taipei: Council for Culture Affairs '㔯⺢㚫'.
Chang, C.-l. (2007). A Comparative Study on the Verb "qi-lai " in Mandarin Chinese and Taiwan Southern Min '厗救婆嵐⎹≽娆ˬ崟Ἦ˭ᷳ婆佑⎍㱽≇傥䞼䨞'.
National Sun Yat-sen University '⚳䩳ᷕⰙ⣏⬠', Kaohsiung.
Chen, M. Y. (2000). Phonological Phrase as a Sandhi Domain. In Tone Sandhi : Patterns Across Chinese Dialects: Cambridge Univ. Press.
Chen, S.-L. (2006). The Use and Grammaticalization of Taiwan Southern Minˬ䓇˭ ' 冢䀋救⋿婆ˬ䓇˭䘬䓐㱽⍲℞婆㱽⊾㬟䦳'. ⚳䩳㕘䪡㔁做⣏⬠, Hsinchu.
Cheng, C.-C., Ho, D.-a., Hsiao, S.-y., Chiang, M.-h., & Chang, Y.-l. (Eds.). (2007).
Multiculturalism Thinking of the Language Policy '婆妨㓧䫾䘬⣂⃫㔯⊾⿅侫'.
Taipei: Institute of Linguistics of Academia Sinica.
Cheng, R. L. (1990). In the evolution of Taiwan's society and language literacy '㺼嬲ᷕ
ǫ⎘䀋䣦㚫婆㔯'. Taipei: 冒䩳.
Cheng, R. L. (1997). Taiwanese and Mandarin Structures and Their Developmental Trends in Taiwan Book I: Taiwanese Phonology and Morphology '⎘婆䘬婆枛
冯娆㱽'. Taipei City: Yuan-liou Publishing Co. '怈㳩'.
Cheng, R. L. (2002). 婆㱽㧉㜧ᶲ䘬倚婧嬲⊾ʇʇ娵䞍⍲㷔槿 'Tone Sandhi on Template Grammar -- Cognition and Test', 1st International Conference on Taiwanese Romanization. Taitung: Taiwanese Romanization Association.
Cheng, Y.-F. (2007). Patterns of Negative Words of A-not-A Questions in Taiwan
Southern Min '⎘䀋救⋿婆㬋⍵⓷⎍ᷕ⏎⭂≑≽娆䘬䲣䴙'. National Tsing-hua University '⚳䩳㶭厗⣏⬠', Hsinchu.
Chhong-bi Memorial Foundation. TBTS Taiwanese Writing Forum. Retrieved 12/1, 2008, from http://chhongbi.org/index2.html
Chiang, Y.-c. (2004). Dai-im Input Method. Retrieved 12/30, 2008, from http://taiwantp.net/eternity/holodownload.htm
Chiunn, U.-b. (2008). Taiwanese and Hakka Modern Literature Website '⎘婆⍲⭊婆䎦 ẋ㔯⬠⮰柴䵚䪁'. Retrieved 11/29, 2008, from
http://140.116.10.241/NCKUTaiWeb/View/index.aspx
Chou, S.-y. (2006). T3 Taiwanese Treebank and Brill Part-of-Speech Tagger 'T3⎘婆⇾
㜸㧡婆㕁⹓冯Brill娆栆㧁姀'. National Tsing Hua University, Hsin-chu.
CKIP. (1993). Analysis of Chinese Part-of-speech 'ᷕ㔯娆栆↮㜸'. Taipei: The Association for Computational Linguistics and Chinese Language Processing.
CKIP. (2004). Chinese Word Segmentation and Tagging System. Retrieved 11/22, 2008, from http://ckipsvr.iis.sinica.edu.tw/
Embree, B. L. M. (1984). A dictionary of Southern Min '⎘劙录℠'. Taipei: Taipei
Embree, B. L. M. (1984). A dictionary of Southern Min '⎘劙录℠'. Taipei: Taipei