Further Tests with Wlist and Blist - Visually and phonologically similar characters in incorrec

We explained, in Section 4.2, that we used Wlist and Blist as the new datasets to test how our system will perform with unforeseen data.

Table XVI and Table XVII provided in the Appendix show that the experimental results for Wlist and Blist were not different from the results for Elist and Jlist, which we discussed in Section 5.5. The inclusion rates were good, as we discussed in Section 5.4. Using the top 10 candidate characters enabled us to catch most of the errors that we were able to capture with the complete lists. Using PMI and ENOPs to rank the candidate characters achieved performance profiles of similar quality. SS lists and SC3 lists performed best if we had to use only one of the lists to capture the phonologically and the visually related errors, respectively. In addition, SD lists complemented the SS lists to find those phonologically related errors.

Table XVII. AIRs for Ranking the Candidates for Blist Based on Frequencies and PMIs

R₁ R₂ R₃ R₄ R₅ R₆ R₇ R₈ R₉ R₁₀

FFrequencyF

SS 69.1 78.3 80.3 80.9 81.5 81.8 81.8 81.8 81.8 81.8 SD 20.1 22.0 22.9 23.9 23.9 23.9 23.9 23.9 23.9 23.9

MS 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2

MD 1.6 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9

Phone 71.3 82.5 86.3 89.2 90.4 92.4 93.0 93.9 93.9 93.9 SC1 64.2 71.6 76.1 76.9 77.6 77.6 77.6 77.6 77.6 77.6 SC2 59.7 67.2 70.1 72.4 73.9 74.6 74.6 74.6 74.6 74.6 SC3 75.4 85.1 87.3 88.1 88.8 88.8 89.6 89.6 90.3 90.3

RS 3.0 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7

Visual 71.6 79.9 83.6 86.6 88.1 89.6 89.6 89.6 89.6 89.6

yyyPMIy

SS 64.3 75.8 77.4 80.3 80.9 81.2 81.5 81.5 81.5 81.8 SD 19.7 22.3 22.6 23.9 23.9 23.9 23.9 23.9 23.9 23.9

MS 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2

MD 1.6 1.6 1.6 1.6 1.6 1.9 1.9 1.9 1.9 1.9

Phone 67.5 83.4 86.9 89.5 90.8 92.4 93.0 93.0 93.0 93.0 SC1 63.4 73.1 76.9 77.6 77.6 77.6 77.6 77.6 77.6 77.6 SC2 61.9 70.9 73.9 73.9 74.6 74.6 74.6 74.6 74.6 74.6 SC3 75.4 83.6 87.3 89.6 89.6 89.6 89.6 89.6 89.6 89.6

RS 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7

Visual 74.6 82.8 87.3 89.6 90.3 91.0 91.0 91.0 91.0 91.0

6. APPLICATIONS

With the capability to capture the actual errors that occurred while people typed and wrote Chinese, we can apply our techniques to computer assisted language learning and to the related fields that we mentioned in Section 1.

The most obvious application is to help teachers prepare test items for “In-correct Character Correction” tests (ICC tests). In such tests, students have to find and correct an incorrect Chinese character in a given sentence, for example,

“

λܴࢄϺୖуΑਠѦࡼՉ

” (Ming took part in the field trip yesterday. Xiao3 ming2 zuo2 tian1 can1 jia1 le1 xiao4 wai4 shi1 xing1). In this Chinese string, “

ࡼ

” (shi1) is incorrect and should be changed to “

ਓ

” (lu3) to make the statement correct. This is a very common type of test in the assessment of Chinese language proficiency.

To prepare such test items, there may be some characters which teachers wish to check if their students recognize in their correct form or not, that is, “

ਓ

^{” in “}

ਓՉ

” (lu3 xing2) in the previous example. When preparing the test items, teachers will have to figure out what might be the most appropriate incorrect character to use to stand in for the correct character. Depending on the level of difficulty and the purpose of the test, they may prefer incorrect characters that are visually similar or are phonologically similar to the correct character. For phonologically similar characters, teachers may prefer to select incorrect characters that were recommended with the SS, SD, MS, and MD criteria. From this viewpoint, we should present candidate characters by their categories. The union lists are not the best choice.

Figure 5 shows a snapshot of the user interface of our prototype¹⁶that aims to help teachers prepare test items for ICC tests. In this example, a teacher requested errors that were visually similar (“

׎ᡏ࣬՟

” in the figure; xing2 ti3 xiang1 si4) and errors

16See http://140.119.164.139/biansz/bianszindex v2.php. This is our own service and it is always open, ex-cept when we experience power outage problems.

Fig. 5. The interface of our prototype for assisting teachers to prepare test items in the “Incorrect Character Correction” tests.

that had the same sound and same tone (“

ӕॣӕፓ

” in the figure; tong2 yin1 tong2 diao4), and our system returned only the top three candidates. This is a conserva-tive design, given that we are able to capture a large percentage of the actual incor-rect characters of previously observed errors with the top 10 candidates (Sections 5.5 to 5.7).

This authoring tool can be evaluated by how often the recommended characters are adopted by teachers in their test items. This style of evaluation is the same as what we have accomplished in Section 5.4 through Section 5.7. The correct words in the lists, in Table VI, serve as the target test items, and the actual incorrect characters are the teachers’ choices. From this viewpoint, we have conducted an evaluation with more than 4,100 individual tests. The observed inclusion rates, which were presented in Table XI, show that our system was able to offer candidate characters that included the teachers’ choices. The accumulate inclusion rates, which were presented from Table XIII to Table XVII, further indicate that, by providing no more than 10 candi-date characters in different categories of similar characters, our system maintained its efficacy for assisting the compilation of test items for ICC tests.

In addition to assisting the preparation of test items for Chinese tests, we can em-ploy the lists of similar characters to automatic detection of errors in Chinese text (e.g., Zhang et al. [2000]). The statistics discussed in Section 4.3 show that a large portion of errors in Chinese texts are related to characters that have the same or similar pro-nunciations, and a previous work applied phonetic information for this error detection task based on related arguments [Huang et al. 2008]. Using both visually and phonet-ically similar characters along with statistical methods, we significantly improve the performances of Huang et al.’s system and two other systems that were reported in the literature [Wu et al. 2010].

We plan to offer a free and open Web-based service to the research community. The service will allow users to enter queries to search Chinese characters that meet certain conditions. With a minor change of the interface shown in Figure 5, we can offer psycholinguistic researchers the neighbor words (e.g., Lo and Hue [2008], Tsai et al.

[2006]) of a given Chinese word for their studies. In fact, we are applying our system to support the design of educational games for cognition-based learning of Chinese characters (cf., Lee et al. [2010]). Moreover, our work can be used to find the suggested queries when users of search engines enter incorrect words [Croft et al. 2010, p. 197].

Although we are not experts in the recognition of Chinese characters either in printed

(i.e., OCR) or in written form, we can help researchers to find the confusion sets for Chinese characters [Fan et al. 1995] more efficiently.

7. DISCUSSIONS

The experimental results presented in Section 5 showed that it is relatively easy to capture errors that are related to phonological similarity. It is relatively harder to catch errors that are related to visual similarity.

Using the information about the pronunciations of characters that are available in Chinese lexicons is very effective for reproducing phonologically-related errors. Se-lecting candidate characters with the SS and SD criteria was most fruitful. The main reason that our program did not catch the errors that were related to phonological similarity was that our lists of confusing phonemes (cf. Table I) did not contain the types of errors that actually occurred. This can happen if the types of errors are those which are not considered significant in psycholinguistic studies, but which can occur once in a while in reality.

Using the extended Cangjie codes proved to be the main reason why we could cap-ture a larger portion of the errors that are related to visual similarity, when we com-pare the performances of our systems that were implemented in 2007 and 2008. The decision to divide characters into subareas further improves our ability to find simi-lar characters. However, the steps demand subjective decisions, in which we observed how characters were divided [Lee 2010a], and these decisions will influence how well we find the incorrect characters. We discussed an example of the problem, that is, the

“

ψΓ΋

” (gong1 ren2 yi1) problem, in Section 3.4. Another example is the question of how our programs may find the similarity between “

ୋ

” (fu4) and “

ᅽ

” (fu2). According to Lee [2010a], the LIDs for “

ୋ

” and “

ᅽ

” are 4 and 3, respectively (cf., Section 3.3).

Hence, the shared component “

㧔

” (fu2) of these two characters will be saved in two different ways: “

΋αҖ

” (yi1 kou2 tian1) at P1 for “

ୋ

^{”; and “}

΋α

^{” and “}

Җ

” at P1 and P2, respectively, for “

ᅽ

^”.

To alleviate the problem, we concatenated the substrings of Cangjie codes into one string and computed the Dice coefficient (Equation (1) in Section 3.4) of the concate-nated Cangjie codes for two characters. This strategy proved to be very important.

Using SC3 to select visually similar characters outperformed SC2 and SC1 in all of our experiments.

Although we have achieved good experimental results, using Cangjie codes as the basis of defining the visual similarity between characters does not produce perfect re-sults. The original Cangjie codes may not reflect the complexity, for example, the num-ber of strokes, of a component in a character. A complex component can be represented with a simple Cangjie symbol, for example, the Cangjie code for “

ѷ

” (fu2) is “

ύύψ

^” (zhong1 zhong1 gong1). In contrast, a seemingly simple component can be represented with a longer sequence of Cangjie symbols, for example, the complete Cangjie code for

“

ϒ

” (yu3) is “

ψЉψψ

” (gong1 ge1 gong1 gong1). This phenomenon may mislead our score functions, that is, SC1, SC2, and SC3, which rely on the lengths of the matched Cangjie codes to determine the degree of similarity between characters.

A possible solution to this problem is to use our own Cangjie codes for the basic elements, but this strategy has its problems. For instance, we replaced the Cangjie code for “

ق

” (yan2) with “

Ν΋΋α

” (bu3 yi1 yi1 kou3) in c₅, c₆, and c₇ in Table IV.

However, such an operation is extremely subjective and labor intensive. Although we changed the Cangjie codes for a limited number of elements, we cannot guarantee that we have done enough for all possible errors that are related to visual similarity.

Moreover, we were not sure whether we were just trying to maximize the performance of our systems in the case of some particular lists of errors. This was the main reason

that we collected Wlist and Blist for further experiments, after we had been using Elist and Jlist for an extended period of time. Fortunately, the experimental results for Wlist and Blist remained satisfactory.

Another problem that came up when we built the database of the extended Cangjie codes is the degree of detail to which we should recover the Cangjie codes. Consider this list of characters: “

ᆸ

^{” (wu3), “}

ӈ

” (lie4), “

ٯ

^{” (li4), “}

უ

” (huo3), and “

ഝ

” (mai4). It is probably not easy for everyone to notice that they all share “

δ

” (xi4) somewhere in-side them. To what degree do users pay attention to relatively small elements? Should we consider this factor when we design the score functions to measure the degree of similarity between two characters? This is a hard question for us. The best design may depend on the actual applications, for example, the needs of psycholinguistic ex-periments [Leck et al. 1995; Yeh and Li 2002].

So far, we have not touched upon the issue that the Cangjie codes do not provide a good mechanism for comparing the similarities between characters that consist of very few strokes. Examples are c₁(

Җ

, tian2), c₂(

җ

^{, you2), c}3(

Ҙ

, jia3), and c₄(

ҙ

, shen1) in Ta-ble II. Another group of similar characters are “

β

^{” (tu3), “}

γ

” (shi4), “

π

” (gong1), “

υ

” (gan1), and “

ί

” (qian1). Differences among these characters are at the stroke level, so we cannot rely on the Cangjie codes to find their similarities. For such characters, the Wubihua encoding method [Wubihua 2010] should be applied. The Wubihua encoding method assigns identification numbers to a selected set of strokes, for example, “1” for horizontal strokes and “2” for vertical strokes. Because there is exactly one canonical way to write a Chinese character, that is, the standard order of the strokes that form the character, one can convert each of the strokes into their Wubihua digits, and use this sequence of digits to encode a Chinese character. The Wubihua codes for “

β

^{”, “}

γ

^”,

“

π

^{”, “}

υ

^{”, and “}

ί

” are, respectively, “121”, “121”, “121”, “112” and “312”; and the Wu-bihua codes for “

Җ

^{”, “}

җ

^{”, “}

Ҙ

^{”, and “}

ҙ

” are, respectively, “25121”, “25121”, “25112”, and “25112”. Demanding an exact match between strings as the selection criterion, we can find that “

β

”, “

γ

^{”, and “}

π

” are more similar to each other than to “

υ

^{” and}

“

ί

”. By appropriately integrating the extended Cangjie codes and Wubihua codes, we will be able to extend our ability to find visually similar characters to a larger scope of characters.

For the study of incorrect Chinese characters, we have intentionally put aside an important class of errors at this moment. For written characters, people may write incorrect characters that look like correct characters, for example, writing “

၂

” (shi4) as “

”

^”.¹⁷ These so-called pseudo-characters obey the formation principles of Chinese characters, but, in fact, do not belong to the language. These incorrect characters were not considered in the current study because we could not normally enter them into our files as they were not contained in the font files. Nevertheless, studying this type of errors may uncover possible ways that people memorize Chinese characters, and opens another door to the mental lexicons of Chinese learners.

Song and his colleagues propose methods for automatic proofreading for simplified Chinese in Song et al. [2008]. They consider seven operators for building Chinese char-acters from their components and propose a set of rules for computing the similarity between Chinese characters. They then employ the similar characters with statistical information about language models to detect possible incorrect words. This line of work is very similar to the work presented in this article. It will be very interesting to compare the performances of Song et al.’s and our systems with some common test sets.

SJTUD [1988] provides not just a systematic way to decompose simplified Chinese characters, and it also lists the decompositions of 11,254 individual characters. It

17Reported in the United Daily News (http://www.udn.com.tw) on 19 May 2010.

will be very interesting to compare the effectiveness for computing visually similar characters with the extended Cangjie codes and the decompositions in SJTUD [1988].

8. CONCLUSIONS

We found methods to reproduce the errors found in the writing of Chinese script. The methods utilized information about the pronunciation of Chinese characters and the heuristics rules that were derived from observations in psycholinguistic studies to judge the degree of similarity between pronunciations. The methods also employed the extended Cangjie codes and score functions to determine the degree of visual sim-ilarity between characters.

We evaluated our approach from three aspects. In Section 5.3, we compared the Web-based statistics to show the reliability of Web-based statistics. In Section 5.4, we showed that our approach could capture the incorrect character for a diverse scope of test data at satisfactory rates. In Sections 5.5 through 5.7, we applied and compared two different methods to rank the candidate characters in an attempt to capture the incorrect character with shorter lists of candidate characters. The experiments were carried out with data that we presented in previous conference articles and some new data that covered both traditional and simplified Chinese.

In these experiments, it was found that 76% of these errors were related to phono-logical similarity and that 46% were related to visual similarity between characters.

We showed that the Web-based statistics were reasonably stable when we compared the popularity of word usages by comparing the numbers of Web pages that contained the target words in both 2009 and 2010. Experimental results show that we were able to capture 97% of the 4,100 errors, when we recommended 104 candidate characters.

When we recommended only 10 candidate characters, we still caught more than 80% of the 4100 errors. The reported techniques are useful for applications that are related to Chinese, and, in particular, we showed a real-world application that can help teachers to author test items for “incorrect character correction” tests for Chinese.

ACKNOWLEDGMENTS

We thank anonymous reviewers of this journal version and the previous conference articles for their in-valuable comments, which strongly influenced this publication. Experiments were added and improved to respond to reviewers’ comments, though we did not mark each of such experiments to indicate the reviewers’

credits. We would also like to thank Professor Song Rou for his sharing his article with us. We would also like to thank Miss Moira Breen for her indispensable support for our English.

REFERENCES

CANGJIE. 2010. An introduction to the Cangjie input method.

http://en.wikipedia.org/wiki/Cangjie input method.

CDL. 2010. Chinese document laboratory, Academia Sinica. http://cdp.sinica.edu.tw/cdphanzi/. (In Chinese) CHEN, M. Y. 2000. Tone Sandhi: Patterns Across Chinese Dialects (Cambridge Studies in Linguistics 92).

Cambridge University Press.

CHU, B.-F. 2010. Handbook of the Fifth Generation of the Cangjie Input Method.

http://www.cbflabs.com/book/5cjbook/. (In Chinese)

CORMEN, T. H., LEISERSON, C. E., RIVEST, R. L.,ANDSTEIN, C. 2009. Introduction to Algorithms 3rd Ed.

MIT Press.

CROFT, W. B., METZLER, D.,ANDSTROHMAN, T. 2010. Search Engines: Information Retrieval in Practice.

Pearson.

DICT. 2010. An official source of information about traditional Chinese characters.

http://www.cns11643.gov.tw/AIDB/welcome.do.

FAN, K.-C., LIN, C.-K.,ANDCHOU, K.-S. 1995. Confusion set recognition of online Chinese characters by artificial intelligence technique. Patt. Recog. 28, 3, 303–313.

FELDMAN, L. B. AND SIOK, W. W. T. 1999. Semantic radicals contribute to the visual identification of Chinese characters. J. Mem. Lang. 40, 4, 559–576.

FROMKIN, V., RODMAN, R.,ANDHYAMS, N. 2002. An Introduction to Language 7th Ed. Thomson.

HANDICT. 2010. A source for traditional and simplified Chinese characters.

http://www.zdic.net/appendix/f19.htm.

HUANG, C.-M., WU, M.-C.,ANDCHANGC.-C. 2008. Error detection and correction based on Chinese phone-mic alphabet in Chinese text. Int. J. Uncertainty, Fuzziness Knowl.-Base. Syst. 16, suppl. 1, 89–105.

JACKENDOFF, R. 1995. Patterns in the Mind: Language and Human Nature. Basic Books.

JUANG, D., WANG, J.-H., LAI, C.-Y., HSIEH, C.-C., CHIEN, L.-F.,AND HO, J.-M. 2005. Resolving the unencoded character problem for Chinese digital libraries. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries (JCDL’05). 311–319.

JURAFSKY, D.ANDMARTIN, J. H. 2009. Speech and Language Processing 2nd Ed. Pearson.

KUO, W.-J., YEN, T.-C., LEE, J.-R., CHEN, L.-F., LEE, P.-L., CHEN, S.-S., HO, L.-T., HUNG, D. L., TZENG, O. J.-L.,ANDHSIEH, J.-C. 2004. Orthographic and phonological processing of Chinese characters: An fMRI study. NeuroImage 21, 4, 1721–1731.

LECK, K.-J., WEEKES, B. S.,AND CHEN, M.-J. 1995. Visual and phonological pathways to the lexicon:

Evidence from Chinese readers. Mem. Cogn. 23, 4, 468–476.

LEE, C.-Y. 2009. The cognitive and neural basis for learning to reading Chinese. J. Basic Educ. 18, 2, 63–85.

LEE, C.-Y., HUANG, H.-W., KUO, W.-J., TSAI, J.-L.,ANDTZENG, O. J.-L. 2010. Cognitive and neural basis of the consistency and lexicality effects in reading Chinese. J. Neurolinguist. 23, 1, 10–27.

LEE, C.-Y., TSAI, J.-L., HUANG, H.-W., HUNG, D. L.,ANDTZENG, O. J.-L. 2006. The temporal signatures of semantic and phonological activations for Chinese sublexical processing: An event-related potential study. Brain Res. 1121, 1, 150–159.

LEE, H. 2010a. Cangjie Input Methods in 30 Days 2. Foruto. http://input.foruto.com/cccls/cjzd.html.

(In Chinese)

LEE, MU. 2010b. A quantitative study of the formation of Chinese characters.

http://chinese.exponode.com/0 1.htm. (In Chinese)

LIU, C.-L., LAI, M.-H., CHUANG, Y.-H.,ANDLEE, C.-Y. 2010. Visually and phonologically similar char-acters in incorrect simplified Chinese words. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). 739–747.

LIU, C.-L., LEE, C.-Y., TSAI, J.-L.,ANDLEE, C.-L. 2011. Forthcoming. A cognition-based interactive game platform for learning Chinese characters. In Proceedings of the 26th ACM Symposium on Applied Com-puting (SAC’11).

LIU, C.-L.ANDLIN, J.-H. 2008. Using structural information for identifying similar Chinese characters.

In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL’08).

93–96.

LIU, C.-L., TIEN, K.-W., CHUANG, Y.-H., HUANG, C.-B.,AND WENG, J.-Y. 2009a. Two applications of lexical information to computer-assisted item authoring for elementary Chinese. In Proceedings of the 22nd International Conference on Industrial Engineering & Other Applications of Applied Intelligent Systems (IEA/AIE’09). 470–480.

LIU, C.-L., TIEN, K.-W., LAI, M.-H., CHUANG, Y.-H.,ANDWU, S.-H. 2009b. Capturing errors in writ-ten Chinese words. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL’09). 25–28.

LIU, C.-L., TIEN, K.-W., LAI, M.-H., CHUANG, Y.-H.,ANDWU, S.-H. 2009c. Phonological and logographic influences on errors in written Chinese words. In Proceedings of the 7th Workshop on Asian Language Resources (ALR’09). 84–91.

LIU, C.-L., JAEGER, S.,ANDNAKAGAWA, M. 2004. Online recognition of Chinese characters: The state-of-the-art. IEEE Trans. Patt. Anal. Mach. Intel. 26, 2, 198–213.

LO, M.ANDHUE, C.-W. 2008. C-CAT: A computer software used to analyze and select Chinese characters and character components for psychological research. Behav. Res. Meth. 40, 4, 1098–1105.

在文檔中 Visually and phonologically similar characters in incorrect simplified Chinese words (頁 32-39)