Visually and phonologically similar characters in incorrect simplified Chinese words

(1)

10 Visually and Phonologically Similar Characters in Incorrect Chinese

Words: Analyses, Identiﬁcation, and Applications

C.-L. LIU, M.-H. LAI, K.-W. TIEN, and Y.-H. CHUANG,National Chengchi University

S.-H. WU,Chaoyang University of Technology

C.-Y. LEE,Academia Sinica

Information about students’ mistakes opens a window to an understanding of their learning processes, and helps us design effective course work to help students avoid replication of the same errors. Learning from mistakes is important not just in human learning activities; it is also a crucial ingredient in techniques for the developments of student models. In this article, we report findings of our study on 4,100 erroneous Chinese words. Seventy-six percent of these errors were related to the phonological similarity between the correct and the incorrect characters, 46% were due to visual similarity, and 29% involved both factors. We propose a computing algorithm that aims at replication of incorrect Chinese words. The algorithm extends the principles of decomposing Chinese characters with the Cangjie codes to judge the visual similarity be-tween Chinese characters. The algorithm also employs empirical rules to determine the degree of similarity between Chinese phonemes. To show its effectiveness, we ran the algorithm to select and rank a list of about 100 candidate characters, from more than 5,100 characters, for the incorrectly written character in each of the 4,100 errors. We inspected whether the incorrect character was indeed included in the candidate list and analyzed whether the incorrect character was ranked at the top of the candidate list. Experimental results show that our algorithm captured 97% of incorrect characters for the 4,100 errors, when the average length of the candidate lists was 104. Further analyses showed that the incorrect characters ranked among the top 10 candidates in 89% of the phonologically similar errors and in 80% of the visually similar errors. Categories and Subject Descriptors: I.2.7 [Computing Methodologies]: Artificial Intelligence—Natural language processing; J.5 [Computer Applications]: Arts and Humanities—Linguistics; K.3.1 [Comput-ing Milieux]: Computers and Education—Computer uses in education; Computer-assisted instruction (CAI);

H.3.5 [Information Systems]: Information Storage and Retrieval—Online information services; Web-based services; J.4 [Computer Applications]: Social and Behavioral Sciences—Psychology

General Terms: Design, Languages

Additional Key Words and Phrases: Error analysis of written Chinese text, student modeling, traditional Chinese, simplified Chinese, computer-assisted language learning, psycholinguistics

ACM Reference Format:

Liu, C.-L., Lai, M.-H., Tien, K.-W., Chuang, Y.-H., Wu, S.-H., and Lee, C.-Y. 2010. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Trans. Asian Lang. Inform. Process. 10, 2, Article 10 (June 2011), 39 pages.

DOI = 10.1145/1967293.1967297 http://doi.acm.org/10.1145/1967293.1967297

This article was completed while C.-L. Liu visited the Department of Electrical Engineering and Computer Science of the University of Michigan as a visiting scholar.

This research was supported in part by the research contracts 97-2221-E-004-007, 99-2221-E-004-007, and 99-2918-I-004-008 from the National Science Council of Taiwan.

Authors’ addresses: C.-L. Liu, M.-H. Lai, K.-W. Tien, and Y.-H. Chuang, Department of Computer Sci-ence, College of SciSci-ence, National Chengchi University, Taipei, Taiwan; email: _{{chaolin, g9523, g9627,} g9804_{}@cs.nccu.edu.tw; S.-H. Wu, Department of Computer Science and Information Engineering, College} of Informatics, Chaoyang University of Technology, Taichung, Taiwan; email: shwu@cyut.edu.tw; C.-Y. Lee, Institute of Linguistics, Academia Sinica, Taipei, Taiwan; email: chiaying@gate.sinica.edu.tw.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permit-ted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from the Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.

c

2011 ACM 1530-0226/2011/06-ART10 $10.00

(2)

1. INTRODUCTION1

The studies about people using incorrect characters in Chinese words are related to the education, perception, recognition, and applications of the Chinese language2_. Some Chinese words contain just one character, but most words comprise two or more characters. For instance, “

ӳ

” (hao3)3_{is a word that has just one character and means} “good” in English. “

ᇟق

” (yu3 yan2) is a word that is formed by two characters and means “language” in English. Experience indicates that the two most common causes for writing or typing incorrect Chinese words are due to phonological and visual similarity between the correct and the incorrect characters [Liu et al. 2009a, 2009b, 2009c]. For instance, one might use “

ન

” (su4) in the place of “

๘

” (su4) in “

ᝄ๘

” (yan2 su4) because of the phonological similarity; one might use “

_ࡼ

” (shi1) for “

_ਓ

” (lu3) in “

ਓ೼

” (lu3 tu2) due to the visual similarity.

Manipulating the similarity between characters has served as an instrumental tech-nique in psycholinguistic studies into how people read and recognize Chinese charac-ters. Researchers in psycholinguistics investigate the cognition processes of Chinese readers [Kuo et al. 2004; Lee et al. 2006; Tsai et al. 2006], by measuring readers’ re-sponse times to words that have various numbers of “neighbor” words. The neighbors of a Chinese word include phonologically and visually similar characters.

Phonologically and visually similar characters are also useful for computer assisted language learning (CALL). In elementary schools in Taiwan, students may be re-quested to identify and correct “erroneous words” in test items, where, typically, an “erroneous word” contains an incorrect character that was introduced intentionally when teachers prepared the test items. Such tests are Incorrect Character Correction tests (ICC tests). It takes effort and time to provide incorrect characters that are ap-propriate for different assessment purposes, and to make sure that the test items do not repeatedly use the same incorrect characters at the same time. We have built an environment for assisting the preparation of such test items [Liu et al. 2009a] by find-ing a way to offer phonologically and visually similar Chinese characters as candidates to serve as the incorrect characters [Liu and Lin 2008].

In addition, phonologically and visually similar characters can be applied to student modeling, optical character recognition (OCR), and information retrieval (IR) in Chi-nese. Bug libraries contain students’ records of previous errors [Sison and Shimura 1998; Virvou et al. 2000], and are useful for modeling student behavior. Some algo-rithms for optical character recognition for printed Chinese and for written Chinese try to guess the input images based on confusion sets [Fan et al. 1995; Liu et al. 2004]. Characters in a confusion set are similar to each other visually, and they help the OCR programs to confine the search space for a given image. It would be possible to re-duce the computational costs and to increase recognition rates if we can pinpoint the confusion set of a character that is being recognized. The current confusion sets are hand-crafted clusters of visually similar characters. In recent years, it has become a 1_{This article is a significantly extended version, in terms of the depth of discussion and the scale of} ex-periments, of the material reported in Liu and Lin [2008], Liu et al. [2009a, 2009b, 2009c], and Liu et al. [2010].

2_{In this article, we use “Chinese” to refer to Mandarin Chinese.}

3_{We show traditional and simplified Chinese characters followed by their Hanyu pinyin}

(http://en.wikipedia.org/wiki/Pinyin). The Hanyu pinyin of a Chinese character shows the sound of the character by a string of English letters, and the digit that follows the letters is the tone for the character. To simplify our presentation, we will show Chinese text only in either the traditional or the simplified form, but not both. If presented in simplified Chinese, the errors listed in the first paragraph in the Introduction will replace “丝” (su4) in “䛸丝” (yan2 su4) by “ન” (su4) for phonological similarity and “ਓ” (lu3) in “ਓ೼” (lu3 tu2) by “ࡼ” (shi1) for visual similarity. The traditional and simplified forms of a Chinese character might not differ from each other.

(3)

common practice for IR service providers, such as Yahoo! and Google, to offer correc-tions when users enter queries that contain incorrect words. For English queries, one may apply the Levenshtein distance to compute the edit distance between the spellings and employ the Soundex system to determine the degree of similarity between the pro-nunciations of words [cf., Croft et al. 2010; Manning et al. 2008]. These methods are not perfect but can catch similar English words in practice. The work reported in this article can be applied to find possible corrections for Chinese queries.

Some researchers state that there are more than 50,000 Chinese characters [HanDict 2010], although only thousands of characters are used in daily lives. In the People’s Republic of China, a government agency selected 7,000 popular Chinese characters and highlighted 3,500 characters among these 7,000 characters as the most frequently used characters in 19884_{. In Taiwan, 5,401 characters were selected to be} the most commonly used in daily lives in 1984 when the BIG5 code was formulated [Dict 2010].

” (xun1) in different sizes and at different positions.

We apply an extended version of the Cangjie codes [Cangjie 2010; Chu et al. 2010] to encode the layouts and details of traditional Chinese characters for computing vi-sually similar characters [Liu and Lin 2008; Liu et al. 2009a, 2009b, 2009c], and ex-tend the work to compare similar characters in simplified Chinese characters [Liu 2010]. Evidence observed in psycholinguistic studies [Feldman and Siok 1999; Lee et al. 2006; Yeh and Li 2002] offers a cognition-based support for the design of our ap-proach; namely, the use of shared components to define the visual similarity between Chinese characters.

The proposed method proves to be effective in capturing incorrect words for both traditional [Liu et al. 2009a, 2009b, 2009c] and simplified Chinese [Liu 2010]. We col-4_{The statistics are available on the following two Wikipedia pages: http://zh.wikipedia.org/zh-tw/%E7%8E%} B0%E4%BB%A3%E6%B1%89%E8%AF%AD%E9%80%9A%E7%94%A8%E5%AD%97%E8%A1%A8 (if Chi-nese is available on your computers: http://zh.wikipedia.org/zh/䯢ж䬙俟೯Ҕӷ߄) and http://en.wikipedia. org/wiki/Xi%C3%A0nd%C3%A0i H%C3%A0ny%C7%94 Ch%C3%A1ngy%C3%B2ng Z%C3%ACbi% C7%8Eo (if Chinese is available on your computers: http://zh.wikipedia.org/zh/䯢ж䬙俟தҔӷ߄). The first page is written in Chinese, and the second one is in English. The translations of “䯢ж䬙俟” (xian4 dai4 han4 yu3) and “ӷ߄” (zi4 biao3) are “Modern Chinese” and “character list”, respectively. We use “popular” for “_೯Ҕ” (tong1 yong4) and “most frequently used” for “தҔ” (chang2 yong4).

(4)

lected and analyzed approximately 4,100 errors that were reported in published books, found in students’ compositions, or posted on the Internet. Each reported error is of a word which will be understood as appearing in its correct form as “

ᝄ๘

”; but which in the error may appear as “

_ᝄન

”, where “

_ન

” is used instead of “

_๘

”. Namely, writing “

_ᝄ๘

” as “

_ᝄન

” is a reported error. We found that 76% of the errors were related to phonological similarity and that 46% of the errors were related to visual similar-ity. More significantly, the dominance of the phonological factor was also observed in hand-written text, not just in electronic documents that were directly prepared on computers.

In experiments that aimed at reproducing the collected errors, we ran our programs to select and recommend a list of candidates from more than 5,100 Chinese characters for the correct character, that is, “

๘

”, and we recorded the likelihood that the can-didate list actually included the incorrect character. Experimental results show that if the length of the candidate list is about 100, we achieved inclusion rates of about 97% for both traditional and simplified Chinese. If the length of the candidate list was shortened to 10, the average inclusion rates were 89% for the phonologically similar errors and 80% for the visually similar errors. We have also applied our algorithms for reproducing the reported errors to build an environment to assist teachers to prepare test items for ICC tests.

In this article, we integrate and extend the previous reports on the phonologically and visually similar characters in both traditional and simplified Chinese to capture errors in Chinese words. We go over some issues about phonological similarity in Chinese in Section 2, elaborate how we extend and apply the Cangjie codes to judge the visual similarity between Chinese characters in Section 3, explain how we acquired the reported errors and how we analyzed the phonological and visual influences on these errors in Section 4, present details about our experiments and discuss the observations in Section 5, show a real-world application of the proposed techniques to the authoring of test items for the ICC tests in Section 6, and review some of the design issues and experience in Section 7 before we summarize our work in Section 8.

Compared with the previous conference articles [Liu and Lin 2008; Liu et al. 2009a, 2009b, 2009c; Liu et al. 2010], we expanded the scale of experiments and discussions in terms of both depth and coverage. More specifically, we validated the reliability of the Web-based statistics by examining the data that we collected in 2009 and in 2010, compared the contribution of different sources of similar characters, explored the applications of alternative ranking methods, and exhibited the robustness of our approach by running our systems over new data sets.

2. PHONOLOGICALLY SIMILAR CHARACTERS

Chinese characters are single syllable. The pronunciation of a Chinese character in-volves the nucleus and a tone, where the nucleus contains a vowel that follows an optional consonant. In this article, we use the Hanyu pinyin method to denote the sound of Chinese characters, and show the tone with a digit that follows the symbol string for the sound. In Mandarin Chinese, there are four tones. (Some researchers include the fifth tone.)

Although Chinese is not an alphabetical language, it is shown that the pronun-ciations of characters affect how people write Chinese [Ziegler et al. 2000]]. The pronunciation of a Chinese character has two parts: sound and tone. Therefore, the phonological similarity between two characters may consider these two aspects, and we consider four categories of phonological similarity between two characters: same sound and same tone (SS), same sound and different tone (SD), similar sound and same tone (MS), and similar sound and different tone (MD).

(5)

Table I. Samples of the Similar Phonemes with Example Characters

Original Similar A Character with the Examples with the Similar Sound Phoneme Phoneme Original Phoneme and the Same Tone (MS)

/s/ /sh/

_๘

(su4)

_{ኧǵᐋǵ৯}

(shu4)

consonant /z/ /zh/

Ԑ

(zao3)

_פǵݡǵП

(zhao3)

/c/ /ch/

வ

(cong2)

_ख़ǵՇǵ஖

(chong2)

vowel /eng/ /en/

ቻ

(zheng1)

੿ǵଞǵृ

(zhen1)

/eng/ /ang/

ቻ

(zheng1)

஭ǵകǵᖟ

(zhang1)

We rely on the information provided in a lexicon [Dict 2010] to determine whether two characters have the same sound or the same tone. The judgment of whether two characters have a similar sound should consider the language experience of an individ-ual. An individual who lives in southern China and one who lives in northern China, for instance, might have quite different perceptions of similar sound. In this work, we resort to the confusion sets observed in a psycholinguistic study, conducted at the Academic Sinica in Taiwan, to obtain a list of confusion sets of vowels and consonants in Mandarin Chinese.

Some Chinese characters are heteronyms [cf., Fromkin et al. 2002]. Let C1and C2 be two characters that have multiple pronunciations. If C1 and C2 share one of their pronunciations, we consider that C1and C2belong to the SS category. This principle applies when we consider phonological similarity in other categories.

With a lexicon and the list of confusion sets, our program can select a list of phonologically similar characters for a given character. Consider the example “

_ᝄ๘

” is in the MS and MD lists for “

๘

” because it is a heteronym.

Table I shows more pairs of the confusing phonemes that we used in our system. Note that phonological similarity is a symmetric relationship. Namely, when phoneme X is similar to phoneme Y , phoneme Y is similar to phoneme X . To help readers focus on the symbols of the phonemes, we underline the confusing phonemes of the example characters in boldface. Notice also that, although we do not explicitly provide examples in Table I, it is possible to change both the consonant and the vowel for a character to find a phonologically similar character. For instance, “

_ᠢ

” (zang1) is phonologically similar to “

ቻ

” (zheng1) because we can replace the consonant /zh/ and vowel /eng/ in “

ቻ

” with /z/ and /ang/, respectively, and find “

_ᠢ

”.

One challenge in defining phonological similarity between characters is that a Chi-nese character may be pronounced in more than one way, and the actual pronunciation depends on the context. Tone sandhi [Chen 2000] is a frequently mentioned source of confusion. The most common example of the use of tone sandhi in Chinese is that the first third-tone character in words formed by two adjacent third-tone characters will be pronounced with the second tone. For example, although “

գ

” (ni3) and “

ӳ

” (hao3) are both third-tone characters, “

գ

” in “

գӳ

” is pronounced with the second

(6)

Fig. 1. Examples of visually similar characters in traditional Chinese (groups 1-5) and in simplified Chinese (groups 6-10).

tone in practice. Namely, native speakers usually pronounced “

գӳ

” as ni2-hao3. At present, we ignore the influences of context when determining whether two characters are phonologically similar. (As we shall see in Section 5, doing so did not disturb the experimental results.)

Although we have confined our definition of phonological similarity to the context of the Mandarin Chinese, we would like to note that the influence of sublanguages within the Chinese language family will affect the perception of phonological similarity. Di-alects used in different areas in China, for example, Shanghai, Min, and Canton, share the same written forms with the Mandarin Chinese, but have quite different though related pronunciation systems. Hence, people living in different areas in China might perceive phonological similarity in different ways. The study in this direction, however, is beyond the scope of the current study.

3. VISUALLY SIMILAR CHARACTERS

Figure 1 shows examples of visually similar Chinese characters. The first row contains five groups of visually similar traditional Chinese characters, and the second row con-tains five corresponding groups of simplified Chinese characters. The jth _character (counted from left to right) in group (i + 5) is the simplified form of the jth_{character in} group i. Notice that the traditional and simplified forms of a character may be exactly the same.

The characters in group 1 differ subtly at the stroke level, as do the characters in group 2. The characters in group 3 share the same components on their right sides. The shared components of the characters in group 4 and group 5 appear at different places within the characters.

Analogously, characters in group 6 differ subtly at the stoke level, as do the simpli-fied characters in group 7. Characters in group 8 share the components on their right sides. The shared components of the characters in group 9 and group 10 appear at different places within the characters.

The radical of a Chinese character carries the main semantic information about the character [cf., Feldman et al. 1999], and lexicographers employ radicals to organize characters in Chinese dictionaries. Characters that belong to the same radicals are placed in the same category, and are listed sequentially by the number of strokes. Hence, it is possible to employ the information about radicals to find visually similar characters. The characters in group 1 and group 2 have the radicals “

Җ

” (tian2) and “

ق

” (yan2), respectively. Analogously, the simplified characters in group 6 and group 7 have the radicals “

Җ

” and “ ”, respectively. (“ ” is the simplified form of “

ق

”.) Notice that, although the radicals for group 2 and group 7 are obvious, those for group 1 and group 6 are not because “

_Җ

” is not a standalone component in these groups.

Although radicals themselves provide information about the shared components of characters, the most saliently shared components of characters might not be the radicals of the characters. This problem occurs in both traditional and simplified Chinese. The shared component of the characters in group 3 is not the radical. The

(7)

shared components of the characters in groups 4, 8, and 9 are not the radicals for the characters in the groups either. In these cases, the shared components carry information about the pronunciations of the characters. Hence, those characters are listed under different radicals, though they do look similar in some ways.

In some cases, one may be interested in characters that share small elements in the characters, such as “

_И

” (dai3) in group 5 and group 10. The shared elements in these two groups do not carry semantic or phonological information, and they are not the radicals either. It is also possible that a radical is written in different ways in the characters that have the same radical in a dictionary, for example, “

ࢨ

” (quan2) and “

ݲ

” (bo2). These two characters are listed under the radical “

Н

” (shui3). The radical appears literally in “

_ࢨ

”, but is written as “

䬔

” in “

ݲ

”.

Therefore, we cannot rely only on the information about radicals of characters in typical lexicons to find visually similar characters, and we will use the extended Cangjie codes as the basis to judge the degree of similarity between Chinese characters.

3.1 Cangjie Codes for Traditional Chinese

The Cangjie input method is one of the most popular methods used for entering Chi-nese characters into computers. The designer of the Cangjie method selected a set of 24 basic elements occurring in characters, and proposed a set of rules to decompose Chinese characters according to these elements [Chu et al. 2010]. Because the Cangjie system is designed to help people enter Chinese characters into computers, the design of the Cangjie codes had aimed at allowing its users to recall the codes for Chinese characters as easy as possible. Namely, users must be able to easily figure out the Cangjie codes for the characters that they want to enter. Given the popularity of the Cangjije input method in a wide range of Chinese speaking communities, the Cangjie codes of Chinese characters have practically shown their strong links with the forma-tion of Chinese characters. This was an important motivaforma-tion for us to try to define the similarity between two Chinese characters based on the degree of similarity between their Cangjie codes.

Table II shows the Cangjie codes for the 13 characters listed in groups 1 to 4 in Figure 1 and for five other characters. The “ID” column shows the identification num-ber for the characters, and we will refer to the ith _{character by c}

i, where i is the ID. The “CC” column shows the Chinese characters, and the “Cangjie” column shows the Cangjie codes. Each symbol in the Cangjie codes corresponds to a key on the keyboard, for example, “

_Җ

” (tian2) and “

ύ

” (zhong1) collocate with “W” and “L”, respectively. Information about the complete correspondence is available in Wikipedia5.

Using the Cangjie codes saves us from the need to apply image processing methods to determine the degrees of similarity between characters. Take the Cangjie codes for the characters in group 2 (c5, c6, and c7) for example. See that the characters share a common component based on the shared substrings of the Cangjie codes (shown in boldface), that is, “

Να

” (bu3 kou3). We may also find the shared component “

ᵭ

” (gou4, encoded by “

_ЅЅД

”) for the characters in group 3 (c10, c11, and c12), the shared component “

Κ

” (li4, encoded by “

_εν

”) in c15and c16, and the shared component “

ᦩ

” (jing1, encoded by “

΋΋

”) in c16and c17.

However, the original Cangjie codes are still lacking in some respects, in spite of their perceivable advantages. The Cangjie codes have been limited to contain no more than five keys, in order to maintain efficiency in inputting Chinese characters. Thus, 5_{See http://en.wikipedia.org/wiki/Cangjie input method#Keyboard layout; last visited on 30 September} 2010.

(8)

Table II. Examples of Cangjie Codes for Traditional Chinese ID CC Cangjie ID CC Cangjie 1

Җ

10

_ᖼ

_Дߎ

_ЅЅД

2

җ

_ύҖ

11

ྎ

НЅЅД

3

_Ҙ

_Җύ

12

_ᄬ

_Е

_ЅЅД

4

_ҙ

_ύҖύ

13

_঩

αДξߎ 5

_೚

_ΝαΓΜ

14

༝

ҖαДߎ 6

_૷

_Να΋Μ

15

_ര

αߎ

εν

7

_ी

_ΝαΜ

16

_ࠂ

΋΋εν 8

_ࡹ

΋΋Γε

17

_ᓍ

΋΋

΋Д

ߎ 9

ϡ

΋΋ξ

18

_࿶

ζО΋ζ΋

Table III. Examples of Cangjie Codes for Simplified Chinese

ID CC Cangjie ID CC Cangjie 19

Җ

28

_偂

_ДΓЈ

_ЈЉ

20

_җ

_ύҖ

29

_㟻

_НЈ

_ЈЉ

21

Ҙ

_Җύ

30

᫴

ЕЈЉ

22

_ҙ

_ύҖύ

31

_䠑

_αДΓ

23

_侪

_ЉζΓΜ

32

_䡿

ҖαДΓ

αΓεν

24

_侘

_Љζ΋Μ

33

_䟋

αΓεν

25

_侓

_ЉζΜ

34

_䟄

ψ΋εν

26

_呆

ψ΋ВВ

35

_勹

ψ΋΋ДΓ

27

厇

ψ΋Јα

36

䶈

_ζ΋ψΓ΋

users of the Cangjie input method must familiarize themselves with the principles for simplifying the Cangjie codes. While the simplified codes help to enhance the input ef-ficiency, they also introduce difficulties and ambiguities when we compare the original Cangjie codes for computing similar characters. The shared component “

_঩

” (yuan2) is encoded in three different ways in c13, c14, and c15, i.e., “

αДξߎ

” (kao2 yue4 shan1 jin1), “

αДߎ

”, and “

αߎ

”. The prefix “

΋΋

” in c16and c17can represent “

ᦩ

” (jing1), “

_҅

” (zheng4; e.g., in c8), and “

Β

” (er4; e.g., in c9). Consequently, characters whose Cangjie codes include “

΋΋

” may contain any of these three components, but c8, c9, and c16do not really look alike.

3.2 Cangjie Codes for Simpliﬁed Chinese

Not surprisingly, the Cangjie codes are also useful for capturing the similarities be-tween simplified Chinese characters. Using a structure similar to Table I, Table III shows the Cangjie codes for the characters listed in groups 6 to 9 in Figure 1 and five other characters.

(9)

Fig. 2. Layouts of Chinese characters (used in Cangjie).

Again, the Cangjie codes offer the possibility to determine the degrees of similarity between characters efficiently. It is possible to find that the characters c23, c24, and c25share a common component because their Cangjie codes share “

Љζ

” (ge1 nu3). Using the common substrings (shown in boldface) of the Cangjie codes, we may also find the shared component “

ϭ

” (gou1, encoded by “

ЈЉ

”) for the characters in group 8 (c28, c29, and c30), the shared component “

䠑

” (yuan2, encoded by “

αДΓ

”) in c31and c32, the shared component “

Κ

” (li4, encoded by “

εν

”) in c33 and c34, and the shared component “ ” (jing1, encoded by “

ψ΋

”) in c34and c35.

Similar to the problem of using the original Cangjie codes for traditional Chinese, we would encounter ambiguity problems when comparing the similarities between simplified Chinese characters. The shared component “

䠑

” in c32and c33is encoded by “

αДΓ

” (kao3 yue4 ren2) and “

αΓ

”, respectively. The prefix “

ψ΋

” (gong1 yi1) in c34 and c35can represent “ ”, “

吏

” (yu2; e.g., in c26), and “

卺

” (ma3; e.g., in c27). Characters whose Cangjie codes include “

ψ΋

” may contain any of these three components, but c26, c27,and c34do not really look alike.

Given the observations reported in the previous subsection and in this present one, we augmented the original Cangjie codes by using the complete Cangjie codes and annotated each Chinese character with a layout identification that encodes the overall contours of the characters.

3.3 Augmenting the Cangjie Codes

Figure 2 shows the 12 possible layouts that are considered for the Cangjie codes for both traditional and simplified Chinese characters. Most of the layouts contain two or three small regions (called subareas henceforth), and the rectangles show individual subareas within a character. The subareas are assigned IDs, but to maintain readabil-ity of the figures, not all of the IDs for subareas are shown in Figure 2. From left to right and from top to bottom, each layout is assigned an identification number from 1 to 12. An example pair of characters, separated by a slash, is provided below each lay-out. A traditional Chinese character is on the left, and a simplified one is on the right. For example, the layout ID of “

_ڰ/㡚

” is 8. “

_ڰ

” (gu4) is a traditional Chinese character, and has two parts, that is, “

_᢭

” (wei2) and “

_ђ

” (gu3). “

_㡚

” (guo2) is a simplified Chinese character and has two parts, that is, “

_᢭

” and “

ҏ

” (yu4).

When Chinese characters are transformed from the traditional to simplified forms, the layout of the same characters may or may not be changed, and a more compre-hensive discussion about the significant change in the structures of the characters is available in Lee [2010b]. Hence, we may and may not use the traditional and simplified forms of the same character as a pair in Figure 2. Except for layouts 6, 7, 8, and 10, the pairs of characters shown under the layouts are the same characters in both traditional and simplified forms. For instance, “

倔

” (xie4) and “

_ᖴ

” (xie4) are

(10)

examples of layout 4, and “

_倔

” is the simplified form of “

_ᖴ

”. In contrast, “

_ڰ

” and “

_㡚

” are two different characters, but both are examples of layout 8. The traditional form of “

㡚

” is “

୯

”, which belongs to layout 9.

Researchers have come up with other ways to decompose individual Chinese char-acters. A team at the Shanghai Jiao-Tong University (SJTU) report an early attempt, and they consider five major ways to decompose Chinese characters [p. 1071, SJ-TUD 1988]. In this study, the SJTU team report detailed analysis of the compositions of Chinese characters. Based on their analysis, “

α

” (kou3) and “

Е

” (mu4) are the most frequent components in Chinese characters [p. 1027, SJTUD 1988]. Juang et al. [2005] employ four relationships for components of Chinese characters, and Sun et al. [2002] six relationships. The Chinese Document Laboratory at the Academia Sinica in Taiwan considers 13 possible ways to decompose Chinese characters [CDL 2010]. Lee [2010b] proposes more than 30 possible layouts. In Unicode standard 4.0.1, 12 operators are considered to build Chinese characters from a set of building blocks [UNICODE 2010].

” (zheng4) and “

Β

” (er4), as we have illustrated by c8, c9, and c16 in Table II. The simplified Cangjie codes for “

ق

” are the same as the Cangjie codes of “ ”, which is in the upper part of “

_ଯ

” (gao1).

After finding the frequent substrings, we verify whether these frequent substrings are simplified codes for meaningful components, which, in our definition, form parts

(11)

Table IV. Examples of Extended Cangjie Codes for Traditional Chinese ID CC LID P1 P2 P3 5

೚

2

Ν΋

΋΋α

ΓΜ

6

૷

2

Ν΋

΋΋α

΋Μ

7

_ी

2

Ν΋

΋΋α

_Μ

10

ᖼ

10

΋΋

Дξ

_α

_Дξߎ

14

༝

9

_Җ

_α

Дξ

ξߎ

15

_ര

2

αД

Дξߎ

_εν

16

_ࠂ

2

Дξ

΋ζ

ζζζ΋

_εν

17

_ᓍ

2

Дξ

΋ζ

ζζζ΋

΋Дξ

ξߎ

18

_࿶

3

ζζζ

ζЉ

ЉО

ξ

΋ζζ

ζζ

_΋

of one or more Chinese characters. For meaningful components, we replace the sim-plified codes with their complete codes. For instance, the Cangjie codes for “

_೚

” (xu3) and “

૷

” (jie2) are extended to contain “

Ν΋΋α

” in Table IV, where we indicate the extended keys that did not belong to the original Cangjie codes in boldface and with a surrounding box. After recovering the dropped codes for “

ق

”, our programs will have the information necessary to be able to tell “

ق

” and “ ” apart.

Although we have tried to employ computer programs to help us find the frequent substrings in as many instances as we can, the work to recover the simplified codes remained labor-intensive, and we had to devote particular attention to certain anom-alous cases at times. Fortunately, the process to implement the extended Cangjie codes proved to be worthwhile as we will show in the experimental studies.

Using a structure that is similar to Table IV, Table V shows the extended Cangjie codes for some of the simplified Chinese characters that we show in Table III. The “ID” column provides links between the characters listed in both Table III and Table V.

In Table V, we recover the Cangjie codes for “ ” (yan2) and “ ” (jing1). Using “

Љψζ

” (ge1 gong1 su3), rather than “

Љζ

”, for “ ” prevents us from confusing “ ” with “

ᣠ

” (yue4). Similarly, using “

_ψΓ΋

” (gong1 ren2 yi1), rather than “

ψ΋

”, for “ ” avoids the confusion of “

_卺

” (ma3) and “

吏

” (yu2) with “ ”, e.g., c26, c27, and c34 in Table III.

Replacing simplified codes with complete codes not only helps us avoid incorrect matches but also helps us find matches that would be missed due to the simplification of the Cangjie codes. If we only use the original Cangjie codes in Table III, it is not easy

_Љ

31

_䠑

5

_α

_ДΓ

32

_䡿

9

_Җ

_α

_ДΓ

33

_䟋

Γ

ζζ

ζ΋

_ψΓ

_΋

37

䦯

4

_Ј

΋΋

΋Љ

εν

to determine that c36 (“

_䶈

”, jing1) in Table III shares the component “ ”(jing1) with c34 (“

_䟄

”, jing4) and c35 (“

_勹

”, jing3). In contrast, there is a chance to find the similarity with the extended Cangjie codes in Table V, given that all of the three Cangjie codes include “

_ψΓ΋

” (gong1 ren2 yi1).

Although most of the examples provided in Table V indicate that we expanded only the first part of the Cangjie codes for the simplified Chinese, it is possible that the other parts, that is, P2 and P3, may need to be extended too. Sample c37 shows such an example.

3.4 Similarity Measure

The main differences between the original and the extended Cangjie codes are the degrees of detail about the structures of the Chinese characters. By recovering the details that were ignored in the original codes, our programs will be better equipped to find the similarities between characters.

We experiment with three different scoring methods to measure the visual simi-larity between two characters based on their extended Cangjie codes. Two of these methods were tried in our studies for traditional Chinese characters [Liu et al. 2009b, 2009c]]. The first method, denoted SC1, considers the total number of matched keys in the matched parts. Two parts are considered as matched as long as their contents are the same. They do not have to locate at the region within a character. Let cidenote the

ith_{character listed in Table V. We have SC1(c}

(13)

(da4 shi1). Analogously, we have SC1(c37, c34) = 2 because of the matched “

εν

”. No-tice that, although “

_΋

” (yi1) is in the P1 of c34and in the P2 of c37, it is not considered a match because the P1 of c34and the P2 of c37do not match as a whole.

The second method, denoted SC2, includes the score of SC1 and considers the following conditions: (1) add one point if the matched parts locate at the same place in the characters and (2) if the first condition is met, an extra point will be added if the characters belong to the same layout. Hence, we have SC2(c33, c34) = SC1(c33, c34) + 1 + 1 = 4 because (1) the matched “

_εν

” is the P2 of both characters and (2) c33and c34 belong to the same layout. Assuming that c34belongs to layout 5, than SC2(c33, c34) would become 3. In contrast, we have SC2(c37, c34) = 2. No extra points were added for “

_εν

” in the Cangjie codes for c37and c34because “

εν

” is not at the same position in the characters. The extra points consider the spatial influences of the matched parts on the perception of similarity.

While splitting the extended Cangjie codes into parts allows us to tell that c33 is more similar to c34 than to c37, it also creates a new barrier in computing similarity scores. An example of this problem is that SC2(c35, c36) = 0. This is because that “

ψΓ΋

” (gong1 ren2 yi1) at P1 in c35 can match neither “

ψΓ

” at P2 nor “

΋

” at P3 in c36.

To alleviate this problem, we consider SC3 which computes the similarity in three steps. First, we concatenate the parts of a Cangjie code for a character. Then, we compute the longest common subsequence (LCS) (cf., Cormen et al. [2009]) of the con-catenated codes of the two characters being compared, and compute a Dice coefficient (cf., Croft et al. [2010]) as the similarity. The Dice coefficients are used in many appli-cations, including defining the strength of the relatedness of two terms (or similarity of two strings) in natural language processing (cf., Manning and Sch ¨utze [1999]). Let X and Y denote the concatenated, extended Cangjie codes for two characters, and let Z be the LCS of X and Y . The similarity is defined by the following equation.

DiceLCS=

2× |Z|

|X | + |Y|, where |S| is the length of string S (1)

We compute another Dice coefficient between X and Y . The formula is the similar to Equation (1), except that we set Z to the longest common consecutive subsequence. We call this score DiceLCCS. Notice that DiceLCCS ≤ DiceLCS, DiceLCCS ≤ 1, and

DiceLCS ≤ 1 . Using both DiceLCS and DiceLCCS allows us to compute the visual similarity from two aspects. Finally, the SC3 of two characters is the sum of their SC2, 10× DiceLCCS, and 5× DiceLCS. We multiply the Dice coefficients with constants to make them as influential as the SC2 component in SC3. Since the LCCSs of two strings are generally quite shorter than the LCSs, we multiply DiceLCCSwith a larger weight. The constants were not scientifically chosen, but were selected heuristically.

Using the extended Cangjie codes and a selected score function, we can select a list of visually similar characters for a given character. Using SC3, we can find that every character in the string “

೵੄ᇸ৩ยಳમణ

” (jing4 jing1 qing1 jing4 jing4 jing1 yun2 qing1) is similar to “

࿶

” (jing1) in some way. Interestingly, each character in the string “

೵੄ᇸ৩ยಳમణ

” belongs to different radicals: “

僧䬔ًᢰ㜃᥋԰ᣅ

” (chuo4 shui3 che1 chi4 chuang2 cao3 mi4 qi4). One may verify that not all of the characters in the string are listed under the same radical as “

࿶

”, so our approach offers chances to find visually similar characters that belong to different radicals.

3.5 Appropriate Similarity Measures

”. However, “

ษε

” (hao4 da4), and “

ݯε

” (zhi4 da4) are not equally attractive for the writing of “

੏ε

” (hao4 da4). Psycholinguistic evidence has shown that humans do not read text letter by letter for alphabetic languages or character by character for languages such as Chinese (e.g., Jackendoff [1995]). The contexts matter in determining the similarities.

As a result, the “best” similarity measure for computer software depend on the goals of the applications. Do we want to build a model for how humans judge visually-similar characters? Do we want to build a model for how human process confusing words in which some characters are visually similar?

In this article, the target application is more closely related to the latter question. In another application that we are building for learning Chinese characters [Liu et al. 2011], we are more concerned with similarity between individual characters. Hence, the similarity measures that we presented in the previous section were just to find “good candidates”, and we will have another measure to compare these first-round candidates.

This is the main reason that we did not report experiments in which we carefully tuned the weights for SC3. The main function of SC3 in the current study was to find good first-round candidates.

Changing the current weights for DiceLCSand DiceLCCSalso changes the order in the recommended characters. We illustrate the results of using three different sets of weights below. We show the alternative formulas for SC3, and the resulting recom-mended lists for “

ᜩ

” (yun4) and “

ਈ

” (juan1) below. From left to right, recommended characters are listed in the order of descending scores. We do not show the Hanyu pingyin symbols of all of the listed characters to avoid congesting the page with the symbols for Chinese pronunciations.

Original: SC3 = SC2 + 10× DiceLCCS+ 5× DiceLCS.

ᜩ: ཞ႗༝፞ሤდᒪᑆຣ঩ᚏᙬᖽᕮᑈᠾᠶውሽᅄᄫᄍ໸ᢌᢕᖻ

ਈ: ཞ੉ীԌ੢ܞඞ࿷ටབचڟ࿝࡮௦ܩൃኗᏹਆܑ෫୬ଯߜမᆵᇬ

Alternative 1: SC3 = SC2 + 10× DiceLCCS+ 0× DiceLCS

ᜩ: ႗ཞ፞༝ሤდᚏᙬᖽᕮᒪᑈᑆᠾᠶውሽᅄᄫᄍ໸ᢕຣ঩

ਈ: ཞ੉ীԌ੢ܞඞ࿷ටབኗᏹ࿝࡮௦ܩൃ႗Ꮧᜩਆचܑڟ਎෫ྴጋ୬

Alternative 2: SC3 = SC2 + 0× DiceLCCS+ 5× DiceLCS

ᜩ: ႗ཞሤდ፞༝ᒪᢌᚏᙬᖽᕮ᠔ᑈᑆᠾᠶውሽᅄᄫᄍ໸ᢕᖻຣ

ਈ: ੉ীཞ੢ටඞܞ࿷Ԍ࡮௦ܩൃਃঠ௖௤௥୞චජܕབའ

We can see that there are differences in the lists when we adopted different sets of weights in SC3. However, most, if not all, visually similar characters are included in the lists. Hence, treating these as the first-round candidates, we were satisfied with

(15)

the weights that we selected. A more practical mechanism to rank these candidate characters in the context of words will be introduced in Section 4.4. Due to that ranking mechanism, the resulting performances of different weights will not differ significantly for the current application, as long as we have chosen a satisfactory set of weights.

When we focus on just finding just visually similar characters, there will be no contextual information, that is, the words, available to rank the characters. In such cases, the weights certainly matter. Song et al. [2008] discuss related issues when they build a system for Chinese spelling checker. We [Liu et al. 2011] also face a similar problem when we need software to find characters that find Chinese characters that contain specific components.

4. DATA SOURCES AND PRELIMINARY ANALYSES

We provide information about our lexicons, the sources from which we obtained the reported errors in Chinese text, and our analyses of these reported errors in this section.

4.1 Lexicons

For both traditional and simplified Chinese, we prepare a lexicon that provides infor-mation on the pronunciation and a database that contains the extended Cangjie codes for the characters. Our programs rely on these databases to generate lists of characters that are phonologically and visually similar to a given character.

It is not difficult to acquire lexicons that contain information about standard pro-nunciations for Chinese characters. As we stated in Section 2, the main problem is that it is not easy to predict how people in different areas in China and Taiwan actu-ally pronounce the characters. In the current study we employ the standards for Man-darin Chinese that are recorded in the lexicons and published by the official agency in Taiwan6_{. Experimental results reported in Section 5 will show that the ethnic} back-ground and mother tones did not influence the performance of our methods very much (at most 1%).

With the procedure reported in Section 3.3, we built databases of extended Cangjie codes for both the traditional and the simplified Chinese. Our database for the tradi-tional Chinese was designed to contain 5,401 common characters in the BIG5 encoding system (between 0xa440 and 0xc67e), which was originally designed for the traditional Chinese. We will call this list of characters TCdict. We converted the traditional Chi-nese characters to their simplified counterparts and built the database of Cangjie codes for the simplified Chinese. Because two different traditional Chinese characters may be transformed to a common simplified form, this simplified list contains only 5,170 different characters, and we call this list of characters SCdict.

Count from the very first day of the conception of the main ideas, it took us a long time to develop the current TCdict and SCdict. The original idea was published in Liu and Lin [2008], but we continued to try different ideas since then. With the help of the software, that we explained in Section 3.3, to analyze the frequent substrings of the original Cangjie codes, two graduate students (the third and the fourth authors) were able to come up with a good version of the extended Cangjie code for the 5,401 traditional Chinese characters in a couple of weeks. That initial version was modified once in a while afterward. The modification operations were motivated by results of sporadic tests we ran with some data (Elist and Jlist, to be explained in Section 4.2), so 6_{See http://www.cns11643.gov.tw/AIDB/welcome en.do.}

(16)

we used some new data (Wlist and Blist, also to be explained in Section 4.2) to examine the performance of our system.

We employed our experience with the traditional Chinese to build the first and only version of the extended Cangjie codes for the simplified Chinese characters in few weeks. Most of the work was conducted only by the second author. We did not run ex-periments for the simplified Chinese while we are building the extended codes. There-fore, the experimental results that we report in Section 5.6 were not already based on new data.

4.2 Sources of Incorrect Words and Their Roles in Experiments

We acquired five lists of reported errors in Chinese at different stages of our study. By 2009, we collected two lists of errors for traditional Chinese, and in 2010, we added two lists of errors for traditional Chinese and a list of errors for simplified Chinese.

All of these lists contained information about the observed errors. In order to facilitate our experiments, we saved the reported errors in a simple format. An item of a reported error contains three parts: the correct word, the correct character that will be replaced, and the actual incorrect character. For instance, the correct way to write a type of banana is “

޴ᑴ

” (ba1 jiao1) and sometimes people use “

_ડ

” (ba1) for “

_޴

” (ba1). In this case, we will maintain a data item “

޴ᑴ,޴,ડ

” for this error.

At the beginning of our study, we acquired two lists of reported errors for traditional Chinese. The first list was obtained from a book published by the Ministry of Education (MOE) in Taiwan [MOE 1996]. The second list was collected in 2008 from the written essays of students of the seventh and the eighth grades in a middle school in Taipei. The errors were entered into computers based on students’ writings, not including those characters that did not actually exist and could not be entered. We call the first list of errors the Elist, and the second the Jlist. Elist and Jlist contain, respectively, 1,490 and 1,718 items of errors.

Two or more different ways to write the same words incorrectly were listed in dif-ferent items and considered as two items. When the same character of a word can be written incorrectly in multiple ways, for example, writing “

ᔈб

” (ying4 fu4) as “

ᔈߕ

” (ying4 fu4) or “

ᔈڦ

” (ying4 fu4) in Jlist, we considered them different errors. Cases like these make a program difficult to find the best actual incorrect character, as we will see in Sections 5.5 and 5.6.

Repeated or semantically related errors were treated as many times as the

er-rors were committed by writers. Writing “

ᡂள׳ӳ

” (bian4 de2 geng4 hao3) as

“

ᡂޑ׳ӳ

” (bian4 de1 geng4 hao3) and writing “

ᡂள׳ம

” (bian4 de2 geng4 qiang2) as “

ᡂޑ׳ம

” (bian4 de1 geng4 qiang2) can be considered repeated errors. Writing

“

ᡂள׳ӳ

” as “

ᡂޑ׳ӳ

” and writing “

բளόᒱ

” (zuo4 de2 bu2 cuo4) as “

բޑόᒱ

”

(zuo4 de1 bu2 cuo4) can be considered related errors in lexical semantics. (These errors were observed in Jlist.)

These decisions helped us preserve the original distribution of the reported errors. That is, we took the test data as they were and did not try to manipulate or change the reported incorrect Chinese words. However, this also allowed a larger influence of the repeated errors on the reported experiment results.

In order to conduct further experiments, we collected two more lists of errors for traditional Chinese in 2010. The main reason for obtaining these lists was to use them as extra test data for our Cangjie codes that were improved during 2008 and 2009. Since we had access to both Elist and Jlist while we were improving the extended Cangjie codes for TCList, we thought it would be necessary to have new test data that we had never seen before to examine the effectiveness of the improved codes.

(17)

Table VI. Quantities of Reported Errors in Different Lists

Data Source Original Reduced Data Source Original Reduced

Elist 1490 1333 Wlist 199 188

Jlist 1718 1645 Blist 487 385

Ilist 684 621

The new datasets were acquired from independent sources. The first new list was collected from the Internet,7_{and the second new list came from errors discussed in a} published book that was compiled by scholars [Tsay and Tsay 2003]. The first and the second lists contain 199 and 487 incorrect words, and we refer to these lists as Wlist and Blist, respectively.

In order to test whether our approach works for capturing errors in simplified Chi-nese, we searched the Internet for reported errors for simplified ChiChi-nese, and obtained two lists of errors. The first list8came from the entrance examinations for senior high schools in China, and the second list9 _{contained errors that were observed at senior} high schools in China. We used 160 and 524 errors from the former and the latter lists, respectively. Both of these lists of errors were produced by students at the senior high school levels, so we combined them into one list and refer to the combined list as Ilist.

We dropped some of the reported errors in our experiments because of the current scope of study. Some of the reported errors involved characters that did not belong to TCdict (for traditional Chinese) or SCdict (for simplified Chinese). Since we have extended the Cangjie codes for characters that were included only in TCdict for tradi-tional Chinese and in SCdict for simplified Chinese, we ignored reported errors that did not occur in either TCdict or SCdict. This reduced the sizes of the lists that we collected. Table VI shows the sizes of the original and the reduced lists, respectively, under and the Original and Reduced columns.

4.3 Preliminary Error Analyses

In order to know the main reasons that caused the production of the observed errors, we asked two native speakers to classify the causes of these errors into three categories based on whether the errors were related to phonological similarity, visual similarity, or neither. Since the annotators did not always agree on their classifications, the final results are presented in five categories: P, V, N, D, and B in Table VII. P and V in-dicate that the annotators agreed on the types of errors to be related to, respectively, phonological and visual similarity. N indicates that both annotators believed that the errors were not due to phonological or visual similarity. D indicates that the annota-tors believed that the errors were due to phonological or visual similarity, but they did not have a consensus on the category. B indicates the intersection of P and V, that is, errors that are related to both phonological and visual similarities. Table VII shows the percentages of errors in these categories.

We used the quantities of reported errors in the reduced lists as the denominators to compute the percentages in Table VII. Hence, 79.9% in the “Jlist” row indicates that 1,314 ( = 1645×0.799) errors were classified as related to phonological similarity. To get 100% for a row in the table, we need to add P, V, N, and D, and subtract B from the total.

7_{See http://www.eyny.com/archiver/tid-2529010.html; last visited on 30 September 2010.}

8_{See http://www.0668edu.com/soft/4/12/95/2008/2008091357140.htm; last visited on 10 June 2010.} 9_{See http://gaozhong.kt5u.com/soft/2/38018.html; last visited on 30 September 2010.}

(18)

Table VII. Error Analysis of the Errors: Phonological Influences Dominated in These Errors

P V N D B Elist (traditional) 67.2% 66.1% 0.2% 3.6% 37.1% Jlist (traditional) 79.9% 30.7% 2.4% 7.9% 20.9% Wlist (traditional) 69.1% 54.8% 4.8% 8.0% 36.7% Blist (traditional) 81.6% 34.8% 1.6% 4.7% 22.6% Ilist (simplified) 83.1% 48.3% 0% 3.7% 35.1%

In all of these five lists, phonological similarity showed a dominant influence in respect to the visual similarity of the reported errors. Most of the reported errors were related to similar pronunciations, while the percentage of errors that were related to visual similarity depended on the lists of the reported errors. It should not be very surprising that the annotators may disagree sometimes.

The weighted proportion of phonologically related errors is 76.0%. Based on the statistics shown in Table VI and Table VII, this analysis considered 4,172 errors (the total of the errors in the reduced lists). The total number of errors that were related to similar pronunciation is 1333× 0.672 + 1645 × 0.799 + 188 × 0.691 + 385 × 0.816 + 621× 0.831 = 3170.25. The result of dividing 3170.25 by 4172 is 76.0%. Similarly, we can compute that the weighted proportion of visually related errors is (1333× 0.661 + 1645× 0.307 + 188 × 0.548 + 385 × 0.348 + 621 × 0.483) ÷ 4172 = 46.1%.

It is particularly noticeable that although the errors in Jlist were collected from written documents, the phonological factor still dominated. It is a common belief that the dominance of pronunciation-related errors in electronic documents occurs as a re-sult of the common habit of entering Chinese with pronunciation-based methods. The ratio between P and V, that is, P÷ V, for the Jlist challenges this popular belief and indicates that even though the errors occurred during a writing process, rather than typing on computers, students still produced more pronunciation-related errors. Dis-tribution over error types is not as related to input method as one may have believed. Nevertheless, the observation might still be a result of students in Taiwan being so used to entering Chinese text with a pronunciation-based method that the organiza-tion of their mental lexicons is also pronunciaorganiza-tion related. The P ÷ V ratio for the Ilist also supports this phenomenon, suggesting that the dominance of phonological influence may be a common phenomenon in the use of both traditional and simplified Chinese. The ratio for the Elist suggests that editors of the MOE book may have cho-sen the examples with a special viewpoint in their minds—that of balancing pronun-ciation and composition related errors. (The Blist is so short that we do not consider it representative in regard to this issue.)

It is worthwhile to note that a large percentage of errors are related to either phonological or visual similarity in Chinese. The sum of the statistics under N and D columns indicates the proportion of errors that were related to neither visual nor phonological similarity. The weighted average of (N + D) for the five lists was just 7%. The lowness of this figure can be explained by the large percentage of phono-semantic compounds (xingsheng words, “

׎ᖂӷ

”) in Chinese.

4.4 Web-Based Statistics

In this section, we examine the effectiveness of using Web-based statistics to differen-tiate correct and incorrect characters. The abundance of text material on the Internet allows people to treat the Web as a corpus10_{. When we send a query to Google, we will} 10_{See http://webascorpus.org.}

(19)

Table VIII. Reliability of Web-Based Statistics (Based on Data Collected in April 2010)

Elist Jlist Ilist

C A I C A I C A I

P 92.4% 0.1% 7.5% 91.3% 0.9% 7.8% 97.1% 0.0% 2.9% V 92.6% 0.0% 7.4% 91.5% 0.6% 7.9% 98.0% 0.0% 2.0%

be informed of the estimated number of pages11 _{(ENOPs) that possibly contain} rele-vant information. If we put the query terms in quotation marks, we should find the Web pages that replicate the query forms in the exact sequence and with the same ad-jacency as those originally entered. Hence, it is possible for us to compare the ENOPs for two competing phrases for guessing the correct way of writing a word. For instance, at the time of this writing, Google reported 116,000 and 33,000 relevant pages, respec-tively, for “strong tea” and “powerful tea”. (When conducting such advanced searches with Google, the quotation marks are needed to ensure the adjacency of the individual words.) Hence, “strong” appears to be a better choice to go with “tea”. This is an idea similar to one of the approaches for computing collocations based on word frequencies [cf., Manning and Sch ¨utze 1999]. Although the idea may not work very well when using a small database, the size of the current Web should be large enough.

We ran experiments for only those items that the annotators were in consensus over the causes of the error. Hence, for instance, we had 1285( = 1333× (1-0.036), cf. Table VI and Table VII), 1515 ( = 1645× (1-0.079)), and 598( = 621 × (1-0.037)) such words for Elist, Jlist, and Ilist, respectively. As the information available on the Web may change over time, we also have to note that the statistics reported in Table VIII were based on experiments conducted during April 2010.

Table VIII shows the results of our investigation. For each reported error, we sub-mitted the correct word and the incorrect word to Google and considered that we had a correct result when we found that the ENOP for the correct word was larger than the ENOP for the incorrect word. If the ENOPs were equal, we recorded an ambiguous result; and when the ENOP for the incorrect word was larger, we recorded an incorrect event. We use C, A, and I to denote correct, ambiguous, and incorrect events, respec-tively, in the table. We record a correct result for the “strong tea vs. powerful tea” test, for instance.

The Web-based statistics did not work very well for the Elist and Jlist, but seemed to work well enough for Ilist. The most common reason for the errors is that certain words are confusing to the extent that the majority of the Web pages showed the incorrect words. Some of the errors are so common that even one of the Chinese input methods on Windows XP offered wrong words as possible choices, for example, “

_໢ॆॆ

” (xiong2 jiu1 jiu1; the correct one) vs. “

_໢ޟޟ

” (xiong2 jiu1 jiu1). It is also interesting to note that people may intentionally use incorrect words on some occasions; for instance, people may choose to write homophones in advertisements.

Another possible reason for the mistakes is that whether a word is correct depends on a larger context. For instance, “

_λථ

” (xiao3 si1) is more popular than “

_λት

” (xiao3 si1) because the former is a popular nickname. Unless we provided more contextual information about the queried words, checking only the ENOPs of “

_λථ

” and “

λት

” would lead us to choose “

λት

”, which would be an incorrect word if we meant to find the right way to write “

λථ

”. Other difficult pairs of words to distinguish are“

इᒵ

” (ji4 lu4) vs. “

૶ᒵ

” (ji4 lu4) and “

໪ा

” (xu1 yao4) vs. “

ሡा

” (xu1 yao4).

11_{According to Croft et al. [2010], the ENOPs may not reflect the actual number of pages on the Internet,} they may result from statistical estimations.