Two Applications of Lexical Information to Computer- Computer-Assisted Item Authoring for Elementary Chinese

Chao-Lin Liu, Kan-Wen Tien, Yi-Hsuan Chuang, Chih-Bin Huang, Juei-Yu Weng Department of Computer Science, National Chengchi University, Taiwan

{chaolin, g9627, s9436, g9614, s9403}@cs.nccu.edu.tw

Abstract.^† Testing is a popular way to assess one’s competence in a language.

The assessment can be conducted by the students for self evaluation or by the teachers in achievement tests. We present two applications of lexical tion for assisting the task of test item authoring in this paper. Applying informa-tion implicitly contained in a machine readable lexicon, our system offers se-mantically and lexically similar words to help teachers prepare test items for cloze tests. Employing information about structures and pronunciations of Chi-nese characters, our system provides characters that are similar in either forma-tion or pronunciaforma-tion for the task of word correcforma-tion. Experimental results indi-cate that our system furnishes quality recommendations for the preparation of test items, in addition to expediting the process.

Keywords: Computer assisted test-item authoring, Chinese synonymy, Chi-nese-character formation, natural language processing

1 Introduction

The history of applying computing technologies to assisting language learning and teaching can be dated back as least 40 years ago, when the Programmed Logic for Automatic Teaching Operations, which is referred as PLATO usually, was initiated in 1960 [6; p. 70]. The computing powers of modern computers and the accessibility to information supported by the Internet offer a very good environment for language learning that has never been seen before.

The techniques for natural language processing (NLP) [11] are useful for designing systems for information retrieval, knowledge management, and language learning, teaching, and testing. In recent years, the applications of NLP techniques have re-ceived attention of researchers in the Computer Assisted Language Instruction Con-sortium (often referred as CALICO, http://calico.org/, instituted in 1983) and the researchers in the computational linguistics, e.g., in United States of America [13] and in Europe [5]. Heift and Schulze [6] report that there are over 100 documented

† Due to the subject matter of this paper, we must show Chinese characters in the text. When-ever appropriate, we provide the Chinese characters with their Romanized forms and their translations in English. We use traditional Chinese and Hanyu Pinyin, and use Arabic digits to denote the tones in Mandarin. (http://en.wikipedia.org/wiki/Pinyin) For readers who do not read Chinese, please treat those individual characters as isolated pictures or just symbols.

jects that employed NLP techniques for assisting language learning.

The applications of computing technologies to the learning of Chinese language can also be traced back as far as 40 years ago, when researchers applied computers to collate and present Chinese text for educational purposes [14]. The superior comput-ing powers of modern computers offer researchers and practitioners to invent more complicated tools for language learning. As a result, such computer-assisted language learning applications are no longer limited to academic laboratories, and have ex-panded their existence into real-world classrooms [1, 16].

In this paper, we focus on how computers may help teachers assess students’ com-petence in Chinese, and introduce two new applications for assisting teachers to pre-pare test items for elementary Chinese. Students’ achievements in cloze tests provide a good clue to whether they learned the true meanings of the words, and the ability to identify and correct a wrong word in the so-called word-correction tests is directly related to students’ ability in writing and reading. Our system offers semantically or lexically similar words for preparing the cloze items, and provides characters that are visually or phonetically similar to the key characters for the word-correction items.

Experimental results indicate that the confusing characters that our system recom-mended were competitive in quality, even when compared with those offered by na-tive speakers of Chinese.

We explain how to identify semantically and lexically similar Chinese words in Section 2, elaborate how to find structurally and phonetically similar Chinese charac-ters in Section 3, and report an empirical evaluation of our system in Section 4. Fi-nally, we make concluding remarks in Section 5.

2 Semantically and Lexically Similar Words for Cloze Tests A cloze test is a multiple-choice test, in which one and only one of the candidate words is correct. The examinee has to find the correct answer that fits the blank posi-tion in the sentence. A typical item looks like the following. (A translaposi-tion of the sentence used in this test item is “The governor officially ___ the Chinese teachers’

association yesterday, and discussed with the chairperson about the education prob-lems for the Chinese language in California.” The four choices, including the answer, are different ways to say “visit” or “meet” in Chinese.)

州長於昨日正式中文教師學會，與會長深入討論加州的中文教育問題。(a) 見面 (b) 走訪 (c) 拜訪 (d) 訪視

Cloze tests are quite common in English tests, such as GRE and TOEFL. Applying techniques for word sense disambiguation, Liu et al. [10] reported a working system that can help teachers prepare items for cloze tests for English. Our system offers a similar service for Chinese cloze tests.

To create a cloze test item, a teacher determines the word that will be the answer to the test item, and our system will search in a corpus for the sentences that contain the answer and present these sentences to the teacher. The teacher will choose one of these sentences for the test item, and our system will replace the answer with a blank area in the sample sentence (the resulting sentence is usually called stem in computer assisted item generation), and show an interface for more authoring tasks.

A cloze item needs to include distracters, in addition to the stem and the correct an-swer in the choices. To assist the teachers prepare the distracters, we present two types of candidate word lists to the teachers. The first type of list includes the words that are semantically similar to the answer to the cloze item, and the second type of list contains the words that are lexically similar.

We have two sources to obtain semantically similar words. The easier way is to rely on a Web-based service offered by the Institute of Linguistics at the Academia Sinica to find Chinese words of similar meanings [3], and present these words to the teachers as candidates for the distracters. We have also built our own synonym finder with HowNet (http://www.keenage.com). HowNet is bilingual machine readable lexicon for English and Chinese. HowNet employs a set of basic semantic units to explain Chinese words. Overlapping basic semantic units of two Chinese words indi-cate that these words share a portion of their meanings. Hence, we can build a syno-nym finder based on this observation, and offer semantically related words to the teachers when they need candidate words for the distracters of the cloze items. In addition, words that share more semantic units are more related than those that share fewer units. Hence, there is a simple way to prioritize multiple candidate words.

When assisting the authoring of cloze items, we can obtain lists of Chinese words that are semantically similar to the answer to the cloze item with the aforementioned methods. For instance, “造訪” (zao(4) fang(3)), “拜會” (bai(4) hui(4)), and “走訪”

(zou(3) fang(3)) carry a similar meaning with “拜訪” (bai(4) fang(3)). The teachers can either choose or avoid those semantically similar, yet possibly contextually inap-propriate in ordinary usage, words for the test items.

It is the practice for teachers in Taiwan to use lexically similar words as distracters.

For this reason, our system presents words that contain the same characters with the answer as possible distracters. For instance, both “喝酒” (he(1) jiu(3)) and “奉茶”

(feng(4) cha(2)) can serve as a distracter for “喝茶” (he(1) cha(2)) because they share one character at exactly the same position in the words. We employ HowNet to find candidate words of this category.

3 Visually and Phonetically Similar Words for Word Correction In this section, we explain how our system helps teachers prepare test item for “word correction.” In this type of tests, a teacher intentionally replaces a Chinese character with an incorrect character, and asks students to identify and correct this incorrect character. A sample test item for word correction follows. (A translation of this Chi-nese string: The wide varieties of the exhibits in the flower market dazzle the visitors.)

花市中各種展品讓人眼花繚亂 (“繚” is incorrect, and should be replaced with “撩”)

Such an incorrect character is typically similar to the correct character either visu-ally or phoneticvisu-ally. Since it is usuvisu-ally easy to find information about how a Chinese character is uttered, given a lexicon, we turn our attention to visually similar charac-ters. Visually similar characters are important for learning Chinese. They are also important in the psychological studies on how people read Chinese [12, 15]. We

pre-sent some similar Chinese characters in the first subsection, illustrate how we encode Chinese characters in the second subsection, elaborate how we improve the encoding method to facilitate the identification of similar characters in the third subsection, and discuss the weakness of our current approach in the last subsection.

3.1 Examples of Visually Similar Chinese Characters We show three categories of

simi-lar Chinese characters in Figures 1, 2, and 3. Groups of similar charac-ters are separated by spaces in these figures. In Figure 1, charac-ters in each group differ at the stroke level. Similar characters in every group in the first row in Figure 2 share a common compo-nent, but the shared component is not the radical of these characters.

Similar characters in every group in the second row in Figure 2 share

a common component, which is the radical of these characters. Similar characters in every group in Figure 2 have different pronunciations. We show six groups of homo-phones that also share a common component in Figure 3. Characters that are similar in both pronunciations and internal structures are most confusing to new learners.

It is not difficult to list all of those characters that have the same or similar pronun-ciations, e.g., “試” and “市”, if we have a machine readable lexicon that provides information about pronunciations of characters and when we ignore special patterns for tone sandhi in Chinese [2].

In contrast, it is relatively difficult to find characters that are written in similar ways, e.g., “構” with “購”, with an efficient manner. It is intriguing to resort to image processing methods to find such structurally similar words, but the computational costs can be very high, considering that there can be tens of thousands of Chinese characters. There are more than 22000 different characters in Chinese [7], so directly computing the similarity between images of these characters demands a lot of compu-tation. There can be more than 242 million combinations of character pairs. The Min-istry of Education in Taiwan suggests that about 5000 characters are needed for eve-ryday communication. In this case, there are about 12.5 million pairs.

The quantity of combinations is just one of the bottlenecks. We may have to shift the positions of the characters “appropriately” to find the common component of a character pair. The appropriateness for shifting characters is not easy to define, mak-ing the image-based method less directly useful; for instance, the common component of the characters in the rightmost group in the second row in Figure 3 appears in dif-ferent places in the characters.

Lexicographers employ radicals of Chinese characters to organize Chinese charac-ters into sections in dictionaries. Hence, the information should be useful. The groups in the second row in Figure 3 show some examples. The shared components in these

士土工干千戌戍成田由甲申母毋勿匆人入未末采釆凹凸

Fig. 1. Some similar Chinese characters 頸勁搆溝陪倍硯現裸棵搞篙列刑盆盎盂盅因困囚間閒閃開 Fig. 2. Some similar Chinese characters that have

different pronunciations 形刑型踵種腫購構搆紀記計

園圓員脛逕徑痙勁 Fig. 3. Homophones with a shared component

Table 1. Cangjie codes for some characters Cangjie Codes Cangjie Codes

士十一土土

groups are radicals of the characters, so we can find the characters of the same group in the same section in a Chinese dictionary. However, information about radicals as they are defined by the lexicographers is not sufficient. The groups of characters shown in the first row in Figure 3 have shared components. Nevertheless, the shared components are not considered as radicals, so the characters, e.g., “頸”and “勁”, are listed in different sections in the dictionary.

3.2 Encoding the Chinese Characters with the Cangjie Codes

The Cangjie method is one of the most popular methods for people to enter Chinese into computers. The designer of the Cangjie method, Mr. Chu, selected a set of 24 basic elements in Chinese characters, and proposed a set of rules to decompose Chi-nese characters into these elements [4]. Hence, it is possible to define the similarity between two Chinese characters based on the similarity between their Cangjie codes.

Table 1 has three sections, each showing the Cangjie codes for some characters in Figures 1, 2, and 3. Every Chinese character is decomposed into an ordered sequence of elements. (We will find that a subsequence of these elements comes from a major component of a character, shortly.) Evidently, computing the number of shared ele-ments provides a viable way to determine “visual similarity” for characters that ap-peared in Figures 2 and 3. For instance, we can tell that “搞” and “篙” are similar because their Cangjie codes share “卜口月”, which in fact represent “高”.

Unfortunately, the Cangjie codes do not appear to be as helpful for identifying the similarities between characters that differ subtly at the stroke level, e.g., “士土工干”

and others listed in Figure 1. There are special rules for decomposing these relatively basic characters in the Cangjie method, and these special encodings make the result-ing codes less useful for our tasks.

The Cangjie codes for characters that contain multiple components were inten-tionally simplified to allow users to input Chinese characters more efficiently. The average number of key strokes needed to enter a character is a critical factor in designing input methods for Chinese. The longest Cangjie code among all Chinese characters contains five elements. As shown in Table 1, the component “巠” is represented by “一女一” in the Cangjie codes for “脛” and “徑”, but is repre-sented only by “一一” in the codes for

“頸” and “勁”. The simplification makes it relatively harder to identify visually similar characters by comparing the actual Cangjie codes.

3.3 Engineering the Cangjie Codes for Practical Applications

Though useful for the design of an input method, the simplification of Cangjie codes causes difficulties when we use the codes to find similar characters. Hence, we choose to use the complete codes for the components in our database. For instance the com-plete codes for “巠”, “脛”, “徑”, “頸”, and “勁” are, respectively, “一女女一”, “月一

decomposed vertically; e.g., “盅” can be split into two smaller components, i.e., “中”

and “皿”. Some characters can be decomposed horizontally; e.g., “現” is consisted of

“王” and “見”. Some have enclosing components; e.g., “人” is enclosed in “囗” in

“囚”. Hence, we can consider the locations of the components as well as the number of shared components in determining the similarity between characters.

Figure 4 illustrates the layouts of the components in Chinese characters that were adopted by the Cangjie method [9]. A sample character is placed below each of these layouts. A box in a layout indicates a component, and there can be at most three com-ponents in a character. We use digits to indicate the ordering the comcom-ponents. Due to space limits in the figure, we do not show all digits for the components.

After recovering the simplified Cangjie code for a character, we can associate the character with a tag that indicates the overall lay-out of its components, and sepa-rate the code sequence of the character according to the layout of its components. Hence, the information about a character includes the tag for its layout and between one to three sequences of code elements. The layouts are numbered from left to right and from top to bottom in Figure 4.

Table 2 shows the annotated and expanded codes of the sample characters in Figure 4 and the codes for some characters that we will discuss. Elements that do not

Table 2. Some annotated and expanded codes Layout Comp. 1 Comp. 2 Comp. 3

Fig. 4. Layouts of Chinese characters (used in Cangjie)

belong to the original Cangjie codes of the characters are shown in a bounding box.

Recovering the elements that were dropped out by the Cangjie method and orga-nizing the sub-sequences of elements into components facilitate the identification of similar characters. It is now easier to find that the character (頸) that is represented by

“一女女一” and “一月山金” looks similar to the character (徑) that is represented by

“竹人” and “一女女一” in our database than using their original Cangjie codes in Table 1. Checking the codes for “員” and “圓” in Table 1 and Table 2 will offer an additional support for our design decisions.

Computing the similarity between characters using a database of such strengthened Cangjie code is very efficient. In the worst case, we need to compare nine pairs of code sequences for two characters that both have three components. Since we are just doing simple string comparisons, computing the similarity between characters is sim-ple. It takes less than one second to find visually similar characters from a list of 5000 characters on a Pentium IV 2.8GHz CPU with 2G RAM. Moreover, we can offer a search service that allows psycholinguistics researchers to look for characters that contain specific components that locate at particular places within the characters.

3.4 Drawbacks of Using the Cangjie Codes

Using the Cangjie codes as the basis for comparing the similarity between characters introduces some potential problems.

It appears that the Cangjie codes for some characters, particular those simple ones, were not assigned without ambiguous principles. Relying on the Cangjie codes to compute the similarity between such characters can be difficult. For instance, “分”

uses the fifth layout, but “兌” uses the first layout in Figure 4. The first section in Table 1 shows the Cangjie codes for some character pairs that are difficult to compare.

It appears that we need to mark the similarity among such special characters manually, perhaps with the interactive assistance of the methods proposed in this paper.

Except for the characters that use the first layout, the Cangjie method splits

在文檔中機率式建模技術與自然語言的標記、認知和教學 (I) (頁 24-28)