Inclusion Tests - Visually and phonologically similar characters in incorrect simplified Chines

We have changed several aspects of our system since we conducted the experiments re-ported in the previous subsection. We have built a new version of the extended Cangjie codes for the characters in TCdict and the first version of the extended Cangjie codes for the characters in the SCdict, using the procedure that we discussed in Section 3.3.

Table X. AIRs for Ranking the Candidates for Elist Based on the Frequencies Collected in 2009 and 2010

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

FMarch2009FF

SS 55.5 74.6 82.4 85.8 87.6 89.3 90.1 90.7 90.7 91.1 SD 10.0 13.8 15.4 16.3 16.6 17.1 17.4 17.6 17.9 18.3

MS 2.1 2.3 2.5 2.7 2.9 3.0 3.0 3.0 3.0 3.0

MD 1.1 1.4 1.4 1.4 1.4 1.4 1.7 1.7 1.7 1.7

SC1 33.6 49.4 55.9 61.4 65.8 67.7 68.8 71.2 72.0 73.2 SC2 30.6 43.5 50.7 56.8 60.6 64.1 65.5 67.1 68.7 70.1

RS 2.7 3.5 3.7 4.0 4.1 4.1 4.1 4.1 4.1 4.1

FApril2010F

SS 55.1 73.4 81.7 86.0 88.4 90.0 90.4 90.9 91.1 91.2 SD 10.0 13.2 15.1 16.1 16.6 16.9 17.3 17.6 17.8 17.9

MS 2.1 2.5 2.8 2.8 2.9 2.9 3.0 3.0 3.0 3.0

MD 1.0 1.4 1.4 1.4 1.6 1.6 1.6 1.6 1.7 1.7

SC1 31.1 48.0 57.1 62.8 66.6 68.4 69.9 70.8 72.7 73.7 SC2 29.4 43.7 52.1 58.5 62.7 64.5 65.9 67.0 68.6 69.6

RS 2.8 3.5 3.8 3.8 4.0 4.0 4.0 4.1 4.1 4.1

We have also added a new score function, that is, SC3, that we did not have in 2009.

Furthermore, we do not submit our queries to the ordinary Google interface anymore, as we explained in Section 5.3.

To make the results of the experiments with traditional Chinese more convincing, we used two new lists, Wlist and Blist that we did not know when we improved the Cangjie codes in TCdict. These two lists serve as unforeseen test instances for our programs and databases.

We ran ICCEval with Elist, Jlist, Wlist, Blist, and Ilist. The experiments were conducted for all categories of phonological and visual similarity. When using SS, SD, MS, MD, and RS as the selection criteria, we did not limit the number of candidate characters. Any characters that conform to the selection criteria were selected in the candidate list, L in ICCEval. When using SC1, SC2, and SC3 as the selection criteria, we limited the number of candidates to be no more than 30. We inspected samples of the candidate lists generated with SC1, SC2, and SC3, and found that the number of visually similar characters for a given character rarely exceeded 30. Hence this limit was chosen heuristically.

We considered only words that the native speakers were in consensus over the causes of errors. There is a limit on the maximum number of queries that one can submit to the Google AJAX API. As a consequence, we could not complete our exper-iments in a short time, and the ENOPs were obtained during March and April 2010.

Table XI shows the inclusion rates that the candidate lists, generated with different crit at step 1, contained the incorrect characters in the reported errors. The columns have the same meaning as they have in Table IX. The rows show the statistics that we observed while using the lists, listed in Table VI, as the test data. For instance, we achieved an inclusion rate of 90.3% for the visually similar errors when we applied SC3 to generate the candidate lists for errors in the Wlist.

ICCEval and our databases worked well for traditional Chinese. Although we have slightly expanded the definitions of similar sound since 2009, the effectiveness of SS, SD, MS, and MD remain the same for Elist and Jlist in Table IX and Table XI. The statistics for RS did not change because we were using the same list in 2009 and in 2010. Statistics about SC1 and SC2 are slightly better in Table XI for both Elist and Jlist, but the improvements are not significant. Using the new score function, that is,

Table XI. Inclusion Rates for the Different Experiments

SS SD MS MD SC1 SC2 SC3 RS Phone Visual All

Elist 91.6 18.4 3.0 1.9 77.7 76.3 87.3 4.1 99.0 89.8 96.2 Jlist 92.2 20.2 4.2 3.7 77.0 71.3 89.3 6.1 99.3 91.9 99.3 Wlist 94.6 23.1 0.8 0.8 80.6 78.6 90.3 1.1 99.2 90.3 96.9 Blist 82.2 24.2 3.2 1.9 77.6 75.4 90.3 3.7 95.2 91.8 94.7 Ilist 82.6 29.3 2.1 1.6 78.3 71.0 87.7 1.3 97.3 90.0 96.5

SC3, and the new Cangjie codes significantly improved the inclusion rates for visually related errors. We were able to include 88.3%¹⁴of the visually similar errors with SC3.

We were able to include approximately 10% more of the actual incorrect characters in experiments, when we used SC3 rather than SC1 or SC2 to generate the candidate lists.

Even though we had not seen the errors in Wlist and Blist¹⁵previously, our program had shown a robust performance. ICCEval achieved a comparable performance when running with Wlist than running with Elist and Jlist. When working with Blist, IC-CEval did not perform as well with phonologically similar errors, but showed a similar performance for visually similar errors. However, the change was not very significant, and the results reflected the preference of the experts who wrote the book from which we obtained Blist. The effectiveness of the SS lists dropped, while the effectiveness of the SD lists increased. We inspected the errors in Blist closely, and found many challenging instances—those that native speakers found the incorrect words are be-coming more frequently used in practice. For this reason, we considered that ICCEval achieved reasonably well.

When running with Ilist, ICCEval achieved a performance similar to the one which it had achieved with Blist. Like the results observed in the other experiments, it is easier to find phonologically similar incorrect characters than visually similar ones.

Using SS and SC3 as the selection criterion at step 1 in ICCEval were the most ef-fective criteria for phonologically and visually similar characters, respectively. The contribution of SD was quite significant for Ilist.

When we used the unions of the phonologically similar characters to compute the inclusion rates, we captured 98.6% of the phonologically similar errors for the five lists.

The unions of the visually similar characters were also very effective, though capturing only about 90.5% of the visually similar errors. When we used the union of all of the candidate lists, we captured 97.4% of all the errors. These are the weighted averages of the inclusion rates which we calculated with a procedure similar to the one provided in the tenth footnote.

It is certainly desirable for applications to have the potential to capture all of the reported errors. However the inclusion rates were achieved at different costs. For each reported error and the actual incorrect character of the error, ICCEval generated a candidate list at step 1. In an experiment that used a list of y errors, we would have y candidate lists. Hence, for a particular experiment, we can calculate the average length of such ycandidate lists. Table XII shows the average lengths of the corre-sponding experiments that are reported in Table XI.

14The rates are averages computed considering the numbers of errors in the error lists, listed in Table VI and Table VII. For instance, 88.3 = (87.3× 1333 × 0.661 + 89.3 × 1645 × 0.307 + 90.3× 188 × 0.548 + 90.3

× 385 × 0.348 + 87.7 × 621 × 0.483) ÷ (1333 × 0.661 + 1645 × 0.307 + 188 × 0.548 + 385 × 0.348 + 621 × 0.483).

15We used only 20 of the reported errors in Blist in Liu et al. [2009a].

Table XII. Average Lengths of the Candidate Lists

SS SD MS MD SC1 SC2 SC3 RS Phone Visual All

Elist 11.3 18.6 9.7 21.8 23.3 26.7 25.4 9.2 56.6 48.8 102.1 Jlist 12.4 22.0 11.6 25.4 21.9 24.3 25.4 7.7 64.3 46.0 107.4 Wlist 11.8 17.4 10.7 22.0 22.6 26.1 25.5 9.2 56.4 48.8 102.0 Blist 14.2 22.2 10.4 22.5 22.2 25.6 25.7 8.1 62.5 47.4 106.9 Ilist 12.6 19.1 9.1 19.5 24.3 27.1 25.5 9.4 55.5 47.8 100.2

Clearly, longer candidate lists would increase the chances to achieve higher inclu-sion rates. Hence, it would be more preferable if a shorter candidate list can achieve the same inclusion rate as that of a longer candidate list. From this perspective, the statistics in Table XII show that SS is very effective for capturing similar errors, as we were able to capture better than 89.8% of the phonologically-similar errors by an average of 12.2 characters. (12.2 is the weighted average of 11.3, 12.4, 11.8, 14.2, and 12.6.) Taking the union of SS, SD, MS, and MD lists to obtain the

“Phone” list, the average lengths increased from 12.2 to 60.0, but the inclusion rates increased from 89.8% only to 98.5%.

Using SC3 as the selection criterion for visually similar errors offered a significant improvement in both effectiveness and efficiency. The weighted average lengths of the candidate lists that were selected with SC1, SC2, and SC3 were 22.8, 25.7, and 25.4, respectively. The weighted average inclusion rates for SC1, SC2, and SC3 were 77.6%, 73.6%, and 88.6%, respectively. Using SC3 allows us to achieve higher inclusion rates with shorter candidate lists. We took the union of the SC1, SC2, SC3, and RS lists to form the Visual list, increased the average lengths of the candidate lists to 47.4, but increased the inclusion rates only marginally to 90.9%.

To achieve the inclusion rates in the All column in Table XI, we would have to allow ICCEval to recommend 104.3 characters. Although a list of 104 characters will be too long to be practically useful, we have to keep in mind that we had reduced the search space from more than 5,100 characters to approximately 100 characters—which is 98%

for the compression rate. This point is particularly important to bear in mind as we seek to applying our findings to help teachers select “attractive incorrect characters”

when authoring test items of the ICC tests.

在文檔中 Visually and phonologically similar characters in incorrect simplified Chinese words (頁 24-27)