Ranking Tests with Elist and Jlist - Visually and phonologically similar characters in incorrec

To make the candidate lists applicable, we wish to place the actual incorrect character high in the ranked list. This will help the efficiency in supporting computer-assisted test-item writing. Having shorter lists that contain relatively more confusing charac-ters may facilitate the data preparation for psycholinguistic studies as well.

Table XIII shows the results when we recommended only the leading ten candidates for the errors in Jlist. The table is divided into two parts. The upper part (with row heading “Frequency”) shows the results when we used the raw values of the ENOPs to rank the candidate characters, and the lower part (with row heading “PMI”) shows the results when we used Equation (4) in Section 5.1 to rank the candidate characters.

The column “R_i” shows the accumulative inclusion rates (AIRs) that we defined in Equation (6) in Section 5.2. The sub-row headings show the selection criteria that were used in the experiments. For instance, using SS as the criterion and ranking with the raw values of ENOPs, 55.1% of phonologically related errors were included if we offered only one candidate, 70.6% of the phone-related errors were included if we offered two candidates, etc. If we recommended only the top five candidates in SS (ranked with ENOPs), we captured the phonologically similar errors 84.3% of the time.

For errors that were related to visual similarity, recommending the top five candidates

Table XIII. AIRs for Ranking the Candidates for Jlist Based on ENOPs and PMIs

R₁ R₂ R₃ R₄ R₅ R₆ R₇ R₈ R₉ R₁₀ Rnum

FFrequencyF

SS 55.1 70.6 77.6 81.7 84.3 86.2 87.6 88.2 89.2 89.7 92.2 SD 10.6 13.8 15.9 16.9 17.7 18.0 18.1 18.5 18.7 18.8 20.2

MS 3.0 3.3 3.5 3.7 3.7 3.7 3.7 3.8 3.8 3.9 4.2

MD 2.2 2.7 2.9 3.0 3.1 3.3 3.3 3.3 3.3 3.3 3.7

Phone 42.9 57.5 64.6 70.1 73.7 77.7 81.1 83.0 84.6 86.1 99.3 SS+SD 43.7 56.3 64.9 71.7 75.5 78.7 81.1 83.2 84.4 85.5 95.1 SC1 40.2 53.5 59.8 64.2 65.9 67.7 68.3 69.1 70.5 72.1 77.0 SC2 34.7 48.5 52.1 55.4 57.6 60.0 62.0 63.0 64.4 65.5 71.3 SC3 42.6 56.6 64.4 69.7 73.9 77.0 78.8 80.4 81.6 83.6 89.3

RS 5.3 5.9 5.9 5.9 5.9 5.9 5.9 5.9 6.1 6.1 6.1

Visual 35.3 50.4 57.9 61.5 66.1 69.2 72.4 74.0 76.0 77.6 91.9

yyyPMIy

SS 47.0 62.2 70.7 75.8 79.0 82.8 84.1 85.5 86.9 87.9 92.2

SD 9.4 12.0 14.1 15.3 15.8 16.6 16.8 17.7 18.2 18.4 20.2

MS 2.7 3.3 3.3 3.6 3.6 3.6 3.7 3.7 3.7 3.7 4.2

MD 2.1 2.5 2.7 2.8 2.8 3.0 3.0 3.0 3.1 3.1 3.7

Phone 37.0 51.0 59.7 64.5 69.0 72.7 75.6 77.7 79.3 80.5 99.3 SC1 38.2 49.9 56.4 59.4 63.0 66.1 67.5 69.1 70.5 71.7 77.0 SC2 33.7 45.0 50.1 54.9 57.0 59.0 61.8 63.4 64.4 64.6 71.3 SC3 39.6 54.3 63.2 69.5 73.9 75.4 78.2 79.4 80.6 82.4 89.3

RS 3.8 5.9 5.9 5.9 5.9 5.9 5.9 5.9 6.1 6.1 6.1

Visual 34.7 47.4 56.9 61.9 66.9 69.8 72.8 76.4 77.0 78.4 91.9

in SC3 (ranked with PMIs) would capture the actual incorrect characters 73.9% of the time. As we explained in Section 5.2, the AIRs must be smaller or equal to the inclusion rates of the individual experiments. In Table XIII, we copy the inclusion rates of the Jlist row in Table XI into the R_numcolumn.

The statistics listed in Table XIII show the effectiveness of our ranking mecha-nisms – both ENOPs and also PMIs. The difference (Rnum-Ri) is a good indicator of the degree of sacrifice required in the situation when we recommend only the top i candidate rather than the complete candidate lists. When we shortened the candidate lists to contain less than 10 characters, we did not sacrifice the inclusion rates significantly. When we recommended 10 characters, the differences (R_num-R₁₀) were not large, especially when we considered that we would have to put forward much longer lists of candidate characters, for example, Table XII, to achieve Rnum. One exception to this observation is that providing the complete candidate lists that were selected with the SS criterion may be worthwhile. According to Table XII, suggesting an average of 12.4 characters achieved R_num.

Recall that using the union of the candidate lists, such as Phone and Visual in Table XII, helped us to achieve higher inclusion rates. Although higher inclusion rates are desirable, the detailed statistics in the Phone and Visual sub-rows in Table XIII shed light on the drawbacks of the union lists. If we present only the top k candi-dates to those who need the similar characters of a given character, the union of the lists might not provide better performance profiles than that of the individual lists, separately.

It is not very difficult to understand the potentially inferior performances of the union lists. Assume that the rank of the actual incorrect character is j in a list, say LL. This implies that there already are at least ( j− 1) characters that are mistakenly considered as better candidates by the score functions. After we put the lists together

Fig. 3. AIRs of the union lists might not be as good as individual lists.

and rank the joined lists, these ( j− 1) characters still win against the actual incorrect character. In addition, other characters that were not in LL might be ranked higher than the actual incorrect character in the union. As a consequence, the rank of the actual incorrect character might not improve in the union lists.

Consider, in one particular test, two ranked lists SS = {A, B} and SD = {C, D}, where A, B, C, and D are four different characters. Hence, we must have score ( A)≥ score(B) and score(C) ≥ score (D). Assume that B is the actual incorrect character, that score (C)≥ score(B), and that score(B) ≥ score(D). The union of SS and SD will be{A, C, B, D}. The rank of the actual incorrect character will drop from 2 in SS to 3 in the union. This is a situation in which the joined list might not outperform the best individual list.

However, it remains possible that the joined lists perform better than the individual ones. This could happen when the actual incorrect characters were included in only one of the individual lists and when the ranks of the actual incorrect characters remain the same in the joined lists. If, in the previous example, score(B) is larger than score(C), than the joined lists will perform as well as SS. Moreover, if, in another test, we have SS ={E, F}, SD = {G, H}, where E, F, G, and H are different characters, and where G is the actual incorrect character, than the joined list will perform better than SS.

Given the reasons provided in the previous two paragraphs, we cannot tell whether or not the joined lists will perform better than the best performing individual lists.

Figure 3 shows four pairs of examples for the experiments with Jlist. It happened that the joined lists did not perform as well as the best-performing individual lists in the joined lists. The charts were drawn based on the statistics listed in Table XIII.

For instance, the curve “Freq-SS” in the chart with the title “Jlist-Freq” was based on the data in the sub-row “SS” in the “Frequency” part in Table XIII. The performance profiles of SS lists dominated those of Phone lists, and the performance profiles of SC3 lists dominated those of Visual lists in these cases. When we ranked the candidate characters with PMIs, the results were similar, and are shown in the chart titled “Jlist-PMI.”

It is interesting to explore whether we may improve the performance of the Phone list by not considering the characters that were in the MS and MD lists. We conducted such an experiment, and in the middle of Table XIII, the row SS + SD shows the AIRs of the union list of words that were formed by using the candidate characters originally in the SS and SD lists. We can compare the performances of the Phone list and the SS + SD list. Overall, the SS + SD list provides better, but not significantly better, performance.

Table XIV. AIRs for Ranking the Candidates for Elist Based on ENOPs and PMIs

R₁ R₂ R₃ R₄ R₅ R₆ R₇ R₈ R₉ R₁₀

FFrequencyF

SS 55.2 72.7 81.9 85.2 87.5 88.9 89.7 90.0 90.5 90.7 SD 11.3 14.4 15.6 16.2 16.8 17.2 17.6 17.7 17.8 17.9

MS 2.0 2.2 2.6 2.7 2.9 2.9 2.9 2.9 3.0 3.0

MD 1.3 1.6 1.6 1.6 1.6 1.6 1.8 1.8 1.8 1.8

Phone 39.7 57.1 68.0 72.8 77.1 79.9 82.5 84.6 86.0 87.1 SS+SD 46.7 62.8 73.5 78.9 83.7 86.2 88.6 89.9 90.6 92.0 SC1 32.4 46.8 55.7 62.0 65.1 67.7 69.4 71.0 72.7 73.1 SC2 27.7 41.6 49.2 55.3 59.2 62.6 64.9 66.8 67.9 69.5 SC3 33.6 49.3 57.6 63.2 68.4 71.4 74.4 77.7 79.2 81.5

RS 2.8 3.6 3.6 3.7 3.8 4.0 4.0 4.1 4.1 4.1

Visual 27.7 41.7 49.1 55.1 59.8 63.7 66.7 69.3 71.7 74.1

yyyPMIy

SS 51.8 73.9 82.2 85.6 88.1 89.0 89.6 89.7 90.2 90.5 SD 10.5 14.5 16.1 16.9 17.3 17.3 17.4 17.6 17.8 17.8

MS 1.7 2.1 2.6 2.6 2.6 2.7 2.8 2.8 2.9 2.9

MD 1.1 1.6 1.7 1.7 1.7 1.8 1.8 1.8 1.9 1.9

Phone 40.5 61.6 72.5 77.7 81.6 84.5 86.6 88.2 89.6 91.0 SC1 35.5 50.5 59.1 63.7 67.7 69.8 71.0 72.2 73.1 74.4 SC2 32.5 47.1 53.8 58.6 62.5 66.1 67.9 69.6 70.4 71.3 SC3 36.4 53.2 64.1 69.4 74.9 77.5 79.7 80.7 82.0 82.9

RS 2.6 3.7 3.8 4.0 4.1 4.1 4.1 4.1 4.1 4.1

Visual 30.5 47.1 56.5 62.6 67.3 70.7 72.9 75.2 76.8 78.7

In addition to ranking the candidate characters directly with their ENOPs, we also ranked the characters with their PMIs, shown in Equation (2) and Equation (4) in Section 5.1, and repeated the experiments with Elist and Jlist. The lower part of Table XIII shows the observed statistics. Qualitatively, the statistics in the upper and the lower parts of Table XIII do not show different trends: SS and SC3 remain to be the most effective methods to find phonologically and visually similar errors. Overall, finding phonologically similar errors is easier than finding visually similar errors. Pro-viding candidate lists that had only 10 characters achieved reasonable performances.

The most noticeable difference between the upper and the lower part of Table XIII is in the R1column. It appears that, if we would recommend only one candidate char-acter, using the raw values of ENOPs to rank will offer better inclusion rates than using PMI. Although this observation appears to be appropriate for the statistics in Table XIII that we collected from the experiments that used Jlist, this trend did not survive in our experiments with Elist.

Table XIV shows exactly the same sets of statistics as those shown in Table XIII. The only difference is that we used Elist, rather than Jlist, to repeat all of the experiments that we used to obtain Table XIII. Statistics in Table XIV indicate the same trends as those suggested by the most of statistics in Table XIII, so we do not repeat the same statements.

A major contribution of the statistics in Table XIV appears in Figure 4, which shows that using PMIs or ENOPs did not guarantee that there would be differences in the performances. We drew the left chart based on the data in Table XIII and the right chart based on data in Table XIV. The curves in the left chart show that using PMIs offered inferior performance than using the raw values of ENOPs, and the curves in the right chart show the opposite trend.

Fig. 4. Using PMI does not necessarily outperform using ENOPs.

Table XV. Ranking the Candidates Based on ENOPs and PMIs for Ilist

Frequencies PMIs

R₁ R₂ R₃ R₄ R₅ R₁ R₂ R₃ R₄ R₅

SS 70.3 77.7 80.6 81.0 81.6 66.7 76.0 80.2 81.0 81.6 SD 25.6 28.3 28.9 28.9 29.3 24.8 28.1 28.9 29.1 29.3

MS 1.4 1.7 1.7 1.7 1.7 1.6 2.1 2.1 2.1 2.1

MD 1.6 1.6 1.6 1.6 1.6 1.2 1.6 1.6 1.6 1.6

Phone 76.7 89.1 93.6 94.8 95.5 70.5 86.2 92.1 94.4 95.7 SC1 64.7 72.0 76.0 78.0 78.3 61.7 72.3 75.0 76.7 77.7 SC2 58.0 64.7 68.3 70.7 71.0 53.7 66.3 69.3 70.3 70.3 SC3 71.3 80.0 85.0 86.7 87.0 67.0 80.0 84.7 86.0 86.3

RS 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3

Visual 71.0 82.3 86.3 89.3 89.3 65.0 81.7 86.7 88.3 89.0

Although PMIs are frequently used to compute the co-occurrences of two events, including the collocation of words, examining Formula (4) discussed in Section 5.1 reveals an intuitive interpretation of the PMIs in our applications. The formula mea-sures the percentages of observing the incorrect words (i.e., C ∧ X ) given that the candidate characters appeared (i.e., X in the formula; recall that this is the charac-ter that would replace the correct characcharac-ter). Such percentages can be diluted if X happens to represent a high frequent character.

Similar to an experiment that we conducted for the Jlist, we also created the union list SS + SD for the experiments with Elist, and the results are shown in the middle of Table XIV. This time, the SS + SD list outperformed the Phone list by a margin by about 5%. However, the resulting performance profile of SS + SD is still inferior to that of SS. Interestingly, when we consider the top 10 candidate characters, the SS + SD outperformed not only the Phone but also the SS list marginally.

在文檔中 Visually and phonologically similar characters in incorrect simplified Chinese words (頁 27-31)