Major Findings - 運用多面向羅許測量模式分析指考和學測翻譯試題

As for the research question 1, MFRM can show the interactions of the factors of raters severity, rater experience, test takers ability, and item difficulty by Facets variable map and logit scale.

First, the Facets variable map could be seen in Figure 4. This map shows the distribution of four main factors in this study, test takers, raters’ experience, rater severity, and items. In this map, the logit scale is shown in the left hand column. The zero of the metric is set at the average item difficulty. The test takers are listed by

measure in the second column. The raters’ experience are in the third column, and the difficulty of items are in the second place from the right hand side.

Figure 4. Facets Variable Map of the Study

From this variable map, most of the test takers’ performances are plotted between the values +1 and −1. That is to say, these test takers’ ability are normally distributed in this study, and these samples have a reasonable ability range. Also, half of the test takers’ proficiency levels are good enough to finish this translation test. Most of them did answer these items based on their proficiency levels and their answering behaviors are mostly good. Test items were also balanced between easy and difficult levels and that the rater facets show a balance between experienced and novice raters, with all raters at average severity. When doing a research, to check whether test takers pass or

fail a test depending more on ability or more on luck or guess with examiners is important. As can be seen in Figure 4, therefore, in this study, the data of 225 samples could be properly used to examine four factors of test takers, raters severity, raters experience, and item difficulty (in terms of grammatical features).

However, there are still some extreme values which lie between +2 and −2 logits.

Many reasons could be proposed to explain the phenomenon. There is a single outlier near +2 whose proficiency level is unusually high such that he/she could provide translations that were scored high. Rather, the reason to explain those logits below the value −1 might be complicated. Some discussions are listed as follows.

First, the lower scores might be due to their lower proficiency levels. In the logit scale, there are still some test takers whose proficiency levels are extremely lower than normal people. Second, different testing time might cause different response behaviors, because fatigue or other personal physical problems might cause lower scores. Third, in the schedule of Taiwan college entrance, some of test takers have entered the college through the Recommendation-Selection Admission Program in April. As a result, some of them could not maintain good focus while taking the test.

Table 4. Students Measurement Report

With extremes, Model, Populn: RMSE .21 Adj (True) S.D. .84 Separation 3.96 Reliability .94 With extremes, Model, Fixed (all same) chi-square: 3470.7 d.f.: 218 siginificance (probability): .00

In Table 4, the separation reliability for the test takers’ facet could be shown. The value of 0.94 indicates that there is a strong differentiation of test takers’ ability among these test takers. The fixed χ2 tests the hypothesis that all of these test takers have the same ability. This χ2 is highly significant (p < .05) causing one to reject the hypothesis that all of them are ranging with the same level of proficiency.

Here, fit statistics should be used to detect problem items and person performances (Bond & Fox, 2012). According to Linacre (2000), the most ideal and productive item is the one whose mean-square fit ranges from 0.5 to 1.5. If the index is smaller than 0.5, this item is less productive for measurement due to overfit. That is, the item is insufficiently sensitive to variance in test taker ability to comprise a useful portion of the test. This situation does not degrade the overall test quality, but it may produce misleadingly high reliability and separations. When the value is larger than 1.5 but smaller than 2.0, it implies that the item is not productive for the construction of the measurement, but does not degrade the quality of the test either. When the value is larger than 2.0, it suggests that this item is so badly written that it may distort or degrade the overall test quality. Generally speaking, misfit occurs when the item has an index greater than 1.5. Mean-square values over 2.0 are indications of serious misfit. Here, Table 5 shows the implications for measurement of mean-square value (Linacre, 2002).

Table 5. The Implication for Measurement of Mean-square Value (Linacre, 2002)

Mean-square Value Implication for Measurement

> 2.0 The item is so badly written that it may distort or degrade the overall test quality.

1.5 - 2.0 The item is not productive for the construction of the measurement, but not degrade the quality of the test either.

0.5 - 1.5 The item is the most ideal and productive.

< 0.5 The item is less productive for measurement.

Table 6. Raters Measurement Report

Num Judges

Model, Fixed (all same) chi-square: 14.5 d.f.: 5 siginificance (probability): .01

In Table 6, the range of six raters’ mean square outfit fall between 0.5 and 1.5, indicating that the six raters’ judgements on these 10 items are stable. In addition, the value is close to the expected value of one, and this shows that the quality of rating is high. Also, the rater separation reliability of 0.60 indicates that raters are reliably the same, not reliably different. Though the mean square value is high, when we take the χ2 into account, there are still significant differences among the raters shown by p <

0.05. These differences could be discussed when rater experience is taken into account.

It is also the answer to the second research question: To what extent do the characteristics of raters, in terms of their experience, affect scores on the translation items? Here are some reports which can be used to elucidate.

Table 7. Raters’ Experience Measurement Report

Num Experience

Model, Fixed (all same) chi-square: 7.6 d.f.: 1 siginificance (probability): .01

As shown in Table 7, the mean square outfit of expert raters and novice raters range from 0.5 to 1.5, well within the generally accepted range of good fit. This information indicates that two groups of raters, expert and novice, have reliable scoring behavior which is conducive for measurement. Moreover, the raters separation reliability of 0.74 indicates that raters are reliably the same, not reliably different.

Though the mean square value is acceptable, the significant χ² value (χ² = 7.6, df = 1) indicates that rater’s experience indeed affects their scoring in some way.

Table 8. Task Measurement Report

Num Task

In Table 8, the item separation reliability of 0.99 means that sample size and the variance in item difficulty is sufficiently large to precisely locate the items along the continuum of the trait-ability. Thus, the estimation of item difficulty and person ability could be accurately measured. In addition, the Outfit mean square values under 1.5 and the correlations around 0.7 indicate that the measurement quality of these items is good.

Items ten and five are easiest at -0.32 logits, while item six is the most difficult at 0.43 logits.

Table 9. The Interaction between Items and Experience

Target

In Table 9, it can be seen that novices show harshness in rating items 1, 9, 4, and 10, whereas they are lenient on items 7, 3, 5, and 6. Examination of the item difficulties for the respective rater groups (listed under Measure), shows that the differences are minor between the two groups, because only two of the ten items exhibit significantly different difficulty estimates: item eight (t = 2.20, p = .0284) and item two (t = -3.23, p

= .0013). The significance of these differences indicates that experts and novices have different points of view on the respective items. Based on Table 8, the difficulty level of item eight is high (0.27) and novice raters show more severe scoring than expert

raters (novice severity, 0.34 > expert severity, 0.20). On the contrary, the difficulty level of item two is low (−0.04) and novice raters are more lenient than expert raters (novice severity, = −0.14 < expert severity, 0.06). According to Table 6, although intra-rater grading behavior is stable, there are still significant differences among the raters. In addition, it is reasonable to have different opinions on rating the same item.

The Chinese translation of item eight is “這十年來，許多台灣創作的影片已經逐漸受到國際注目，甚至贏得了全球性獎項。” The answer should be “Over the past decade/ For ten years/ For the decade, many films (which) created/made/produced in Taiwan have gradually gained international focus internationally and even won worldwide prizes/ global awards.” Table 10 shows the classification of lexical and grammatical components for this item. The sentence structure is a simple sentence (S+Vt+O); the tense is present perfect, and the topic is about Taiwan films. According to the College Entrance Examination Committee (CEEC), the majority of vocabulary in this item range between Level 1 and Level 2, with two words at Level 3 (gradual and awards). The word range is one reasons for differential item difficulties. Also, to finish this translation item perfectly requires a precise language ability of transferring two languages, because this item is not a short sentence. For such a short period of testing time, test takers have to use their inner knowledge to convert two languages and find the best translation to score points. The more words appearing in an item, the greater the difficulty. In addition, the word “逐漸” in this item was ignored by some test takers. From this result, we may infer that adverbs are easily omitted by learners because adverbs here only have a function to modify the sentence without a concrete referent. Therefore, in this translation, the word range, the word number, and the adverb are crucial issues to decide item difficulty.

Table 10. The Grammatical and Syntactic Features of Item 8

Grammar Structure Word range Tense Topic

在文檔中運用多面向羅許測量模式分析指考和學測翻譯試題 (頁 56-64)