Major Findings - 整體式與分項式翻譯測驗評分之等化:多面向羅許分析模式

Regarding the first research question, MFRM is capable of demonstrating the factors of raters’ severity, rating experience, test takers’ ability, and item difficulty by Facets variable map and logit scale.

As presented in Figure 3 and Figure 4, the two Facets variable maps display the distribution of the four main factors in the present study, i.e. test takers, rating experience, rater severity, and test items. The logit scale situates in the left hand column. The zero of the metric is established at the average item difficulty. The test takers are arranged by measure in the second column. Raters’ rating experiences are put in the third column and the difficulty of tasks are in the second place from the right hand side.

It is visible that the majority of the test takers’ responses are plotted between the values +4 and 0. This indicates that the ability of the participating test takers in the present study is normally distributed, and these collected samples represent an agreeable range of ability. Rarely without exception did they fail to respond the five translation tasks due to the lack of appropriate proficiency levels. The five translation tasks were balanced designed; with task 2 and task 5 being easy, task 1 and task 4 moderate, and task 3 difficult. As far as raters are concerned, both the expert and the novice raters are being averagely severe under both two types of rating scales.

When conducting a research, it is vital to distinguish those with corresponding competence from those who simply guess or depend on sheer luck. As shown in Figure 3 and Figure 4, the data of 60 samples had been appropriately measured under holistic and analytic scales respectively to appraise the relevant factors of the present study. However, some extreme values situate around -1 logits under both types of rating scales. The reasons behind may be manifold, some further possible explanations are discussed below.

To begin with, even though they are college freshmen, some of them are still in pretty unsatisfactory English proficiency levels and thus their lack of appropriate corresponding ability resulted in such extremely undesirable scores. Another plausible reason is that some of these freshmen had been admitted to the university through the

Recommendation-Selection Admission Program much earlier in the year than those took the final examination in July. After several months of completely isolation from the subject of English, it is possible that they are not in their optimal conditions in terms of answering the examination items in this particular study. Last but not least, even though test takers are allowed to answer the questions as long as they intend, still there are some inevitable factors that may influence their performance, such as nervousness, fatigue or other physical discomfort that may cause such poor performance.

Figure 3 Holistic Scale Facets Variable Map of the Study

Figure 4 Analytic Scale Facets Variable Map of the Study

In Table 3 and Table 4, we can see the separation reliability for the test takers’

facet, the value of 0.94 & 0.92 implies that among these test takers their abilities vary tremendously. The fixed χ2 examines the hypothesis that all of these test takers are equipped with the same ability. This χ2 is highly significant (p < .05) causing one to reject the hypothesis that all of them are ranging within the same level of proficiency.

Table 3 Holistic Scale Students Measurement Report

Obsvd

Model, Populn: RMSE .32Adj (True) S.D. 1.21 Separation 3.81 Reliability .94 Model, Fixed (all same) chi-square: 1005.4 d.f.: 59siginificance (probability): .00

Table 4 Analytic Scale Students Measurement Report

Obsvd

Model, Populn: RMSE .30Adj (True) S.D. 1.04 Separation 3.49 Reliability .92 Model, Fixed (all same) chi-square: 788.4 d.f.: 59 significance (probability): .00

Fit analysis comes into effect in terms of descrying potential problematic items and performances (Bond & Fox, 2013). Mean-square fit ranging from 0.5 to 1.5 makes one item the most suitable and productive (Linacre, 2000). The item would be less productive for measurement owing to overfit if the value is lower than 0.5, which means that it fails to detect variance in test takers’ ability and thus fails to function as a part of a test. Under this circumstance it does not ruin the general quality of the test, but it probably will result in beguilingly high reliability and separations. When the index is higher than 1.5 but lower than 2.0, it suggests that the item is not productive for the construction of the measurement, but does not degrade the quality of the test either. And if the value is higher than 2.0, it implies that particular item is so badly written that it may distort or degrade the overall test quality. On the whole, misfit takes place when the value of an item is higher than 1.5; and severe misfit if the mean-square values exceed 2.0. Table 5 demonstrates the implications for measurement of mean-square value (Linacre, 2002).

Table 5 The Implication for Measurement of Mean-square Value (Linacre, 2002) Mean-square Value Implication for Measurement

> 2.0 The item is so badly written that it may distort or degrade the overall test quality.

1.5 - 2.0 The item is not productive for the construction of the measurement, but not degrades the quality of the test either.

0.5 - 1.5 The item is the most ideal and productive.

< 0.5 The item is less productive for measurement.

So overall, the 60 test takers in this study represented a wide variety of students and thus yielded a more credible result. What’s intriguing is that under both types of rating scales, test taker number 28 remains the highest rank and test taker number 16 the lowest. Such unanimity may serve as a possible hint that holistic scoring rubrics may be a qualified alternative to the current adopted yet time and effort consuming analytic

scoring rubrics.

Regarding the first research question, “To what extent do the two different types of rating rubrics – holistic vs analytic – converge on offering consistent and unbiased rating on test-takers’ translation performance?”, we may refer to Table 6 and 7 to answer it, which present all the raters’ rating performance under holistic rating scale and analytic rating scale respectively. It is clear that under both types of rating scales, the range of four raters’ mean square lie between 0.5 and 1.5, suggesting their entire grading are steady. To be more specific, the value is nearly identical to the anticipated value of one, indicating highly satisfactory rating quality.

In Table 6, the rater separation reliability of 0.94 under holistic rating scale outperformed the rater separation reliability of 0.69 under analytic rating scales as Table 7 presents; this result suffices to say that holistic rating is no doubt a qualified alternative to analytic rating in translation tests. The relationship between the scoring of holistic rating scale and the one of analytic rating scale could be further

investigated by adopting Pearson correlation coefficient. A very high and significant correlation between the two scales was obtained, r = 0.934, n = 60, p =0.000. An independent-sample t-test was also implemented to compare the two distinct types of rating rubrics. It was found that there was no statistically significant difference between holistic scoring (M = 1.24, SD = 1.26) and analytic scoring (M = 1.15, SD = 1.09), t (60) = 0.42, p = 0.678. Nonetheless, distinctions among the four raters are indicated by p< 0.05 under both types rating scales. The differences may be further investigated by taking their rating experiences into consideration. And this happened to be answering the second research question as well.

Table 6 Holistic Scale Judges Measurement Report

Obsvd Model, Populn: RMSE .08 Adj (True) S.D..33 Separation 4.02 Reliability .94

Model, Fixed (all same) chi-square: 70.4 d.f.:3 significance (probability): .00

Table 7 Analytic Scale Judges Measurement Report

Obsvd Model, Populn: RMSE .07 Adj (True) S.D. .11 Separation 1.50 Reliability .69

Model, Fixed (all same) chi-square: 13.6 d.f.:3 significance (probability): .00

As for the second research question, “Is there any significant difference between inexperienced and experienced raters in terms of rating translation items under the different approaches between analytic and holistic rubrics?” Let’s relate to Table 8 and Table 9, which present raters’ experience under the two different rating scales, to answer it. The mean square outfit of expert raters and novice raters under both holistic and analytic rating rubrics fell between 0.5 and 1.5, indicating overall good fit. This result suggested that both expert and novice raters recruited in this study demonstrated reliable grading patterns, which is quite ideal. More importantly, all the raters,

regardless of their rating experience, manifested indefectible rating behaviors under the two drastically different rating scales, making it a strong argument that holistic rating is also capable of offering consistent and unbiased assessment on test-takers’

translation performance. Under holistic rating scale the raters’ separation reliability was 0.91 while analytic counterpart was 0.46 implied that raters in the former scale were relatively reliable similar, instead of reliably dissimilar. When it comes to the second research question, even though the mean square value of both rating scales is satisfactory, the significant χ²value (χ² = 21.7, df = 1; χ² = 3.7, df = 1) revealed that to some extent raters’ grading experiences undeniably influenced their grading.

Table 8 Holistic Scale Raters’ Experience Measurement Report

Obsvd Model, Populn: RMSE .06Adj (True) S.D. .18 Separation 3.14 reliability.91

Model, Fixed (all same) chi-square: 21.7d.f.: 1 significance (probability): .00

Table 9 Analytic Scale Raters’ Experience Measurement Report

Obsvd Model, Populn: RMSE .05Adj (True) S.D. .05 Separation .93 reliability.46

Model, Fixed (all same) chi-square: 3.7 d.f.: 1 significance (probability): .05

From Table 10 and Table 11 we could see the item separation reliability of 0.95 and 0.97 respectively, this could be interpreted as the variation of translation task difficulty and the size of sample were ample so that in the present study the evaluation of task difficulty as well as test takers’ ability were precisely estimated. Furthermore, the Outfit mean square values of both rating scales were all under 1.5, serving as a solid proof that both types of rating quality were excellent. Task item 2 and item 5 were basic at -0.45 logits under holistic rating and -0.54 logits as well as -0.52 logits under analytic rating. Task item 4 and item 1 were moderate at 0.32 logits and -0.3 logits under holistic rating while 0.17 logits as well as 0.21 logits under analytic rating.

Task item 3 was the most difficult at 0.62 logits under holistic rating and 0.69 under analytic rating.

Table 10 Holistic Scale Task Measurement Report

Obsvd Model, Populn: RMSE .09 Adj (True) S.D. .41 Separation 4.51 Reliability.95

Model, Fixed (all same) chi-square: 108.0 d.f.:4 significance(probability): .00

Table 11 Analytic Scale Task Measurement Report

Obsvd Model, Populn: RMSE .08Adj (True) S.D. .46 Separation 5.58 Reliability.97

Model, Fixed (all same) chi-square: 162.0 d.f.:4 significance(probability): .00

The Chinese translation task number three was “藉由謙虛的態度與努力不懈的學習，他不但成功了，還獲得了成就感。” And the provided answer for reference was

“Through humble attitude and hardworking learning, he not only succeeded but also gained a sense of achievement.” Table 12 displays the grammatical and lexical

composition of the translation task number three. It is a compound sentence structure;

the tense should be in past time, and the topic is about learning.

Table 12 The Grammatical and Syntactic Features of Translation Task Number 3 Grammar Structure Word range Tense Topic Number 3 Not only…but also… compound L1~L2, L3(2),L4 past Learning

Different word levels among the translation task items may be one of the reasons causing different task difficulties for test takers. As for translation task number three in question, most vocabulary ranged between the first and second level based on the classification from College Entrance Examination Committee (CEEC). However, there were two words at the third level (attitude and achievement), and one at the fourth level (learning), making translation task number three the most challenging

item in this study.

Besides, to translate impeccably requires the mastery of both the source language and the target language. The longer the sentence in a translation test, which means more words to be processed in the brain, the greater the challenge would be.

Other than word level and sentence length difficulty, syntactical issues may also serve as a factor influencing the difficulty of translation tasks. Numerous test takers’

responses were problematic in terms of choosing correct preposition for “藉由” at the very beginning of the sentence and thus messed up the entire sentence. In addition, quite a few test takers were careless with the tense, it should be past tense but they wrote in the present. Thus, all these grammatical and syntactic aspects resulted in translation task number three slightly more challenging than other items.

From the grading severity between novice and expert groups listed in Table 12, we could see that expert raters were slightly lenient than novice raters in grading translation task number three. It did not suggest any fault of both groups; instead, it suggested that there were slight distinctions between them. The rationale behind this phenomenon could be further discussed even if it was not statistically significant different. When raters were grading, even though they were provided with the correct answer for reference and rating criteria, it was highly possible that they encountered various answers which did not match the provided ones but still could be correct.

Under these circumstances, raters’ preceding experiences and professional knowledge came into effect (Glaser & Chi, 1988). Expert raters may be relatively more flexible and welcome to various writing responses as long as it matches the original meaning of the translation tasks. Novice raters, on the contrary, exhibited comparatively less flexibility and tended to follow the rating criteria strictly.

For harder translation tasks, some low proficiency level test takers received some points because raters felt sympathetic and tried to encourage them, this also explained

the reason why their raw scores were slightly higher than their estimated competence and thus resulted in novice raters being stricter than experts in grading translation task number three. Table 13 below displayed all four raters’ opinions toward translation task number three.

Table 13 Excerpt from two novice raters’ and two expert raters’ interview

Interviewer How did you feel when you graded these items? Especially for translation task number three?

Novice 1 “humble attitude” and “hardworking learning” seemed to pretty challenging for them. The sentence pattern and tense were also mistaken easily; I assumed that was the reason they (test takers) did not get high scores.

Novice 2 The sentence was relatively longer than others, so students needed more vocabulary to get full marks. The structure of the sentence was also more complicated and thus made it harder for students.

Expert 1 Some vocabulary in this sentence posed difficulties for them to translate, even though at first I did not think this translation item to be extra challenging for students. I also noticed that many of them confused the word “success” with “succeed”.

Expert 2 I think that the students scored lower because they seemed to have trouble translating quite a few idea units, such as “sense of achievement” “humble / modest attitude”, even the word

“succeeded” and “hardworking”. The vocabulary items required to fulfill the translating task in question 3 might be in slightly lower frequency bands.

The sentence structure was also relatively more complicated when compared with other translation items. So naturally, students got lower scores for question 3.

Also, “獲得” in Chinese might have quite a few equivalents, such as gain, get, receive, obtain, some student might pick one that did not fit into this context, like “receive” was used yet it didn't fit here.

Due to the complicated nature of translation tests per se, raters’ background, their professional knowledge and rating experiences could be factors influencing their

grading in performance assessment. Each rater may have different weightings under various circumstances, and the cognitive processing when rating could be really sophisticated and causing variance in assessing performance tests. Even though rater training and the criteria of grading were essential, rater effects still deserved to be further investigated (Myford & Wolfe, 2004). In the present study, the range of all four raters’ mean square outfit under both holistic and analytic rating rubrics all situated between 0.5 and 1.5, indicating their trustworthy appraisal on these five translation tasks and both types of rating scales were qualified to ascertain test takers’ underlying competence in translation tests.

As for task 3, raters may have separate expectations toward students of various proficiency levels and led to expert raters being slightly lenient to people with only basic proficiency levels. This result was in accord with Kuiken and Vedder’s study (2014), in which they discovered that raters may subconsciously accommodate their grading depending on test takers’ competence despite a mental effort trying to rate every candidate under the exact same scales. Raters were apt to be more severe toward those who can write well and be forgiving toward lower ability writers (Schaefer, 2008). Accordingly, indisputably recognizing a specific group of raters being harsh or lenient may not be attainable. This understanding was also in view in some quantitative studies, which concluded that the rating quality difference between experienced and inexperienced were minor (Lim, 2011; Shohamy, Gordon, &

Kraemer, 1992; Weigle, 1998).

Table 14 below showed how raters felt toward holistic rating and analytic rating respectively, their mixed opinions constituted quite an interesting picture.

Table 14 Excerpt from two novice raters’ and two expert raters’ interview

Interviewer What’s your opinion regarding the two drastically different rating scales? Do you think one outperforms the other? Why?

Novice 1 Students tend to receive higher scores under holistic rating scale compared to analytic. Personally I prefer holistic rating because it is much easier to assign scores. Analytic rating can be really bothersome when students’ writing responses are different from the provided answer for reference. However, holistic rating may be unfair and I think analytic rating is a better way of assessment of students’ underlying competence.

Novice 2 Holistic rating is much faster and students will get higher scores.

On the contrary, analytic rating is time-consuming and students will be marked lower. For low proficiency level test takers, analytic rating scale seems to be much friendlier. As for high-achievers, holistic rating scale seems to be ideal. Overall, I like holistic rating better because it is relatively straightforward.

Expert 1 I think both types of rating have its own advantages and disadvantages. It’s tricky to say whether one is better than the other.

But for personal preference I will definitely choose holistic rating scale because raters are allowed to judge by his / her own prudence instead of following preset ways of chunking.

Expert 2 Holistic rating takes long but serves as a better assessment way in which different mistakes can have various score weightings. On the other hand, analytic scoring can be problematic because how to divide a sentence fairly could be a potential issue.

Based on the interview, all the four raters seemed to recognize the practicality and significance with their own regards. The results of the present study being discussed earlier showed that holistic rating scale also functioned as a qualified alternative to the current analytic rating scale adopted by College Entrance Examination Committee (CEEC). Translation has been a quite common method for assessment in Taiwan for a long period of time. Rating scales and rater reliability are crucial in evaluating test takers’ underlying competence; consequently, the contribution of this research is that it proves the practicability of applying holistic rating scale in translation test. This

research also justified the momentousness of adopting MFRM to adjust raw scores since they could be misleading. The raw score is not a trustworthy indicator of a candidate’s competence given the inconstancy in raters (McNamara, 1996). Thus, adopting MFRM is fundamental in this research.

在文檔中整體式與分項式翻譯測驗評分之等化: 多面向羅許分析模式 (頁 35-52)