Summary of Major Findings - 運用多面向羅許測量模式分析指考和學測翻譯試題

The present study consists of a MFRM analysis of a Chinese to English (L1 to L2) translation test comprised of 10 items compiled from achievement tests in advanced subject tests and general scholastic ability tests. MFRM analysis of test responses from 225 test takers scored by six raters indicates that rater experience is a significant factor in determining test scores.

Responses from 225 test takers yielded a separation reliability of 0.94, indicating that there is a strong differentiation of test taker ability among these candidates. Also, the Rasch facet map shows that the trait-ability of the sample is normally distributed across an ability range that is reasonable in comparison to extant Rasch analysis studies.

The implication of this finding is that a sample of 225 or more individuals is sufficient for conducting MFRM analysis of translation tests.

Second, the χ2 value of the rater facet indicates that there are significant differences in severity among the raters. Different raters maintain different grading rationales and harshness despite training sessions to homogenize rater interpretation of scoring rubrics. Moreover, it is hard to make a strong conclusion as to which group is more severe. Even within groups, there are rating differences. It may only be surmised that rater experience indeed causes differences in rater severity.

It was also found that easy items cause more bias than difficult ones, e.g. items one, two, and ten. Therefore, raters should correct items more carefully and with a more

neutral attitude, especially for easy items. Better rating training is necessary to ensure that raters deduct scores according to the criteria and reach consensus on candidates’

answers.

Third, the interaction of rating experience on scores was also examined. Mostly, novice raters and expert raters did not exhibit a large degree of difference, with only two out of ten items showing significance ( p < .05). To conclude, though two groups of raters have different prior knowledge, given careful adherence to the scoring criteria, experts and novices can reach agreement on item scores.

Fourth, raters’ prior experience facilitated awareness of item similarities and differences. This prior knowledge allowed experienced raters to recognize and reproduce patterns very quickly and accurately. This assumption was corroborated when raters corrected item two with good accuracy.

Fifth, this study proves that using MFRM to adjust raw scores is quite important in future high-stakes testing. Performance assessment includes subjective evaluations, wherein raters must utilize their professional knowledge to judge candidates’

performance. Thus, exclusive reliance on raw scores does not objectively measure test taker’s real abilities, because every rater has the potential for a degree of bias on certain items. In addition, MFRM allows the simultaneous analysis of candidates’ ability distribution, item difficulty, and rater severity. The relationships among these facets constitute helpful information to enhance the instructional methods of teachers and study methods of learners.

Implications

First, it is feasible for a teacher to use MFRM to understand more about students’

ability, rater severity, and item difficulty. This implication derives from the fact that the sample size of 225 in this study was large enough to show the relationship between item

difficulty, rater severity, and test taker ability. Given that one English teacher is typically responsible for 135 students to 180 students in total (three to four classes with approximately 45 students), it follows that a single teacher has sufficient number of test takers to utilize MFRM for estimation of abilities. A teacher trained in MFRM could examine the relationships between and among the facets of the estimates of trait ability.

From this process, a teacher can understand more about students, items, and even himself/ herself.

Second, English is a core component of the national senior high school curriculum, of which translation skill is a key objective. That is to say, every English teacher should have the ability to teach translation and grade translation items. Whether novice or expert teachers, they can be good raters. Therefore, novice teachers should join translation training or seminars frequently. Correcting translation includes two kinds of knowledge, one is the level of skilled performance, and the other is context-dependent judgments and skills (Dreyfus, 1982). The former knowledge could be attained through principles and theory learned in a classroom and the latter one can be acquired only in real situations. When teachers have opportunities to be exposed to the practical correcting and the theoretical knowledge, they can be an involved performer rather than a detached observer (Benner, 2001). In addition, this study implies that it will be better to have two raters grade every test sheet. In order to reduce the degree of subjectivity and to naturalize harshness and leniency, more than one rater is needed.

Third, teachers should use an official checklist of scoring criteria to guide their own judging behaviors. Novice raters tend to judge their grading by the number of rules or principles they have followed; however, these rules do not always delineate the most relevant tasks to perform in an actual situation. Meanwhile, experts have the ability to consider all criteria equally and use more rubric-related descriptions to justify their ratings (Wolfe, Kao, & Ranney, 1998). Therefore, making a checklist not only helps

raters evaluate themselves, but also discerns the rater’s progress on scoring translation items.

在文檔中運用多面向羅許測量模式分析指考和學測翻譯試題 (頁 74-77)