Raters and Writing Assessment - 運用多面向羅許測量模式分析指考和學測翻譯試題

Translation is not only a learning strategy or a teaching technique but also mode of testing. Translation test items are a kind of writing assessment, which can show the test takers’ writing performance and their knowledge of the target language. In translation items, candidates have to read the sentences written by the native language first, and then translate them into the target language. Although the ideal translation refers to “the same events in the real world as the original”, it may be impossible to produce something in one language which means exactly the same as something in another language (Cook, 2011). Since there might be some gaps between two languages’

translation conversion, a good translator should know how to identify the essential meaning of one language and express it into another language.

In Taiwan, translation is not only a learning strategy but also a test item on advanced subject tests as well as general scholastic ability tests. Over 20 years ago, Buck showed that translation can be one kind of test item that was satisfactorily reliable and valid (Buck, 1992). As a result, translation items are also included in high-stake exams in Taiwan, and in a large scale assessment, students’ ability must be determined through technical and statistical rigor. Hence, the development of scoring guidelines or rubrics, the training of raters, and the test takers’ ability are crucial factors in the performance assessment.

Brown and Hudson (1998) proposed that there are three test main item types on

language tests: selected response, constructed response, and personal response. In terms of definitions, constructed-response items are those in which a student has to actually produce language by four skills of a language, including writing or speaking, rather than answering one-on-one answer only. Here, the three sub-types of constructed-response items are fill-in, short answer, and performance and they can show students’

receptive and productive knowledge simultaneously while minimizing most of the guessing risk. Writing assessment, such as essay writing, translation items, or composition, therefore belongs to the type of the constructed response and especially in the sub-type of the performance assessment. In the performance assessment, the test items design may include real life contexts and candidates have to integrate their language knowledge to examine their true language ability (Brown & Hudson, 1998).

And by the definition of performance assessment, examinees have to perform some tasks that should be as authentic as possible, and candidates’ performances should be rated by qualified raters (Brown, Hudson, Norris, & Bonk, 2001). Therefore, in the writing assessment, candidates generate appropriate content by making use of the knowledge they have in their knowledge base.

Again, performance assessment includes complex human performance and involves acts of interpretation from test takers’ responses. For this reason, to reduce the disagreement of raters these interpretation to acceptable levels is the main goal in the performance assessment (McNamara, 1996). Writing assessment, belonging to the performance assessment, is a type of assessment which needs to take many dimensions into consideration. Researchers recognize that rater judgments include rater’s subjective opinion, so even the same writer would vary his judgements when correcting different items. Many studies have showed that the assessment of linguistic performance may be influenced by various factors with the scoring procedure and the traits of the writing samples to be scored, such as content, organization, sentence

structure, or genre, might influence the judgment made by raters (Schoonen, 2005;

Weigle, 2007; Cooper, 1984). Because raters play a crucially important role in the writing assessment and because raters need to temper the subjectivity of their evaluations, rater performance requires an empirical examination.

Based on the arguments mentioned above, in the testing process, there are three main factors of raters, candidates, and test items from a continuum. From one end of this continuum, test takers’ writing samples can indicate items’ complexity and candidates’ ability. From the other end, a standardized and reliable measurement is used as a rating basis. Raters lie in the pivotal position of this continuum to turn performances into outcomes. However, a number of elements might simultaneously cause variability in writing scores. Rater characteristics seem to affect raters grading as well, such as rating experience, the assessment process knowledge, familiarity with the rating criteria, or the amount of rater training (Kuiken & Vedder 2014). Similarly, task characteristics and types of assignment can also influence scores.

Shaw and Weir (2007) noted that in order to make sure the tasks are obviously valid in all other aspects, scoring validity is an important guide. Since the rating of exams should be examined for a valid scoring, the rater is the main focus in the thesis.

Raters have to be able to give scores appropriately and consistently, and therefore, rating scales are used to improve grading’ consistency. However, rating is a complex issue. Even if there are rating criteria and rater training, the essence of rating is still a complicated cognitive process which might lead to variance in performance ratings, so there is a need to study rater effects (Myford & Wolfe, 2004).

Raters can bring the biases of raters to exams, and McNamara listed three factors which might cause differences in raters’ grading, rater severity, rater characteristics, and rater consistency. First, the interaction between raters, items, and candidates may be either severe or lenient. Linacre (1989) used the term severity to refer both to the overall

severity of the rater and the difference between raters in the way they evaluate constructed responses. There may also be some interactions among raters or some other aspects of the rating situation, like rater-item interaction and rater-candidate interaction.

In addition, raters may interpret rating scales in different ways, which can lead to different grading rationales. McNamara and Adams pointed out that differences of interpreting rating scales might result in two systematic ways of grading writing assessment, one is a centralizing tendency and the other is avoiding scores in the middle of the scale (McNamara & Adams, 1991). Another potential source of differing interpretations involve different rater characteristics, such as gender, background, or even the rating time of a day. A third source of difference involves the rater’s own consistency. The extent of random error associated with their ratings might change according time, sometimes being harsher and sometimes being more lenient. In short, it is a challenge to maintain the consistency of rater scoring for writing assessment.

Many studies have made an effort to eliminate interrater problems and to try to improve scoring validity. Rater training is one way to solve these problems and to show scoring validity. Training typically includes “familiarization activities, practice rating, and feedback and discussion” (Lane & Stone, 2006). Weigle (1998), however, referred that differences between experienced and inexperienced raters could not be totally eradicated by training. However, inexperienced raters can make some progress after training, relatively speaking. In other words, rater training cannot achieve complete consistency among raters, especially among expert and novice raters. These rater characteristics thus need to be taken into account

在文檔中運用多面向羅許測量模式分析指考和學測翻譯試題 (頁 29-32)