整體式與分項式翻譯測驗評分之等化: 多面向羅許分析模式

全文

(1)國立台灣師範大學英語學系碩士論文 Master Thesis Graduate Institute of English National Taiwan Normal University. 整體式與分項式翻譯測驗評分之等化: 多面向羅許分析模式. Score Equivalence between Holistic and Analytic Rating Rubrics in Chinese-English Translation Tests: A Many-Facet Rasch Analysis. 指導教授：曾文鐽. 博士. Advisor: Dr. Wen-Ta Tseng 研究生：陳德海 Te-Hai Chen. 中華民國一百零五年六月 June 2016.

(2) 摘要英文在台灣的教學環境中是主要的核心科目之一，而且在普通高級中學英文課程綱要中，培養學生中英翻譯的能力也是其主要的目標。在重要的升學入學考試如大學學科能力測驗與指定科目考試中，中譯英常常是測驗題型的其中一種。翻譯評量的給分本質上是相當複雜的，採用何種評分規準以及評分者的合適與否皆對於判斷受試者的表現產生重大的影響。因此，本研究旨在檢視現行翻譯測驗題型中分項式給分的效度，並為翻譯測驗的效用提供實證。本研究亦提出整體式評分規準為現行分項式給分之替代方案，並檢視兩者能否平等地評量考生於翻譯測驗中的表現以及評分者的專業程度與評分經驗是否在某些程度上影響他們評比受試者們的作答。本研究採用多面向羅許模式 (Many-Facets Rasch Measurement Model)來分析實際資料，期盼能透過實例來檢視翻譯測驗的分數與評分者專業程度及評分經驗之間的關係。除此之外，整體式與分項式評分規準應用於翻譯測驗的優點與缺點將會被仔細的檢視與探討。. 關鍵字: 翻譯; 測驗; 多面向羅許分析模式; 評分規準 i.

(3) ABSTRACT English has been one of the core subjects in the EFL context of Taiwan, and cultivating students’ translation skills is a major objective in the national senior high school curriculum. In high-stakes examination like General Scholastic Ability Test and Advanced Subjects Test, translation has always been part of the two tests. The nature of grading performance assessment, such as translation, is quite convoluted. The adaptation of rating scales and recruitment of adequate raters become crucial in judging test takers’ performance. Hence, this research aims to inspect the validity of the current analytic rating rubrics of translation, and proffer empirical evidence for the utility of translation items. The author also proposes the holistic rating rubrics as an alternative to check the extent to which the translation performance can be equally assessed and to examine the extent to which rater’s expertise and rating experience correlates with their verdict on test taker’s performance. In the research, the Many-Faceted Rasch measurement model (MFRM) will be taken as the approach to analyzing the empirical data. The expectation result of this research is to provide empirical evidence that the translation scores may have significant relationships with rater’s expertise and rating experience. Also, the pros and cons of holistic and analytic rating rubrics in relation to translation test will be critically examined and discussed.. Keywords: translation; assessment; MFRM; rating rubric. ii.

(4) ACKNOWLEDGEMENT I would first like to thank my thesis advisor Dr. Wen-ta Tseng at National Taiwan Normal University. The door to Prof. Tseng’s office was always open whenever I ran into a trouble spot or had a question about my research or writing. He consistently allowed this paper to be my own work, but steered me in the right direction whenever he thought I needed it.. I would also like to thank the committee members, Dr. Hsi-chin Janet Chu and Dr. Hsing-fu Cheng, for giving me their insightful suggestions and critical comments to better this thesis. Without their passionate participation, the thesis could not have been successfully completed.. I would also like to express my gratitude to the four raters and sixty students. Without their contribution, this work of thesis could not be accomplished.. I would also like to acknowledge Roxanne Liu and Tiffany Wu as the readers of this thesis, and I am gratefully indebted to them for their very valuable comments on this thesis.. Finally, I must express my very profound gratitude to my parents for providing me with unfailing support and continuous encouragement throughout my years of study in NTNU. This accomplishment would not have been possible without them. Thank you.. iii.

(5) TABLE OF CONTENTS 摘要 ................................................................................................................................ i ABSTRACT................................................................................................................... ii ACKNOWLEDGEMENT ............................................................................................ iii TABLE OF CONTENTS.............................................................................................. iv LIST OF TABLES ......................................................................................................... v LIST OF FIGURES ...................................................................................................... vi CHAPTER ONE INTRODUCTION ............................................................................. 1 Background and Purpose of the Study ................................................................... 1 Significance of the study and Research Aims ........................................................ 2 Research Questions: ............................................................................................... 2 Organization of the Thesis ..................................................................................... 3 CHAPTER TWO LITERATURE REVIEW ................................................................. 4 From Recognition Assessment to Performance Assessment ................................. 4 Translation ............................................................................................................. 5 The Merits of Translation....................................................................................... 6 Criticisms towards Translation .............................................................................. 7 Translation in Language Teaching ......................................................................... 9 Holistic and Analytic Scoring .............................................................................. 10 Novice and Expert raters ...................................................................................... 12 Many-Faceted Rasch Measurement Model (MFRM) .......................................... 15 CHAPTER THREE RESEARCH METHODOLOGY ............................................... 22 Participants ........................................................................................................... 22 Instrument ............................................................................................................ 23 Scoring and Coding.............................................................................................. 25 Procedure ............................................................................................................. 26 Data Analysis ....................................................................................................... 26 Expected Result and Suggestion .......................................................................... 27 CHAPTER FOUR RESULTS AND DISCUSSIONS .............................................. 28 Overview of the study .......................................................................................... 28 Major Findings ..................................................................................................... 28 CHAPTER FIVE CONCLUSION.......................................................................... 45 Summary of Major Findings ................................................................................ 45 Limitations of the Study....................................................................................... 46 Directions for Future Research ............................................................................ 47 REFERENCES .................................................................................................... 48 iv.

(6) LIST OF TABLES Table 1 A comparison between holistic and analytic rating scales in L2 writing (based on Weigle, 2002, p. 121). ............................................. 11 Table 2 The characteristics of translation items .......................................... 24 Table 3 Holistic Scale Students Measurement Report ................................. 33 Table 4 Analytic Scale Students Measurement Report ................................. 33 Table 5 The Implication for Measurement of Mean-square Value (Linacre, 2002) .................................................................................................... 34 Table 6 Holistic Scale Judges Measurement Report .................................... 36 Table 7 Analytic Scale Judges Measurement Report ................................... 36 Table 8 Holistic Scale Raters’ Experience Measurement Report................. 37 Table 9 Analytic Scale Raters’ Experience Measurement Report ................ 37 Table 10 Holistic Scale Task Measurement Report ...................................... 38 Table 11 Analytic Scale Task Measurement Report ..................................... 39. v.

(7) LIST OF FIGURES Figure 1 The Facets variable map of This Study. ........................................ 17 Figure 2 Measurement Model for The Writing Assessment (Engelhard, 1992, p. 174) .................................................................................................. 20 Figure 3 Holistic Scale Facets Variable Map of the Study .......................... 31 Figure 4 Analytic Scale Facets Variable Map of the Study ......................... 32. vi.

(8) CHAPTER ONE INTRODUCTION. Background and Purpose of the Study Grammar Translation has been harshly criticized for giving exclusive attention on grammatical accuracy, adopting isolated and invented example sentences in teaching, and the focus was on the knowledge about a language instead of the ability to use it. After fierce dismissal of the grammar translation method, the arrival of series of language pedagogies has yielded a more thorough yet fair exploration of the pros and cons of the implementation of translation in teaching a new language. The distinction between Grammar Translation and translation in teaching has commenced to be drawn, and hence the potential merits of translation in teaching deserve to be further investigated and discussed. Controversies regarding the practice of translation in the past as well as new perspectives and potential applications will be discussed in detail in the second chapter of this thesis. In the EFL context of Taiwan, translation is not only deployed as an instructional means other than the four basic language skills, which include listening, reading, speaking, and writing for quite a long time, but also plays an incessant yet vital role in assessment end. Since translation is a complex process and involves cognitive processing and performance output, the implement of translation in teaching as well as assessment must be consolidated by comprehensive empirical evidences to support its use. To be more specific, the adoption of analytic scoring rubrics in translation has been executed for such an extended period of time. It is worthy of further scrutiny to validate the use of it and investigate whether holistic scoring rubrics may be a solid 1.

(9) alternative for the current situation. Hence, the purpose of this thesis is to proffer some new perspectives toward the application of translation in language learning and teaching, accompanied with some empirical experiments and interviews, hopefully to pile up some basics for a long neglected yet potentially vital topic.. Significance of the study and Research Aims This study aims to investigate the correlation between applying holistic scoring rubrics and analytic scoring rubrics in translation examination. Through both inexperienced and experienced raters’ perspectives, the goal this study would like to achieve is to disclose the difference between the two types of scoring rubrics, if any, and to proffer empirical evidence whether one type of scoring will outperform the other and thus yield a more comprehensive result toward test takers’ real underlying capacity based on the data we collected and analyzed.. Research Questions: 1. To what extent do the two different types of rating rubrics – holistic vs analytic – converge on offering consistent and unbiased rating on test-takers’ translation performance? 2. Is there any significant difference between inexperienced and experienced raters in terms of rating translation items under the different approaches between analytic and holistic rubrics?. 2.

(10) Organization of the Thesis This thesis is comprised of five chapters. The first chapter presents a concise introduction of the background; indicates its purpose as well as the research questions, and further specifies the significance of the study. Chapter two renders a review of literature that is germane to the pros and cons of translation in general, and the potential merits of applying translation in English teaching and examination. Chapter three delineates the research design of the thesis, including the participants, and data collection procedures and data analysis. Chapter four reveals the results of the experiment and discusses the major findings in this study. Chapter five recapitulates the major findings and underlying implications, the limitations of the present study, and suggestions for future research.. 3.

(11) CHAPTER TWO LITERATURE REVIEW. From Recognition Assessment to Performance Assessment There may be some insufficiency to examine learners’ learning result thoroughly if standardized test was the exclusive method of assessment. To be more specific, whether recognition was the only ability being measured in multi-choice tests needs more consideration and discussion (Mehrens, 1992). Besides the measurement of recognition only issue, the delimited domain (i.e. the inclusion of test content) in traditional multiple-choice tests may raise concerns about whether all vital educational objectives can be assessed. To a certain extent, it will deviate too much from the topic of the thesis if we go too deep into discussing whether all important intent of education can be attained and reflected only in one single form of test, i.e. recognition, for the time being it would be more conscientious to presume that both types of measurement have its own significance in terms of achieving educational goals. Before we further this issue, let’s back to the basics first. The defining characteristic of the term performance assessment is the actual performances of relevant tasks that are required of candidates, rather than more abstract demonstration of knowledge, often by means of pencil-and-paper test (MacNamara, 1996). In the scenario of language testing, teachers and educators may get a better understanding of learners’ actual usage of the target language through performance assessment. As Short (1993) pointed out that the reason educators called for alternative assessments was due to demand for “more accurate measures of students’ knowledge”. Hopefully, via various forms of assessment aside from traditional ways, students’ innate ability can be truly reflected in the transcript they will receive. Recommended assessment alternatives include interviews, portfolios, and 4.

(12) performance-based tasks. Among them, one of the variants of performance-based task that has been widely used for quite a long period of time in language tests is translation.. Translation In the culture of Greek, the act of translation can be traced back to the notion of “hermeneus (from Hermes, messenger of the gods—literally ‘the translator’, giving us the basis of today’s hermeneutics in philosophy and linguistic...” (Farquhar & Fitzsimons, 2011). Scholars of Latin also employed the concept of meaning transfer, like the use of terminology as the following: traducere, interpres, transferre, translatum. They all retain the meaning of shifting or transferring from one lexicon to another and are still visible in modern derivatives of English. Cook (2010, p. 55) further indicated a popular view seeing translation “involves a transfer of meaning from one language to another”, and he mentioned the Latin root of translation “translatum” as well, which is a form of the verb “transferre” meaning “to carry across”. Since translation has something to do with carrying meaning from one language to another, to what degree students can achieve from pedagogical perspective and test-takers can perform from language testing angle worth pondering and further investigation. Researchers seemed to define the competence of translating in their own ways, though not without some ground. Bell (1991, p.43) viewed translation competence as “the knowledge and skills the translator must possess in order to carry out a translation”. Bilingual speakers may act as brokers who help bridge two different cultures via translation (McQuillan & Tse, 1995; Wenger, 1998). Another definition of translation competence derives from PACTE research group (Pacte, 2000), which states as “the underlying system of knowledge and skills needed to be able to translate”. This version of definition is based on four premises: (i) translation competence constitutes in various 5.

(13) ways under certain circumstances, (ii) it inherently comprises operative knowledge, (iii) strategies is essential in translation competence, and (iv) most processing of translation competence occurs automatically, just like any kind of expertise. Oxford (1990) gave a definition of translation as “converting the target language expression into the native language (at various levels, from words and phrases all the way up to whole texts); or converting the native language into the target language (p.46). Lado (1964), in his classical book Language Testing, pointed out that it is an art to be able to translate well. He further made his stand in statement like this: “In testing the ability to translate from and into a foreign language, however, translation tests have self-evident validity: they are performance tests” (p.261). To briefly sum up, translation is no doubt a form of performance assessment per se, and the next question we need to pose for language educators and test developers is like this: why do we incorporate translation in language tests?. The Merits of Translation It has been contended that translation is a pedagogical tool, which should not be “totally disregarded” (Matthews-Breský, 1972), it can be used (a) to test (b) to test grammatical and structural items (c) to test them in conjunction with other methods. In addition to being subsumed into tests, translation has a great potential to enrich the process of a second language acquisition and its use henceforth. Cook (2010, p.109) illustrated the fact that translation is an essential skill that occurs quite often in the modern world, especially vital for the engagement of international issues and the financial well-being of multinational enterprises. In addition, functionality of translation is accentuated in contemporary translation theories, translation exercises is invariably confined to meticulously chosen, authentic materials with an unambiguous 6.

(14) context and objective (Nord, 1997). Translation is regarded as a communicative activity by this approach, which can enhance learners’ translation skills as well as their proficiency in both L1 and L2. Nord (2005) further stressed that the metalinguistic consciousness of structural sameness and difference between the source language and the target language could be honed by analyzing the two types of texts contrastively, which enables learners to acquaint the patterns and customs of expression in both cultures. There are other benefits that translation may yield, including finer comprehension in the target language, the expansion of learners’ vocabulary reservoir, better understanding of how language functions, and the reinforcement of the target language structure for effective appliance (Schäffner, 1998, p. 125). As reported by Ali (2008), translation may be applied to cultivate and employ students’ natural competence to facilitate L2 information through their native language processing. Chellappan (1991) also accounted that translation might help learners be aware of the sameness and differences between the source language and the target language. This in consequence facilitates learners to employ certain grammatical structures precisely and vocabulary properly. Translation should not be a hindrance to the learning of a new language but a boon to systematize it via comparative analysis. L1 is accentuated by Corder (1981) as more like a beneficial supply to which students can refer to compensate for the limitations when picking up a new language. The notion of L1 as an “interference” should be reconsidered as an “intercession” for the sake of viewing the use of students’ L1 as a means of strategic communication.. Criticisms towards Translation Perhaps this long exclusion of translation from language teaching can be 7.

(15) optimally explained by abundant criticism of translation in literature. Carreres (2006) compiled some disputes over the use of translation as a pedagogical means: 1. Translation does not belong to any communicative category of pedagogy since it is unnatural and pretentious. Besides, it is confining in that only two language skills (reading and writing) are practiced. 2. Translation into L2 is inimical to learners since they are compelled to explore the foreign language always through the medium of their native language; this result in interferences and a reliance on L1 which hinders unrestricted expression in L2. 3. Translation into L2 has nowhere to function in daily lives and thus an aimless practice, since translators usually translate L2 into L1, not the other way around. 4. Translation into L2 is especially discouraging and disheartening in which learners can hardly achieve the quality of accuracy or the polished style demonstrated by their instructors. The nature of this type of exercise appears to induce mistakes instead of correct usage of the target language. 5. Translation may well serve literary-oriented students whose interest lies in exploring the complexity of sentence patterns and vocabulary, yet for the majority of students it may be unfit. Yet, despite such harsh condemnation of the practice of translation, an authoritative voice from Howatt (1984) urged us to “take another look at it” since any really convincing reasons are missing. Besides, learners of a foreign language do refer to their mother tongue to aid the process of acquisition of L2 or, in other words they "translate silently" (Titford 1985, p. 78). As a matter of fact, there is an increasing trend in studies that supports the beneficial and facilitating role of translation or L1 transfer in terms of learning a second language (Baynham, 1983; Ellis, 1985; Atkinson, 1987; Newmark, 1991; Kim, 2011; Kobayashi& Rinnert, 1992; Prince, 1996). 8.

(16) Translation in Language Teaching The mistaken connection between Translation in Language Teaching (TILT) and the notorious Grammar Translation approach in language teaching has refrained educators from adopting and developing TILT for the past century or so, since this erroneous notion has been rooted in the collective consciousness of teachers for so long that it seems absolutely natural to assume the equation of the two (Cook, 2010). The consequence has resulted in sluggish progress in the appliance as well as evolution of TILT, and severe impairment to language teaching in general. Nevertheless, despite being stagnant in using and developing in TILT for such a long time, now TILT is gradually being accepted and adopted in many areas globally (Allford, 1999; Butzkamm & Caldwell, 2009; Cook,2009; Klapper, 2006; Lems, Miller and Soro, 2010; Malmkjæ r, 1998; Schjoldager, 2004; Widdowson, 2003; Witte,Harden, & Ramos de Oliveira Harden, 2009). Numerous students’ language expertise, including reading and writing skills, may be cultivated during the process of well-designed translation tasks (Gonzalez-Davies, 2004, p.2). Positive correlation between the inclusion of students’ L1 and the experience of learning a foreign language are also revealed in the study conducted by Brooks-Lewis (2009). In the Iranian context, Tavakoli & Ghadiri & Zabihi (2014) proposed that translation has a positive influence on EFL learners’ writing, especially on fixed expressions, transitions, and certain grammar rules. Surprisingly, students’ checklists reflected that even when they are required to write directly, they still translate mentally before writing. The authors further suggested that teachers should incorporate translation into classes to instruct students how to implement effective strategies to achieve the best possible effect in various situations. In Lauferand Girsai’s study (2008), integrating contrastive analysis and translation activities into courses has been proven beneficial toward the retention of 9.

(17) new vocabulary and collocations. The results of Buck’s (1992) study also suggested that the prevailing denial of translation as a means of language test was not grounded on solid empirical evidences, if any. Students who took translation classes in Modern Languages at the University of Cambridge reflected their opinions through the form of feedback questionnaires that translation is one of the most beneficial activities to pick up a new language (Carreres & Noriega-Sanchez, 2011, p. 282). Martínez, Orellana, and Pacheco (2008) uphold the incorporation of translation into language courses for the sake of cultivating learners’ meta-linguistic awareness as well as bettering their writing. A recent study conducted by Kelly and Bruen (2015) also attested the revival of translation as a practical way to reflect the actual language use. Both college lecturers and students hold positive attitudes toward the implementation of TILT on the premise that it is theoretically based and supported. Translation may grant students to ponder over what they truly would like to utter in the process of thinking, and consequently enables them to communicate more precisely and effectively (Naznean, 2014). Still, it remains unsatisfactory regarding the process of translation in cross-language humanity studies (Douglas & Craig, 2007; Liamputtong, 2010). Researches focusing on translation process are quite limited, euphemistically speaking.. Holistic and Analytic Scoring Typical approaches to the model of translation scores include holistic and analytic scoring. Holistic scoring involves the assignment of a single score to a piece of work on the basis of an overall impression of it (Hughes, 2007). On the contrary, analytic approach requires a separate score for each of a number of aspects of a task. It is worth further investigation into the application of both analytic and holistic scoring since 10.

(18) translation itself is multi-faceted, which includes grammar usage, appropriate word selection, spelling, and punctuation and so on. Weigle (2002) provides a useful comparison summarizing the pros and cons between analytic and holistic rating scales (Table 1). Inherent advantages and disadvantages are visible in both types of rating scales; hence, it really depends on what purposes should be achieved and resources available so that optimal decisions can be made as to which way of rating should be adopted. In high-stakes examinations, it is vital to reconsider each and every examination tool and method so that the scores can best represent test-takers’ underlying actual competence since it involves significant consequences. Table 1 A comparison between holistic and analytic rating scales in L2 writing (based on Weigle, 2002, p. 121). Quality. Holistic scoring. Analytic scoring. Reliability. Lower than analytic but still acceptable. Higher than holistic. Construct. Holistic scale assumes that all relevant. Analytic scales is more. validity. aspects of writing develop at the same. appropriate for L2 writers as. rate and can thus be captured in a single. different aspects of writing. score; holistic scores correlate with. ability develop at different. superficial aspects such as length and. rates. handwriting Practicality. Relatively fast and easy. Time-consuming; expensive. Impact. Single score may mask an uneven. More scales provide useful. writing profile and may be misleading. diagnostic information for. for placement. placement and/or instruction ; more useful for rater training 11.

(19) Authenticity White (1985) argues that reading. Raters may read holistically. holistically is a more natural process. and adjust analytic scores to. than reading analytically. match holistic impression. Novice and Expert raters In professional fields of various types, like science, humanity, and so on, it is a continuous process to nurture novices into experts. Appealingly, novices will be able to transform into experts after undergoing certain phases. A five stage of mental activity skills moving from novices to experts was introduced by Dreyfus and Dreyfus (1980), and the stages went in sequence as the following: novice, advanced beginner, competent, proficient, and expert. The mastery of knowledge is delineated as “knowing how” instead of “knowing that” and expertise fuses with actual act. Experts know not only what to do, but how to attain the objective through getting the job done. Experts are capable of discriminating various conditions based on their prior experiences and then make appropriate reactions (Scribner, 1985). Novices and experts are not that much different from what they can tell, but equipped with abundant experiences in the past, experts take critical reactions that are drastically distinct. Each situation has been encountered before for experts, these are not new difficulties to them but something that they have dealt with previously. The more experience experts acquire, the faster and more expedient they become when resorting to their long-term memory for proper reactions. It is no doubt a conscious mental processing. In the field of cognitive psychology, it is presumed that even though novices and experts encounter the same issue, the actions they take would not be the same (Alexander & Judy, 1988; Chi, Glaser & Rees, 1982). Experts deduce more from the existing conditions and compile them into sub-groups that make sense (Chi, Feltovich, 12.

(20) & Glaser, 1981; Feltovich, Prietula, & Ericsson, 2006). Experts demonstrate their inherent knowledge subconsciously in the actions they take and are flexible in altering themselves to meet new challenges without unnecessary hesitation (Benner, 1982). Novices may only see the superficial aspects of an issue while experts classify the issue and then make the optimal reaction. This difference exists because there is a higher-order processing in experts’ mind to arrange each incident, resulting in their proper reaction and solution (Chi, Glaser, & Rees, 1982). Experts heed even the slightest details in context and procure useful information in the process. On the contrary, novices often fail to accomplish this since they tend to target issues on the surface level (Hobus, Schmidt, Boshuizen, & Patel, 1987). What makes an expert distinct from a novice is that the former can consult to their prior experience and are willing to acquire new knowledge (Berliner, 1994). Based on above-mentioned literature, a conclusion can be drawn that experts are schema-driven instead of data-driven; prior experience plays a vital role and allow them to adjust their judgment in various circumstances. To briefly sum up, idiosyncrasies of experts include sufficient prior experience, judicious judgment of the situation, and appropriate way of dealing. The transition from novice to expert is an on-going process, though it can also be viewed as a continuum with two ends. On one side stands novice while the other side is expert for certain. Novice raters are instructed to rate objectively since they are relatively less experienced, they may require clear principles to consult to so that they can make correct judgments (Benner, 1982). On the contrary, experts may devise a better solution through clustering information and analyzing them (Ross, Shafer, & Klein, 2006; Voss, Tyler, & Yengo, 1983). Experts also realize the importance of supervising their own performance, be more aware of errors, and more flexible to cater for the requirements of particular contexts. Novices have the potential to achieve 13.

(21) expert-status through copious hands-on tasks, and under appropriate circumstances they are likely to develop professional expertise (Chi, 2006). The quality disparity of the final products between experts and novices may lie in different cognitive patterns in their mind and different ways of addressing information. To briefly sum up, the way an expert categorizes information is relatively more complicated yet organized than a less experienced counterpart. Presenting simulated classroom discipline problems to both novice and expert teachers, Swanson (1990) discovered a difference in the mental processing between the two groups. While novice teachers are trying to figure out feasible solutions, experts will identify the problem and then deal with the situation in order of priority. To reach expert status and expertise, experiences are considered crucial. Farrell (2013) proposed five representative features of experts, including knowledge of learners and learning, engage in critical reflection, access past experiences, informed lesson planning, and active student involvement. So, experiences are essential, yet experiences alone may not be sufficient to transform a novice into an expert. Other determinants like one’s own teaching experience, reflection of one’s own judgment and decision, and continuous learning are vital as well. One distinction needs to be drawn that experience does not necessarily lead to expertise. One common error made by people is that they assume copious experiences equal to expertise (Bereiter & Scardamalia, 1993). If one can gain from their prior experience through continual reflection, then gradually the amassing of experiences will likely transform into expertise (Tsui, 2003). Inexperienced raters have the tendency to be more stringent and less congruous at the same time (Weigle, 1998). It requires constant reflection, along with numerous experiences and rater training, to transform a novice rater into an expert. In the present study, the feature of expert rating versus novice rating will be in 14.

(22) focal point. Researches have shown that raters are critical in terms of writing assessment. Experts can be defined as someone whose rating judgment is consistently excellent (Cummins, 1990). Raters are categorized into three levels, which include competent, intermediate, or proficient, based on how well their rating align with others (Wolfe, Kao, & Ranney, 1998). According to these scholars, the highest level, i.e. proficient raters, tends to rate writing similarly in which all criteria are taken into consideration and will be able to justify their judgments based on rubrics. In performance assessment, aside from test taker’s own competence and test items, raters are crucial since they are the ones who judge the performances and give the final verdict. Nonetheless, raters may more or less hold subjective perspectives and the result is that there may be inconsistent judgments even from the same group of raters. Their grading consistency, stringency, and personal characteristics may be affected by various factors. The nature of grading involves a complex cognitive process and variance in performance assessment grading seems to be common. In the present study, rater’s characteristics, i.e. novice versus expert, would be the focus. To what extent do these experiences and expertise differences influence their grading regarding translation? To answer this research question, it is wise to resort to MFRM.. Many-Faceted Rasch Measurement Model (MFRM) In oral or writing tests, or performance assessment of any kind in general, it is possible that factors or facets may concurrently occur and result in unexpected or unfair judgements, like tests items, test takers, raters, etc. People of various proficiency levels, ethnological background, or even gender may be appraised drastically different on performance assessment if one exclusive single variable, such as their background, is taken into account. Variables such as task itself, raters, and other related factors may 15.

(23) produce a wide range of result and thus mistakenly analyze and erroneously present the real competence of the test taker. Due to the complicated nature of performance assessment, i.e. manifold variables, it is essential to reliably estimate to what extent test takers can perform. Hence, the Many-Faceted Rasch measurement model has been introduced to recompense for the deficiencies of raw scores and is suitable for gauging the corresponding competence of test takers, comparative difficulties of test items, and the stringency of raters in a logit scale. Further, a model can be created to forecast any test taker’s likelihood from a certain rater with particular stringency on a specific testing item. This matters tremendously, performance assessment in particular, since the objective is to present credible and stable scores, which may be affected by various factors collectively and concurrently (Bachman, 2004; McNamara, 1996). The Many-Faceted Rasch Model derives from the basic Rash measurement model and is suitable for processing dichotomous information. In this model, the value of number manifests itself, the value of 1 signifies significance compared to the value of 0. Plus, raw scores can be adjusted to form a linear, calculable, and duplicable estimation. The parameters of each test item, test taker, and rater become distinct in the Rasch model, which allows researchers to compare and contrast without being subjective. The advantage of MFRM lies in the feature that each parameter is capable of being conditioned during the processing (Brentari & Golia, 2007). This empowers the researcher to place the facets, like item difficulty, personal competence, and rater severity, on the equal logit scale to compare. In addition to specific factors, latent factors such as personal characteristics can also be revealed based on the calibration of a Rasch map (Figure 1).. 16.

(24) Figure 1 The Facets variable map of This Study. Unidimensionality and local independence are two fundamental assumptions of Rasch measurement model that requires being distinguished. The former refers to the condition in which a single attribute undergoes assessment at a time and the entire assessment items gauge exclusively a single construct. All examination questions help assess an individual attribute, along with the estimations of test taker’s competence and test items difficulty in the data matrix being taken into consideration. The characteristics of unidimensionality is not unusual to educational statistics and it functions under the circumstances when an individual quality or facet in an examination is being adopted to construe the major measurements of the testing score sums (Bond & Fox, 2007). 17.

(25) Local independence also plays a vital role in the functioning of the Rasch measurement model under the unidimensionality scheme. It is presumed that each test item is linked to a latent trait value on its own. Latent trait refers to the idea that the data comply with an aligned measurement line or the fundamental construct. Every individual examination item carries value of its own and people who participate in the test respond to these items independently. The test taker’s competence, the corresponding test item difficulty, and the stringency of raters can thus be outlined in a logit scale and a model can be produced to forecast each individual’s likelihood from a single rater with a given stringency on a specific item (McNamara, 1996). Researchers will be informed about the interrelatedness of the data by the fit analysis, which comes from MFRM and it is able to demonstrate how well each examination item conforms to the intended construct. Abnormal signs, if any, will be pointed out by the fit analysis. Those unfitting examination items are recommended to be redesigned, substituted, or simply stated in other forms (Bond & Fox, 2007). Besides examination items, researches also suggest that with the help of the fit analysis, aberrant raters can also be identified, either misfitting or overfitting, and thus pinpoint certain raters required to be improved, or even recruit new ones (Bachman, 2004; McNamara, 1996). Hence, based on the anticipation of the model, any varying characteristic of the raters will be revealed and compiled by the fit statistics (McNamara, 1996). In the meanwhile, investigators are able to inspect the condition of the examination items and keep track of them by the fit analysis. Furthermore, Rasch analysis can yield two indicators, infit and outfit. Infit is the square of the model standard deviation of the observation and pertains to the anticipated value. Outfit, on the other hand, depends on the squared standardized residuals and pertains to underlying unanticipated ratings. Contrary to standardized statistical methods, infit and outfit mean square tend to be less impressionable to sample size and 18.

(26) the value is determined by test takers’ response patterns, which is the reason why they are broadly adopted and welcomed (Smith, Schumacker, & Bush, 1995). To sum up, MFRM is an essential measurement, which functions include gauging the characteristics of the participants and the examination itself under common, suitable measurement circumstances. It is devised to discern the likelihood of success established by the difference between the individual’s competence and item difficulty, presented in a table of expected likelihood. Investigators will be able to deduce each individual’s true competence and optimally decipher the examination since the probabilistic estimates are based on actual test performance. It is also worth mentioning that in the Rasch model test takers can be arranged in order according to their competence. Examination items can also be presented in sequence based on how hard it is. This is a vital feature since raters are crucial in the outcome of the test result in the performance assessment. MFRM may serve as a remedy for the righteousness of grades given by raters in performance assessment. Engelhard (1992) indicated that MFRM is an impartial and trustworthy mechanism, especially applicable in writing competence assessment. Traditionally, if one is simply judged by its own given raw scores, his/her real competence may be under or overrated. Besides, each individual rater may be relatively more or less stringent on each examination, and it is possible that the exact same test taker will receive drastically different grades from different raters. Even if there are rating trainings in advance for raters, some disparities may still exist, high-stakes examinations in particular. The application of MFRM hence becomes critical since giving raw scores only may yield a misrepresentation of test takers’ true competence. McNamara (1996) indicated that raw score should not be considered as a trustworthy index of the test taker’s true competence since many variables may affect the result in one single examination. The gauge produced by the Rasch model will take miscellaneous rating circumstances into 19.

(27) consideration and thus more reliable compared to raw scores. Figure 2 demonstrates the measurement model for the writing assessment, in which the theoretical aspect is clearly illustrated for a prototype.. Figure 2 Measurement Model for The Writing Assessment (Engelhard, 1992, p. 174) The MFRM model that reflects the conceptual model of writing ability takes the following general form (Engelhard, 2002, p. 175): log[Pnijmk/ Pnijmk-1] = Bn – Ti– Rj– Dm– Fk where Pnijmk= probability of student n being rated k on translation task i by rater j for domain m Pnijmk-1 = probability of student n being rated k-1 on translation task i by rater j for domain m Bn= translation ability of student n Ti= Difficulty of translation task i Rj= Severity of rater j Dm= Difficulty of domain m Fk= Difficulty of rating Step k relative to Step k-1 20.

(28) In this model, the observed rating is the dependent variable. The three primary facets which characterize the intervening variables are domain difficulty, difficulty of writing task, and rater severity. Test taker’s writing ability is the fourth underlying facet. Apart from the above-mentioned, the structure of rating scale is a vital factor as well. Conventionally, inter-rater reliability is adopted to inspect raters’ effects, which is to see whether the rating pattern among raters is deviating or not. Nonetheless, inter-rater reliability is unable to distinguish each rater’s distinctness in terms of their rating stringency and leniency (Bond & Fox, 2007). What makes MFRM magnificent is that the latent volatility among raters can be revealed. After raters’ volatility being modified, test taker’s real competence becomes crystal clear. According to Bond and Fox (2007), the issue of inter-rater reliability lies in the fact that even though the rank orders of test takers are in conformity, the stringency or leniency variation between raters failed to exhibit. The MFRM is capable of mending this gap via modelling the relatedness among raters and thus safeguard the consistency of rating (Bond & Fox, 2007). Accordingly, inter-rater reliability is deficient in proffering an exhaustive rating pattern between raters; the implementation of MFRM becomes essential. Another worth-mentioning feature of MFRM lies in the interplay between a certain rater and a specific facet of significance. In performance assessment of any kind, interplay of facets may lead to bias. It could be a specific rater with some particular rating circumstances. In the Many-Faceted Rasch measurement, there is a feature called bias analysis, which is capable of detecting these sub-patterns analytically. With this function of MFRM, the observed and expected values, called residuals, would be able to analyze them in contrast. In-depth analysis will reveal whether any sub-patterns exist, like interplay within groups or between groups systematically (McNamara, 1996). Therefore, the FACETS will be put into use for the current study. 21.

(29) CHAPTER THREE RESEARCH METHODOLOGY. The goal of the present study is set to investigate whether the adoption of holistic scoring rubric or analytic scoring rubric will result in any significant difference in terms of grading in performance assessment. In order to achieve this goal, other factors like test taker’s competence and test item difficulties will also be analyzed concurrently. Many-Faceted Rasch Measurement Model will be utilized as the research methodology to process various factors.. Participants In the present study, there were 60 Taiwanese college freshmen from one of the most prominent National universities in the northern part of Taiwan. Most participants involved in this study have received at least ten years of formal English instruction. The third year of elementary school should be the time when English is officially implemented into the courses, according to the current English educational policy of Taiwan. The participants, therefore, are supposed to be equipped with moderate English proficiency. As for the selection of expert raters, the two raters were supposed to be experienced teachers from other universities with excellent reputation. They also had received formal rating training in assessment and abundant experience in rating. Another two novice raters were TESOL-majored graduates from one of the most prominent universities in the northern part of Taiwan. Compared to expert raters, they also received formal TESOL courses but were equipped with relatively scarce rating 22.

(30) experience.. Instrument In sum, there will be sixty participants, and two experienced raters as well as two inexperienced raters. The translation questions are adapted from the Scholastic Aptitude English Test, and thus with credible validity and reliability. Translation items of the current study: 1.食用含有油炸的食物，可能導致嚴重健康問題。因此，政府已經對於某些食物頒布禁令。 2.對於辛勤工作的人們而言，有效的時間規劃是重要課題。 3.藉由謙虛的態度與努力不懈的學習，他不但成功了，還獲得了成就感。 4.政府正提倡省水的觀念，以避免缺水。 5.台灣的自然景觀吸引許多來自外國的觀光遊客。. Provided answer for analytic rating: 1. (Eating/Consuming oil-fried food) (may lead to) (serious health problems) (Hence, the government) (has posed bans on some food.) 2. (For hard-working/diligent) (people, efficient/effective) (time planning) (is an) (important lesson.) 3. (Through humble attitude) (and hard-working learning.) (He not only succeeded) (but also gained) (a sense of achievement.) 4. (The government is) (promoting the concept) (of saving water) (to avoid) (water shortage.) 5. (The natural) (scenery of Taiwan) (has attracted) (many tourists) (from overseas.) 23.

(31) Provided holistic rating scale for raters: Five points: 內容能充分表達題意；文段組織、連貫性甚佳，能充分掌握句型結構；用字遣詞、文法、拼字、標點及大小寫幾乎無誤。 Four points: 內容適切表達題意；文段組織、連貫性及句型結構大致良好；用字遣詞、文法、拼字、標點及大小寫偶有錯誤，但不妨礙題意之表達。 Three points: 內容未能完全表達題意；文段組織鬆散，連貫性不足，未能完全掌握句型結構；用字遣詞及文法時有錯誤，妨礙題意之表達，拼字、標點及大小寫也有錯誤。 Two points: 僅能局部表達原文題意；文段組織不良並缺乏連貫性，句型結構掌握欠佳，大多難以理解；用字遣詞、文法、拼字、標點及大小寫錯誤嚴重。 One point: 內容無法表達題意；句型結構掌握差，無法理解；用字遣詞、文法、拼字、標點及大小寫之錯誤多且嚴重。 Zero: 未答/等同未答. Table 2 The characteristics of translation items Grammar. Structure. Word. Tense. Collocation. Topic. L1~L2,. present/. impose a ban. Health. L3(1),. present. L5(2). perfect. range Translation1. S+Vt+O. compound. Translation2. S+Vi+SC. simple. L1~L2. present. Translation3. Not. compound. L1~L2,. present /. a sense of. only…but. L3(2),. past. achievement. also…. L4 present. water. Translation4. S+Vt+O+OC. simple. L1~L2,. 24. Life Learning. Energy.

(32) L3(1),. progressive. L4(1),. conservation/ water shortage. L5(1), L6(1) Translation5. S+Vt+O. simple. L1~L2,. present. Culture. L3(2). Scoring and Coding Four raters were divided into two groups, expert raters and novice raters, and each of them rated translation items analytically based on the criteria CEEC had proposed. There are four basic principles raters have to follow according to the CEEC criteria. First, each item will be divided into five chunks, and each chunk takes up 1 point. So, the maximum for each item will be 5 points. Second, each error will deduct 0.5 point; and 1 point at most for two errors or more within one single chunk. Too many errors within one single chunk do not lead to deduction of points in other chunks. In other words, each chunk is scored separately, and the sum of the five chunks will be the total score of the item. Third, the same repeated spelling errors will only be counted once. Fourth, punctuation error will only deduct 0.5 point the most. Hence, in the scoring process, for each item, the correct responses would get 5 points, so they were coded 5 while incorrect items are coded from 0 to 4.5 based on the answering. As a result, raw scores are assigned on a scale of 0-5.. 25.

(33) Procedure Under the pressure of time, test takers may make unnecessary errors and that is a variable which should be eliminated in the present study. So, Translation items were taken untimed by the participants. All the raters were planned to rate the collected data analytically based on the rating criteria CEEC released first, and then to rate again holistically at their free time. All the response sheets would be randomly assigned to raters to avoid any memory effect. The test takers’ performance will be analyzed by the Many-Faceted Rasch Model (MFRM). Hence, each student will receive two scores; one derives from holistic scoring rubrics and the other from analytic scoring rubrics. Pearson Correlation will be performed to examine the extent to which the two scorings are related. Post hoc interview with raters will be further implemented to probe into the similarities and differences regarding the two drastically contrasting types of scoring.. Data Analysis Multi-Faceted Rasch Measurement Model expands the basic Rasch model and enables researchers to add the facet of judge severity (or another facet of interest) to person ability and item difficulty so that they can be placed on the same logit scale for comparison. Rater variability can be adjusted and thus provides a more accurate picture of test takers’ actual underlying competence. For the research questions, the researcher would like to investigate how rating scales, raters and their rating experience, and difficulty of items (in terms of grammatical features) interact with each other. This interaction could be examined by MFRM and the fit analysis would also be applied to investigate how well raters’ 26.

(34) grading fit into a Multi-Faceted Rasch Measurement Model of test performance.. Expected Result and Suggestion It is expected that the translation scores should have significant relationships with the mastery of the source language and the target language, as evidence for the inclusion of translation in tests. Further, the holistic rating system can gain its adequate empirical validity as support for its continual use on the translation rating. The same expectation should also be observed on the utility of analytic rating rubrics. The advantages and disadvantages of holistic and analytic rating rubrics in translation test will be critically presented and discussed.. 27.

(35) CHAPTER FOUR. RESULTS AND DISCUSSIONS. To answer the two research questions, this chapter proffers an exhaustive discussion about the results of quantitative data as well as their underlying values. The estimates of raters’ reliability, test takers’ reliability, item reliability and fit statistics will be presented, followed by explanations of these findings and excerpts from raters’ interview.. Overview of the study Many-Faceted Rasch Measurement Model (MFRM) was deployed to examine the collected data of the translation test for the sake of answering the two research questions, and Facets served as the means of analysis. Research questions are as follows: 1. To what extent do the two different types of rating rubrics – holistic vs analytic – converge on offering consistent and unbiased rating on test-takers’ translation performance? 2. Is there any significant difference between inexperienced and experienced raters in terms of rating translation items under the different approaches between analytic and holistic rubrics?. Major Findings Regarding the first research question, MFRM is capable of demonstrating the factors of raters’ severity, rating experience, test takers’ ability, and item difficulty by Facets variable map and logit scale. 28.

(36) As presented in Figure 3 and Figure 4, the two Facets variable maps display the distribution of the four main factors in the present study, i.e. test takers, rating experience, rater severity, and test items. The logit scale situates in the left hand column. The zero of the metric is established at the average item difficulty. The test takers are arranged by measure in the second column. Raters’ rating experiences are put in the third column and the difficulty of tasks are in the second place from the right hand side. It is visible that the majority of the test takers’ responses are plotted between the values +4 and 0. This indicates that the ability of the participating test takers in the present study is normally distributed, and these collected samples represent an agreeable range of ability. Rarely without exception did they fail to respond the five translation tasks due to the lack of appropriate proficiency levels. The five translation tasks were balanced designed; with task 2 and task 5 being easy, task 1 and task 4 moderate, and task 3 difficult. As far as raters are concerned, both the expert and the novice raters are being averagely severe under both two types of rating scales. When conducting a research, it is vital to distinguish those with corresponding competence from those who simply guess or depend on sheer luck. As shown in Figure 3 and Figure 4, the data of 60 samples had been appropriately measured under holistic and analytic scales respectively to appraise the relevant factors of the present study. However, some extreme values situate around -1 logits under both types of rating scales. The reasons behind may be manifold, some further possible explanations are discussed below. To begin with, even though they are college freshmen, some of them are still in pretty unsatisfactory English proficiency levels and thus their lack of appropriate corresponding ability resulted in such extremely undesirable scores. Another plausible reason is that some of these freshmen had been admitted to the university through the 29.

(37) Recommendation-Selection Admission Program much earlier in the year than those took the final examination in July. After several months of completely isolation from the subject of English, it is possible that they are not in their optimal conditions in terms of answering the examination items in this particular study. Last but not least, even though test takers are allowed to answer the questions as long as they intend, still there are some inevitable factors that may influence their performance, such as nervousness, fatigue or other physical discomfort that may cause such poor performance.. 30.

(38) Figure 3 Holistic Scale Facets Variable Map of the Study. 31.

(39) Figure 4 Analytic Scale Facets Variable Map of the Study. 32.

(40) In Table 3 and Table 4, we can see the separation reliability for the test takers’ facet, the value of 0.94 & 0.92 implies that among these test takers their abilities vary tremendously. The fixed χ2 examines the hypothesis that all of these test takers are equipped with the same ability. This χ2 is highly significant (p < .05) causing one to reject the hypothesis that all of them are ranging within the same level of proficiency.. Table 3 Holistic Scale Students Measurement Report Obsvd. Obsvd. Obsvd. Fair-M. Score. Count. Average. Average. Measure. Model. Infit. S.E.. MnSq. Outfit ZStd. MnSq. ZStd. Num. Students. 88. 20. 4.40. 4.43. 3.58. .37. .87. -.3. .85. -.4. 28. 28. 88. 20. 4.40. 4.43. 3.58. .37. 1.23. .8. 1.52. 1.5. 35. 35. 87. 20. 4.35. 4.37. 3.44. .36. 1.35. 1.1. 1.27. .9. 22. 22. Middle Omitted. 35. 20. 1.75. 1.79. -1.51. .26. 1.82. 2.3. 2.00. 2.7. 34. 34. 32. 20. 1.60. 1.62. -1.70. .25. .78. -.7. .88. -.3. 16. 16. 66.1. 20.0. 3.31. 3.32. 1.24. .32. 1.01. -.2. 1.02. -.2. Mean (Count: 60). 12.7. .0. .64. .63. 1.25. .02. .77. 1.6. .79. 1.7. S.D. (Population ). Model, Populn: RMSE .32Adj (True) S.D. 1.21 Separation 3.81 Reliability .94 Model, Fixed (all same) chi-square: 1005.4 d.f.: 59siginificance (probability): .00. Table 4 Analytic Scale Students Measurement Report Obsvd. Obsvd. Obsvd. Fair-M. Score. Count. Average. Average. Measure. Model. Infit. S.E.. MnSq. Outfit ZStd. MnSq. ZStd. Num. Students. 93. 20. 4.65. 4.67. 3.79. .44. .65. -1.0. .68. -.9. 28. 28. 90. 20. 4.50. 4.52. 3.28. .39. .71. -.9. .69. -1.0. 26. 26. 89. 20. 4.45. 4.47. 3.13. .38. 1.06. .2. 1.23. .7. 35. 35. Middle Omitted. 42. 20. 2.10. 2.12. -.73. .23. .61. -1.4. .61. -1.4. 29. 29. 22. 20. 1.10. 1.06. -1.85. .25. .90. -.2. 1.09. .3. 16. 16. 68.7. 20.0. 3.44. 3.45. 1.15. .29. .99. -.3. 1.01. -.3. Mean (Count: 60). 13.4. .0. .67. .67. 1.08. .04. .75. 1.8. .80. 1.9. S.D. (Population ). Model, Populn: RMSE .30Adj (True) S.D. 1.04 Separation 3.49 Reliability .92 Model, Fixed (all same) chi-square: 788.4 d.f.: 59 significance (probability): .00 33.

(41) Fit analysis comes into effect in terms of descrying potential problematic items and performances (Bond & Fox, 2013). Mean-square fit ranging from 0.5 to 1.5 makes one item the most suitable and productive (Linacre, 2000). The item would be less productive for measurement owing to overfit if the value is lower than 0.5, which means that it fails to detect variance in test takers’ ability and thus fails to function as a part of a test. Under this circumstance it does not ruin the general quality of the test, but it probably will result in beguilingly high reliability and separations. When the index is higher than 1.5 but lower than 2.0, it suggests that the item is not productive for the construction of the measurement, but does not degrade the quality of the test either. And if the value is higher than 2.0, it implies that particular item is so badly written that it may distort or degrade the overall test quality. On the whole, misfit takes place when the value of an item is higher than 1.5; and severe misfit if the mean-square values exceed 2.0. Table 5 demonstrates the implications for measurement of mean-square value (Linacre, 2002). Table 5 The Implication for Measurement of Mean-square Value (Linacre, 2002) Mean-square Value. Implication for Measurement. > 2.0. The item is so badly written that it may distort or degrade the overall test quality.. 1.5 - 2.0. The item is not productive for the construction of the measurement, but not degrades the quality of the test either.. 0.5 - 1.5. The item is the most ideal and productive.. < 0.5. The item is less productive for measurement.. So overall, the 60 test takers in this study represented a wide variety of students and thus yielded a more credible result. What’s intriguing is that under both types of rating scales, test taker number 28 remains the highest rank and test taker number 16 the lowest. Such unanimity may serve as a possible hint that holistic scoring rubrics may be a qualified alternative to the current adopted yet time and effort consuming analytic 34.

(42) scoring rubrics. Regarding the first research question, “To what extent do the two different types of rating rubrics – holistic vs analytic – converge on offering consistent and unbiased rating on test-takers’ translation performance?”, we may refer to Table 6 and 7 to answer it, which present all the raters’ rating performance under holistic rating scale and analytic rating scale respectively. It is clear that under both types of rating scales, the range of four raters’ mean square lie between 0.5 and 1.5, suggesting their entire grading are steady. To be more specific, the value is nearly identical to the anticipated value of one, indicating highly satisfactory rating quality. In Table 6, the rater separation reliability of 0.94 under holistic rating scale outperformed the rater separation reliability of 0.69 under analytic rating scales as Table 7 presents; this result suffices to say that holistic rating is no doubt a qualified alternative to analytic rating in translation tests. The relationship between the scoring of holistic rating scale and the one of analytic rating scale could be further investigated by adopting Pearson correlation coefficient. A very high and significant correlation between the two scales was obtained, r = 0.934, n = 60, p =0.000. An independent-sample t-test was also implemented to compare the two distinct types of rating rubrics. It was found that there was no statistically significant difference between holistic scoring (M = 1.24, SD = 1.26) and analytic scoring (M = 1.15, SD = 1.09), t (60) = 0.42, p = 0.678. Nonetheless, distinctions among the four raters are indicated by p< 0.05 under both types rating scales. The differences may be further investigated by taking their rating experiences into consideration. And this happened to be answering the second research question as well.. 35.

(43) Table 6 Holistic Scale Judges Measurement Report Obsvd. Obsvd. Obsvd. Fair-M. Score. Count. Average. Average. Measure. Model. Infit. Outfit. S.E.. MnSq. ZStd. MnSq. ZStd. Num. Judges. 881. 300. 2.94. 3.07. .53. .08. .89. -1.3. .89. -1.3. 4. Expert 2. 1021. 300. 3.40. 3.32. .01. .08. 1.16. 1.8. 1.16. 1.9. 1. Novice 1. 989. 300. 3.30. 3.40. -.16. .08. .99. .0. .98. -.2. 3. Expert 1. 1078. 300. 3.59. 3.51. -.38. .08. 1.04. .5. 1.06. .7. 2. Novice 2. 992.3. 300.0. 3.31. 3.33. .00. .08. 1.02. .2. 1.02. .3. Mean (Count:4). 71.7. .0. .24. .16. .34. .00. .10. 1.2. .10. 1.2. S.D. (Population ). Model, Populn: RMSE .08 Adj (True) S.D..33 Separation 4.02 Reliability .94 Model, Fixed (all same) chi-square: 70.4 d.f.:3 significance (probability): .00. Table 7 Analytic Scale Judges Measurement Report Obsvd. Obsvd. Obsvd. Fair-M. Score. Count. Average. Average. Measure. Model. Infit. Outfit. S.E.. MnSq. ZStd. MnSq. ZStd. Num. Judges. 975. 300. 3.25. 3.40. .23. .07. .90. -1.2. .90. -1.2. 4. Expert 2. 1054. 300. 3.51. 3.57. -.05. .08. 1.12. 1.3. 1.10. 1.1. 2. Novice 2. 1035. 300. 3.45. 3.59. -.09. .07. 1.04. .4. 1.04. .5. 3. Expert 1. 1061. 300. 3.54. 3.59. -.09. .08. 1.02. .2. 1.01. .1. 1. Novice 1. 1031.3. 300.0. 3.44. 3.54. .00. .07. 1.02. .2. 1.01. .1. Mean (Count:4). 33.8. .0. .11. .08. .13. .00. .08. .9. .07. .9. S.D. (Population ). Model, Populn: RMSE .07 Adj (True) S.D. .11 Separation 1.50 Reliability .69 Model, Fixed (all same) chi-square: 13.6 d.f.:3 significance (probability): .00. As for the second research question, “Is there any significant difference between inexperienced and experienced raters in terms of rating translation items under the different approaches between analytic and holistic rubrics?” Let’s relate to Table 8 and Table 9, which present raters’ experience under the two different rating scales, to answer it. The mean square outfit of expert raters and novice raters under both holistic and analytic rating rubrics fell between 0.5 and 1.5, indicating overall good fit. This result suggested that both expert and novice raters recruited in this study demonstrated reliable grading patterns, which is quite ideal. More importantly, all the raters, 36.

(44) regardless of their rating experience, manifested indefectible rating behaviors under the two drastically different rating scales, making it a strong argument that holistic rating is also capable of offering consistent and unbiased assessment on test-takers’ translation performance. Under holistic rating scale the raters’ separation reliability was 0.91 while analytic counterpart was 0.46 implied that raters in the former scale were relatively reliable similar, instead of reliably dissimilar. When it comes to the second research question, even though the mean square value of both rating scales is satisfactory, the significant χ2 value (χ2 = 21.7, df = 1; χ2 = 3.7, df = 1) revealed that to some extent raters’ grading experiences undeniably influenced their grading.. Table 8 Holistic Scale Raters’ Experience Measurement Report Obsvd. Obsvd. Obsvd. Fair-M. Score. Count. Average. Average. Measure. Model. Infit. S.E.. MnSq. Outfit ZStd. MnSq. ZStd. Num. Experience. 1870. 600. 3.12. 3.24. .19. .06. .94. -1.0. .94. -1.1. 2. Expert. 2099. 600. 3.50. 3.42. -.19. .06. 1.10. 1.6. 1.11. 1.8. 1. Novice. 1984.5. 600.0. 3.31. 3.33. .00. .06. 1.02. .3. 1.02. .4. Mean (Count: 2). 114.5. .0. .19. .09. .19. .00. .08. 1.4. .09. 1.5. S.D. (Population ). Model, Populn: RMSE .06Adj (True) S.D. .18 Separation 3.14 reliability.91 Model, Fixed (all same) chi-square: 21.7d.f.: 1 significance (probability): .00. Table 9 Analytic Scale Raters’ Experience Measurement Report Obsvd. Obsvd. Obsvd. Fair-M. Score. Count. Average. Average. Measure. Model. Infit. S.E.. MnSq. Outfit ZStd. MnSq. ZStd. Num. Experience. 2010. 600. 3.35. 3.50. .07. .05. .96. -.5. .97. -.5. 2. Expert. 2115. 600. 3.53. 3.58. -.07. .05. 1.07. 1.1. 1.05. .9. 1. Novice. 2062.5. 600.0. 3.44. 3.54. .00. .05. 1.02. .3. 1.01. .2. Mean (Count: 2). 52.5. .0. .09. .04. .07. .00. .05. .9. .04. .7. S.D. (Population ). Model, Populn: RMSE .05Adj (True) S.D. .05 Separation .93 reliability.46 Model, Fixed (all same) chi-square: 3.7 d.f.: 1 significance (probability): .05. 37.

(45) From Table 10 and Table 11 we could see the item separation reliability of 0.95 and 0.97 respectively, this could be interpreted as the variation of translation task difficulty and the size of sample were ample so that in the present study the evaluation of task difficulty as well as test takers’ ability were precisely estimated. Furthermore, the Outfit mean square values of both rating scales were all under 1.5, serving as a solid proof that both types of rating quality were excellent. Task item 2 and item 5 were basic at -0.45 logits under holistic rating and -0.54 logits as well as -0.52 logits under analytic rating. Task item 4 and item 1 were moderate at 0.32 logits and -0.3 logits under holistic rating while 0.17 logits as well as 0.21 logits under analytic rating. Task item 3 was the most difficult at 0.62 logits under holistic rating and 0.69 under analytic rating.. Table 10 Holistic Scale Task Measurement Report Obsvd. Obsvd. Obsvd. Fair-M. Score. Count. Average. Average. Model. Infit. S.E.. MnSq. ZStd. MnSq. ZStd. Measure. Outfit Num. Task. 718. 240. 2.99. 3.03. .62. .09. .79. -2.3. .78. -2.4. 3. 3. 756. 240. 3.15. 3.18. .32. .09. 1.11. 1.1. 1.11. 1.1. 4. 4. 799. 240. 3.33. 3.34. -.03. .09. .79. -2.3. .81. -2.2. 1. 1. 848. 240. 3.53. 3.55. -.45. .09. 1.19. 1.9. 1.12. 2.3. 2. 2. 848. 240. 3.53. 3.55. -.45. .09. 1.23. 2.3. 1.19. 2.0. 5. 5. 793.8. 240.0. 3.31. 3.33. .00. .09. 1.02. .1. 1.02. .2. Mean (Count: 5). 51.1. .0. .21. .20. .42. .00. .19. 2.1. .19. 2.1. S.D. (Population ). Model, Populn: RMSE .09 Adj (True) S.D. .41 Separation 4.51 Reliability.95 Model, Fixed (all same) chi-square: 108.0 d.f.:4 significance(probability): .00. 38.

(46) Table 11 Analytic Scale Task Measurement Report Obsvd. Obsvd. Obsvd. Fair-M. Score. Count. Average. Average. Model. Infit. S.E.. MnSq. ZStd. MnSq. ZStd. Measure. Outfit Num. Task. 720. 240. 3.00. 3.10. .69. .08. .93. -.7. .94. -.6. 3. 3. 798. 240. 3.33. 3.41. .21. .08. .72. -3.2. .75. -2.9. 1. 1. 804. 240. 3.35. 3.44. .17. .08. 1.15. 1.5. 1.12. 1.2. 4. 4. 900. 240. 3.75. 3.83. -.52. .09. 1.13. 1.3. 1.05. .6. 5. 5. 903. 240. 3.76. 3.84. -.54. .09. 1.22. 2.2. 1.19. 1.9. 2. 2. 825.0. 240.0. 3.44. 3.52. .00. .08. 1.03. .2. 1.01. .1. Mean (Count: 5). 69.1. .0. .29. .28. .47. .00. .18. 2.0. .15. 1.7. S.D. (Population ). Model, Populn: RMSE .08Adj (True) S.D. .46 Separation 5.58 Reliability.97 Model, Fixed (all same) chi-square: 162.0 d.f.:4 significance(probability): .00. The Chinese translation task number three was “藉由謙虛的態度與努力不懈的學習，他不但成功了，還獲得了成就感。” And the provided answer for reference was “Through humble attitude and hardworking learning, he not only succeeded but also gained a sense of achievement.” Table 12 displays the grammatical and lexical composition of the translation task number three. It is a compound sentence structure; the tense should be in past time, and the topic is about learning.. Table 12 The Grammatical and Syntactic Features of Translation Task Number 3 Grammar Number 3 Not only…but also…. Structure. Word range. Tense. Topic. compound. L1~L2, L3(2),L4. past. Learning. Different word levels among the translation task items may be one of the reasons causing different task difficulties for test takers. As for translation task number three in question, most vocabulary ranged between the first and second level based on the classification from College Entrance Examination Committee (CEEC). However, there were two words at the third level (attitude and achievement), and one at the fourth level (learning), making translation task number three the most challenging 39.

(47) item in this study. Besides, to translate impeccably requires the mastery of both the source language and the target language. The longer the sentence in a translation test, which means more words to be processed in the brain, the greater the challenge would be. Other than word level and sentence length difficulty, syntactical issues may also serve as a factor influencing the difficulty of translation tasks. Numerous test takers’ responses were problematic in terms of choosing correct preposition for “藉由” at the very beginning of the sentence and thus messed up the entire sentence. In addition, quite a few test takers were careless with the tense, it should be past tense but they wrote in the present. Thus, all these grammatical and syntactic aspects resulted in translation task number three slightly more challenging than other items. From the grading severity between novice and expert groups listed in Table 12, we could see that expert raters were slightly lenient than novice raters in grading translation task number three. It did not suggest any fault of both groups; instead, it suggested that there were slight distinctions between them. The rationale behind this phenomenon could be further discussed even if it was not statistically significant different. When raters were grading, even though they were provided with the correct answer for reference and rating criteria, it was highly possible that they encountered various answers which did not match the provided ones but still could be correct. Under these circumstances, raters’ preceding experiences and professional knowledge came into effect (Glaser & Chi, 1988). Expert raters may be relatively more flexible and welcome to various writing responses as long as it matches the original meaning of the translation tasks. Novice raters, on the contrary, exhibited comparatively less flexibility and tended to follow the rating criteria strictly. For harder translation tasks, some low proficiency level test takers received some points because raters felt sympathetic and tried to encourage them, this also explained 40.