• 沒有找到結果。

Many-Facet Rasch Measurement Model (MFRM)

In the area of performance assessment, such for writing and speaking tests, there are different factors or facets that simultaneously and obviously interact with each other, such as, test takers, items, and raters. Candidates with different backgrounds, proficiency levels, or genders may have different performances on the performance assessment based on their backgrounds alone. It is also likely that raters, tasks, and other relevant aspects of the exam setting also cause a variability of grades and affect the interpretation for the true ability of the candidate. Because of the multiple sources of variability, a stable and fair estimate of how well learners can manage relevant tasks is necessary. For these reasons, the Many-Facet Rasch measurement model is an important tool to compensate for deficits of raw scores and can be used to map the relative ability of candidates, relative difficulty of items, and toughness of raters in a logit scale. Also, the model can form a model to predict a candidate’s odds from a rater with given leniency on a given item. The is especially the case in performance tests, where the goal of valid and reliable scoring is very important because scoring is often influenced by multiple facets simultaneously and interactively. (Davies, 1999;

Bachman, 2004; McNamara, 1996).

The Many-Facet Rasch measurement model (Linacre, 1996) is an extension of the basic Rash measurement model (Wu, 2010) and can be used to examine dichotomous data, whose two values are more than nominal data. The value of 1 is meaningfully

32

greater than the value of 0, not merely different from 0. The Rasch model can show that one value is better or superior than the other value in itself. In addition, the Rasch measurement model converts a raw score into a measurable, linear, and reproducible measurement. From the measurement of the Rasch model, researchers can see parameters of separable person and item, which are statistics for the parameters and conjoint additivity. These features can bring an exactly objective comparisons of persons and items. Also, from the estimation procedure of the others, each parameter can be conditioned (Brentari & Golia, 2007). MFRM expands the basic Rasch measurement model which enables researchers to put the facet of judgment severity, person ability, and item difficulty together and place them on the same logit scale for comparison. Not only can explicit factors be examined using this model, but implicit factors, such as test taker or rater characteristics, can also be checked from the calibration of a Rasch map (Figure 1).

In the past decades, many statistical measurements have been proposed to measure writing assessment, like the Poisson Process model (Andrich, 1973) or Partial Credit model (Masters, 1982), and De Gruijter (1984) proposed two models for rater effects, that is, additive and the other nonlinear. However, most of these were designed to measure only two facets (Engelhard, 1992). The Rasch measurement model proposed by Linacre in 1989 can examine many facets and check the relationship among these factors simultaneously. Although the Rasch measurement model only accounts for one-parameter, it can still be used to analyze rating scales and items that are given partial credits (Wright & Masters, 1982).

33

Figure 1. Facets Variable Map (Bond & Fox, 2007, p. 55).

The two basic assumptions of Rasch measurement model that must be understood are unidimensonality and local independence. Unidimensionality means that there is only one attribute that is being assessed at one time and that all test items measure only one single construct. Every question thus contributes to the measure of one attribute, and the estimates of person ability and item difficulty in the data matrix will be taken

34

into account. The property of unidimensionality is common to educational testing and it happens when a single factor or trait in a test could be used to explain a large proportion of the total test score variance (Bond & Fox, 2007).

In the framework of unidimensionality, the concept of local independence is also crucial to the workings of the Rasch measurement model. Local independence has an assumption that item responses are independent given a subject’s latent trait value.

Latent trait is the concept of whether the data adheres to that concept of a straight measurement line or underlying construct. Each item has an independent value and the test takers’ responses to those items are independent of each other. The Many-facet Rasch measurement model can therefore map the ability of test takers, the relative difficulty of items, and the severity of raters in a logit scale and form a model to predict a candidate’s odds from a rater with a given leniency on a given item (McNamara, 1996).

The fit analysis from MFRM can provide investigators the information to make interrelated decisions about the data, because the fit statistics show how each item fits within each underlying construct which is constructed by providing indicators. Any aberrant performance is highlighted by the fit statistics, and if any item cannot fit within the construct, this test item should be rewritten, replaced, or expressed in some other ways (Bond & Fox, 2007). Relevant studies have also showed that with the fit analysis, test developers can identify items or raters that do not fit the model— by either overfitting or misfitting— and decide which item may need to be revised or which rater may need to be retrained or replaced (Bachman, 2004; McNamara, 1996). Therefore, the fit statistics can summarize and identify variations from raters according to the expectations of the model (McNamara, 1996). At the same time, researchers can check the quality of the test items and monitor the items from the fit statistics.

Also, two indices can be indicated by Rasch analysis, infit and outfit. Outfit is based on the sum of the squared standardized residuals and refers to outlying

35

unexpected ratings, whereas infit is the square of the model standard deviation of the observation and refers to the expected value. The reason why infit and outfit mean square are widely used by most researchers is that they are less sensitive to sample size and the value is decided by examinee’s response patterns, unlike the standardized statistical method (Smith, Schumacker, & Bush, 1995).

In summary, MFRM is a fundamental measurement which can be used across similar appropriate measurement situations and to estimate properties of persons and tests. Also, Rasch model is a table of expected probabilities designed to find out the probability of success, which is based on the difference between the ability of the person and the difficulty of the item. Since the probabilistic estimates are based on test performance, researchers can infer every test taker’s real ability from his/her performance and find out the best interpretation of each exam.

Another feature of the Rasch model is that it could order examinees based on their ability and order the items according to their difficulty. The theory of conjoint measurement applies when the levels of some attribute increase along with increases in the values of two other attributes (Bond & Fox, 2007).

In the performance assessment, raters play an important role to decide candidates’

ability, and MFRM can also be used to compensate the fairness of scores of writing assessment. Engelhard (1992) added that MFRM is an objective and fair tool to measure writing ability. Through raw scores only, writing ability may be over or under estimated.

Different raters have different severity, and this may cause their different rating even on grading the same student. He also pointed out that even with rater trainings beforehand, there will still be some differences among raters’ grading, especially for the high-skates tests. Using MFRM to adjust the raw score is thus necessary, because the raw score can easily lead to the misleading outcomes easily. Given that many sources of variability in the test may influence the outcome for a candidate, the raw

36

score is not a reliable guide to candidate ability (McNamara, 1996). The measures generated by the Rasch measurement model, which accounts for the various rating conditions in which examinees were placed, are more stable than raw scores. To help achieve higher item reliability and stabilize item estimates, testing more people or including items with a wider difficulty range may help (Rasch Measurement Software and Publications, 2010). The measurement model underlying the writing assessment used is presented graphically in Figure 1 (Engelhard, 1992, p. 174). Figure 1 shows the conceptual model for a prototypical performance assessment of writing ability. The observed rating is the dependent variable in the model. The three major facets that define the intervening variables used to make the latent variable observable are rater severity, domain difficulty, and task difficulty. The fourth facet is writing ability. Also, an important intervening component is the structure of the rating scale.

Figure 2. Measurement Model for The Writing Assessment (Engelhard, 1992, p. 174)

The MFRM model that reflects the conceptual model of writing ability takes the following general form (Engelhard, 2002, p. 175):

log[Pnijmk / Pnijmk-1] = Bn – Ti – Rj– Dm– Fk

37

where

Pnijmk = probability of student n being rated k on writing task i by rater j for domain m Pnijmk-1 = probability of student n being rated k-1 on writing task i by rater j for domain m

Bn = Writing ability of student n Ti = Difficulty of writing task i Rj = Severity of rater j

Dm = Difficulty of domain m

Fk = Difficulty of rating Step k relative to Step k-1

An outstanding feature of MFRM is that it can account for the true variability between raters. Traditionally, raters’ effects are examined by interraters reliability whose purpose is to check the extent to which the rating behavior, i.e., rater behavior is consistent. However, interrater reliability fails to pinpoint the raters’ individual differences, regarding the severity and leniency of assigning scores to examinees’

performances (Bond & Fox, 2007).

MFRM adjusts for rater variability and offers a clear picture of ability. In the words of Bond and Fox, the problem with intercorrelaions between raters’ grading is that they only point out the consistency among rank orders of candidates, but the severity or leniency differences between judges cannot be highlighted. However, MFRM can model the measurement relationship between raters and thereby ensure that raters’

rating are consistent (Bond & Fox, 2007). Therefore, interrater reliability is not sufficient in giving a comprehensive picture of raters’ behaviors.

Interactions between particular raters and particular conditions of each facet of interest are an additional feature of Many-facet Rasch measurement model. Interactions between facets might cause bias in the assessment of writing ability, and MFRM can be

38

used to examine this potential problem. There may be an interaction involving a rater and some other aspects of the rating situation. The identification of these systematic sub-patterns is achieved in Many-Facet Rasch measurement in the so-called bias analysis, which is the study of interaction effects. MFRM can be used to compare expected and observed values in a set of data with the differences between expected and observed values being called residuals. To further analyze the residuals can show whether there are any sub-patterns, such as systematic interaction between groups or within groups (McNamara, 1996). In this study, the FACETS will be applied (Linacre, 2006).

39

CHAPTER THREE METHOD

The present study is an investigation of raters’ characteristics affecting grading on translation items. Based on this main goal, other factors are also examined simultaneously in this study, such as test takers’ ability and difficulty of items. The research methodology employed in this study was a quantitative method-applying Many-Facet Rasch Measurement Model to examine multiple factors.

Participants

The participants in this study were a total of 225 Taiwanese third-year senior high school students from northern Taiwan. According to the English educational policy in Taiwan, English instruction is officially implemented in the third-year elementary school. Therefore, most of these participants have been exposed at least eight-year of elementary school and junior high school formal English instruction.

The study examined the effects of raters’ characteristics, in terms of expert and novice, on the scores of revised translation items from advanced subject tests and general scholastic ability tests. Each of three experts and three novices rated translation items based on the rating criteria analytically. Expert raters were teachers from different universities who had already had the rating experience and received formal rater training in assessment rating. Their teaching experiences were more than five years and were all with good reputation in English teaching. On the other hand, the novice raters were the graduate students who aim to become junior high or senior high English teachers, and they did not have any prior training and experience in translation rating.

相關文件