• 沒有找到結果。

Assumption of the Use of Latent Trait Models

Centrality/Extremism

2.3 Assumption of the Use of Latent Trait Models

2.3.1 Normative versus Criterion-referenced Contexts

Before discussing the assumption these models make in detecting rater effects, it is essential to first make a distinction which is central to this study. This distinction, rarely explicitly made in the rater effects literature, is between rater deviations from

the ratings of a group or community of raters and rater deviations from the ideal of error-free measurement. Following Wolfe (2004), we can consider the former type as

relevant to normative-referenced settings and the latter type as relevant to

criterion-referenced settings. In a normative-reference setting, scores “derive their meaning

from their relative standing in the distribution of ratings” (Wolfe, 2004, p. 39). Rater deviations are understood as deviations from a consensus understanding of ability. In a criterion-referenced setting, examinee performances are being evaluated against an articulated standard and rater deviations represent errors in relation to that standard.

Since a standard setting situation is clearly a criterion-referenced setting, the focus here will be on detecting rater deviations from the ideal of error-free measurement.

When the MFRM is used to detect rater effects, the frame of reference normally used is the ‘internal frame of reference’ constructed by the joint ratings of the raters themselves. When only this judge-generated or internal frame of reference is used, the only claim that can be unproblematically made is that the effects occur as

deviations from the ratings of the other members of the group. As just noted, this is

appropriate for ‘normative-reference settings.’ However, latent trait/MFRM indices have also been proposed for use in standard setting, a criterion-referenced setting (Eckes, 2009; Engelhard, 2007, 2009, 2011; Engelhard & Anderson, 1998; Engelhard

& Cramer, 1997; Englehard & Gordon, 2000; Engelhard & Stone, 1998). Indeed, as one of the advantages of the MFRM-based approach, Engelhard and Anderson explicitly state that it “does not require data from examinees to define indexes of

rating quality” (1998, p. 227, emphasis added; the authors do, however, immediately

add that “Estimates of item difficulties based on examinee data can be used to

examine and validate” the results - a suggestion taken up in this study, as detailed below).

the further claim that ratings deviate from error-free measurement requires the critical assumption that “rater effects are randomly distributed and are exhibited by only a minority of raters in the pool” (Wolfe, 2004, p. 47). Put differently, “one must assume that the group of raters assigns, on average, scores that are unbiased” (Wolfe &

McVay, 2010, p. 6), or that no group-level rater effects exist. As Wolfe further emphasizes:

A very important topic that has received little attention in the literature relating to rater effects is the interpretive frame of reference within which rater effects are portrayed. A serious shortcoming of the methods described in this article is their reliance on an implicit assumption that rater effects are distributed in the pool of raters in a non-systematic manner. (Wolfe, 2004, p. 48)

A key purpose of this study is to investigate how plausible this assumption is in the context of an Angoff standard setting and to assess how robust the model is to violations of this assumption.

2.3.2 Frames of Reference: Internal versus External

The assumptions being evaluated in this study can be stated in terms of different measurement frames of reference. An internal frame of reference “depicts the characteristicts of a particular rater in the context of the characteristics of the pool of raters of whom the rater is a member. To create a relative frame of reference, rating data from the pool of raters is scaled, and parameters are jointly estimated for examinees and raters” (Wolfe & McVay, 2010, p. 9). An external frame of reference

“depicts the characteristics of a particular rater in the context of the characteristics of scores that are external to the pool of raters of whom the rater is a member. These external scores could have been produced by a pool of expert raters, or the scores could be based on the examinees’ performance on an external test.” To construct an external frame of reference, “rating data from the pool of raters is scaled while fixing the characteristics (i.e., anchoring the parameters) of examinees on measures that are based on external scores” (Wolfe & McVay, 2010, p. 10).

In an Angoff standard setting, the ability level of the just-proficient student set by the judges within the standard setting frame of reference is assumed to correspond to the location of such a student within the test frame of reference. The many-facet

Rasch model has been proposed for use in evaluating this assumption. However, the use of these models requires the further assumption that no group-level rater effects exist. This is so because the expected values against which the observed values are compared were generated from within the ‘internal’ frame of reference created by the raters or Angoff judges themselves. The claim that these indices can detect deviations from error-free measurement requires the assumption that the expected values from the internal frame of reference are the same as expected values from an external

‘error-free’ frame of reference.

An Angoff standard setting differs from most rating situations in that an

external frame of reference often does exist. The test items are scaled within both

frames of reference: the external frame of reference resulting from the administration of the original test, and the internal frame of reference constructed from the judge’s estimates for each of the items. As the task of Angoff judges is to consider how a just-proficient student from the same student population as the actual examinees would perform on the items, it seems reasonable to use results from the administration of the exam to construct the external frame of reference. In fact, in the form of correlations with empirical p-values (Cizek & Bunch, 2007) or with such indices as van der Linden’s consistency index (van der Linden, 1982), results from the test frame of reference are commonly used to evaluate the performance of Angoff judges. In this study, results from the external frame are additionally used to examine the central assumption underlying use of the MFRM for detecting rater effects.