Centrality/Extremism
2.3 Assumption of the Use of Latent Trait Models
2.3.1 Normative versus Criterion-referenced Contexts
Before discussing the assumption these models make in detecting rater effects, it is essential to first make a distinction which is central to this study. This distinction, rarely explicitly made in the rater effects literature, is between rater deviations from
the ratings of a group or community of raters and rater deviations from the ideal of error-free measurement. Following Wolfe (2004), we can consider the former type as
relevant to normative-referenced settings and the latter type as relevant tocriterion-referenced settings. In a normative-reference setting, scores “derive their meaning
from their relative standing in the distribution of ratings” (Wolfe, 2004, p. 39). Rater deviations are understood as deviations from a consensus understanding of ability. In a criterion-referenced setting, examinee performances are being evaluated against an articulated standard and rater deviations represent errors in relation to that standard.Since a standard setting situation is clearly a criterion-referenced setting, the focus here will be on detecting rater deviations from the ideal of error-free measurement.
When the MFRM is used to detect rater effects, the frame of reference normally used is the ‘internal frame of reference’ constructed by the joint ratings of the raters themselves. When only this judge-generated or internal frame of reference is used, the only claim that can be unproblematically made is that the effects occur as
deviations from the ratings of the other members of the group. As just noted, this is
appropriate for ‘normative-reference settings.’ However, latent trait/MFRM indices have also been proposed for use in standard setting, a criterion-referenced setting (Eckes, 2009; Engelhard, 2007, 2009, 2011; Engelhard & Anderson, 1998; Engelhard& Cramer, 1997; Englehard & Gordon, 2000; Engelhard & Stone, 1998). Indeed, as one of the advantages of the MFRM-based approach, Engelhard and Anderson explicitly state that it “does not require data from examinees to define indexes of
rating quality” (1998, p. 227, emphasis added; the authors do, however, immediately
add that “Estimates of item difficulties based on examinee data can be used toexamine and validate” the results - a suggestion taken up in this study, as detailed below).
the further claim that ratings deviate from error-free measurement requires the critical assumption that “rater effects are randomly distributed and are exhibited by only a minority of raters in the pool” (Wolfe, 2004, p. 47). Put differently, “one must assume that the group of raters assigns, on average, scores that are unbiased” (Wolfe &
McVay, 2010, p. 6), or that no group-level rater effects exist. As Wolfe further emphasizes:
A very important topic that has received little attention in the literature relating to rater effects is the interpretive frame of reference within which rater effects are portrayed. A serious shortcoming of the methods described in this article is their reliance on an implicit assumption that rater effects are distributed in the pool of raters in a non-systematic manner. (Wolfe, 2004, p. 48)
A key purpose of this study is to investigate how plausible this assumption is in the context of an Angoff standard setting and to assess how robust the model is to violations of this assumption.
2.3.2 Frames of Reference: Internal versus External
The assumptions being evaluated in this study can be stated in terms of different measurement frames of reference. An internal frame of reference “depicts the characteristicts of a particular rater in the context of the characteristics of the pool of raters of whom the rater is a member. To create a relative frame of reference, rating data from the pool of raters is scaled, and parameters are jointly estimated for examinees and raters” (Wolfe & McVay, 2010, p. 9). An external frame of reference
“depicts the characteristics of a particular rater in the context of the characteristics of scores that are external to the pool of raters of whom the rater is a member. These external scores could have been produced by a pool of expert raters, or the scores could be based on the examinees’ performance on an external test.” To construct an external frame of reference, “rating data from the pool of raters is scaled while fixing the characteristics (i.e., anchoring the parameters) of examinees on measures that are based on external scores” (Wolfe & McVay, 2010, p. 10).
In an Angoff standard setting, the ability level of the just-proficient student set by the judges within the standard setting frame of reference is assumed to correspond to the location of such a student within the test frame of reference. The many-facet
Rasch model has been proposed for use in evaluating this assumption. However, the use of these models requires the further assumption that no group-level rater effects exist. This is so because the expected values against which the observed values are compared were generated from within the ‘internal’ frame of reference created by the raters or Angoff judges themselves. The claim that these indices can detect deviations from error-free measurement requires the assumption that the expected values from the internal frame of reference are the same as expected values from an external
‘error-free’ frame of reference.
An Angoff standard setting differs from most rating situations in that an