Assumption of Latent Trait Models for Detecting Rater Effects

Latent trait models generally, and the many-facet Rasch model in particular have come into widespread use for the detection of rater biases or effects. Generally, the use of internal data alone (i.e., from the ratings of the judges) is used for this purpose.

A number of researchers have proposed using the MFRM for evaluating standard setting results and the ability of the model to detect biases based solely on internal data has been noted as an advantage (Engelhard & Anderson, 1998). However this use of these models relies on the assumption that no group-level rater effects exist or, put differently, that the group-level data can be taken to represent error-free measurement.

If this were true, in an Angoff setting, we would expect the results from the internal

frame of reference to correspond closely to the results of the same analysis from an external frame of reference constructed using the item response data from the original exam. As item response data was available for the current standard setting, it was possible to directly test this assumption. In the first part of the study, the estimates of the Angoff judges were used to construct an internal frame of reference, and the item difficulty parameter estimates from the original test administration were used to construct an external frame of reference. The ratings were analyzed separately in each frame using the MFRM and the results were compared. The values of the indices for leniency/severity, accuracy and centrality/extremity were found to differ across the two frames and it was thus concluded that the key assumption underlying use of the MFRM did not hold in the present case.

For leniency/severity, the differences were relatively minor, and only one additional judge was flagged in the external frame who was not flagged within the internal frame. Note that the comparison between the internal and external frames was indirect, as, of course, no external data was available for actual B1 students. (Making such a comparison would be possible using, e.g., the results from a standard setting using a different method. Such was beyond the scope of the present study.)

For inaccuracy, there were marked differences in the results and in the judges flagged for displaying the effect. Overall, the results from the external frame of reference revealed more rater effects than the internal frame of reference, and the correlational indices in the external frame suggested generally less positive or

‘optimistic’ views of judge performance.

For centrality/extremism, there were marked substantive differences, with a much larger number of judges were flagged within the external frame of reference.

Indeed, judges who were flagged for showing an extremity effect in the internal frame

were actually found to show a centrality effect within the external frame.

Overall, these results suggest that analysts need to carefully distinguish between normative (or ‘rater consensus’) situations and criterion-referenced situations.

Without an external frame of reference, the use of latent-trait models for detecting rater effects can be used to make claims about the former type of situation, based on deviations from the ratings of the group of raters as a whole. However, the stronger

considerable justification.

Group-level Effects. On a more positive note, the group-level effect indicators for

centrality and inaccuracy within the internal frame of reference did suggest group-level effects, and analysis within the external frame confirmed that group-group-level effects did occur. This is of considerable importance, since in most rating situations, there is no data available from which an external frame of reference might be constructed.

Indices capable of detecting group-level rater effects could be used to evaluate the assumption of the model even in the absence of an external frame of reference. If no group-level rater effects are found, it is more likely that the assumption is met and that the judge frame of reference does not differ significantly from an external ‘error-free’

frame of reference. On the other hand, finding group-level effects would serve to caution the analyst away from making criterion-referenced interpretations where only an internal frame of reference exists.

Assumptions of the Angoff Method

It was also found that the assumptions of the Angoff method appear to have been violated. Given previous findings in the literature, this was not particularly surprising.

In terms of the first assumption (representation of the BPS), it was difficult to make clear statements about how accurately the cut score was located, for the simple reason that there is no external standard for comparison. In terms of the second assumption, in line with previous research, the present study showed that a high degree of

inaccuracy was present in the first round of item estimates made by the judges. More optimistically, less inaccuracy was found in the listening test, suggesting that judges do become more accurate with experience. Finally, also in line with a number of earlier studies, raters were shown to have difficulty in quantifying their decisions using the probability scale, and almost all raters displayed a central tendency bias, compressing the internal scale.

It was also observed that judges who were non-native English speakers (as opposed to native English speakers) and judges with administrative roles (as opposed to judges with only teaching roles) were more severe, more accurate and less likely to

display centrality. The sample was too small to draw conclusions, but it is tempting to speculate that non-native speakers might be more able to perform the task required in an Angoff, since they have more direct experience with assessing the difficulty of target-language texts, compared to the native speakers, for whom it may represent more of an abstract exercise.

5.2 Implications and Suggestions

在文檔中以多面向Rasch模式為基礎檢驗Angoff標準設定法的效度議題 (頁 99-102)