Leniency v. Severity - 以多面向Rasch模式為基礎檢驗Angoff標準設定法的效度議題

Leniency and severity effects manifest as scores which are either lower or higher than those of other raters. MFRM indices rely primarily on the estimated severity measures for the judges and on the separation statistics. For detecting whether individual judges were lenient or severe in relation to the group, a number of indicators are available.

1. Mean scores. Directly comparing the mean scores of the ratings assigned by each judge is the standard indicator within a raw score framework.

Within the MFRM framework, a number of further indicators exist.

2. Judge severity measures. Leniency and severity can be examined directly by comparing the values for the different judges on the judge severity parameter, λr. (In an Angoff standard setting, where the only ‘examinee’ is the BPS as imagined by the different judges, the severity parameter can be omitted and judge severity would appear as different values for the βn parameter - representing the location of the cut score on the latent variable.)

3. Fixed chi-square test of the hypothesis that the judges share the same level of severity. A significant difference would indicate that at least two judges differed in

severity.

4. Follow-up t-tests. Significant findings on the above chi-square test can be followed up with t-tests between pairs of judges, using the judge severity measures and

associated standard errors to determine whether the two judges differ significantly in their displayed levels of severity.

5. Judge separation ratio. This ratio measures the spread of the measures for the different judges relative to their precision.

6. Judge separation index. This index indicates the number of statistically distinct severity levels among the raters.

7. Reliability of the judge separation index. This measures the reliability with which the judges have been separated. A value of 0.0 would indicate that the panelists were exchangeable, while higher values indicate that the judges were reliably separated in terms of their severity.

There are no agreed-upon criteria for the above indices. Their value lies in providing information about the degree to which rater severity levels diverged. Actual interpretation remains largely a matter of judgment. In a standard setting,

‘interchangeability’ of judges is not normally expected. While all of the judges are subject-area experts, they are also chosen to represent diverse backgrounds and may be expected to come to different but defensible interpretations of the performance level descriptors which articulate the standard. It thus becomes a question of judgment on the part of those evaluating the judges’ performance as to how much difference is acceptable. Myford & Wolfe (2004b) suggest using t-tests to identify judges whose measures differ significantly from one another. Wolfe (2004) flags raters who differ significantly from the group mean. Given the above consideration concerning standard setting judges, another approach would be to define ‘problem judges’ as those who are at least 2 standard errors (SEs) in distance from any members of the main cluster of judges.

For leniency/severity, there are no clear indicators of group-level effects, which would indicate when most or all of the members of the group were displaying leniency or severity. The only indicators available to detect group-level leniency/

severity are the group level category usage statistics. The problem with attempting to use these to identify group-level effects in an Angoff standard-setting is that it presupposes some prior expectation concerning which categories should be used. If

Inaccuracy

Inaccurate ratings are typically diagnosed through correlations and patterns in

statistical indicators that are based on residuals.

Two raw-score indices are used here.

1. Raw-score correlations. When scores from an external framework are available, as is often the case in operational Angoff standard settings, these are used. The critical value of the correlation coefficient can be used to flag problematic raters.

2. Single Rater/Rest of Rater (SR/ROR) Correlations. Inter-rater correlations are often used when no external scores are available. The Facets software package calculates this raw score statistic. The critical value of the correlation coefficient can be used to flag problematic raters.

In addition to these raw score statistics, four MFRM indices have been proposed for use in investigating individual level inaccuracy effects.

3. Point-measure correlation. This is the correlation between scores assigned to a group of ratees (items) by a particular examinee and the Rasch parameter estimates or measures for the same ratees. Low consistency between these two scores should be reflected in a low correlation. The critical value of the correlation coefficient can be used to flag problematic raters.

4. Score-expected correlations. The Facets software program generates an expected score for each rater-ratee interaction. A low correlation between observed and expected scores would indicate inaccuracy. The critical value of the correlation coefficient can be used to flag problematic raters.

5. Standard deviation of the residuals. Accurate ratings would result in small,

randomly distributed residuals. A large standard deviation of the residuals would thus indicate inaccuracy. Wolfe (2004) ‘arbitrarily’ defined large as 1.25, and small at .75.

6. Judge fit statistics. These should be sensitive to rater inaccuracy. For mean square fit indicators, values above 1.4 are typically used to flag raters for misfit. For

standardized fit statistics, values above 2.0 are used.

Indices have also been proposed for detecting group-level effects.

7. Item separation statistics. Myford & Wolfe (2004) suggested examining the item separation statistics for the ratees (fixed chi-square test of the hypothesis that items share the same measure with a non-significant chi-square value suggesting a group-level effect, item separation ratio with a low ratio suggesting a group-group-level effect, item separation index with a low value suggesting a group-level effect and reliability of the item separation index with a low value suggesting a group-level effect) for evidence that the raters or judges did not effectively discriminate between or

‘separate’ the ratees.

在文檔中以多面向Rasch模式為基礎檢驗Angoff標準設定法的效度議題 (頁 37-40)