• 沒有找到結果。

Listening. The item separation statistics for listening in Table 4.10 show an even smaller degree of separation than did the reading results and are also consistent with a

4.1.3 Centrality/Extremity

Individual-level Effects

The residual-expected and residual-measure correlations yielded essentially identical results, as did the residual-based indices (standard deviation of the residuals, fit statistics). Thus, to simplify the presentation of results, only results for the residual-expected correlations, infit mean square statistics and the raw score standard deviations are shown below. Results for all indicators can be found in Appendix D.

Values for all indices also appear in the correlation matrices below, which indicate the high level of agreement found.

Reading. Table 4.11 displays the standard deviation of ‘raw score’ estimates and the

MFRM indices of centrality for each judge. In the internal frame, the residual-expected correlations flag judges J02, J14, J15, J16 and J18 for displaying extremity effects and judges J05, J08, J10 and J12 for centrality. The infit mean square values indicate overfit to the model for J01, J02, J05, J08, J09, J10 and J13. No judges were flagged for misfit.

Results differ markedly in the external framework. All judges have negative correlations, suggesting centrality, and the correlations are significant at the .05 level for all judges except for J03 and J18. Looking at the infit mean square values, all judges have values above 1.0, and only J13 is not flagged for misfit or extremity.

Table 4.11. Indices of Centrality/Extremity for Reading, Internal v. External

r

exp,res

r

exp,res Mean SquareInFitInFit Mean Square Judge Raw Score

Standard

Deviation INTERNAL EXTERNAL INTERNAL EXTERNAL

J01 11.9 0.08 -0.62* 0.56* 1.87**

J02 15.4 0.36** -0.55* 0.39* 1.55**

J03 17.8 0.24 -0.30 1.11 1.89**

J04 11.2 -0.31 -0.72* 0.63 1.77**

J05 9.6 -0.51* -0.84* 0.42* 1.64**

J06 13.7 -0.19 -0.70* 0.70 2.36**

J07 14.7 -0.31 -0.76* 1.01 3.58**

J08 9.8 -0.39* -0.77* 0.56* 1.73**

J09 12.2 -0.14 -0.71* 0.43* 1.66**

J10 8.3 -0.62* -0.88* 0.55* 2.33**

J11 13.2 -0.28 -0.77* 0.69 2.78**

J12 11.9 -0.32* -0.75* 0.61 2.03**

J13 10.7 -0.20 -0.73* 0.35* 1.05

J14 16.3 0.39** -0.45* 0.57* 1.64**

J15 19.4 0.51** -0.43* 0.81 2.65**

J16 19.7 0.55** -0.33* 0.81 2.07**

J17 16.3 0.27 -0.58* 0.57* 2.28**

J18 20.1 0.59** -0.23 0.78 1.55**

*Overfit/

Centrality 4 16 9 0

No Effect 9 2 9 1

**Underfit/

Extremity 5 0 0 17

*Negative correlation, p < .05; or, infit mn. sq. < .6.

**Positive correlation, p < .05; or, infit mn. sq. > 1.4.

Figure 4.10 provides a visual comparison of results across both frameworks. The values of the correlational indices are shifted to all negative correlations in the external framework. As was observed above, the fit values appear relatively

‘compressed’ in the internal framework.

!1#

0" 0.5" 1" 1.5" 2" 2.5" 3" 3.5" 4"

I"N"T"E"R"N"A"L"

E"X"T"E"R"N"A"L"

InFit"Mean"Square"(Reading)"

Figure 4.10. Centrality/extremity indices for reading, internal v. external.

Using the correlation matrices in Table 4.12 to compare the correlational with the residuals-based indices, note that while there is some overlap in the internal framework, they move in opposite directions within the external framework. Using the raw score standard deviation, in the first column, as an indicator of centrality in the external framework, we can see that the correlational indices are in good

agreement with it in both frameworks. In the internal framework, there are reasonably strong positive correlations between the raw-score standard deviation and the

residuals-based indices, meaning that low values for both was suggestive of centrality.

However, in the external framework, this relationship nearly disappears: the correlations are low and non-significant.

Table 4.12. Correlation Matrices of Centrality/Extremity Indices for Reading

Raw Score

SD SD

Residuals InFit

MnSq InFit Z OutFit

MnSq OutFit Z Res-Exp Corr INTERNAL

SD Residuals .624**

InFit MnSq .587* .966**

InFit Z .578* .968** .992**

OutFit MnSq .588* .985** .992** .988**

OutFit Z .578* .983** .984** .994** .993**

Res-Exp Corr .915** 0.289 0.290 0.283 0.270 0.264

Res-Meas Corr .913** 0.287 0.288 0.282 0.269 0.263 1.000**

EXTERNAL

SD Residuals 0.220

InFit MnSq 0.161 .981**

InFit Z 0.179 .983** .992**

OutFit MnSq 0.029 .973** .981** .971**

OutFit Z 0.035 .976** .975** .980** .992**

Res-Exp Corr .911** -0.138 -0.176 -0.150 -0.318 -0.306

Res-Meas Corr .901** -0.148 -0.176 -0.150 -0.325 -0.311 .997**

* Significant at 0.05 level (2-tailed). **Significant at 0.01 level (2-tailed).

The scatterplots in Figure 4.11 suggest that in the external framework, when the residuals are less ‘compressed’, the correlation with the raw score standard deviation disappears (recall from above, that in the external framework, the residuals-based indices appeared to become more sensitive to inaccuracy and correlated more strongly with the score/p-value correlations).

I N T E R N A L E X T E R N A L

&1" &0.6" &0.2" 0.2" 0.6" 1"

SD#Raw#Scores#

&1" &0.6" &0.2" 0.2" 0.6" 1"

SD#Raw#Scores#

0" 0.5" 1" 1.5" 2" 2.5" 3" 3.5" 4"

SD#Raw#Scores#

0" 0.5" 1" 1.5" 2" 2.5" 3" 3.5" 4"

SD#Raw#Scores#

InFit#Mean#Square##

Figure 4.11. Centrality/extremity indices v. raw score standard deviations, reading,

internal v. external.

Listening. Table 4.13 displays the standard deviation of ‘raw score’ estimates and the

MFRM correlational indices of centrality for each judge. In the internal frame, the expected-residual correlation flags judges J04 and J16 for extremity and J07 and J11 for centrality. For the infit mean square values in Table 4.13, 17 judges overfit the model, with J11 being the only judge showing acceptable fit. In the external framework, results again sharply differ. The expected-residual correlation flags all judges for centrality. The infit mean square values, however, flag only two judges (J07 and J11) for misfit or underfit. The other 16 judges had acceptable fit values.

Table 4.13. Indices of Centrality/Extremity for Listening, Internal v. External

r

exp,res

r

exp,res Mean SquareInFitInFit Mean Square Judge Raw Score

Standard

Deviation INTERNAL EXTERNAL INTERNAL EXTERNAL

J01 11.0 0.15 -0.64* 0.29* 1.06

J02 10.0 0.09 -0.61* 0.30* 0.61

J03 8.1 -0.08 -0.70* 0.28* 0.89

J04 14.1 0.34** -0.45* 0.51* 1.14

J05 8.9 -0.05 -0.72* 0.31* 1.20

J06 8.6 -0.19 -0.78* 0.22* 1.02

J07 10.0 -0.37* -0.79* 0.47* 1.57**

J08 8.7 0.11 -0.69* 0.25* 1.12

J09 11.1 0.13 -0.65* 0.28* 0.98

J10 13.7 -0.03 -0.55* 0.57* 0.99

J11 11.4 -0.45* -0.78* 0.89 2.37**

J12 10.2 0.05 -0.65* 0.28* 0.71

J13 8.6 -0.19 -0.76* 0.23* 0.69

J14 12.2 0.15 -0.50* 0.50* 1.01

J15 12.2 0.22 -0.53* 0.36* 0.76

J16 12.1 0.32** -0.52* 0.34* 0.87

J17 9.9 0.13 -0.74* 0.19* 1.25

J18 9.4 0.06 -0.73* 0.20* 1.06

*Overfit/

Centrality 2 18 17 0

No Effect 14 0 1 16

*Underfit/

Extremity 2 0 0 2

*Negative correlation, p < .05; or, infit mn. sq. < .6.

**Positive correlation, p < .05; or, infit mn. sq. > 1.4.

Figure 4.12 provides a visual comparison of results across both frameworks. As with reading, the values of the correlational indices are shifted to all negative correlations in the external framework. As was observed above, the fit values were again relatively

‘compressed’ in the internal framework.

!1#

!0.6#

!0.2#

0.2#

0.6#

1#

!1# !0.6# !0.2# 0.2# 0.6# 1#

I"N"T"E"R"N"A"L"

E"X"T"E"R"N"A"L"

Residual1Expected"Correla9on"

(Listening)"

0"

0.5"

1"

1.5"

2"

2.5"

0" 0.5" 1" 1.5" 2" 2.5"

I"N"T"E"R"N"A"L"

E"X"T"E"R"N"A"L"

InFit"Mean"Square"(Listening)"

Figure 4.12. Centrality/extremity indices for listening, internal v. external.

The correlational matrices in Table 4.14 allows closer investigation of how the different indices performed in the two frameworks. Comparing the correlational with the residuals-based indices, the residuals-based indices actually correlate more strongly with the raw score standard deviations than do the correlations between residuals and expected scores and measures. The correlational indices have positive non-significant correlations with the standard deviations. However, in the external framework, this situation is reversed. The correlations between SD Residuals and the raw score standard deviations, and the fit values and the raw score standard deviations falls below significance to near zero, while the residual-expected and residual

measure correlations approach 0.8, and are significant at the .01 level.

Table 4.14. Correlation Matrices of Centrality/Extremity Indices for Listening

Raw Score

SD SD

Residuals InFit

MnSq InFit Z OutFit

MnSq OutFit Z Res-Exp Corr INTERNAL

SD Residuals .639**

InFit MnSq .585* .983**

InFit Z .636** 0.985 .987**

OutFit MnSq .559* 0.984 .999** .985**

OutFit Z .616** 0.988 .985** .998** .986**

Res-Exp Corr 0.425 -0.360 -0.371 -0.300 -0.405 -0.329

Res-Meas Corr 0.417 -0.368 -0.378 -0.308 -0.411 0.337 -.999**

EXTERNAL

SD Residuals 0.155

InFit MnSq 0.092 .981**

InFit Z 0.092 .981** .988**

OutFit MnSq 0.055 .983** .995** .983**

OutFit Z 0.053 .983** .982** .994** .988**

Res-Exp Corr .796** -0.379 -0.395 -0.381 -0.441 -0.432

Res-Meas Corr .781** -0.372 -0.379 -0.364 0.432 -0.422 .995**

* Significant at 0.05 level (2-tailed). **Significant at 0.01 level (2-tailed).

Figure 4.13 depicts the same general pattern noted above. Residuals are compressed within the internal framework; in the external framework, the correlation with the raw score standard deviations tends to disappear for the fit values while strengthening for the residual expected correlations. Examining the plot, the lower-than-expected correlations between the residual-expected and the raw score standard deviations may have resulted from a small number of outlying observations.

I N T E R N A L E X T E R N A L

,1" ,0.6" ,0.2" 0.2" 0.6" 1"

SD#Raw#Scores#

,1" ,0.6" ,0.2" 0.2" 0.6" 1"

SD#Raw#Scores#

0" 0.5" 1" 1.5" 2" 2.5"

SD#Raw#Scores#

0" 0.5" 1" 1.5" 2" 2.5"

SD#Raw#Scores#

InFit#Mean#Square##

Figure 4.13. Centrality/extremity indices v. raw score standard deviations, listening,

internal v. external.

Group-level Effects

Item separation statistics and item fit indices have been suggested for detecting group-level centrality effects. As item separation statistics are also indicators of group-group-level inaccuracy, they were presented above. There it was noted that the items were clearly less separated in the internal framework and that this was consistent with a group-level inaccuracy effect. However, this outcome is also consistent with a group-group-level centrality effect. Given the outcomes for individual judges on the centrality indices above, it seems likely that, in fact, the low level of separation was due to a group-level centrality effect.

It was also suggested that item mean square values significantly below 1.0 might indicate a group-level centrality effect. Tables 4.15 and 4.16 compare the item infit mean square values across frameworks for listening and reading. It can be seen that a large number of items have infit values below 1.0 (36 of 40 for reading, and 40 of 40 for listening). From Figure 4.14, the same pattern of highly compressed values in the internal framework is once again seen. These results are consistent with a group-level centrality effect.

Table 4.15. Item Fit Indices for Reading, Internal v. External

INTERNAL

INTERNAL EXTERNALEXTERNAL INTERNALINTERNAL EXTERNALEXTERNAL Items Measure InFitMS Measure InFit

MS Items Measure InFitMS Measure InFit MS

iR01 0.11 0.45 1.04 1.69 iR21 -0.22 0.49 0.72 1.98

iR02 0.42 0.69 -0.31 2.52 iR22 0.60 0.49 1.48 1.26

iR03 -0.28 0.42 -1.52 2.93 iR23 0.75 1.17 0.10 3.28

iR04 0.00 0.38 -1.32 3.73 iR24 0.63 1.53 2.28 5.49

iR05 0.16 0.93 0.35 1.24 iR25 0.42 1.05 1.85 3.93

iR06 -0.58 0.48 -0.61 0.66 iR26 -0.52 0.87 -2.00 4.68

iR07 0.22 0.82 -0.63 2.76 iR27 -0.34 0.43 -0.29 0.56

iR08 0.65 0.88 1.20 1.29 iR28 -0.52 0.54 -1.92 3.55

iR09 0.42 0.68 0.32 1.10 iR29 -0.65 0.36 -0.99 0.54

iR10 0.58 1.37 0.72 1.93 iR30 0.00 0.60 -1.15 3.46

iR11 -0.75 0.29 -2.01 1.92 iR31 -0.92 0.19 -0.24 1.34

iR12 0.19 0.71 -0.31 1.62 iR32 -0.37 0.28 -0.70 0.48

iR13 0.32 0.57 0.91 1.08 iR33 -0.55 0.53 -1.19 1.13

iR14 0.45 0.83 1.11 1.42 iR34 -0.16 0.77 0.90 2.71

iR15 -0.16 0.63 -0.43 0.98 iR35 -0.52 0.86 -0.40 1.20

iR16 0.24 0.77 -0.19 1.66 iR36 -0.03 0.50 -1.03 2.57

iR17 -0.11 0.32 1.18 2.96 iR37 0.27 0.30 -0.06 0.79

iR18 0.16 0.48 0.86 1.19 iR38 -0.34 0.25 -1.15 1.19

iR19 0.83 0.93 1.62 1.62 iR39 -0.34 0.30 0.81 2.52

iR20 -0.11 0.75 1.55 5.35 iR40 0.03 0.31 -0.51 1.03

Table 4.16. Item Fit Indices for Listening, Internal v. External Frames

INTERNAL

INTERNAL EXTERNALEXTERNAL INTERNALINTERNAL EXTERNALEXTERNAL Items Measure InFitMS Measure InFit

MS Items Measure InFitMS Measure InFit MS

iL1 0.08 0.28 0.38 0.41 iL21 -0.65 0.28 -1.52 1.21

iL2 0.16 0.25 -0.77 2.11 iL22 -0.43 0.3 -0.88 0.65

iL3 -0.01 0.39 -0.75 1.53 iL23 -0.5 0.28 -1.11 0.81

iL4 0.34 0.31 0.76 0.52 iL24 -0.04 0.33 -0.28 0.56

iL5 0.47 0.42 -0.2 1.78 iL25 0.22 0.59 0.57 0.81

iL6 0.28 0.36 0.11 0.61 iL26 -0.26 0.26 0.11 0.48

iL7 -0.26 0.43 -1.47 2.84 iL27 0.58 0.42 1.4 1.14

iL8 -0.07 0.19 -0.71 0.93 iL28 0.42 0.24 0.92 0.48

iL9 -0.69 0.07 -0.78 0.09 iL29 -0.43 0.21 -0.75 0.37

iL10 -0.26 0.3 -1.12 1.41 iL30 -0.3 0.36 0.76 2

iL11 0.31 0.28 0.89 0.65 iL31 0.05 0.6 1.15 2.34

iL12 -0.01 0.4 -0.55 1.08 iL32 -0.47 0.13 -1.33 1.03

iL13 0.25 0.44 0.16 0.64 iL33 0.45 0.34 0.65 0.41

iL14 0.72 0.66 2.06 2.93 iL34 -0.14 0.42 -0.2 0.53

iL15 0.28 0.59 -0.21 1.41 iL35 0.16 0.48 1.43 2.71

iL16 0.66 0.88 1.3 1.39 iL36 0.05 0.56 0.48 0.83

iL17 -0.07 0.44 0.02 0.53 iL37 -0.54 0.08 -1.84 2.09

iL18 0.47 0.32 0.67 0.4 iL38 -0.43 0.25 -1.06 0.81

iL19 0.42 0.3 0.83 0.47 iL39 -0.47 0.18 0.14 0.76

iL20 -0.11 0.29 0.88 1.68 iL40 -0.2 0.26 -0.14 0.34

0"

1"

2"

3"

4"

5"

6"

0" 1" 2" 3" 4" 5" 6"

I"N "T "E "R" N "A" L"

E"X"T"E"R"N"A"L"

Item"InFit"Mean"Square"(Reading)"

0"

0.5"

1"

1.5"

2"

2.5"

3"

3.5"

0" 0.5" 1" 1.5" 2" 2.5" 3" 3.5"

I"N "T "E "R" N "A" L"

E"X"T"E"R"N"A"L"

Item"InFit"Mean"Square"(Listening)"

Figure 4.14. Item infit mean square values in logits for reading and listening, internal

v. external frameworks.

Summary. The results for the internal and external frames clearly do not match, with

the internal frame failing to flag a large number of judges showing a centrality bias,

for both the listening and the reading exams. Indeed, judges flagged for possibly demonstrating an ‘extremity’ bias appeared to display centrality within the external frame. Importantly, however, the group-level indicators within the internal frame did appear to correctly indicate the presence of a group-level centrality effect.

4.1.4 Summary

Above, it was seen that results from the internal and external frameworks did not agree. However, within the frameworks, different indices gave different results.

Before summarizing the results for the two frameworks, it is first necessary to decide which indicators to use, based on correlations with the raw score statistics and

findings from earlier research. For the external framework, it seems reasonable to use the score-expected and point-measure correlations as the MFRM indicators for

inaccuracy and the residual-expected and residual-measure correlations as the MFRM indicators for centrality/extremity. For borderline cases, where these indicators

reached different decisions, the raw score statistic (score/p-value correlation for inaccuracy and raw score standard deviation for centrality/extremity) was given the deciding ‘vote.’ For the internal framework, the SR/ROR correlation was in closest agreement with the raw score indicator and was used to indicate inaccuracy. For centrality/extremity, the residual-expected and measure-residual correlations were used for centrality/extremism for the reading test, and the SD of the residuals was used for the listening test. (Again, note that some of these indicators were not shown above; Appendix D provides the values for all indicators.)

The final results are summarized in Table 4.17. There are clear differences between the results for the internal and external frameworks, suggesting that the assumption of the MFRM for detecting rater effects in a standard setting situation using internal data were not met, and that the model was not robust to violations of this assumption. On a more positive note, the item separation and fit statistics were sensitive to the presence of group-level effects.

Table 4.17. Summary of Flagged Raters, Reading and Listening, Internal v. External

READING

READING LISTENING LISTENING

Internal External Internal External

Leniency J11, J07 J11, J07 J07, J10 J07, J10

Severity J01, J08 J01, J08 J03

Inaccuracy J04, J07, J08, J10, J11, J12

J01, J04, J06, J07, J08, J10, J11, J12, J17

J07, J11 J05, J07, J11, J17

Centrality J05, J08, J10, J12

J01, J02, J04, J05, J06, J07, J08, J09, J10, J11, J12, J13, J14, J15, J16, J17, J18

J01, J02, J03, J06, J08, J09, J12, J13, J17, J18

J01, J02, J03, J04, J05, J06, J07, J08, J09, J10, J11, J12, J13, J14, J15, J16, J17, J18 Extremism J02, J14, J15,

J16, J18

J11