• 沒有找到結果。

Day 2 – Round 2 and 3

3.6 Data Analysis

Data analysis was performed with several statistical tests that are described below.

3.6.1 Repeated Measures Analysis of Variance

Repeated measures analysis of variance (rANOVA) was performed initially on the data to determine if the feedback portion of the standard setting between rounds had the intended effect of altering judge’s item estimates. The rANOVA is performed separately on each of the three different measures of judge accuracy to assure that judge’s scores responded to the

announcement of feedback data between the different rounds in the direction suggested by the Angoff model. It is expected that if the three different measures of judge accuracy had been affected as intended by the feedback, the within-subjects measures of the rANOVA would show a significant difference between the rounds where the feedback data was announced. In addition the power of each of the rANOVA is reported.

3.6.2 Pearson Correlation

Pearson correlation were performed in a series of analysis to test the main research hypothesis of this study, as well as illustrating the convergent validity of the different measures used in this study. Significance testing is used in each case based on the n size of the analysis to determine if the correlations are significant at the .05 level.

.

49 CHAPTER 4 RESULTS

In the following tests, significance is assumed to be p < .05. This value was chosen because of the small number of judges (N=18) used in the standard setting. Selecting the higher value of p

< .01 makes significance much harder to obtain, and hence values that could indicate meaningful difference might become lost. However, choosing a significance level of p <.05 requires careful reflection. Five percent of correlations will reach significance simply by chance. In addition to the correlations reported here, a large number of exploratory correlations were performed. With this in mind, patterns of findings need to be interpreted with caution. All statistics and numbers are rounded to the second decimal place, where appropriate. Cutscore statistics for the standard setting are contained in Appendix 11 and 12.

The Performance of Different Angoff Panels

Table 4.1 shows the p-value correlation, the Root Mean Square Error (RMSE), and the Cutscore Judgement (CSJ) for each of the judges on the reading panel. Table 4.2 shows the same values for the listening panel. A one-way repeated measures ANOVA was performed separately for the reading and listening tests for each of the measures of judge’s accuracy (ie: separate analysis of the p-value correlations, the RMSE, and the CSJ). The purpose of this analysis is to determine if the feedback data shown to the judges between rounds had the effect that was intended. As such, it is expected that the within-subjects measures for each of the three separate measures of judge accuracy would show a significant difference between the three different rounds of the standard setting.

Within-subject effects were significant for all of the measures of judge accuracy for the reading panels (Table 4.1). For the p-value correlation measure of the reading panel. Wilks’

50

Lambda = .162, F(2,16) = 41.50, p < .01, η2 = .84. For the RMSE, Wilks’ Lambda = .373, F(2,16) = 13.47, p <.01, η2 = .627. For the CSJ, Wilks’ Lambda = .620, F(2,16) = 4.91, p = .02, η2 = .38.

Within-subject effects were significant for all of the measures of judge accuracy for the listening panels (Table 4.2), except the CSJ. For the p-value correlations of the listening panel, Wilks’ Lambda = .17, F(2,16) = 39.40, p < .01, η2 = .83. For the RMSE, Wilks’ Lambda = .32, F(2,16) = 17.40, p < .01, η2 = .69. For the CSJ, Wilks’ Lambda = .85, F(2,16) = 1.43, p = .29, η2

= .15.

All of the values of the within-subjects ANOVA results were significant, except for the CSJ listening measures in Table 4.2, which was not significant. Some caution should be taken in interpreting the individual results for each of the subjects in Table 4.1 and 4.2. Particularly in Table 4.2, which is the Listening Panel, variation in the measures is very large. As a result, the ANOVA is not significant. However the standard deviations of the between-subject grand means, as expected, decreases between Round 1 and Round 2 for both the reading and the listening panels. And while the standard deviation does not decrease for Round 3 of the reading panel, the value is essentially the same as the value obtained for Round 2.

This indicates that generally the standard setting feedback between rounds 1 and 2 and between rounds 2 and 3 had the intended effect. This point will be returned to in the conclusion, although it points to some problems with the CSJ measures of the listening panel and that these should be interpreted with some caution.

51 Table 4.1

Measures of Judge Accuracy - Reading

52 Table 4.2

Measures of Judge Accuracy – Listening p-corr

53 Issues Concerning the p-value Correlation

It is important to note that some of the correlations in Table 4.1 and Table 4.2 during Round 3 are as high, or even higher, than .90. Such high correlations for this kind of task are improbable and are seldom achieved under conditions where participants do not have the kind of feedback information and discussion typically provided for standard setting judges (Brandon, 2004;

Clauser et al., 2002; Clauser et al., 2009; Clauser et al, 2013; Clauser et al., 2014; Margolis &

Clauser 2014). There are two explanations that have been suggested for such high correlations.

The first of these can be thought of as the high correlation position. The anti-correlation position is derived from experimental results. In fact, this position is supported by data from a large number of related studies that indicate judges are typically unable to accurately estimate the difficulty measures of items when asked to do so (Goodwin, 1999; Impara & Plake, 1998; Lorge & Krulou, 1953; Linn & Shepard, 1997; Norcini et al., 1987; Norcini, Shea &

Kanya, 1988; Shepard, 1994; Smith & Smith 1988; Taube, 1997).

Brian Clasuer and his colleagues have tried to address this problem experimentally. For example, in one study (Clauser, et al., 2009) judges were shown items and asked to rate their difficulty values, then asked to discuss their estimates with other judges without the benefit of feedback information. After decisions were made without feedback, judges were then provided with feedback information similar to that provided in an operational Angoff standard setting (Clauser, et al., 2009, p.1). Discussion of the items decreased the variance of the judges’ estimates, just as it would during a typical standard setting, but it did not improve the relationship between the judgments and the difficulty values of the items. However, once the judges were given

performance data for a subset of the items, judgments of the items showed a substantial increase in their correspondence with the true difficulty measures of the items. Clauser has interpreted

54

this kind of result as demonstrating that without performance data, judges are unable to accurately gauge the difficulty of items. As such, judges in an operational Angoff standard setting will not be able to accurately estimate the difficulty of an item without first being told how well students perform on the item.

The pro-high correlation position is derived from Item Response Theory (IRT) and the IRT formula. It proposes that because judges are not able to accurately calibrate the difficulty scores for the items, they need additional information to calculate the minimum value of the cut score for the item. High p-value correlations are thus a necessary part of a correctly conducted Angoff standard setting, demonstrating that judges understand their role as judges, as well as the role of feedback information (Cizek, 2012; Hambleton et al., 2012; Loomis, 2012).

55

This position is consistent with the interpretation of the Angoff method from a psychometric point of view.

If pi = exp(ϴj-bi) / 1+exp(ϴj-bi) Where,

pi = the estimated probability of a correct answer

ϴj = the judge’s true mental image of the ability level for the borderline candidate, in this case, the CEFR B1 level

bi = the true difficulty of the item under high stakes conditions

Feedback information appears to be a necessary but not sufficient condition for an accurate judgment. Providing the feedback information merely allows the judge a greater understanding of bi. Since the real value of ϴj remains unknown, knowing the value of bi merely gives the judge greater ability to estimate the range of true difficulty values for ϴj, but does not tell them what it would be. This is particularly important for teachers whose students are drawn from special populations of high or low performers and whose estimates of ϴj might be skewed as a result.

As mentioned earlier, a focus group was conducted with Group 3, where the issue of feedback data was raised by the moderator. Data from this focus group support the idea that judges use the feedback data as a source of additional information to make decisions about their estimates of student ability.

56

Moderator: One of the things I’m particularly interested in is your impression of the impact data [feedback data]…What you thought of that when I presented it. How that affected you when you made your decisions about things…

Judge Agf 36: For the morning part…it shows that I am underestimating the students…I underestimate them, their ability. I underestimate their test taking skills…

Moderator: So you saw it as a source of additional information?

Judge Agf 36: Yes

Moderator: So how did you use that in reevaluating their scores…Did it have any impact on your image of the B1 student?

Judge Agf 36: Yes…I think it increased the percentage a little bit. Not just the original student in my mind. I started to think about, not just the original PE midterms and finals, those who can get more than 80. In the beginning, I started to think about 90…

Moderator: Did it have any impact on the idea of what a B1 student…

Judge Agf 36: I was thinking the student will make this or that mistake. And when I compare to general student performance, I kind of find out my score kind of matched the PE more than the B1 level. So then even though I was doing it unconsciously, then I feel maybe I was thinking about the average student instead of the B1 student. So then when I do the listening, I kind of raise it [the estimate of B1 cutscore].

Moderator speaks now to Judge Agf 31.

57

Moderator: Now you teach at another school, and your students are different from ours in many ways. Did the impact data help you to understand the relationship between the test and the students better?

Judge Agf 31: Yes.

Moderator: In the way that Judge Agf 36 was talking about?

Judge Agf 31: Yes…When I saw the results you provided for us that helped me to understand the real performance for our students. That would be maybe higher than the B1 according to the descriptors, so I raised my criteria for the second level.

Despite theoretical and empirical reasons to accept the value of feedback, there is still one possibility that Clauser and his colleagues may be bringing to the interpretation of the standard setting. It still remains possible that tasks involving the incorporation of feedback data into the mental calculations demanded of standard setting judges are so mentally difficult, that at least some judges are not able to perform the task without losing track of the performance standard and it descriptors. So rather than use the feedback data to mentally position their students relative to the cutscore, they don’t use the performance standard at all and just follow the ups and downs of the feedback data. This, however, was not what the judges in Panel 3 described doing.

58

1. Does knowledge and training in Performance Level Descriptors (PLDs) work effectively to predict an Angoff standard setting judge’s ability?

Performance Level Descriptors (PLDs) are among the key concepts in describing the mental work of a standard setting judge. In an Angoff standard setting, judges must match real items with their mental image of a description of a person at the cutscore. In the IRT model of the p-value formula shown above, this ability to identify the items that represent the judge’s mental image of a person who is barely at the cutscore are represented by ϴj. Thus, for a judge to perform accurately, they must have a clear and consistent image of ϴj. The assumption that training in the PLDs of the standard setting produce better judges is implicit in the design of the Angoff standard setting method (Council of Europe, 2009). Large portions of the Council of Europe 2009 manual are devoted to the theory and practice that knowledge about PLDs will improve the judge’s ability to perform their responsibilities in the standard setting (Council of Europe, 2009). Large portions of the training for this standard setting were committed to this assumption. Certainly the idea carries with it a great deal of face validity. Is it supported by empirical findings?

The 18 judges were separated into 3 panels of 6 people each for the purpose of training and tested on their ability to match levels of the Common European Framework of Reference with descriptors taken from the CEFR Scales. This assessment is described in greater detail in

Assessment 1. Tables 4.1 and 4.2 show the number of descriptors correctly categorized for each of the judges according to their group membership.

As Table 4.1 illustrates, the best performing panel in terms of correctly identifying CEFR

reading descriptors is clearly Panel 2. Despite this, by Round 3, the Group Mean for their p-value

59

correlation (r = 0.45) was the lowest of all three groups (Panel 1, r = 0.80; Panel 2, r = 0.87), and their RMSE remained higher (RMSE = 21.68) than the other two groups (Panel 1, RMSE = 17.55; Panel 3, RMSE = 15.58). These results indicate that knowledge of the PLDs did not assure that a judge can accurately estimate the difficulty of an item.

The scores from Table 4.2, the listening descriptors, were slightly different. While Panel 2 continued to measure lower than other panels on the p-value correlations (r = .40, .66, and .70), Panels 1 and 3 showed a different pattern (Panel 1, r = .50, .79, and .87; Panel 3, r = .55, .82, .84).

This was especially true for the RMSE. While the Group Mean for Panel 1 on the listening panel started out slightly higher (RMSE = 20.10) than the Group Mean for Panel 3 (RMSE = 19.10), by the end of Round 3, Panel 1 had a slightly lower Group Mean (Panel 1, RMSE = 14.79; Panel 3, RMSE = 16.07), but in fact, their performance was about the same during each of the rounds.

Table 4.3 shows the Pearson correlation between the PLD‘s Test scores and accuracy of the judge’s during the standard setting. For an N = 18 (r = .44, p < .05). Although there are some singularly high correlations, the absolute value of only 3 of the 18 correlations reaches

significance. This shows that the relationship between knowledge of the PLDs and the three measures of judge accuracy is not strong, and that there may be other important variables that are unmeasured in this model. In conclusion, the PLD Test is not an important predictor in

determining judge accuracy.

60 Table 4.3

Correlation between PLD Test and Standard Setting Performance

p-corr p-corr p-corr RMSE RMSE RMSE CSJ CSJ CSJ

Round R1 R2 R3 R1 R2 R3 R1 R2 R3

READING -.14 -.23 -.48* .20 .23 .36 -.38 -.45* -.48*

LISTENING -.29 -.13 -.13 .12 .18 .28 .32 .17 .02

*indicates p<.05

61

These results seem to suggest that there is little or no relationship between a judge’s knowledge of how to use the PLDs and their performance in an Angoff standard setting. As such, even if a judge performed poorly on a test of PLDs, they would still be able to act competently as a judge in such a standard setting. This is easier to understand if you consider that a standard setting involves a large number of skills other than just knowledge of the PLDs. Lack of

understanding of the PLDs could be compensated for by strengths in other areas of performance.

On the other hand, knowledge of the PLDs could be a necessary but not sufficient condition for competent performance as an Angoff standard setting judge.

A second explanation for the poor relationship between the PLD Test and judge’s standard setting performance addresses the test itself. A great deal of difficulty was observed in the construction of the PLD Test. A number of the original items had to be deleted to create a psychometrically acceptable measurement instrument. It should be noted that a formal test was never developed and the PLD Test was used only because its Cronbach’s Alpha was acceptable.

Many indexes of reliability and performance were not checked. It is possible that the items used to test the judge’s PLD knowledge do not provide a completely sound measurement instrument.

62

Table 4.4 shows the correlation matrix for the standard setting that examined reading ability. The matrix provides convergent validity (AERA, APA, & NCME, 2014; Cronbach, 1988; Cronbach

& Meehl, 1955; Kane, 2006; Loevinger, 1957; Messick, 1981, 1989, 1998) for the results of this standard setting. Correlations of RMSE scores with other RMSE scores should theoretically be positive. The same would be true for p-value correlations with other p-value correlations. On the other hand, all correlations of p-value correlation scores and RMSE scores should theoretically be negative. Correlations with the CSJ could either be positive or negative for either the p-value correlation or the RMSE. All of the measures of judge accuracy between the RMSE and the p-value correlation were significant. While measures of CSJ were highly correlated with each other at different rounds, they were only moderately correlated with p-value correlation, with 5 of 9 correlations reaching significance. With RMSE, only 1 of the 9 correlations was significant.

In Table 4.4, the absolute value of correlation scores range from r = 0.48 to r = 0.89. By Round 3, the end of the reading panel, the absolute value of 5 of the 8 correlations exceeded significance at r = 0.44. The only correlations that failed to meet significance were those between the CSJ and RMSE.

In addition, all the correlations between p-value correlation scores and RMSE scores were negative and correlations between the same measures of ability were positive.

63

Table 4.4

Matrix Correlation for Measures of Reading Ability

p-corr 1 p-corr 2 p-corr 3 RMSE 1 MSE 2 RMSE 3 CSJ1 CSJ2 CSJ3 p-corr 1 1.00 .71* .77* -.79* -.58* -.61* .41* .43* .65*

p-corr 2 X 1.00 .87* -.48* -.88* -.75* .11 .28 .50*

p-corr 3 X X 1.00 -.68* -.82* -.89* .26 .41* .60*

RMSE 1 X X X 1.00 .50* .60* -.32 -.22 -.46*

RMSE 2 X X X X 1.00 .89* .00 .03 -.23

RMSE 3 X X X X X 1.00 -.05 -.09 -.24

CSJ 1 X X X X X X 1.00 .43* .61*

CSJ 2 X X X X X X X 1.00 .88*

CSJ 3 X X X X X X X X 1.00

*indicates p<.05

64

Table 4.5 shows the same matrix of correlations for judge accuracy of listening ability.

The correlations in this matrix are not as large as those in the judge’s reading ability matrix. The absolute value of the correlation scores range from r = .22 to r = .84, although the highest

correlations come from measures of CSJ with measures of CSJ in other rounds. Only 13 of the 36 correlations reach significance at r = .44, three of those correlations came from measures of CSJ correlating with measures of CSJ in other rounds. Measures of CSJ all correlated

significantly with other measures of CSJ during other rounds, however only 1 of the correlations with a measure of RMSE was significant and none of the correlations with p-value correlation were significant.

As in the judge’s listening ability matrix, all the correlation scores are in the expected direction with correlations between p-value correlations scores and RMSE scores all negative, and correlation scores between the same type of measure of judge ability all positive. However, at Round 3, only 3 of the final 5 correlations are significant.

The data indicates that the p-value correlation and the RMSE are not measuring the same things. If the correlations were very high, approaching unity, this would indicate that the two measures were probably measuring the same kinds of error. However, the largest correlation was r = 0.89, with most of the other correlations falling between r = .50 and r = .80. As a result, it is more probable that p-value correlation and RMSE are tapping into at least two different types of measurement error. As stated above, RMSE is rarely reported for the Angoff standard setting, and instead a large body of understanding about the p-value correlation has developed. A wider use of RMSE could lead to a better understanding of the Angoff standard setting and how judges assign scores to items.

相關文件