• 沒有找到結果。

Accurate Representation of Item Functioning. This has surely been the most controversial and well-researched assumption of the Angoff method

(Brennan & Lockwood, 1980; Chang, 1999; Chang et al., 1996; Clauser et al., 2009;

Fehrmann, Woehr & Arthur, 1991; Goodwin, 1999; Hurtz & Jones, 2009; Impara &

Plake, 1998; Lorge & Kruglov, 1953; Plake & Impara, 2001; Plake, Impara & Irwin, 1999; Shepard, Glaser, Linn, & Bohrnstedt, 1993; Van Der Linden, 1982). The overwhelming consensus which has emerged from this research is that judges are indeed quite limited in their ability to represent item functioning. Most such studies have reported correlations between the means of modified Angoff judges’ item

estimates and actual difficulty levels (i.e., empirical p-values). Brandon’s 2004 review of the literature on the Angoff method reported that, for 29 correlations reported, average correlations were .63 for operational standard settings and .51 for non-operational standard settings. This moderate level of success in meeting the

assumption has remained the rule in studies published since Brandon’s review (e.g., Clauser et al., 2009).

Research in this area has increasingly sought to investigate the variables influencing accuracy in assessing item difficulty. Panelist background and expertise has been the focus of one line of research, with inconclusive results. Van De Watering and Van Der Rijt (2006) found that students were more accurate than their teachers but the Verhoeven studies discussed above failed to find a difference between panelists with different backgrounds.

Assumption 3: Quantification. After developing representations of the BPS and of item functioning, Angoff judges next need to juxtapose these representations, imagine how the just-proficient student would interact with the item, conceptualize the degree of challenge posed by the task and ‘quantify’ this by estimating the probability of the BPS answering correctly. The ability of panelists to quantify their expectations as probabilities has rarely been explicitly discussed. This is curious, as there is little reason to expect this to be a natural task for most people and,

conceptually, it is not clear how a panelist is expected to perform it.

Furthermore, previous research offers reason to believe that the central

tendency or centrality effect, in particular, may commonly occur when the Angoff

method is used. The centrality effect has long been known to influence judgments made in settings similar to that of the Angoff. Indeed, over a century ago,

movement, length, area, size of angles, have all shown the same tendency to gravitate toward a mean magnitude, the result being that stimuli above that point in the

objective scale were underestimated and stimuli below overestimated” (Hollingworth, 1910, p. 426). This effect has been consistently found within the psychophysics tradition: Stevens and Greenbaum (1966) reviewed a series of experiments

demonstrating the same effect, which they referred to as a “regression effect.” More than a decade later, Poulton provided an updated review of the literature concerning this tendency, which he referred to as “contraction bias” and described as “a general characteristic of human behavior” (Poulton, 1979, p. 778). Unfortunately, this literature has rarely been referred to in relation to the Angoff method, despite its obvious relevance.

If the centrality effect were present in an Angoff setting, it would manifest as a tendency for judges to overestimate the difficulty of relatively easy items and to underestimate the difficulty of relatively difficult items. The standard deviation of judges’ estimates would also be smaller than the standard deviation of the empirical item difficulties (i.e., those derived from the actual administration of the test to the relevant student population).

Precisely this pattern of results has been found in a number of studies. In Lorge and Kruglov’s (1953) study of the ability of judges to estimate item difficulty, the standard deviation of the judges’ estimates was 16.3 compared to 23.7 for the empirical item difficulties. Shepard (1994) found that trained Angoff judges systematically overestimated examinee performance on difficult items and

underestimated examinee performance on easy items. In Goodwin (1999), 14 judges made estimates for all examinees and for the borderline examinees on a 140-item financial certification exam. The standard deviations of the judge’s estimates were .09 for the total group and .10 for the borderline group; the corresponding standard deviations from the actual exam results were .19 and .18 respectively. Heldsinger and Humphry (2005) and Heldsinger (2006) reported results from a study in which 27 judges used a modified Angoff procedure with 35 items from a Year 7 reading test.

The standard deviation of the item difficulties set by the panelists was 0.5 logits, less than half the standard deviation of 1.16 logits from the actual exam results. The authors used the ratio of the standard deviations to re-scale the Angoff results and

found that that it significantly altered the final cut score. Schulz (2006), in addition to providing one of the first attempts to theoretically elucidate the nature of this bias as it relates to standard setting, reported results from a pilot study, with 21 Angoff panelists making estimates for items from the 2005 NAEP Grade 12 math exam. The results suggested ‘scale shrinkage’ which, significantly, persisted even through the third round of ratings. Finally, Clauser et al. (2009), reported results from two operational standard setting exercises for a physician credentialing examination, with six Angoff judges making estimates for 200 items (34 of which has associated empirical data) on one, and six judges and 195 items (43 with empirical data) on the other. Even though items with “very high” or “very low” p-values were excluded from the study, the judges were still found to “systematically overestimate the probability of success on difficult items and underestimate the probability of success on easy items” (Clauser et al., 2009, p. 17).

In fact, results consistent with a centrality effect appear to have been found

every time they have been looked for. The one seeming exception is a study by Impara

and Plake, in which, according to the authors, panelists “did not systematically

overestimate (or underestimate) performance on easy items or overestimate (or underestimate) performance on hard items” (Impara & Plake, 1998, p. 77). However, the particular methodology used in that study makes it difficult to directly compare their results with the studies mentioned above. In their study, the authors asked 26 sixth-grade science teachers to estimate the probabilities of success on each item in a 50-item science test for two groups: the borderline (“D/F”) students in their class, and the class as a whole. They also asked the teachers to assign and record the class grades for each student. The researchers then compared predicted with actual performance for both groups with the borderline group defined by the

teacher-assigned class grades. They found that the teachers overestimated the performance of

the class as a whole but underestimated the performance of the borderline group. They then examined the relationship between predicted and actual item difficulty levels for both groups, categorizing estimates as overestimates (more than .10 over the actual p-value), underestimates (more than .10 under the p-value) and accurate estimates (within .10 of the actual p-value). The results were then further divided according to

and above .66). They concluded that “these results did not show a consistent variation in accuracy of prediction simply as a function of item difficulty” (p. 77).

This study certainly speaks to the ability of panelists to estimate the

performance of particular students and may be of particular interest in comparing the modified Angoff method with student-centered standard setting methods, such as the contrasting groups method. Nonetheless, their results cannot be compared directly with results from the studies mentioned above, for at least two reasons. First, as noted in Clauser et al. (2009), Impara and Plake defined the borderline group in terms of the

class grades assigned by the teachers. In order to make a direct comparison of

estimated and observed difficulty levels, the authors would have needed to have defined the groups statistically, in accordance with the modified Angoff method: by the number of items each group was predicted to answered correctly. Doing so would have resulted in a different set of proportion-correct (‘p’) values, a different

categorization of items into the three levels of item difficulty, and different percentages of estimates falling into each of the accuracy categories used by the authors (overestimates, accurate estimates and underestimates). In other words, the relevant comparison is with students who performed around the mean score derived from the teachers’ item-by-item estimates. Second, the authors provide no information on the dispersion of estimates, such as the range or standard deviation. Without these, and given the above issue of category definition, Impara and Plake’s findings cannot be used as evidence either for or against the presence of a centrality bias.

In short, then, based on previous research, there is strong reason to believe that the Angoff method is highly vulnerable to a central tendency bias which has the potential to undermine one of its core assumptions.

2.1.3 Rater Effects

An important part of the validation process generally is to identify possible threats to validity, formulate them as hypotheses and then seek to empirically refute them (APA/

AERA/NCME, 1999; Kane, 1994; Messick, 1998). For standard setting and subjective rating situations more broadly, such hypotheses can be explicitly formulated in terms of the presence of possible ‘rater effects,’ defined as a “broad category of effects [resulting in] systematic variance in performance ratings that is

associated in some way with the rater and not with the actual performance of the ratee” (Scullen, Mount and Goff, 2000, p. 957). These rater effects have been

investigated in some depth within two broad research traditions. The first of these has focused on the psychological processes involved in making subjective evaluations and on the potential sources of rater effects (Pula & Huot, 1993). The second tradition has focused on detecting and diagnosing rater effects by searching for their characteristic patterns in ratings data. Research within this latter tradition has resulted in a variety of criteria to evaluate the psychometric quality of ratings, across different measurement frameworks, including classical test theory, analysis of variance, regression analysis, generalizability theory and Rasch measurement/item response theory (Saal, Downey,

& Lahey, 1980; Stemler, 2004; Stemler & Tsai, 2008). Within this broad literature, rater effects have been defined in various ways (Myford & Wolfe, 2003, 2004; Saal, Downey & Lahey, 1980). The present study will follow Wolfe’s division of rater effects into three categories leniency/severity, inaccuracy and centrality/extremism (Wolfe, 2004). These are discussed in turn.

Leniency/Severity. This effect is present when raters gives scores that are

consistently either too high or too low. In terms of an Angoff standard setting,

leniency/severity is present when a judge’s probability estimates are uniformly either lower or higher than is warranted by the performance-level descriptors. A judge displaying a leniency bias would assign comparatively low probability estimates to the items, resulting in a lower cut score and a higher percentage of students meeting the standard. Conversely, a judge displaying a severity bias would attribute to the BPS more ability than warranted by the PLDs and would thus assign higher probability estimates, resulting in a higher cut score and a lower percentage of students meeting the standard.

Inaccuracy. To the extent that this effect is present, ratings will appear

unrelated to the presence or absence of the latent trait being rated. In an Angoff standard setting, this effect would create inaccuracies in the representations of item functioning.

(It should be noted that, within the broader category of inaccuracy, it is