Training and the Angoff Standard Setting Method

Merriam-Webster tells us that training is, “a process by which someone is taught the skills that are needed for an art, profession, or a job.” This definition implies both learners and teachers. All standard setting methods would have both. But in addition, the idea of ‘training’ implies that it is preparing learners for a task that they cannot or do not perform naturally without instruction. In principle, learners who perform well during training should be better prepared for the task than those who do not perform as well.

The training that has emerged for the Angoff method of standard setting has two fundamentally different manifestations. In the United States, where most of the published research on standard setting is generated, training has focused on preparing judges to identify as accurately as possible where categories of the borderline student test takers end and begin - or at least this has become the emphasis of activities used during standard setting training. In Europe, on the other hand, the Common European Framework of Reference (CEFR) has for years held the dominant position as a standard in language testing. The CEFR is fundamentally nothing more than lists of the PLDs that are aimed at describing competency. As such, training for standard setting that has emerged from Europe is based largely on whether judges are able to use and perform tasks based in lists of the skills associated with different levels of competency.

The American sense of standard setting is very clearly laid out around the concept that the judge’s job is to identify the borderline test taker. Many researchers describe in detail the tasks that such a student should be able to do. For example this quote from the widely cited Raymond

& Reid (2001, p. 147) illustrates this point.

Training should give participants an opportunity to practice the steps for assigning MPLs (Minimum Passing Level) under conditions similar to the conditions they will experience when assigning actual MPLs (p. 144). Asking the participant with the lowest MPL and the participant with the highest MPL to explain their rationale is a common training technique.

Similar descriptions can be found in more recent examples of the training of standard setters.

Raymond & Reid (2001, pp. 150-1) provide explanations of a training program for standard setting judges. Loomis (2012) provides details of the preparation of standard setters for NAEP.

While she gives descriptions of how it is that NAEP selects and prepares their standard setters, both her work and that of Raymond & Reid (2001) fail to provide any evidence that their training methods can actually produce the knowledge and skills deemed necessary by the authors. It is not at all clear that asking judges to explain their reasoning has any effect on their actual ability to perform the task with more or less efficacy. Although such instructions seem to make sense, to do so, in effect, is relying on the face validity of the activities.

In other disciplines and areas of psychology, this form of work would be done through the use of clinical trials to assess the efficacy of training methods. Instead, judges are asked to fill out self-report surveys describing self-reflections on their knowledge and feelings about training and personal success at mastering the training. Gregory Cizek (2001, 2012, 2012a; Cizek

& Bunch, 2007) is one of the leading researchers in standard setting today. He has discussed in great detail the use of these self-report surveys to monitor progress during the standard setting and to clarify judges’ level of knowledge and attitudes during and after the procedure. While Cizek has published several major academic books on standard setting, they are better thought of as manuals concerning how to conduct a valid standard setting. Cizek (2012a, p.170) states that,

Minimally, two essential validity related questions are addressed by the surveys. (a) Is there evidence that the standard setting participants received appropriate and effective training in the standard setting method, key conceptualization, and data sources to be used in the procedure? (b) Is there evidence that the participants believe they were able to complete the process successfully;

yielding recommended cutscores that they believe can be implemented as valid and appropriate demarcations of the relevant categories.

Cizek (2012a) continues by providing extremely detailed examples of ‘evaluations’ timed for the

“End of Orientation” (p. 174), the “End of Method Training Session” (p. 175), the “End of Round One” (p. 175), “Round Two” (p. 176), “Round Three” (p. 177), the “Final Evaluation” (p.

177), and a final form dealing with “Level of Reliance on Information” (p. 178).

The study reported here was originally planned long before Cizek (2012a) wrote this, but in addition, it has a different agenda in mind. As such, its schedule only roughly follows the one suggested by Cizek (2012a). Of the seven different types of assessment used in this study, five of them were self-report surveys; addressing the,

1. knowledge to standard setting procedures 2. knowledge of the CEFR

3. knowledge of the Practical English Test

4. beginning of Day 2, prior to the beginning of the operational standard setting 5. final evaluation

In one final suggestion for training, Loomis (2012), and also Cizek (2001), describe the slightly different version of this used by NAEP. NAEP uses as a key element, and “the most essential

part of the process”, (Loomis, 2012, p. 123), the concept that the training process should produce in judges a common understanding of the achievement levels. As a result, one of the procedures is that all judges take a version of the test to try to understand what it will be like for the actual test takers. A similar situation exists in Europe. European standard setting of language tests is based largely around the Common European Framework of Reference (CEFR). Some CEFR manuals are simply lists of PLDs for various language situations (Council of Europe, 2001).

Others are lists of PLDs and how to interpret them. Exercises for the training of judges can be constructed from these instructions (Council of Europe, 2001; Council of Europe, 2009). Some of these exercises are very interesting and appear to have very strong face validity. But like their American counterparts, a formal test of their ability to predict standard setting outcome has yet to be reported.

The situation described here, where the training for judges is suggested without any quality control other than the face validity of the procedure, is in fact much more significant than first indicated. It is difficult to find any source anywhere dealing with the training of standard setting judges that reports or even describes a need for predictive checks on training activities.

Hambleton & Pitoniak (2006) spend a great deal of time in their chapter “Setting Performance Standards in the APA Publication Educational Measurement” discussing the importance of training. Many standards of the APA’s Standard for Educational and Psychological Testing are cited in Hambleton & Pitonik (2006) as the authors refer to the training of judges. For example, on page 434, they state,

Standard 4.20 addresses the desirability of obtaining external evidence to support the validity of test score interpretations associated with performance category descriptions.

Standard 4.21 stresses the importance of designing where panelists can optimally use the knowledge that they have to influence the process.

The paper itself contains numerous sections that deal directly with the training of judges, such as

“2.4 Step 4: Train Panelists to Use the Method” (p. 437) and “6. Training Panelists” (p. 453-455) which are cited extensively in Raymond & Reid (2001). At no point in any of these references is there mention of an empirical validation of these training methods or how such training methods are connected to the standard setting method in a way that makes them more valid as methods of performing the procedure.

在文檔中 Angoff標準設定之判斷者的評估 (頁 29-33)