Problems with the Angoff Method - Angoff標準設定之判斷者的評估

The Angoff method suffers from all of the general problems that plague standard setting, such as the rejection of the conceptualization of standard setting as a psychometric technique capable of discovering a knowable or estimable parameter (Cizek & Bunch, 2007) rather than an abstraction that is useful, but not a real value (Cizek, 1996; Cizek, 2001; Cizek & Bunch, 2007; Cizek, Bunch & Koons, 2004). Or as Dixy McGinty (2005) put it, the way in which statistics describing standard setting performance are not really indicators of validity, and while useful in

demonstrating it, are not themselves measures of that validity.

But also, in the Angoff standard setting, there are a series of special problems that judges experience. The most prominent of these are related to the fact that Angoff judges are making estimates of the difficulty of the items. Human minds are limited in their ability to make such estimates (Brandon, 2004; Goodwin, 1999; Impara & Plake, 1998; Linn & Shepard, 1997, Lorge

& Krulou, 1953;; Norcini et al., 1987; Norcini, Shea & Kanya, 1988; Shepard, 1994; Smith &

Smith 1988; Taube, 1997) and the fact that the Angoff method forces judges to do so as part of

the procedure produces a situation in which judges look for extra sources of information on which to make their estimates, such as the copying estimates of other judge’s estimates, or the copying of feedback information, and the problem of restricted range.

Copying of Judge’s Estimates and Feedback Information

One source of information that may be producing convergence of the p-values is the provision of feedback information. This phenomenon has been widely studied, but its root cause is not clearly understood. It is generally seen as an indication of improving performance of the judges as they receive more information about the items (Cizek, 1996; Cizek, 2001; Cizek & Bunch, 2007;

Cizek, Bunch & Koons, 2004). A second explanation which will be discussed later is that the judges are simply copying the information they are receiving about the items during the feedback sessions. In its modified form, the Angoff standard setting method asks judges to rate the

difficulty of test items. The ability of expert judges to make such estimates is crucial to the validity of the method, and as such, a large and comprehensive research literature has been developed to address the issue.

P-value correlations as high as those seen in later rounds of an Angoff method standard setting are virtually never seen unless judges have not already been told the p-value of the items during feedback information before Round 2 and 3. A large number of references are typically cited questioning the ability of even the most highly trained experts to accurately estimate the

difficulty of test items and the way in which asking judges to make an estimate has an effect on the magnitude of the estimate (Brandon, 2004; Goodwin, 1999; Impara & Plake, 1998; Lorge &

Krulou, 1953; Linn & Shepard, 1997; Norcini et al., 1987; Norcini, Shea & Kanya, 1988;

Shepard, 1994; Smith & Smith 1988; Taube, 1997).

Angoff method procedures conducted by Brandon (2004) concluded that typically, the values obtained by correlating the empirical p-values with the estimates obtained from Angoff judges range from around 0.40 to 0.70, indicating that at best, actual estimation of the p-value can rarely account for more than half of the variance in a judged estimate. In conclusion, he states (p. 71),

“results of this level, show that the ordering of item estimates - particularly those in operational standard setting studies - can be expected to mirror moderately the ordering of item difficulty.”

The clustering of scores around a central point is referred to as ‘restricted range’. And if these scores are clustered around the middle of the rating scale, this is referred to as ‘centrality’

(Saal et al., 1980). One indication of this would be that estimated values for the difficulty of items that suffer from centrality would have a smaller standard deviation than the standard deviation of the measured items, indicating that estimated values for the easy items and for difficult items are not correct (Saal et al., 1980), and are more correct for items closer to the mean or median. This, in fact, is a commonly observed aspect of the research. Lavallee (2012, p.

14) reviewed the literature related to this issue and concluded, “…results consistent with a centrality effect have been found every time they have been looked for” (italics in the original).

In addition, the tendency for judges to cluster estimates of actual values in tighter distributions than the actual values themselves has been the subject of comment for almost as long as there has been systematic scientific investigation into standard setting results.

Lorge and Kruglov’s original (1953) study found a standard deviation of 16.3 for the judges’

estimates compared with 23.7 for the empirical difficulty values. Goodwin (1999) reported, in her study of the results of a financial planner licensing exam, that the judges’ estimated p-values were “more homogeneous” than the actual results obtained from the administration of the items to candidates. The standard deviations for the estimates of total group and for borderline

examinees were .09 and .10 respectively. The actual observed values were.19 and .18.Van de Watering & van der Rijt (2006) compared the estimates of difficulty values for teachers and students. They found high rates of inaccuracy among these groups. Interestingly, their student group did not overestimate the difficulty of easy items, although they showed dramatic underestimation of difficult items. Teacher’s estimates of easy items showed much more centrality and systematically underestimated the easiest items.

More recently, Brian Clauser and his team have expanded on this theme with attempts to find out more about what and how different factors drive p-value estimates. Clauser et al. (2013)

confirmed that providing judges with information about the empirical p-values of items was what resulted in the characteristic distribution of judge’s estimates in a standard setting. Mee et al.

(2013) found that varying the instructions judges received could also affect their final cutscore.

Clauser et al. (2014), using a generalizability theory framework, found that the greatest source of variability within the Angoff standard setting was between tables rather than between individual judges, suggesting that something similar to social influences could be effecting estimates inside the tables of the judges.

25 CHAPTER 3 METHODS

3.1 Materials

The data used in this study is drawn from a standard setting meetings held at a Taiwan university (hereafter referred to as The University) to link a university-level English proficiency exam to the Common European Framework of Reference (CEFR) (See Appendix 1). The test used in this study, the English Proficiency Test or EPT, is an examination of English as a Foreign Language.

The EPT is a series of in-house language proficiency tests developed to meet the needs of the Practical English (PE) program adopted at The University. The EPT exams are multiple-choice exams. They test a series of listening, reading and vocabulary skills in a number of different practical contexts. It is divided into 8 sections with students’ progressing through 2 sections of Practical English each year: PE 1 and 2 are taught to freshmen (1st year), PE 3 and 4 to

sophomores (2nd year), PE 5 and 6 to juniors (3rd year), and PE 7 and 8 to graduating seniors (4th year). An outline of the test organization is detailed in Table 3.1 (Lavalle, 2012). Items on the EPT are linked to vocabulary suggestions contained in the PE textbooks. The CEFR was not taken into account for items used in this standard setting, which are tied to topics covered in the textbook, rather than the CEFR scales. The textbooks from The University were not designed with the CEFR in mind; however this is one of the goals of the standard setting, to match the textbooks and items written within the PE program with the standards of the CEFR.

Because the same items may appear on more than one PE test, The University maintains a strict control policy over them. As a result, no examples of items can be provided in this research.

Items for the EPT are written by the classroom teachers of the PE program under the supervision of test editors who are assigned by the school. Items are then sent to a proofreader and finally

returned to the editors. The test editor returns the test to the school who then print and distribute the test forms to students. The various tests of the EPT are administered on a single day. So for example, all freshman students receive the test at the same time. All sophomores receive the test at the same time, which is different from the time for freshman students and other students.

Following student examinations, test results are collated, sent to a test coordinator and calibrated with Winsteps Rasch modeling software (Linacre, 2012). All test items are placed on a single difficulty scale. Items are sorted by their point-biserial correlations and difficulty values, and stored in an item bank for later use. Currently, most items that appear on the EPT are drawn from this item bank, although teachers continue to write new items to expand the item bank.

The test items used in this standard setting were drawn from several different midterm examinations. All items had been calibrated onto a single scale using Rasch modeling. This standard setting project was designed to establish cutscores along the scale used to calibrate all items in the item bank and not along a raw score scale corresponding to a single test form.

Accordingly, the test form used in the project was actually a composite, with its items drawn from a series of different test forms administered during the midterm examination period for first-, second-, third- and fourth-year students. The tests shared a number of common items illustrated in Table 3.1, which were used to equate them and calibrate them together onto the same scale.

27 Table 3.1.

Contents of the English Proficiency Test (EPT) Skill Question

Type Description Items Time

28 3.2 Judges

Judges were selected primarily from faculty and staff of The University. Several external judges were selected to provide diversity to the standard setting decisions. These judges were selected because of their experience teaching students at similar universities in Taiwan. Two of the external judges were faculty members at universities in the Taipei area and one was a doctoral candidate at another university but had taught remedial classes for the university at which she was studying. Table 3.2 (Lavalle, 2012) provides a summary of the judges and a brief description of the background of each.

Agf11 F NNS Administrator, former teacher

Agf12 F NNS Teaching Assistant, recently graduated student

Agf13 F NNS Teacher

Agf22 M NS Teacher, External University

Agf23 F NNS Teacher

Agf33 F NNS Administrator, Teacher

Agf34 F NNS Administrator, Teacher

Agf35 F NNS Teacher

Agf36 F NNS Teacher

F=female, M = male, NS = native English speaker, NNS = non-native English speaker

30 3.3. Procedures

A one-day training/orientation session was held on Saturday, July 10, 2010 for all the

participants. The judges themselves were then separated into three different panels which were held on Monday, July 12, Wednesday, July 14, and Friday, July 16 in 2010. The individual panels were conducted on three separate days to ensure that proper procedures were followed, particularly during the discussion period. A group of six judges participated on each day. The facilitator for each discussion session acted as the moderator of each of the panels, thus requiring having the panels meet on separate days.

Introduction to Training

As noted in Table 3.1, the test form presented to each of the panels was a composite drawn from tests in the EPT series of tests. The items were drawn from test forms administered as part of the annual EPTs for all four year levels of the program, and differed slightly from the EPT exam described earlier. Table 3.3 (Lavalle, 2012) summarizes the type of question types used in the composite form that each of the judges had to work with. The form itself was composed of a listening and a reading section.

31 Table 3.3.

Contents of the Test Form Used in the Standard Setting

Listening

What’s Next? 16 items

Dialogues 12 items (3 listening texts) Extended Listening 12 items (3 listening texts)

Reading

Fill in the Blank 10 items Text Completion 16 items Reading with Questions 14 items

For the purposes of acclimatizing judges to difficulties encountered taking the test and provide them with the experience of taking the exam, a training form was created with the same format as the regular exam (Loomis, 2012). The test form used in the operational standard setting did not contain the scripts for the listening passages, so a separate form was created for the listening test which contained both the listening scripts and the associated items. In the training session, judges took the test using the training form. During the operational standard setting of the listening test, judges were not able to hear the taped version of the questions but were also provided with the scripts of the listening questions. The week prior to the training session, an email was sent to all judges that contained

1. an introductory letter with a link to a CEFR familiarization website, www.CEFtrain.net 2. an agenda for the training session, consisting of adapted versions of pages 33-36 from the

CEFR (2009).

3. the training materials, the listening and reading components of the CEFR (2009) self-assessment grid (CEFR Table 2); and a link to the website.

4. two forms collecting personal information and agreements concerning test security and informed consent for the research portion of the project. (see Appendix 2 and 3)

As homework, judges were asked to refer to the website and level summaries, and use the self-assessment grid to assess themselves (in any second language) and their students, in terms of the CEFR levels. (Council of Europe, 2009).

Training of judges was extremely conventional and followed suggestions given in such authoritative sources as Cizek (2001), Cizek and Bunch (2007), and the Council of Europe (2009).

33 Day 1 of training-Introduction to the CEFR

On the first day of training, judges were given a brief PowerPoint presentation explaining the purpose of the project, a description of the EPT and an explanation of how it was developed and validated. Following guidelines provided by Cizek (2001), Cizek and Bunch (2007) and the Council of Europe (2009) a great deal of effort was extended during training to familiarize judges with the descriptors used for the panels. They then took part in a CEFR familiarization process. After a brief description of the CEFR, their understanding of descriptors was tested.

Judges were given a sheet containing the Global Descriptors from the CEFR Table of Global Descriptors. The descriptors had been rearranged, and the judges were asked to sort them back into the correct order (first individually, then in pairs). After providing them with a copy of the original CEFR Table and discussing the correct answers, the judges were asked to take out their

‘homework’ activity in which they rated their own ability and that of their students using the CEFR levels, and to discuss their answers in pairs.

PLD Test of CEFR Descriptors

The session then shifted to the CEFR reading Performance Level Descriptors (PLDs). The first activity was another sorting activity, in which judges were asked to sort 20 CEFR reading and 20 CEFR listening descriptors from CEFR levels A1 to B2. They were then given a sheet containing CEFR reading descriptors from the scales used in the study, for CEFR levels A1 to B2. These descriptors were in a randomized order. Judges were asked to sort the descriptors into an order from least difficult to most difficult that they felt made the most intuitive sense and assign a CEFR competency level to each descriptor.

The training for the listening PLDs was conducted in parallel fashion. Judges were asked, individually to sort 20 CEFR listening descriptors taken from the CEFR A1 to B2 levels. After they finished, correct answers were provided along with a full list of the listening PLDs from the scales used in the study. The scores on these activities were recorded and analyzed later as a measure of how well the judge could use the CEFR descriptors for levels A1 to B2.

Discussing Difficulty

After a break for lunch, judges took the practice test that was described above. The judges were then divided into the three groups of six people in each of the operational panels. The judges were asked to sit together in a circle with the other members of their standard setting panel. A group leader was chosen, and each panel was asked to go through the test form, item by item. As a group they were asked to discuss what knowledge, skills and abilities were required to answer each item, and how the items differed in terms of difficulty. One hour and fifteen minutes was allotted to this task. The discussions were taped by the facilitators for later transcription.

Discussing the Barely Proficient Student

Following this activity, the judges were introduced to the concept of the barely proficient B1 student (B1 BPS). They were then given a form which contained space for their notes on the BPS and told to refer to their listening and reading PLDs for the A2 and B1 levels, and summarize on the forms what they felt to be the key characteristics of a B1 BPS for both listening and reading.

They were then asked to discuss their summaries in pairs or small groups. This was the final training activity of the day. Judges were then given the opportunity to ask any questions they had about what had been discussed to that point. They were told that when they returned for the

actual meeting, they would have one training round prior to the meeting, then they would perform the actual standard setting. This concluded the Day 1 of the training session.

在文檔中 Angoff標準設定之判斷者的評估 (頁 33-48)