Angoff Standard Setting - CHAPTER THREE METHODS

CHAPTER THREE METHODS

3.3 Angoff Standard Setting

In the Angoff study, 18 judges were asked to set a cut score for the B1 level of

listening and reading proficiency on the Common European Framework of Reference proficiency scale. A one-day training/orientation session was held on Saturday, July 10, 2010, and three separate Angoff meetings were held on Monday, July 12, Wednesday, July 14, and Friday, July 16. The participants and procedures are described below.

3.3.1 Participants

The 18 Angoff judges were chosen for their experience in language learning/teaching and assessment in a Taiwanese university setting. Amongst judges with such

experience, diverse perspectives were sought. Thus, an attempt was made to include native and non-native English speaking teachers at the university, individuals with both EFL teaching experience and administrative responsibilities within the

university, teachers from outside institutions and recently graduated students currently

employed in the EFL field. There were 18 judges in all, all of whom received training on the same day. Then, they were broken into sub-panels with six members each to perform the actual standard setting. The composition of the three Angoff panels is presented in Table 3.3.

Table 3.3. Angoff Judges

J01 F NNS Administrator, former teacher I

(Mon) J04 F NS Teacher

I (Mon)

J05 F NNS Teacher

(Mon)

J06 F NNS Teaching Assistant, recently graduated student

II (Wed)

J07 M NS Teacher

II (Wed)

J08 M NS Teacher, External University II

(Wed)

J09 F NNS Teacher

(Wed) J10 M NNS Teacher

II (Wed)

J11 F NNS Teaching Assistant; recently graduated student II

(Wed)

J12 M NS Teacher

III (Fri)

J13 F NNS Administrator, Teacher

III (Fri)

J14 F NNS Administrator, former teacher III

(Fri)

J15 F NNS Teacher

III

(Fri) J16 F NNS Teacher, External University III

*NS = ‘native speaker’; NNS = ‘non-native speaker’.

3.3.2 Procedures

A group of six judges participated on each day. (The procedure was conducted on three separate days to ensure that proper procedures were followed, particularly during the dicussion period. In some implementations of the Angoff procedure, larger groups are accommodated by having ‘table leaders’ who are themselves participants take charge of the discussion. Given the absence of experienced table leaders, it was decided that it would be preferable to have the facilitators present for each discussion session. This required having the groups meet on separate days.)

Five days prior to the training session, an email was sent to all judges containing an introductory letter, an agenda for the training session, the training materials, and two forms collecting personal information and agreements concerning test security and informed consent for the research portion of the project. The

introductory letter contained a link to a CEFR familiarization website,

<www.CEFtrain.net>. The training materials consisted of pages 33-36 from the CEFR (in slightly adapted form), which summarize the CEFR levels; the listening and reading components of the CEFR self-assessment grid (CEFR Table 2); and a link to the website. As homework, judges were asked to refer to the website and level summaries, and use the self-assessment grid to assess themselves (in any second language) and their students, in terms of the CEFR levels. These homework tasks were developed after referring to page 18 of the CEFR linking manual (Council of Europe, 2009).

On the day of the training, judges were given a brief Powerpoint presentation explaining the purpose of the project, a description of the EPT and an explanation of how it was developed and validated. They then commenced the CEFR familiarization process. After a brief description of the CEFR, they were given a sheet containing the global level descriptors from the CEFR Table of Global Descriptors. The descriptors had been rearranged, and the judges were asked to sort them back into the correct order (first individually, then in pairs). After providing them with a copy of the original CEFR Table and discussing the correct answers, the judges were asked to take out their ‘homework’ activity in which they rated their own ability and that of their students using the CEFR levels, and to discuss their answers in pairs.

The session then shifted to the performance level descriptors (PLDs) for the CEFR reading scale. (The CEFR scales used for the development of both the reading and listening PLDs are listed in Appendix B, along with the global descriptors which provide a general idea of the content of the different levels.) The first activity was another sorting activity, in which judges were asked to sort (individually, and then in pairs) 20 CEFR reading descriptors from CEFR levels A1 to B2. Then they were given the CEFR reading descriptors from the scales used in the study, for CEFR levels A1 to B2. Next, judges were given a 13-item reading test, taken from training material made available by the Council of Europe (CoE). For each item, the judges were asked

to first attempt to answer the item and then to assign the item to a CEFR level, based on the skills required to correctly answer the item. After sharing their answers in pairs, the answers from the original CoE study were shown and discussed. This concluded training for the reading PLDs.

The training for the listening PLDs was conducted in parallel fashion. Judges were asked, individually and in pairs, to sort 20 PLDs from the CEFR A1 to B2 levels. After they finished, correct answers were provided along with a full list of the listening PLDs from the scales used in the study for levels A1 to B2. Judges were then given a 6-item listening test, taken from the CoE training material mentioned above, and asked to attempt to respond to the item and then assign a CEFR-level to the item based on its perceived difficulty. Judges shared their answers in pairs, and then the recommended answers from the CoE study were shown and discussed by the whole group. This concluded training for the listening PLDs.

After a break for lunch, the judges took both tests. The purpose of this was to enhance their ability to see what makes items difficult by seeing them, to some extent, from a student’s perspective. The judges were then asked to sit together in a circle with the other members of their standard setting panel. A group leader was chosen, and each group was asked to go through the test form, item by item. As a group they were asked to discuss what knowledge, skills and abilities were required to answer each item, and how the items differed in terms of difficulty. One hour and fifteen minutes was allotted to this task. Following this activity, the judges were introduced to the concept of the barely proficient B1 student (B1 BPS). They were then given a form which contained space for their notes on the BPS. They were asked to refer to their listening and reading PLDs for the A2 and B1 levels, and to summarize on the forms what they felt were the key characteristics of a B1 BPS for both listening and reading. They were then asked to discuss their summaries in pairs or small groups.

This was the final training activity of the day. Judges were then given the opportunity to ask any questions they had about what had been discussed to that point. They were then told that when they returned for the actual meeting, they would have one training round prior to the meeting and would then perform the actual standard setting. This concluded the training session.

standards were set for the reading test in the morning and the listening test in the afternoon. Before beginning, judges were given a brief review of the notion of the B1 BPS. They were then asked to estimate, based on their knowledge of students in the university’s English program (or Taiwanese university students in general for judges from other universities), the percentage of students who had reached the B1 level for the skill in question. They were asked to write this estimate down. Then, the test form and the round 1 rating form for the reading test were distributed to the judges. The rating form contained a single column for each item being rated with each column containing a list of probabilities in increments of 0.1, starting from 0.1 to 0.9 with a space between each figure (see Appendix C). Judges were asked to “circle or insert”

the probability that a just-B1 level student would answer the item correctly, and to write their answer at the bottom of the column. Judges were then given a practice round, in which they were asked to pencil in their ratings for the first couple of items.

It was made clear that this was simply a practice round, designed to ensure that they understood the procedure and that they could change their answers later. The

facilitators circulated from judge to judge while they were performing the practice round, to make sure the procedures were understood. Once all judges had finished, they were asked if there were any remaining questions. After questions were answered, the first round of ratings was conducted. Two further rounds were

conducted. Between each round, the judges were shown empirical p-values from the actual administration of the test. Between the second and third rounds, judges were also shown the number of students who would reach B1 based on the cut score they (as individuals) had set in the second round. Judges were also asked to share and discuss their ratings between rounds. Since the purpose of the present study is to assess whether the assumptions of the method hold in the absence of such feedback data, only the first round of results is used for the main analyses conducted below.

The same procedures were followed for the listening test. (Since the test form did not contain the scripts for the listening passages, a separate form was created for the listening test which contained both the listening scripts and the associated items for the judges to refer to during the actual standard setting meeting.)

3.4 Analysis

Scoring data were scaled to a Rasch rating scale model using the Facets software program (Linacre, 2009). For both the reading and listening exam, scaling and parameter estimation was carried out twice, to estimate parameters for two distinct frames of reference: the ‘internal’ frame of reference constructed by the Angoff judges, and the ‘external’ frame of reference using the item difficulty information from the administered exams. The binomial model was used to estimate parameters (Eckes, 2009; Engelhard & Anderson, 1998; J.M. Linacre, personal communication, July 7, 2010). Following Engelhard & Anderson (1998), the 101 possible category levels corresponding to probability estimates were collapsed into an 11-point scale by recoding the counts obtained from the judges as follows:

0 - 5 = 0 6 - 15 = 1 16 - 25 = 2 26 - 35 = 3 36 - 45 = 4 46 - 55 = 5 56 - 65 = 6 66 - 75 = 7 76 - 85 = 8 86 - 95 = 9 96 - 100 = 10

Although some information is lost through recoding, the empty cells and small numbers of counts in many other cells makes MFRM analysis difficult.

The analysis was conducted in two phases. The first phase tested the

assumption of the use of the MFRM for detecting rater effects by comparing results from the internal and external frameworks. The second phase used the indices to test the assumptions of the modified Angoff method.

3.4.1 Evaluating the Assumption of the MFRM

In the first phase, parameters were estimated for the internal and external frameworks.

For both frameworks, values for the latent trait indices used to detect rater effects were generated for each rater. Results from the internal and external frameworks were

compared, to determine their level of agreement. As a further check, ‘classical’ or

‘raw score’ indices for detecting rater effects were also included and used to assess the performance of the different latent trait indices. The reading test was analyzed first, and the listening test after it, to see whether the results were replicated. Details for each rater effect are as follows.

在文檔中以多面向Rasch模式為基礎檢驗Angoff標準設定法的效度議題 (頁 49-55)