Angoff標準設定之判斷者的評估

全文

(1)國立台灣師範大學教育心理學與輔導系博士論文. 指導教授: 陳柏熹博士. Angoff 標準設定之判斷者的評估 (AN EVALUATION OF JUDGES IN AN ANGOFF STANDARD SETTING). 研究生: 張夏石 (MICHAEL SCOTT SOMMERS). 中華民國一百六年七月.

(2)

(3) ACKOWLEDGMENTS. There are many things I’ve wanted to say over the years about my life here that I have never been able to say without sounding awkward or clumsy. And now, here in this place where almost no one will read them, I will finally try and put these words to rest. This dissertation was possible only because there is a place like Taiwan. Taiwan is a place where no one thinks it’s strange or unusual that you would want to read academic books for fun or conduct academic research because it just seems like the best way to spend your free time. There have been many countries created by guns and power. Many nations have been constructed through their workers, managers and the economies they have built. But the idea that there is a modern, free, and affluent society invented by men and women who value the search for, recording, and teaching of knowledge above all else is a strange idea that I think can be found today in only one place across this globe, and that place is Taiwan. In the 20 years or so that I have lived here, I have never been refused admission to a library or forbidden from reading any books. Even some of the most expensive books in the country have been made available to me without question. Curators of private libraries and their rare contents, as well as libraries at the most prestigious public institutes of learning have never questioned my interest in their books. I have received extensive help finding books and rare collections of reading material that are now only available on microfilm. It was as if the mere fact that I wanted to know what was in their books made it impossible for the keepers of these volumes to refuse me access.. i.

(4) So it was that without the help of the government, the people of Taiwan, and their combined attitude toward education, research and learning, I could never have been here or done the work that you are about to read. It is to all the people that made a country where these principles are valued, perhaps like no other place in the world, that I devout this research, and where my long road of acknowledgments and my personal thanks must begin. I have to thank the National Taiwan Normal University (NTNU), where I conducted this research, and my employer, Ming Chuan University (MCU). At NTNU, the College of Education and the Department of Educational Psychology and Counseling are part of a proud institution, and they are deservedly so. For this, I’d like to thank my advisor Dr. Chen Po-Hsi. I would also like to thank Dr. Lin Sieh-Hwa particularly for his help with and teaching of many difficult concepts in measurement and modern mental testing and Dr. Song Yao-Ting for the difference his class and tests made in my understanding of experimental design. Finally, it was only through the help and guidance of Grace Lin that I was able to manage the bureaucracy of the National Taiwan Normal University. During the time it took to finish my classes and produce this dissertation, I taught at Ming Chuan University in Taipei and Taoyuan. My experience with the school has only been positive. For that I have to thank my directors Ada Hong and Dr. Chris Liu. It is important that I also thank Dolly Chang, as both she and Ada Hong were responsible for arranging a teaching schedule and other aspects of my job that made it easier to finish my research work at NTNU while still meeting my work responsibilities at MCU. Many of my colleagues at MCU were helpful as friends and acted as a positive inspiration on my work. However, one of my colleagues played a very special role in my work and as such. ii.

(5) deserves special mention. Dr. Joe Lavallee is not only my colleague and friend; he is an alumnus of the Department of Educational Psychology and Counseling. It was Dr. Lavallee who found the program and its professors. He is the first alumni of the department who entered through the pathway that I also entered. His help and inspiration throughout my studies were not only invaluable, they were irreplaceable. A senior manager of the school that made my work possible is Dr. Nelly Chuang. At the time, Dr. Chuang was the Dean of Research and Development at Ming Chuan University that allotted the money creating the data used in this dissertation. Her assistance obtaining this money and her flexibility in allowing us to use it in the way we felt we needed it used, is greatly appreciated. I also owe a debt of gratitude to Dr. Bao De Ming, the founder of MCU, and her son Dr. Lee Chuan, who is now the president of MCU. Their leadership and emphasis on faculty development made it possible for this research to happen in a way that it could not have in other schools. There are many others whose help was important, not because of the academic knowledge it brought or the freedom it gave me to do the work, but because these people provided me with friendship, love, and inspiration during the years it took to produce all the different parts of this research. My daughter the little Olyvia “Cookie” Sommers is the most important person in the world to me. Every day while I worked on this project, I thought of her, I looked at her pictures, and thought of the strange and funny words that come from the mind of a 4-year-old. She is the star of my life and the little person who gives me meaning every single day: Dr. Ann Heylen has been my friend and colleague for decades, May Chen, Paul Jackson, Hans Thom, Glenn Pluckhan, Rodney Szasz, Quentin Brand, my training partners and coaches at Taiwan BJJ,. iii.

(6) Vaughn Anderson, Professor Makoto Ogasawara, Dr. Warren Wang, Professor Andy Wang, and my family: my father Doug Sommers and my mother Sonya May, my brother Dr. Jeff Sommers and his partner Joanne Hochu, and my sister Megan. And finally, the man who saved me physically and kept me out of jail and from being deported, Hung Ming Hsieh. Of all the named and unnamed people who contributed to my education and work, some were more important than others in helping me, leading me through serious academic troubles, while others were there for me to make sure that I stayed sane and on track. But two of my teachers were different:. Norm Cameron and Dale Beyerstein And finally there is one point that made this dissertation special and challenging and horrible beyond belief. During the production of a dissertation things happen that are unexpected. They are not people or things that you can touch or even ideas. They can be good, and they can be bad. I suppose you could call these things luck, but since there is no such thing, a more appropriate term might just be random events. They fill your dissertation life, even as you don’t know they’re happening, and there’s nothing you can do about them except hope like some superstitious mountain man that ‘things go your way’, as he rattles his bones or picks his lucky numbers. So it is with the greatest of caution I tell the writers of the billions of dissertations that are certain to follow mine, there are many things you don’t want to have happen while your dissertation is whirling around you. You can get married and finish your dissertation with both love and knowledge. You can use drugs. I don’t recommend this, but many dissertations have been finished under the influence of mind-altering substances. You can move your house, have babies, loose important documents or even take on too much work to really do a good job. But never, never, never get divorced while you are writing your doctoral dissertation.. iv.

(7) Angoff 標準設定之判斷者的評估. 張夏石 (MICHAEL SCOTT SOMMERS). 摘要. 在標準設定中，專業的判斷者根據表現水準描述（Performance Level Descriptors, PLDs），扣合到標準化測驗的分數，並據以區分將學生的能力表現。這個流程通常決定了分數對學生的意義和決策人員對測驗的使用，例如，通過/未通過的決定、或優秀/平均/未通過等，也就是說，這些決定與標準設定判斷者之評估密切相關。在典型標準設定中，專家學者小組的判斷者接受訓練，評估符合表現水準的考生是否能答對測驗題目，接著互相討論判斷的結果。標準設定的組織者，則會提供回饋讓判斷者了解其決定對影響考生之通過和未通過比例的影響和其他的測驗使用情形。此外，整個標準設定過程，判斷者在訓練中被要求提出對於了解相關概念和想法之熟悉性與自信程度的自我報告，以及是否正確地來運用判斷。Angoff 標準設定是廣泛被使用於區分設定的方法之一。這個方法中，專家判斷小組對於學生的能力做出判斷，以評估學生能夠於表定時間中正確回答測驗題目。此流程相當重要，然而，有關如何地預備判斷者在標準化設定中的角色，所知仍有限。本研究數據蒐集是由一所臺灣的大學發展之本土外語測驗和共同歐洲參考架構（Common European Framework of Reference, CEFR）所對應的題項而來，包括聽和讀兩個小組都加 v.

(8) 以實施。本研究採用兩種共同使用的評量方法，以瞭解預備判斷者對於 Angoff 標準設定和判斷精確性的關聯。判斷的精確性是以答對率判斷的相關性(p 相關)和方均根差(Root Mean Square Error, RMSE) 和截止分數判斷（Cut-off Score Judgments, CSJ）來測量。在第一次評估時，判斷者以 PLDs 加以訓練，然後測試其對於 PLDs 切合測驗知識的 PLDs 和判斷精準性；第二次評估時，則在訓練中介紹判斷的測量精確性，對於概念和想法的熟悉性和自信程度的相關情形，發現最終判斷的測驗精確性於熟悉程度和自信程度之間沒有相關。除了主要發現之外，進一步觀察到精確的語詞說明，對於判斷的精確性是非常重要的。也觀察到以 RMSE 和 CSJ 來對精確性做出差異決定優於 p 相關。本文對未來研究方向提出在訓練 Angoff 標準設定判斷者的結論和建議，也指出本研究限制所在。. 關鍵字：Angoff、判斷者、標準設定. vi.

(9) An Evaluation of Judges in an Angoff Standard Setting. MICHAEL SCOTT SOMMERS (張夏石). Abstract In a standard setting, groups of expert judges evaluate verbal descriptions of performance (Performance Level Descriptors or PLDs) contained in a standard and match these with scores on a standardized test that place students in categories of performance. This procedure is often used to make decisions about what scores mean for the students and policy makers who use the tests. For example, Pass/Fail decisions, as well as Excellent/ Average/ Fail decisions are often tied to how tests are evaluated by standard setting judges. In a typical standard setting, panels of expert judges are trained, evaluate test items, and are then given time to discuss their results with other judges. Feedback is provided by standard setting organizers that allow judges to know how their decisions would affect students Pass/ Fail rate and other decisions the test will be used to make. In addition, throughout the standard setting, judges are asked to give self-reports about their familiarity with and confidence in their understanding of the concepts and ideas during the training and whether or not the judge is applying them correctly. The Angoff standard setting method is one of the mostly widely used methods for setting cutscores. In this method, panels of expert judges make judgments about the ability of students to correctly answer test items listed one at a time. Despite the importance of this procedure, little is known about how best to prepare judges for their role as a judge in the standard setting. Data was gathered from a standard setting held at a Taiwan university to match items from a locally developed foreign language test with the Common European Framework of Reference (CEFR). The study then used an evaluation of. vii.

(10) two commonly used methods to prepare judges for an Angoff standard setting and their relationship with judge accuracy. Both a listening and reading panel were conducted. Accuracy of judges was measured by the p-value correlation, the Root Mean Square Error (RMSE), and the Cutoff Score Judgment (CSJ). For the first evaluation, judges were trained in the PLDs and then tested about their ability to match a test of knowledge of the PLDs with the three measures of judge accuracy. No relationship was found between tested knowledge of the PLDs and judge accuracy. The second evaluation correlated familiarity with and confidence in the concepts and ideas introduced during the training period with the measured accuracy of the judge. Once again no relationship was found between familiarity and confidence with the final measured accuracy of the judge. In addition to the main findings, it was also observed that the exact wording of the instructions to instructions is very important to the accuracy of the judges. RMSE and CSJ were observed to make different decisions about accuracy than the p-value correlation. Future directions for research on the training of Angoff standard setting judges are suggested, as are the limitations of this study. Keywords: Angoff, judges, standard setting. viii.

(11) TABLE OF CONTENTS ACKNOWLEDGEMENTS. i. ABSTRACT (CHINESE). v. ABSTRACT (ENGLISH). vii. TABLE OF CONTENTS. ix. LIST OF TABLES. xi. CHAPTER 1 INTRODUCTION. 1. 1.1. Significance of the Current Research. 1. 1.2. Research Questions. 3. 1.3. Terminology. 4. CHAPTER 2 LITERATURE REVIEW. 5. 2.1. Standard Setting Method. 5. 2.2. The Angoff Method. 12. 2.3. Training and the Angoff Standard Setting Method. 16. 2.4. Problems with the Angoff Method. 20. CHAPTER 3 METHODS. 25. 3.1. Materials. 25. 3.2. Judges. 28. 3.3. Procedures. 30. 3.4. Assessment Tools. 40. 3.5. Assessment Expectations. 47. 3.6. Data Analysis. 48 ix.

(12) CHAPTER 4 RESULTS. 49. CHAPTER 5 CONCLUSIONS & DISCUSSION. 79. 5.1. Summary of Results. 79. 5.2. Other Important Findings. 82. 5.3. Future Research Directions. 84. 5.4. Limitations of the Present Study. 85. REFERENCES. 89. APPENDIXES. 101. Appendix 1. Common European Framework of Reference - Global Scale. 101. Appendix 2. Informed Consent Form. 104. Appendix 3. Security Form. 106. Appendix 4. Angoff Panelist Record Form. 107. Appendix 5. Panelist Information Form. 111. Appendix 6. PART I. Procedures. 113. Appendix 7. PART II. Common European Framework. 114. Appendix 8. PART III. The University Practical English Test. 115. Appendix 9. Review of Standard Setting Procedures. 116. Appendix 10 Angoff Standard Setting. Final Evaluation. 117. Appendix 11 Cutscore statistics for the Standard Setting – reading. 119. Appendix 12 Cutscore statistics for the Standard Setting – listening. 120. x.

(13) LIST OF TABLES Table. Title. Page. 3.1.. Contents of the English Proficiency Test (EPT). 27. 3.2.. Angoff Judges. 29. 3.3.. Contents of the Test Form Used in the Standard Setting. 31. 4.1. Measures of Judge Accuracy – Reading. 51. 4.2. Measures of Judge Accuracy – Listening. 52. 4.3. Correlation between PLD Test and Standard Setting. 60. 4.4. Matrix Correlation for Measures of Reading Ability. 63. 4.5. Matrix Correlation for Measures of Listening Ability. 66. 4.6. Within-judges Correlation between Estimates for p-value Correlations and the Squared Residual Value – Reading. 4.7. Within-judges Correlation between Estimates for p-value Correlations and the Squared Residual Value – Listening. 4.8. 72. 73. Correlation of Measures of Judge Accuracy and Self-report Surveys for Round 3. 77. xi.

(14) CHAPTER. 1. INTRODUCTION. 1.1 Significance of the Current Study This research is about the training and preparation of judges for the Angoff method of standard setting. In particular, its goal is to breach a hole in the current research concerning the efficacy of training and procedures for the method. There currently exists little information that addresses questions about how the training and procedures affect the performance of Angoff judges in a standard setting. Sometime during China’s Han Dynasty, it was discovered that a sample of someone’s knowledge about a particular subject could be used to estimate the total amount of knowledge known by that person on that subject. This was the discovery that became what we now call ‘the test’ (Elman, 2000). Over thousands of years, testing has spread across the globe and its use become ubiquitous in selecting the most suitable. Despite this, its design has remained largely unchanged for almost all of those thousands of years. It was not until the 1960s and 70s that the question of what a score means about the test-taker placed demands on the test that it could not yet handle (Glaser, 1963). With this pressure came the realization that understanding of this new idea of testing lagged far behind the actual practice. The decades that followed, the 1990s and 2000s, saw an explosion in work on this problem, and standard setting became the established method for determining how the scores on a test would be understood. This procedure came to be a key aspect of what is now called criterion-referenced testing.. The design of a criterion-referenced test has now become highly standardized. Manuals, and standard designs dominate training, procedures, and materials (Egan et al., 2012; Loomis, 2012) through national organizations and guidelines that define how a ‘quality’ standard setting is 1.

(15) conducted. The current complexity with which standard setting operates leaves an observer feeling a high level of ability has been obtained, and that standard setting, is an advanced procedure. Sophisticated methods in standard setting, such as vertically-moderated standard setting (Huynh & Schneider, 2005; Lissitz & Huynh, 2003; Lissitz & Wei, 2008) are now used to produce results (Raymong & Reid, 2001). Despite this, many aspects of the standard setting have yet to be explored. Very little is known about the implications of the training and procedures of standard setting judges. All standard setting methods call for training and procedures and some of these are quite complex. It seems strange to say that for the most widely-used, there is little understanding about the ways in which training methods and procedures prepare participants in the standard setting to perform their tasks, and that the reasons to believe that someone has been adequately trained to set the standard are drawn largely from their face validity (Holden, 2010).. The purpose of this study is thus to examine the relationship between conventional training procedures for the Angoff standard setting method and the final outcome of the standard setting. The idea that training and the procedures of the method should have a positive effect on judges’ ability is almost too obvious to state. Yet because of the lack of real data on the subject matter, it is not entirely clear that this is true and in what way it could be true. The study that follows is an attempt to clarify this with empirical data drawn from an actual operational Angoff method of standard setting.. 2.

(16) 1.1 Research Questions. The analysis conducted in this study seeks to address the following research questions: . Does knowledge and training in Performance Level Descriptors (PLDs) work effectively to predict an Angoff standard setting judge’s ability?. . Do self-report measures of familiarity and confidence with one’s knowledge of procedures and materials work effectively to predict an Angoff standard setting judge’s ability?. 3.

(17) 1.3. Terminology. PLD – The abbreviation for Performance Level Descriptor. PLDs are verbal descriptions of what a candidate can, and sometimes cannot, do at a particular score on a given test. Social Influence – Judges in a standard setting may be affected by a range of factors. Social influences refer to those influences that originate in the personality and individual differences between judges. These may vary from judge to judge, in contrast to influences that are, for example, demographic or procedural. Familiarity – Familiarity refers to the degree to which a judge feels his or her exposure to the concept have made it understandable. This is often measured with a Likert-type scale that asks judges to indicate how much they believe their evaluation is an accurate evaluation of the rating of a particular characteristic. In this study, familiarity is measured by a series of self-report Likert-type surveys. Confidence – This is a concept related to familiarity. Confidence refers to the degree to which a judge believes that his or her decisions are correct. This is often measured with a Likert-type scale that asks judges to indicate how much they believe their evaluation is an accurate evaluation of the rating of a particular characteristic. In this study, confidence is measured by a series of self-report Likert-type surveys. Accuracy – Accuracy refers to how close a judge’s judgment is to the actual value of something. In this study, judges are asked to estimate a cutscore for items used in an Angoff standard setting. The accuracy of their judgments are assessed using two different measures, the p-value correlation and the Root Mean Square of the Error (RMSE).. 4.

(18) CHAPTER. 2.1. 2. LITERATURE REVIEW. Standard Setting Method. This section will review some important aspects of the standard setting procedure, and some of the identified problems that make it difficult to interpret the meaning of standard setting scores.. Standard setting refers to the family of procedures used to establish cutscores on a scaled examination. Cutscores separate scaled scores into categories of performance defined in a performance standard (Cizek, 1996; Cizek, 2001; Cizek & Bunch, 2007; Cizek, Bunch & Koons, 2004). Standard setting is mostly used in criterion-referenced examinations to match standardized test scores with a verbal description defined in performance level descriptors (PLDs) of the performance standards. Panels of judges use different methods to compare PLDs with different types of information about items or examinees. The term "standard setting" is used to refer to the different procedures and materials used to make these cutscore decisions. Since the first suggestion of this idea in the 1950s (Nedelsky, 1954), dozens of different procedures have been developed. In one survey (Kaftandjieva, 2010), more than 60 different methods were identified with more than 15 appearing since the year 2000. Standard setting grew out of the expanded role of “criteria” in testing. Examinations can be defined as norm-referenced or criterion-referenced (Glaser, 1963; Shepard, 1980). Normreferenced tests produce results that allow for comparison between individuals and dominated high stakes testing for much of the last century. Such tests are limited by an inability to indicate what the score means for examinee ability. Criterion-referenced tests produce results that have assigned a defined meaning to a particular score. These abilities are typically defined in. 5.

(19) descriptions ranking them from least to most capable. Such descriptions are referred to as a 'performance standard' and the descriptions that define individual categories of performance as 'performance level descriptors’ or PLDs. The standard setting allows for these ranked descriptions - the PLDs - to be placed along scaled test scores providing latent trait scores that correspond with the different categories of ability defined in the standard. Cizek and Bunch (2007, p. 13) have stated that,. Standard setting refers to the process of establishing one or more cutscores on a test...Cutscores function to separate a test score scale into two or more regions, creating categories of performance or classifications of examinees.. A large number of different standard setting procedures have been developed (Hambleton & Patoniak, 2006; Kaftandjieva, 2010; Cizek & Bunch, 2007). While these procedures vary enormously in their details, they all share one property. These procedures present panels of trained experts (the judges) with performance standards and different types of information about items and examinees. The judges are then asked to use these procedures to decide what score on the test is the cutoff point between the different categories of performance. The actual procedures used can vary considerably and different procedures may use a wide range of different types of information. A typical convention in contemporary standard setting procedures is to permit a significant amount of input to inform judges about the impact of their cutscore decisions. For example, one common way to handle this is for panel organizers to allow discussion between judges about their decisions, and then tell them what percentage of an actual examinee population would fall above and below their cutscore decisions.. 6.

(20) As a result of this wide range of methods and procedures, different panels do not always agree on the cutscore decision, even for the same test items and with the provision of the same feedback information about pass/fail rates. It has long been known that different methods produce estimates of cutscore decisions that are systematically different based on their differing procedures (Buckendahl et al., 2002; Green et al., 2003; Hambleton & Patoniak, 2006; Reckase, 2006; Yin & Schultz, 2005). Even small changes in standard setting procedures can result in changes in judge’s decisions (Cross et al., 1984; Hertz & Chinn; 2002; Jaeger 1982). Judges, or even the same judge, may not make the same judgments under apparently identical conditions (George, Haque & Oyebode, 2006).. Very little has been written about the validity of the various standard setting procedures. The concept of ‘validity’ is itself a complex and contested issue. Many different definitions have been suggested. The National Council on Measurement in Education (NCME, 2015) defines it as, “…a general term used to describe whether or not the interpretation of a theory is plausible.” One widely cited definition (Kane, 2006) attempts to explain the complexities of the term.. Measurement uses limited samples of observations to draw general and abstract conclusions about persons and other units (e.g., classes, schools). To validate an interpretation or use of measurement is to evaluate the rationale, or argument, for the purposes being made…Ultimately, the need for validation derives from the scientific and social requirement that public claims and decisions be justified. (2006, p.17). This is similar to the definition used in the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014, p. 9) that suggests, “Validation can be viewed as developing a. 7.

(21) scientifically sound validity argument to support the intended interpretation of test scores and their relevance to the proposed use.”. In contemporary psychological testing, a general theory of validity, sometimes referred to as the argument-based concept of construct validity, has emerged as the dominant model (Cronbach, 1988; Cronbach & Meehl, 1955; Kane, 2006; Loevinger, 1957; Messick, 1981, 1989, 1998). An argument-based concept of validity, …first lays out a network of inferences and assumptions leading from the observed performances to the conclusions and decisions based on the performance (Kane, 2006, p. 23).. Because the structure of a standard setting is so different from that of an experiment or correlational study, the discourse about validity in standard setting is quite different and separate from these more mainstream models of psychological and educational testing. Building on the definitions for standard setting validity suggested by Cizek, Kane and other modern standard setting theorists reject the conceptualization of standard setting as a psychometric technique with knowable or estimable parameters (Cizek & Bunch, 2007, p. 18). Given Cizek’s (2001a) belief that cutscores are arbitrary and that their importance comes from their usefulness rather than their validity, Kane (2001) stresses the point that procedural evidence is the most significant aspect when validating a standard setting, in other words, was the standard setting done the way it was supposed to be done. Examinations of validity in standard setting methods appear to be based on a series of ad hoc principles (Kane, 1992, 2001) and derived from the approach that accepts "just because a standard setting is arbitrary does not mean it is not useful" (Hambleton, 1980, p. 102).. 8.

(22) Following in this tradition, Hambleton (2001; see also Schafer, 2005) built on this suggestion that further information is necessary to determine the 'usefulness' of the standard setting, Kane (2001, p. 63) states that,. Procedural evidence is especially important in evaluating the appropriateness of performance standards. In most cases, few if any solid empirical checks on the performance standards are available.. So rather than the conventional forms of validity defined for conventional psychology, a different model of validity has been suggested for standard setting. This includes (1) a Definitional perspective - that “To be called performance standards, there must be operationally defined, mutually, exclusive, exhaustive ordered categories and a decision process based on one or more assessments to place tested subjects in those categories”, (2) a Psychometric perspective - that “…form a scale that can be evaluated using the well-known criteria of reliability, validity, and utility” (3) a Legal perspective – that “Performance standards are a part of a decision-making process. Assuming the decisions have importance (i.e., stakes, which imply the possibility of harm), the process may be held to criteria that courts have determined are crucial for legal acceptability”, and (4) an Institutional perspective – that a standard setting must be consistent with the goals of the institution sponsoring it. These points should also cover the legal defensibility, the generation of assets and the efficient uses of resources in the construction and use of the standard (Kane, 2001; Schafer, 2005, p. 62).. Cizek and Bunch (2007) suggest that panel organizers should report a number of statistical tests to support their argument for validity. In contrast, Dixy McGinty (2005) has pointed out that such statistical tests as these are really more accurately thought of as indicators of reliability, and 9.

(23) while useful in demonstrating validity, are not themselves measures of validity. As a result of this confusion, in comparison with other psychological assessment procedures, a scientific justification for the validity of a particular procedural decision, such as choice of a method, or variation in a procedure, is very rarely given and when this is done, such justifications are typically operational. Given this definition of validity for a standard setting, explaining choice of a method or variation in a procedure may not be possible or even necessary.. It is widely stated that standard setting procedures are dominated by two methods that are historically linked - the modified-Angoff method and the Bookmark method (Cizek & Bunch, 2007; Engelhard, 2007). The modified-Angoff method is derived from an original method named after William Angoff who, ironically only briefly mentioned it as a note, and attributed the idea named after him to his colleague Ledyard Tucker (Cizek & Bunch, 2007). The main principle of the method is that items are examined one-at-a-time and judged in various ways for their suitability to make decisions about examinee classification. Since the Angoff method is the main focus of this study, much more will be said about it in the following sections; however, the Angoff method is widely cited as being, "the most commonly used method for setting performance standards in contemporary use in licensure and certification context" (Cizek & Bunch, 2007, p. 82). Regardless of the literal accuracy of this statement, it is unquestionably a widely used method to produce cutscore decisions for high stakes tests.. The other widely used standard setting method is the Bookmark method. The Bookmark method emerged from procedural difficulties with the Angoff method. It was first suggested by Mitzel et al. (2001), although Cizek and Bunch (2007) trace its roots back to procedures extended from the Angoff method and used in the 1990s by researchers at American College Testing (ACT) for the. 10.

(24) National Assessment of Educational Progress (NAEP). In the Bookmark method, items are placed in a booklet, referred to as the Ordered Item Booklet (OIB), where they are ranked by their difficulty measures. Judges then place a marker on the item that separates the various categories of the performance standards. Engelhard (2007) speculates that, because of its widespread use in assessments related to the American educational policy No Child Left Behind (NCLB), the Bookmark method may have been the most widely used standard setting method.. Standard setting is now a routine aspect of test development. Huge numbers of the procedures are performed regularly during the development of state and private tests. Standard setting panels were conducted as part of the No Child Left Behind (NCLB) network of accountability tests used in the USA (Linn, Baker, & Betebenner, 2002; Linn, 2003), as well as in other public education accountability projects throughout the world. Standard setting also plays a role in the development of the examinations that establish standards for a wide variety of occupations and professions (Nelson, 1994). In addition, panels similar to those in the standard setting are increasingly used for other purposes. For example, Roach, McGrath, Wixson & Talapatra (2010) describe a procedure similar to a standard setting panel to 'align' two or more different types of assessments whose content is not directly comparable. The results of their study resemble what could be produced from a mathematical equating of different assessment procedures. Their application of the panel comparison, instead of an equating, stems from the limited use of the assessments and hence limited numbers of observations available to perform an equating.. 11.

(25) 2.2. The Angoff Method. This study deals specifically with the Angoff method of standard setting. The section that follows briefly describes its history and the procedural aspects of the method that distinguish it from other methods of standard setting.. As mentioned above, the Angoff method is named after William Angoff (Angoff, 1971) who, attributed the source of the method named after him to his colleague Ledyard Tucker (Cizek & Bunch, 2007). The Angoff standard setting method is one of the oldest methods and is reputed to be among the most widely used methods in the world for setting cutscores (Cizek & Bunch, 2007). From a research point of view, the Angoff method is particularly useful because it produces many discrete values at points throughout the procedure, permitting the application of techniques derived from classical, as well as latent trait theories, such as Item Response Theory (Embretson & Reise, 2000) and Rasch modeling (Bond & Fox, 2001).. There are many different versions of the Angoff method in use today. For this reason, methods that belong to the Angoff family of standard setting methods are sometimes described as a “modified-Angoff”. It has been suggested that there is no general agreement on a definition of the Angoff method (Brandon, 2004; Reckase, 2000), although Brandon (2004) lists 5 steps he believes characterize the modified-Angoff procedure,. 1. selecting judges 2. training judges 3. defining and describing the performance level descriptors 4. estimating examinee performance at the level of each item. 12.

(26) 5. review of empirical information by judges and discussion of item estimates. This definition, while widely cited, is difficult to use. All of these points are routine aspects of other standard setting methods and only number (4) is an aspect distinctive to the Angoff family of standard setting methods. While estimation of examinee performance at the item level is found in other methods, such as the Nedelvsky method (Nedelvsky, 1954) the way it is done in the Angoff methods offers a true distinction between the modified-Angoff and other standard setting methods.. The modified-Angoff is distinctive in its procedures for estimating the cutscore in that,. 1. Judges are presented with items one-at-a-time. 2. Judges are asked to estimate examinee’s ability to answer the item correctly. 3. Estimation of examinee ability to correctly answer the item is done item-by-item, and items are not necessarily presented in any particular order. The second point, estimation of an examinee’s ability to answer the item correctly, has been done in many different ways. Brandon (2004, p. 60 note 2) provides a partial list of some of these different ways. Sometimes percentages are recorded instead of probabilities. Sometimes judges specify the number of candidates out of 100 who could answer the problem correctly. (e.g., Engelhard & Anderson, 1998; Impara & Plake, 1998). Sometimes judges are given a choice of range of percentages or proportions. For example, Cross, Impara, Frary and Jaeger (1984) and Plake and Giraud (1998) instructed judges to select from deciles. Halpin, Sigmon and Halpin (1983) printed the lowest acceptable probability and the highest probability. Cizek and Bunch (2007) list several different versions of the modified-Angoff, including the yes/ no Angoff. 13.

(27) procedure in which judges indicate only a yes or a no concerning their judgment of examinee ability to answer the item correctly.. In addition, the modified-Angoff standard settings conventionally incorporate a number of other procedures to produce a convergence of scores across judges. These are referred to in number (5) of Brandon’s (2004) list, and include, 1. Judges have several opportunities to refine their estimations, referred to as ‘rounds’. The current convention is to perform a standard setting in sometimes two, but often three, rounds (Cizek & Bunch, 2007).. 2. In between rounds, judges have the opportunity to compare their estimations with each other and discuss why they made their individual decisions. This is referred to as ‘discussion’ (Cizek & Bunch, 2007).. 3. In addition to discussion, judges are presented with data that reflects the impact of their decisions. For example, judges may be shown the percentage of examinees who would fall above or below their estimated cutscores. This is referred to as ‘impact data’ or ‘feedback’.. Virtually all standard settings, no matter which method is used, when conducted with multiple rounds, discussion among judges between rounds and the provision of feedback demonstrate a convergence of judges’ cutscore decisions across rounds. So characteristic is this result that Cizek (2001a, p. 10) refers to it as a “common feature of standard settings”. This convergence is not unanticipated. Experts, given the opportunity to discuss data relevant to their expertise, will develop elaborate explanations for the data based on information drawn from their shared background (Chi, Glaser & Farr, 1988; Johnson, 1988; Larkin, McDermott, Simon & Simon, 14.

(28) 1980). It is thus reasonable to interpret the convergence of cutscore decisions as a growing expert consensus about the contents of the standard setting and its panels.. However, the exact origin of these effects is not well understood and much discussion has been generated about their origin. Many types of effects have been suggested as a potential issue in the converging scores of the judges. For example, some standard setting literature has examined social influences during the discussion drive cutscores toward agreement (Fitzpatrick, 1989; Hurtz & Auerbach, 2003; Hertz & Chinn, 2002; Wessen, 2010). Social influences, such as the effects of dominant individuals or group conformity, may be driving judges to report cutscore decisions that are more and more similar to each other. The mechanism of these biases has not been well-established. Despite widespread speculation about the role of these social influences (Fitzpatrick, 1989) and some empirical examinations (Hurtz & Auerbach, 2003; Hertz & Chinn, 2002; Wessen, 2010).. Little is really understood about the social influences on the standard setting. Attempts to measure them have been largely unsuccessful. All of them seem to be derived from an ad hoc ‘common sense’ idea of what effects could be operating in the standard setting rather than from a theoretical description of what kind of factors exist in the procedure. In fact, it is not even clearly understood how they could operate or even if they exist in a fashion that would affect the outcome of the standard setting. As a result, there continues to be confusion about how judge’s accuracy is influenced or even what could be influencing it.. 15.

(29) 2.3. Training and the Angoff Standard Setting Method. Merriam-Webster tells us that training is, “a process by which someone is taught the skills that are needed for an art, profession, or a job.” This definition implies both learners and teachers. All standard setting methods would have both. But in addition, the idea of ‘training’ implies that it is preparing learners for a task that they cannot or do not perform naturally without instruction. In principle, learners who perform well during training should be better prepared for the task than those who do not perform as well. The training that has emerged for the Angoff method of standard setting has two fundamentally different manifestations. In the United States, where most of the published research on standard setting is generated, training has focused on preparing judges to identify as accurately as possible where categories of the borderline student test takers end and begin - or at least this has become the emphasis of activities used during standard setting training. In Europe, on the other hand, the Common European Framework of Reference (CEFR) has for years held the dominant position as a standard in language testing. The CEFR is fundamentally nothing more than lists of the PLDs that are aimed at describing competency. As such, training for standard setting that has emerged from Europe is based largely on whether judges are able to use and perform tasks based in lists of the skills associated with different levels of competency. The American sense of standard setting is very clearly laid out around the concept that the judge’s job is to identify the borderline test taker. Many researchers describe in detail the tasks that such a student should be able to do. For example this quote from the widely cited Raymond & Reid (2001, p. 147) illustrates this point.. 16.

(30) Training should give participants an opportunity to practice the steps for assigning MPLs (Minimum Passing Level) under conditions similar to the conditions they will experience when assigning actual MPLs (p. 144). Asking the participant with the lowest MPL and the participant with the highest MPL to explain their rationale is a common training technique. Similar descriptions can be found in more recent examples of the training of standard setters. Raymond & Reid (2001, pp. 150-1) provide explanations of a training program for standard setting judges. Loomis (2012) provides details of the preparation of standard setters for NAEP. While she gives descriptions of how it is that NAEP selects and prepares their standard setters, both her work and that of Raymond & Reid (2001) fail to provide any evidence that their training methods can actually produce the knowledge and skills deemed necessary by the authors. It is not at all clear that asking judges to explain their reasoning has any effect on their actual ability to perform the task with more or less efficacy. Although such instructions seem to make sense, to do so, in effect, is relying on the face validity of the activities. In other disciplines and areas of psychology, this form of work would be done through the use of clinical trials to assess the efficacy of training methods. Instead, judges are asked to fill out self-report surveys describing self-reflections on their knowledge and feelings about training and personal success at mastering the training. Gregory Cizek (2001, 2012, 2012a; Cizek & Bunch, 2007) is one of the leading researchers in standard setting today. He has discussed in great detail the use of these self-report surveys to monitor progress during the standard setting and to clarify judges’ level of knowledge and attitudes during and after the procedure. While Cizek has published several major academic books on standard setting, they are better thought of as manuals concerning how to conduct a valid standard setting. Cizek (2012a, p.170) states that,. 17.

(31) Minimally, two essential validity related questions are addressed by the surveys. (a) Is there evidence that the standard setting participants received appropriate and effective training in the standard setting method, key conceptualization, and data sources to be used in the procedure? (b) Is there evidence that the participants believe they were able to complete the process successfully; yielding recommended cutscores that they believe can be implemented as valid and appropriate demarcations of the relevant categories. Cizek (2012a) continues by providing extremely detailed examples of ‘evaluations’ timed for the “End of Orientation” (p. 174), the “End of Method Training Session” (p. 175), the “End of Round One” (p. 175), “Round Two” (p. 176), “Round Three” (p. 177), the “Final Evaluation” (p. 177), and a final form dealing with “Level of Reliance on Information” (p. 178). The study reported here was originally planned long before Cizek (2012a) wrote this, but in addition, it has a different agenda in mind. As such, its schedule only roughly follows the one suggested by Cizek (2012a). Of the seven different types of assessment used in this study, five of them were self-report surveys; addressing the, 1.. knowledge to standard setting procedures. 2.. knowledge of the CEFR. 3.. knowledge of the Practical English Test. 4.. beginning of Day 2, prior to the beginning of the operational standard setting. 5.. final evaluation. In one final suggestion for training, Loomis (2012), and also Cizek (2001), describe the slightly different version of this used by NAEP. NAEP uses as a key element, and “the most essential 18.

(32) part of the process”, (Loomis, 2012, p. 123), the concept that the training process should produce in judges a common understanding of the achievement levels. As a result, one of the procedures is that all judges take a version of the test to try to understand what it will be like for the actual test takers. A similar situation exists in Europe. European standard setting of language tests is based largely around the Common European Framework of Reference (CEFR). Some CEFR manuals are simply lists of PLDs for various language situations (Council of Europe, 2001). Others are lists of PLDs and how to interpret them. Exercises for the training of judges can be constructed from these instructions (Council of Europe, 2001; Council of Europe, 2009). Some of these exercises are very interesting and appear to have very strong face validity. But like their American counterparts, a formal test of their ability to predict standard setting outcome has yet to be reported. The situation described here, where the training for judges is suggested without any quality control other than the face validity of the procedure, is in fact much more significant than first indicated. It is difficult to find any source anywhere dealing with the training of standard setting judges that reports or even describes a need for predictive checks on training activities. Hambleton & Pitoniak (2006) spend a great deal of time in their chapter “Setting Performance Standards in the APA Publication Educational Measurement” discussing the importance of training. Many standards of the APA’s Standard for Educational and Psychological Testing are cited in Hambleton & Pitonik (2006) as the authors refer to the training of judges. For example, on page 434, they state, Standard 4.20 addresses the desirability of obtaining external evidence to support the validity of test score interpretations associated with performance category descriptions.. 19.

(33) Standard 4.21 stresses the importance of designing where panelists can optimally use the knowledge that they have to influence the process. The paper itself contains numerous sections that deal directly with the training of judges, such as “2.4 Step 4: Train Panelists to Use the Method” (p. 437) and “6. Training Panelists” (p. 453-455) which are cited extensively in Raymond & Reid (2001). At no point in any of these references is there mention of an empirical validation of these training methods or how such training methods are connected to the standard setting method in a way that makes them more valid as methods of performing the procedure. 2.4. Problems with the Angoff Method. The Angoff method suffers from all of the general problems that plague standard setting, such as the rejection of the conceptualization of standard setting as a psychometric technique capable of discovering a knowable or estimable parameter (Cizek & Bunch, 2007) rather than an abstraction that is useful, but not a real value (Cizek, 1996; Cizek, 2001; Cizek & Bunch, 2007; Cizek, Bunch & Koons, 2004). Or as Dixy McGinty (2005) put it, the way in which statistics describing standard setting performance are not really indicators of validity, and while useful in demonstrating it, are not themselves measures of that validity. But also, in the Angoff standard setting, there are a series of special problems that judges experience. The most prominent of these are related to the fact that Angoff judges are making estimates of the difficulty of the items. Human minds are limited in their ability to make such estimates (Brandon, 2004; Goodwin, 1999; Impara & Plake, 1998; Linn & Shepard, 1997, Lorge & Krulou, 1953;; Norcini et al., 1987; Norcini, Shea & Kanya, 1988; Shepard, 1994; Smith & Smith 1988; Taube, 1997) and the fact that the Angoff method forces judges to do so as part of. 20.

(34) the procedure produces a situation in which judges look for extra sources of information on which to make their estimates, such as the copying estimates of other judge’s estimates, or the copying of feedback information, and the problem of restricted range. Copying of Judge’s Estimates and Feedback Information One source of information that may be producing convergence of the p-values is the provision of feedback information. This phenomenon has been widely studied, but its root cause is not clearly understood. It is generally seen as an indication of improving performance of the judges as they receive more information about the items (Cizek, 1996; Cizek, 2001; Cizek & Bunch, 2007; Cizek, Bunch & Koons, 2004). A second explanation which will be discussed later is that the judges are simply copying the information they are receiving about the items during the feedback sessions. In its modified form, the Angoff standard setting method asks judges to rate the difficulty of test items. The ability of expert judges to make such estimates is crucial to the validity of the method, and as such, a large and comprehensive research literature has been developed to address the issue. P-value correlations as high as those seen in later rounds of an Angoff method standard setting are virtually never seen unless judges have not already been told the p-value of the items during feedback information before Round 2 and 3. A large number of references are typically cited questioning the ability of even the most highly trained experts to accurately estimate the difficulty of test items and the way in which asking judges to make an estimate has an effect on the magnitude of the estimate (Brandon, 2004; Goodwin, 1999; Impara & Plake, 1998; Lorge & Krulou, 1953; Linn & Shepard, 1997; Norcini et al., 1987; Norcini, Shea & Kanya, 1988; Shepard, 1994; Smith & Smith 1988; Taube, 1997).. 21.

(35) Angoff method procedures conducted by Brandon (2004) concluded that typically, the values obtained by correlating the empirical p-values with the estimates obtained from Angoff judges range from around 0.40 to 0.70, indicating that at best, actual estimation of the p-value can rarely account for more than half of the variance in a judged estimate. In conclusion, he states (p. 71), “results of this level, show that the ordering of item estimates - particularly those in operational standard setting studies - can be expected to mirror moderately the ordering of item difficulty.” The clustering of scores around a central point is referred to as ‘restricted range’. And if these scores are clustered around the middle of the rating scale, this is referred to as ‘centrality’ (Saal et al., 1980). One indication of this would be that estimated values for the difficulty of items that suffer from centrality would have a smaller standard deviation than the standard deviation of the measured items, indicating that estimated values for the easy items and for difficult items are not correct (Saal et al., 1980), and are more correct for items closer to the mean or median. This, in fact, is a commonly observed aspect of the research. Lavallee (2012, p. 14) reviewed the literature related to this issue and concluded, “…results consistent with a centrality effect have been found every time they have been looked for” (italics in the original). In addition, the tendency for judges to cluster estimates of actual values in tighter distributions than the actual values themselves has been the subject of comment for almost as long as there has been systematic scientific investigation into standard setting results. Lorge and Kruglov’s original (1953) study found a standard deviation of 16.3 for the judges’ estimates compared with 23.7 for the empirical difficulty values. Goodwin (1999) reported, in her study of the results of a financial planner licensing exam, that the judges’ estimated p-values were “more homogeneous” than the actual results obtained from the administration of the items to candidates. The standard deviations for the estimates of total group and for borderline 22.

(36) examinees were .09 and .10 respectively. The actual observed values were.19 and .18.Van de Watering & van der Rijt (2006) compared the estimates of difficulty values for teachers and students. They found high rates of inaccuracy among these groups. Interestingly, their student group did not overestimate the difficulty of easy items, although they showed dramatic underestimation of difficult items. Teacher’s estimates of easy items showed much more centrality and systematically underestimated the easiest items. More recently, Brian Clauser and his team have expanded on this theme with attempts to find out more about what and how different factors drive p-value estimates. Clauser et al. (2013) confirmed that providing judges with information about the empirical p-values of items was what resulted in the characteristic distribution of judge’s estimates in a standard setting. Mee et al. (2013) found that varying the instructions judges received could also affect their final cutscore. Clauser et al. (2014), using a generalizability theory framework, found that the greatest source of variability within the Angoff standard setting was between tables rather than between individual judges, suggesting that something similar to social influences could be effecting estimates inside the tables of the judges.. 23.

(37) 24.

(38) CHAPTER 3 METHODS 3.1. Materials. The data used in this study is drawn from a standard setting meetings held at a Taiwan university (hereafter referred to as The University) to link a university-level English proficiency exam to the Common European Framework of Reference (CEFR) (See Appendix 1). The test used in this study, the English Proficiency Test or EPT, is an examination of English as a Foreign Language. The EPT is a series of in-house language proficiency tests developed to meet the needs of the Practical English (PE) program adopted at The University. The EPT exams are multiple-choice exams. They test a series of listening, reading and vocabulary skills in a number of different practical contexts. It is divided into 8 sections with students’ progressing through 2 sections of Practical English each year: PE 1 and 2 are taught to freshmen (1st year), PE 3 and 4 to sophomores (2nd year), PE 5 and 6 to juniors (3rd year), and PE 7 and 8 to graduating seniors (4th year). An outline of the test organization is detailed in Table 3.1 (Lavalle, 2012). Items on the EPT are linked to vocabulary suggestions contained in the PE textbooks. The CEFR was not taken into account for items used in this standard setting, which are tied to topics covered in the textbook, rather than the CEFR scales. The textbooks from The University were not designed with the CEFR in mind; however this is one of the goals of the standard setting, to match the textbooks and items written within the PE program with the standards of the CEFR. Because the same items may appear on more than one PE test, The University maintains a strict control policy over them. As a result, no examples of items can be provided in this research. Items for the EPT are written by the classroom teachers of the PE program under the supervision of test editors who are assigned by the school. Items are then sent to a proofreader and finally. 25.

(39) returned to the editors. The test editor returns the test to the school who then print and distribute the test forms to students. The various tests of the EPT are administered on a single day. So for example, all freshman students receive the test at the same time. All sophomores receive the test at the same time, which is different from the time for freshman students and other students. Following student examinations, test results are collated, sent to a test coordinator and calibrated with Winsteps Rasch modeling software (Linacre, 2012). All test items are placed on a single difficulty scale. Items are sorted by their point-biserial correlations and difficulty values, and stored in an item bank for later use. Currently, most items that appear on the EPT are drawn from this item bank, although teachers continue to write new items to expand the item bank. The test items used in this standard setting were drawn from several different midterm examinations. All items had been calibrated onto a single scale using Rasch modeling. This standard setting project was designed to establish cutscores along the scale used to calibrate all items in the item bank and not along a raw score scale corresponding to a single test form. Accordingly, the test form used in the project was actually a composite, with its items drawn from a series of different test forms administered during the midterm examination period for first-, second-, third- and fourth-year students. The tests shared a number of common items illustrated in Table 3.1, which were used to equate them and calibrate them together onto the same scale.. 26.

(40) Table 3.1. Contents of the English Proficiency Test (EPT) Skill. L. Question Type. Description. Items. What’s next?. Student hears 2 conversational turns and is asked to choose the next response.. 20. Dialogues. Student hears short conversation of about 8-14 turns and answers 3-5 comprehension questions.. 10. Extended Listening. Student hears a short monologue and answers 3-5 comprehension questions.. 15. Total. Time. ~45 min. 45 Fill in the Blank. Student chooses a word or short phrase to complete a sentence.. 10. Cloze Reading. Student chooses words or short phrases to complete a short passage (multiplechoice cloze).. 10. R. Reading with Questions. Student reads a short passage (150-300 words) and answer 3-5 comprehension questions based on the text.. Total. ~55 min. 30. 50. TOTAL. 95. 27. 100 min.

(41) 3.2 Judges Judges were selected primarily from faculty and staff of The University. Several external judges were selected to provide diversity to the standard setting decisions. These judges were selected because of their experience teaching students at similar universities in Taiwan. Two of the external judges were faculty members at universities in the Taipei area and one was a doctoral candidate at another university but had taught remedial classes for the university at which she was studying. Table 3.2 (Lavalle, 2012) provides a summary of the judges and a brief description of the background of each.. 28.

(42) Table 3.2. Angoff Judges English Panel. Judge. Gender. Position. NS/NNS. Agf11. F. NNS. Administrator, former teacher. Agf12. F. NNS. Teaching Assistant, recently graduated student. Agf13. F. NNS. Teacher. Agf14. F. NS. Teacher. Agf15. M. NNS. Teacher. Agf16. F. NNS. Teacher. Agf21. M. NS. Teacher. Agf22. M. NS. Teacher, External University. Agf23. F. NNS. Teacher. Agf24. M. NNS. Teacher. Agf25. M. NS. Teacher, External University. Agf26. F. NNS. Teaching Assistant,. 3. Agf31. F. NNS. Teacher, External University. (Fri). Agf32. F. NNS. Teacher. Agf33. F. NNS. Administrator, Teacher. Agf34. F. NNS. Administrator, Teacher. Agf35. F. NNS. Teacher. Agf36. F. NNS. Teacher. 1. (Mon). 2. (Wed). F=female, M = male, NS = native English speaker, NNS = non-native English speaker 29.

(43) 3.3.. Procedures. A one-day training/orientation session was held on Saturday, July 10, 2010 for all the participants. The judges themselves were then separated into three different panels which were held on Monday, July 12, Wednesday, July 14, and Friday, July 16 in 2010. The individual panels were conducted on three separate days to ensure that proper procedures were followed, particularly during the discussion period. A group of six judges participated on each day. The facilitator for each discussion session acted as the moderator of each of the panels, thus requiring having the panels meet on separate days. Introduction to Training As noted in Table 3.1, the test form presented to each of the panels was a composite drawn from tests in the EPT series of tests. The items were drawn from test forms administered as part of the annual EPTs for all four year levels of the program, and differed slightly from the EPT exam described earlier. Table 3.3 (Lavalle, 2012) summarizes the type of question types used in the composite form that each of the judges had to work with. The form itself was composed of a listening and a reading section.. 30.

(44) Table 3.3. Contents of the Test Form Used in the Standard Setting Listening What’s Next?. 16 items. Dialogues. 12 items (3 listening texts). Extended Listening. 12 items (3 listening texts). Reading. Fill in the Blank. 10 items. Text Completion. 16 items. Reading with Questions. 14 items. 31.

(45) For the purposes of acclimatizing judges to difficulties encountered taking the test and provide them with the experience of taking the exam, a training form was created with the same format as the regular exam (Loomis, 2012). The test form used in the operational standard setting did not contain the scripts for the listening passages, so a separate form was created for the listening test which contained both the listening scripts and the associated items. In the training session, judges took the test using the training form. During the operational standard setting of the listening test, judges were not able to hear the taped version of the questions but were also provided with the scripts of the listening questions. The week prior to the training session, an email was sent to all judges that contained 1. an introductory letter with a link to a CEFR familiarization website, www.CEFtrain.net 2. an agenda for the training session, consisting of adapted versions of pages 33-36 from the CEFR (2009). 3. the training materials, the listening and reading components of the CEFR (2009) selfassessment grid (CEFR Table 2); and a link to the website. 4. two forms collecting personal information and agreements concerning test security and informed consent for the research portion of the project. (see Appendix 2 and 3) As homework, judges were asked to refer to the website and level summaries, and use the selfassessment grid to assess themselves (in any second language) and their students, in terms of the CEFR levels. (Council of Europe, 2009). Training of judges was extremely conventional and followed suggestions given in such authoritative sources as Cizek (2001), Cizek and Bunch (2007), and the Council of Europe (2009).. 32.

(46) Day 1 of training-Introduction to the CEFR On the first day of training, judges were given a brief PowerPoint presentation explaining the purpose of the project, a description of the EPT and an explanation of how it was developed and validated. Following guidelines provided by Cizek (2001), Cizek and Bunch (2007) and the Council of Europe (2009) a great deal of effort was extended during training to familiarize judges with the descriptors used for the panels. They then took part in a CEFR familiarization process. After a brief description of the CEFR, their understanding of descriptors was tested. Judges were given a sheet containing the Global Descriptors from the CEFR Table of Global Descriptors. The descriptors had been rearranged, and the judges were asked to sort them back into the correct order (first individually, then in pairs). After providing them with a copy of the original CEFR Table and discussing the correct answers, the judges were asked to take out their ‘homework’ activity in which they rated their own ability and that of their students using the CEFR levels, and to discuss their answers in pairs. PLD Test of CEFR Descriptors The session then shifted to the CEFR reading Performance Level Descriptors (PLDs). The first activity was another sorting activity, in which judges were asked to sort 20 CEFR reading and 20 CEFR listening descriptors from CEFR levels A1 to B2. They were then given a sheet containing CEFR reading descriptors from the scales used in the study, for CEFR levels A1 to B2. These descriptors were in a randomized order. Judges were asked to sort the descriptors into an order from least difficult to most difficult that they felt made the most intuitive sense and assign a CEFR competency level to each descriptor.. 33.

(47) The training for the listening PLDs was conducted in parallel fashion. Judges were asked, individually to sort 20 CEFR listening descriptors taken from the CEFR A1 to B2 levels. After they finished, correct answers were provided along with a full list of the listening PLDs from the scales used in the study. The scores on these activities were recorded and analyzed later as a measure of how well the judge could use the CEFR descriptors for levels A1 to B2. Discussing Difficulty After a break for lunch, judges took the practice test that was described above. The judges were then divided into the three groups of six people in each of the operational panels. The judges were asked to sit together in a circle with the other members of their standard setting panel. A group leader was chosen, and each panel was asked to go through the test form, item by item. As a group they were asked to discuss what knowledge, skills and abilities were required to answer each item, and how the items differed in terms of difficulty. One hour and fifteen minutes was allotted to this task. The discussions were taped by the facilitators for later transcription. Discussing the Barely Proficient Student Following this activity, the judges were introduced to the concept of the barely proficient B1 student (B1 BPS). They were then given a form which contained space for their notes on the BPS and told to refer to their listening and reading PLDs for the A2 and B1 levels, and summarize on the forms what they felt to be the key characteristics of a B1 BPS for both listening and reading. They were then asked to discuss their summaries in pairs or small groups. This was the final training activity of the day. Judges were then given the opportunity to ask any questions they had about what had been discussed to that point. They were told that when they returned for the. 34.

(48) actual meeting, they would have one training round prior to the meeting, then they would perform the actual standard setting. This concluded the Day 1 of the training session. Day 2 – The Operational Standard Setting and Review The Angoff meetings were held over the period of one week on July 12, 14 and 16 in 2010. The meetings were divided into two panels with standards set for the reading test in the morning and the listening test in the afternoon. Before beginning, judges were given a brief review of the contents of the previous training session. This included a review of the B1 BPS. Judges were then told to estimate, based on their understanding of students in the PE program (or Taiwanese university students in general for judges who were instructors at other universities), the percentage of students who had reached the B1 level for the skill in question and write down this estimate. The test form and the Round 1 rating form (see Appendix 4) for the reading test were distributed to the judges and a practice round was conducted. Day 2 – Round 1 The rating form contained a single column for each item being rated with each column containing a list of probabilities in increments of 0.1, starting from 0.1 to 0.9 with a space between each figure. Judges were asked to “circle or insert” the probability that a just-B1 level student would answer the item correctly, and to write their answer at the bottom of the column (see Appendix 4). Judges were instructed not to attempt to include guessing in their calculation of probabilities. They were then given a practice round, in which they were asked to write their ratings for the first few items. It was made clear this was simply a practice round, to ensure that they understood the procedure and that they could change their answers later. The facilitators 35.

(49) circulated from judge to judge while they were performing the practice round to make sure the procedures were understood. Once all judges had finished, they were asked if there were any remaining questions. After questions were answered, the first round of ratings was then conducted. After returning from a break, judges were given forms containing both feedback data and empirical item-difficulty data. They were given feedback data in the form of a distribution of actual students in the program at different scores levels on the test. The rating form for the second and third rounds incorporated further feedback. The range of scale scores was divided into 40 categories of approximately equal size. A column was added to the left side of the form. Each row in the column contained one of the 40 categories, from low to high. Once again, there was one column per item and the columns contained probabilities in increments of 0.1. This time the probabilities were placed in rows corresponding to the scale scores in the left most columns. Based on empirical results from the spring 2010 administration of the EPT, the probabilities were placed in the particular scale-score row to correspond to the approximate probability that a student in that scale-score category would answer the item correctly. Judges were guided in the use of the feedback material, so that they could use their initial estimates of students at the B1 level, the distributional data and the second rating form to contrast their Round 1 rating with what their rating would have been based on their estimate of the number of students at the B1 level. Finally, at the bottom of the column for each item was the empirical p-value for all PE students who took the midterm EPT. The listening form also contained this information for graduating students. For reading, the difference between graduating students and all students was not large, so this information was omitted.. 36.

(50) After being instructed in the use of the feedback information, a discussion session was held. For each item, the judges announced their Round 1 ratings and briefly explained why they had given the rating to each of their items. The assistant facilitator entered ratings into the computer as they were announced. Once the discussion period was finished, the cutscores were calculated and shown to the judges. Using the distribution data, judges were asked to contrast the percentage of students they had initially estimated to be at the B1 level with the percentage of students who would be classified at the B1 level based on their round one rating. They were then asked to make their Round 2 ratings. It was emphasized that they did not need to change their ratings. Day 2 – Round 2 and 3 The Round 2 ratings were entered into the computer and cutscores were calculated. (There was no discussion of individual decisions following Round 2; rather, judges handed their rating forms to the facilitators who entered their ratings into the computer while those who had finished took a break.) Judges were again asked to consider the impact (distributional) data, and given the opportunity to ask questions or make comments. Following this, they were asked to make their final round of ratings. The ratings for the final round were used to derive the recommended cutscores. At the opening training meeting, all participants were asked to sign a research consent form releasing all the data generated from the standard setting to the school for any research and administrative purposes that were necessary (see Appendix 2 and 3). In addition, judge’s feedback about their familiarity and confidence and understanding of the training was gathered regularly throughout both training and the operational standard setting panels. Summary of Day 1 and Day 2 37.

(51) Day 1 1.. pre-training assessment of individual preparation (Appendix 5). 2.. 3 different assessments throughout the training day assessing confidence in and. familiarity with their task (Appendix 6, 7, and 8) Day 2 3.. An assessment at the opening of the operational panel to address confidence and. preparation in the day's coming activities. (Appendix 9) 4.. Three rounds of the operational standard setting, reading panel. 5.. Lunch. 6.. Three rounds of the operational standard setting, listening panel. 7.. A final assessment (Appendix 10) of judge’s confidence in their final cutscore decision. and satisfaction with the manner in which the standard setting training and panels had been conducted. This was modeled after the sample form contained in Cizek & Bunch (2007).. 38.

(52) Focus Group and Recording In addition to the feedback forms, the group discussion activities described earlier were recorded and later transcribed. Following the operational standard setting, Group 3 volunteered to take part in a focus group to discuss their impressions of the standard setting. This focus group was recorded and later transcribed for use in understanding judge’s perceptions of the standard setting, its procedures and its outcomes. These recordings were made by hand-held analog tape recorders with full knowledge of all the participants. Full disclosure of all data gathering practices was conducted throughout.. 39.