• 沒有找到結果。

以多面向Rasch模式為基礎檢驗Angoff標準設定法的效度議題

N/A
N/A
Protected

Academic year: 2021

Share "以多面向Rasch模式為基礎檢驗Angoff標準設定法的效度議題"

Copied!
130
0
0

加載中.... (立即查看全文)

全文

(1)國立台灣師範大學教育心理與輔導學系 博士論文 指導教授:林世華 博士. 以多面向Rasch模式為基礎檢驗Angoff 標準設定法的效度議題 (Validation Issues in an Angoff Standard Setting: A Facets-based Investigation). 研究生:李炯方. (JOSEPH P. LAVALLEE). 中華民國一百一年六月.

(2)

(3) ACKNOWLEDGMENTS. I would like to express my gratitude to my advisor, 林世華 (Lin Sieh-Hwa), for his guidance at every stage of this long process. In my many visits to his office, he always shared not only technical suggestions about my thesis but also generous doses of wisdom and perspective. I am grateful also to the additional members of my committee, 林邦傑 (Lin Pang-Chieh),. 夙珍 (Nellie Cheng), 陳柏熹 (Chen Po-. Hsi) and 張武昌 (Vincent Chang), whose thoughtful comments have made this a better dissertation than it otherwise would have been. I thank my friends, Scott Sommers, 洪素蘋 (Hung Su-Ping) and 黃宏宇 (Huang Hung-Yu), for all of the long and stimulating discussions about educational measurement, and my friend Quentin Brand for occasionally reminding me that other topics merited discussion as well. 林瑩玲 (Grace Lin), 詹雨臻 (Chan Yu-Chen), 林小慧 (Lin Shiao-Hui) and 方威傑 (Johnny Fang) have all gone well out of their way in the past few years to help me to navigate my way around linguistic and other barriers as I’ve worked my way through the program; my ‘sister’ Deborah Kraklow kindly offered comments on various drafts of this dissertation to help make it more readable. I thank all of them for giving so generously of their time. Finally, I thank Brian Lin for his patient support and quiet encouragement..

(4) 以多面向Rasch模式為基礎檢驗Angoff標準設定法的效度議題 李炯方(JOSEPH P. LAVALLEE) __________________________________________________________________. 摘要 __________________________________________________________________ 近年來,標準設定方法在教育實務情境中蓬勃發展,其中尤以修正版Angoff標 準設定法的使用最為廣泛。Angoff法假定,經過訓練後的評分者能依據試題難 度正確地估計出通過預設標準的最低能力受試者,其答對每一道試題的成功機 率。由於標準設定方法的主觀評分特性,因此,尋求適切的工具以確保評分者 評分品質甚為重要。多面向Rasch模式(MFRM)已被廣泛使用於主觀評分情境, 特別是在標準設定程序中,用以考驗評分過程中是否出現負向的評分者效果而 影響評分品質。然而,多面向Rasch模式的基本假設為,評分者間的影響是不存 在的。然而由於多數的研究除了評分資料外並未能取得相對客觀的試題難度資 料加以比對以考驗此假設,因此極少有研究檢驗該假設。由於使用Angoff法 時,除了評分者對於試題難度的評估以及受試者是否有能力能夠達到預先設定 的標準,同時還可以取得外部試題反應資料。基於此,本研究利用Angoff法所 取得的外部試題反應資料以及評分者資料,來交叉驗證多面向Rasch模式的基本 假設。其次,利用多面向Rasch模式來檢驗Angoff法的三個假設,以及評分資料 與模式的適切程度。 在執行Angoff法時,研究者請18位外語教學(EFL)專家擔任評分者,並將 英文閱讀以及聽力試題各40題對照到歐洲語言共同架構中的B1等級(Common European Framework of Reference)。在負向評分者效果的偵測方面,本研究依據 MFRM所提供的各項指標,偵測三種在評分過程常出現的評分者效果:嚴苛度 (leniency/severity)、準確度(inaccuracy)以及趨中與極端評分 (centrality/ extremism)。接著,將Angoff設定法所估計的概率作為內在參照架構,並將施測 所得的試題難度估計作為外在參照架構。首先,將MFRM指標用來偵測在兩個 參照架構下的評分者效果,並比較兩個架構下標準設定的結果。其次,利用原 始分數以及MFRM指標來考驗Angoff標準設定法的基本假定。. ii.

(5) 本研究主要的發現如下: 1. 對照兩個架構下的標準設定,評分者在嚴苛度、準確度以及評分趨中與極端 程度的結果不一致。如此的差異使研究者對於單獨使用Angoff設定法,作為 設定標準分數的方式,產生疑慮。有關群體效果假設的考驗也確實發現,在 使用內部的參造架構下,確實出現群體趨中評分效果。這也顯示出在使用多 面向Rasch模式前必須先考驗評分者間的群體效果是否存在。 2. 關於Angoff法的假設檢定,BPS以及試題功能方面違反基本假設。其中較嚴 重的缺失為,幾乎所有的評分者皆無法利用概率來評估最低受試者能力。 關鍵字:標準設定、Angoff法、多面向Rasch模式、評分者效果、歐洲語言共同 架構、評分品質. iii.

(6) Validation Issues in an Angoff Standard Setting: A Facets-based Investigation JOSEPH P. LAVALLEE(李炯方) __________________________________________________________________ Abstract ____________________________________________________________________ Introduction: The use of standards-based scores in education has grown in recent years and the modified Angoff standard setting method is perhaps the most widely used procedure for establishing these standards. In this method, trained judges imagine students who just meet the standard in question and estimate the likelihood of their responding correctly to each item on the test being aligned to the standard. The method assumes that trained judges can accurately represent students who just meet the standard, represent how test items function and quantify their estimation of the likelihood of student success for each item. All three assumptions have been called into question. More generally, the subjective nature of all standard setting methods has resulted in a focused search for tools to evaluate the quality of judges’ decisions. The many-facet Rasch model (MFRM) has been proposed for use in detecting rater effects generally and for evaluating standard setting results in particular. Use of the MFRM, however, relies on the further assumption that no group-level rater effects exist. Because only internal, judge-generated data is available in most cases, this assumption is usually not evaluated and little research exists on how plausible the assumption is in real settings or on how robust results are to violations of the assumption. As external item response information often is available when the Angoff method is used, an Angoff setting provides a rare opportunity to test this assumption of the MFRM. Thus, the two-fold purpose of this study is to first evaluate the suitability of the many-facet Rasch model using data from an Angoff standard setting, and then to evaluate the assumptions of the Angoff method using the MFRM. Method: The data consisted of the first round estimates of a panel of 18 trained EFL professionals serving as judges in an operational Angoff standard setting linking two. iv.

(7) 40-item English exams (one reading, one listening) to the Common European Framework of Reference B1 proficiency level, and of the item response data from the original administration of the exams. MFRM indices were identified for the detection of three broad types of rater effects: leniency/severity, inaccuracy and centrality/ extremism. These indices include estimated parameters and standard errors, residuals and residual-based indices, separation statistics and correlations between ratings and model indices. The probability estimates made by the Angoff judges were used to construct an ‘internal’ frame of reference, and the item difficulty estimates from the test administration were used to construct an ‘external’ frame of reference. Indices from the many-facet Rasch model were used to examine the subjective ratings of the Angoff judges for the presence of rater effects in both frames and the results were compared. In the second stage of the study, the assumptions of the modified Angoff method were assessed, using raw score and MFRM indices. Results: In the first phase, results differed across frames for all three rater effects. The leniency/severity indicators suggested greater agreement between judges in the internal frame than in the external frame, although a similar number of judges were flagged (four in both the internal and external frames for reading; two in the internal and three in the external frame for listening). Inaccuracy effects were sharply underestimated within the internal frame of reference: six judges were flagged in the internal frame and nine in the external frame for reading; for the listening test, two and four judges were flagged in the internal and extermal frames respectively. Results for centrality/extremity differed even more markedly: for the reading test, four judges were flagged for centrality and five for extremism in the internal frame while 17 judges were flagged for centrality in the external frame; for the listening test, 10 judges were flagged for centrality and one judge for extremity in the internal frame while all 18 judges were flagged for centrality in the external frame. Group-level indicators did indicate the presence of group-level centrality and inaccuracy effects within the internal frame of reference, suggesting their possible use in evaluating the assumption of the model prior to use. In terms of the assumptions of the Angoff method, the BPS and item functioning assumptions appear to have been violated to some extent but the most striking failure was the inability of nearly all judges to accurately quantify their assessments using the probability scale. The ‘centrality’ or ‘central tendency’ bias, in v.

(8) particular, was displayed by nearly all judges, compressing the Angoff metric. This compression of the scale appears to have been largely responsible for the distorted results for the MFRM leniency/severity and centrality/extremity indices in the internal frame noted above. Further, this scale compression appears to have distorted the cut scores, leading to differences in pass/fail rates: for the reading test, the pass rates within the internal frame across the three rounds of the standard setting were 46.4%, 37.8% and 37.7%, while the corresponding pass rates in the external frame were 38.1%, 29.0% and 27.2%; for the listening test, the pass rates in the internal frame were 35.4%, 35.4% and 31.5%, compared to 31.0%, 31.0% and 27.1% in the external frame. Discussion: The critical assumption underlying use of the MFRM for detecting rater effects was found not to hold in the present case, casting doubt on the use of the model in standard setting situations for which only internal data (from the judges’ estimates) is available. More positively, the group-level indicators within the internal frame were found to be sensitive to inaccuracy and centrality effects and thus may serve to help check the suitability of the model for use where no external data is available. The assumptions of the Angoff method were also found to be violated. In particular, a centrality or central tendency bias was shown to persist across all three rounds and to distort results. In view of previous research into central tendency, the present findings are consistent with the possibility that the Angoff method is inherently highly susceptible to the distorting effects of this bias. More generally, the centrality bias seems likely to pose a serious threat in many rating situations, both to the validity of ratings and to the accuracy of indicators used to evaluate these ratings. Future research should focus on refining our understanding of when the MFRM is likely to be appropriate for use; on solutions to problems with the Angoff method (perhaps in the form of procedural modifications or score adjustments); and on what rating situations are likely to be susceptible to the centrality bias and how it might be reduced or eliminated.. Keywords: standard setting, Angoff method, many-facet Rasch model, rater effects, Common European Framework of Reference, rating quality. vi.

(9) TABLE OF CONTENTS ACKNOWLEDGEMENTS. i. ABSTRACT (CHINESE). ii. ABSTRACT (ENGLISH). iv. TABLE OF CONTENTS. vii. LIST OF FIGURES. viii. LIST OF TABLES. ix. CHAPTER 1 1.1 1.2 1.3. INTRODUCTION Significance of the Current Study Research Questions Terminology. 1 1 3 4. CHAPTER 2 2.1 2.2 2.3. LITERATURE REVIEW The Angoff Method: Assumptions and Validity Threats Detection of Rater Effects with the MFRM Assumption of the Use of the MFRM for Detecting Rater Effects. 6 6 15 32. CHAPTER 3 3.1 3.2 3.3 3.4. METHODS Methodological Overview Exam Items and Calibrations Angoff Standard Setting Analysis. 35 35 36 37 42. CHAPTER 4 RESULTS 4.1 Assumption of the MFRM 4.2 Assumptions of the Angoff Method. 46 46 82. CHAPTER 5 5.1 5.2 5.3 5.4. 87 87 90 94 94. DISCUSSION AND CONCLUSION Summary of Results Implications and Suggestions Limitations of the Present Study Future Research Directions. REFERENCES. 96. APPENDICES Appendix A Item Quality Statistics from Original Administration of Test Appendix B CEFR Scales Used to Provide Performance Level Descriptors Appendix C Angoff Judge Response Form Appendix D Results for all MFRM Indices. 106 106 108 110 111. vii.

(10) LIST OF FIGURES. Figure. Title. Page. 4.1. Reading cut scores (judge severity measures in logits), internal and external frameworks.. 49. 4.2. Comparison of reading cut scores (judge severity measures in logits), internal and external frameworks.. 50. 4.3. Listening cut scores (judge severity measures in logits), internal and external frameworks.. 53. 4.4. Comparison of listening cut scores (judge severity measures in logits), internal and external frameworks.. 54. 4.5. Inaccuracy indices for reading, internal v. external frameworks.. 58. 4.6. Inaccuracy indices v. score/p-value correlations, internal v. external.. 60. 4.7. Inaccuracy indices for listening, internal v. external frameworks.. 62. 4.8. Indices v. score/p-value correlations, listening, internal v. external.. 64. 4.9. Item difficulty in logits for reading and listening, internal v. external.. 67. 4.10. Centrality/extremity indices for reading, internal v. external.. 70. 4.11. Centrality/extremity indices v. raw score standard deviations, reading, internal v. external.. 72. 4.12. Centrality/extremity indices for listening, internal v. external.. 74. 4.13. Centrality/extremity indices v. raw score standard deviations, listening, internal v. external.. 76. 4.14. Item infit mean square values in logits for reading and listening, internal v. external frameworks.. 79. viii.

(11) LIST OF TABLES Table. Title. Page. 2.1. Summary of Indicators for Detecting Rater Effects. 31. 3.1. Contents of the English Proficiency Test (EPT). 36. 3.2. Items on Test Forms Used in Angoff Standard Setting. 37. 3.3. Angoff Judges. 38. 3.4. Leniency/Severity - Indices and Criteria. 43. 3.5. Inaccuracy - Indices and Criteria. 44. 3.6. Centrality/Extremism - Indices and Criteria. 45. 4.1. Judge Separation Statistics for Reading, Internal v. External Frames. 47. 4.2. Reading Results (Means, Severity Measures and SEs), Internal v. External. 48. 4.3. Judge Separation Statistics for Listening, Internal v. External Frames. 51. 4.4. Listening Results (Means, Severity Measures and SEs), Internal v. External. 52. 4.5. Judge Separation Statistics for Reading and Listening, Internal v. External, with Flagged Judges Removed. 55. 4.6. Indices of Inaccuracy for Reading, Internal v. External. 57. 4.7. Correlation Matrices of Inaccuracy Indices for Reading. 59. 4.8. Indices of Inaccuracy for Listening, Internal v. External. 61. 4.9. Correlation Matrices of Inaccuracy Indices for Listening. 63. 4.10. Item Separation Statistics, Internal v. External. 65. ix.

(12) 4.11. Indices of Centrality/Extremity for Reading, Internal v. External. 69. 4.12. Correlation Matrices of Centrality/Extremity Indices for Reading. 71. 4.13. Indices of Centrality/Extremity for Listening, Internal v. External. 73. 4.14. Correlation Matrix of Centrality/Extremity Indices for Listening. 75. 4.15. Item Fit Indices for Reading, Internal v. External Frames. 78. 4.16. Item Fit Indices for Listening, Internal v. External Frames. 78. 4.17. Summary of Flagged Raters, Reading and Listening, Internal v. External. 81. 4.18. Raw Score Statistics and Summary of Flagged Raters, Reading and Listening. 82. 4.19. Raw Score Statistics For All Rounds, Reading and Listening. 84. 4.20. Cut Scores, Standard Deviations and Pass Rates for All Rounds. 85. 4.21. Judge Characteristics and Indices for Severity, Accuracy and Centrality. 86. x.

(13) CHAPTER 1 INTRODUCTION. 1.1. Significance of the Current Study. In recent years, the use of standards-based scores has become increasingly widespread, internationally as well as in Taiwan. When significant consequences are attached to meeting these standards, validity becomes an issue of obvious importance. Since most standard setting methods rely on subjective judgments made by contentarea experts, assessing the validity of the results involves evaluating the quality of such judgments. However, methods for this task are themselves still under development. The two-fold purpose of the present study is to use data from a single operational standard setting to both empirically assess the methods used to evaluate subjectively-made judgments and to evaluate the assumptions of the standard setting method itself. Standard setting methods are employed so that scores from an examination can be reported in relation to a standard. Several methodologies have been devised that make use of expert judgment to arrive at a numerical cut score linking an examination to a standard. The most commonly used standard setting methodology is the modified Angoff method (Angoff, 1971). In the modified Angoff standard setting method, a panel of content-area experts is trained to imagine a ‘barely proficient student’ who has just achieved the proficiency standard in question and then to work through the items in a test, estimating for each one the probability that the barely proficient student would answer it correctly. The estimates are summed and the average across raters is the recommended cut score. Typically, this is an iterative procedure involving two or three rounds, with empirical performance data from the test in question used to provide feedback when available. The claim that the method generates valid results rests on a set of assumptions which minimally include that trained judges can: (1) develop accurate representations of the just-proficient examinee; (2) accurately represent item functioning (i.e., the features of each item that make it either more or less difficult for examinees); and (3) juxtapose these two representations to arrive at a quantitative estimate - the probability that the just-proficient examinee can respond correctly to the item. Each of 1.

(14) these three assumptions has been questioned in the literature and there are wellestablished reasons for believing that they may not hold in all, or even typical, cases. Given these known difficulties, methods for assessing the quality of judges’ decisions would clearly be of considerable value, and different diagnostic indicators have been suggested for use in evaluating the results of standard setting meetings. A growing body of literature has emerged around the use of the many-facet Rasch model (MFRM) and related latent trait models for detecting a number of rater effects, including leniency/severity, inaccuracy and central tendency. In recent years, a number of authors have proposed using this model for the detection of rater effects in the context of standard setting exercises (Eckes, 2009; Engelhard, 2007; 2009; 2011; Engelhard & Anderson, 1998; Englehard & Cramer, 1997; Engelhard & Gordon, 2000; Engelhard & Stone, 1998; Noor, 2007). However, use of the MFRM in a standard setting context is itself based on a further assumption. Namely, it assumes that any rater effects are confined to a minority of raters and that no group-level effects are present. This is because the collective ratings of the group are used to define the model expected values, in relation to which deviations can be isolated and identified as particular rater effects. If group-level effects exist, they would influence the expected values themselves. Thus, the claim that the expected values can be used to detect rater effects depends on the assumption that, at the level of the entire group of raters, there are no rater effects. Typically, it is very difficult to evaluate this assumption. In most rating situations, the only data available comes from the judgments of the raters themselves; there is no external data available against which it can be compared. Thus, despite the large and growing body of literature around the use of the MFRM for detecting rater effects, the viability of this assumption has rarely been questioned. From this perspective, an Angoff standard setting provides an unusual opportunity. In many situations in which the Angoff method is used, data on item difficulty exists from the original administration of the exams. These item difficulty parameters can be used to construct an external frame of reference, which can then be used to evaluate the assumption of the MFRM. The present study is the first to explicitly attempt to determine whether this assumption holds within a particular rating situation and, if the. 2.

(15) assumption is not met, how robust results are when violations of the assumption occur. The purpose of this study is thus two-fold. By constructing an external frame of reference from original test results and an internal frame of reference from expert judgments, it seeks to first evaluate the underlying assumption of the many-facet Rasch model, and then to use the MFRM, along with raw score indicators, to evaluate the assumptions of the Angoff standard setting method. The study thus seeks to contribute to our understanding of the use of the MRFM for detecting rater effects. In terms of standard setting procedures, it is hoped that the study will add to the literature exploring the characteristics and evaluating the assumptions of the Angoff method. Identification of specific threats to validity should be useful for refining the procedures used in future standard settings, the design of the training conducted prior to their implementation and for the validation process that occurs after the cut score has been established.. 1.2. Research Questions. The analyses conducted in this study seek to address the following research questions: 1) Does the critical assumption required for the use of latent-trait models for the detection of rater effects in a criterion-referenced situation hold in a typical modified Angoff standard setting? 2) Are the assumptions of the modified Angoff standard setting viable? a) Are trained Angoff panelists able to develop accurate representations of the ‘barely proficient student’? b) Are trained Angoff panelists able to develop accurate representations of item functioning? c) Are trained Angoff panelists able to juxtapose these representations to assess the likelihood of the barely proficient student answering each item correctly and to quantify this assessment using the 0-1 probability scale?. 3.

(16) 1.3. Terminology Angoff Method – Often referred to as the ‘modified Angoff’ method, this is a. procedure for generating a cut score linking a particular exam to the achievement of a standard. In this procedure, a trained panel of subject-matter experts goes through the test and, for each item, estimates how many of a group of 100 barely proficient students (students who just meet the standard) would respond to it correctly. Results are summed for each panelists and averaged across the panel to arrive at the cut score. Often two or three rounds of judgments are conducted, with empirical data used to provided feedback to panelists between rounds, if such data is available.. Barely Proficient Student (BPS) – The (imaginary) student who just meets the standard in question. In the Angoff method, panelists are asked to develop an internal image of such a student after familiarization with the performance standard. The BPS is sometimes referred to as the ‘minimally competent examinee,’ the ‘borderline examinee’ or the ‘just-proficient student.’ In this study, these terms, along with the ‘B1 BPS’ and ‘just-B1 student’ will be used interchangeably.. Common European Framework of Reference (CEFR) - A manual developed by the Council of Europe (CoE, 2001) to provide common reference materials for the teaching and learning of different languages. The manual contains 54 language proficiency scales, covering various aspects of language performance. The proficiency scales consist of six basic levels, labeled, in increasing order of proficiency, A1, A2, B1, B2, C1 and C2. The CEFR scales have been adopted for use internationally in providing ‘performance standards.’. Cut score – The ‘cut score’ is the test score which translates between the performance standard and performance on the test. A student whose test score is at or above the cut score is said to have reached the performance standard.. Facilitator – A facilitator is a person responsible for conducting the training and the actual standard setting meeting.. 4.

(17) Judges– All standard setting methods requiring subjective judgments require panelists or judges who will make these decisions under the guidance of the meeting facilitator(s). They are typically expected to be subject-matter experts and are also sometimes expected to be representative of different stakeholder groups.. Modified Angoff Method - See “Angoff Method”, above.. Performance standards – Benchmarks against which performances can be measured.. Performance Level Descriptors (PLDs) – Descriptions of the characteristics of performance at a given level. In this study, the key ‘PLDs’ are the B1 level descriptors for listening and reading from the Common European Framework of Reference.. 5.

(18) CHAPTER 2. LITERATURE REVIEW. Section 2.1 introduces the Angoff method and the assumptions that it makes, along with the different types of rater errors which could lead to violations of those assumptions. Section 2.2 introduces the many-facet Rasch model (MFRM), which has been proposed for use in detecting the presence of rater effects. The model and its indices are introduced and the use of the model for investigating rater effects is described. The central assumption made for this purpose is introduced and means for evaluating this assumptions are discussed.. 2.1 The Angoff Method: Assumptions and Validity Threats One of the most commonly procedures employed in standard setting is the modified Angoff method (Angoff, 1971; hereafter simply the “Angoff method”). With this method, a panel of judges, usually content-area experts, is trained to imagine a ‘barely proficient student’ (BPS) who has just achieved the proficiency standard in question. After being trained, the judges consider each item in the test, one by one, estimating for each the probability that the BPS would answer the item correctly. The sum of these estimates for each judge represents the ‘cut score’ recommended by that judge; that is, the score on the test that an examinee would need to reach to be considered as having reached the standard in question. The average cut score across the entire panel of judges is taken as the recommended cut score. Typically, this is an iterative procedure involving two or three rounds, with empirical performance data from the test in question used to provide feedback, where such data is available.. 2.1.1. Assumptions of the Angoff Method. Use of a cut score to make potentially high-stakes decisions about examinees assumes that if students who perfectly exemplified the ‘barely proficent’ ability level for the standard took the test, they would receive the same score as did the barely proficient students imagined by the Angoff judges. This implies further assumptions. Angoff did not explicitly state these assumptions and no single ‘list’ is agreed upon in the various discussions of the procedure (Brandon, 2004; Impara, 1997; Impara & Plake, 1998; Ricker, 2006). Nonetheless, the method seems to assume, at a minimum, the 6.

(19) following: 1. Accurate Representation of the Barely Proficient Student. Trained judges can develop accurate representations of the ability level of the just-proficient student. 2. Accurate Representation of Item Functioning. Trained judges can accurately represent the nature and level of knowledge, skills and abilities required to respond to the item when making their estimates. 3. Quantification. Trained judges can juxtapose their representations of the BPS and of item functioning to assess the degree of challenge posed by the item for the BPS and quantify this using the 0-1 probability scale. Threats to these assumptions, which would call the validity of Angoff results into question, are discussed next.. 2.1.2. Known Threats to the Assumptions of the Angoff Method. The Angoff method is likely the most thoroughly researched of all standard setting methods. Findings of this research as they relate to the three assumptions listed above are summarized here. Assumption 1: Accurate Representation of the Barely Proficient Student. In developing mental representations of the just-proficient student, panelists are internalizing the construct as it is articulated in the performance level descriptors (Bourque, 2000; Egan et al., 2009; Lewis & Green, 1997; Mercado & Egan, 2005). All verbal descriptions of ability are likely to leave some degree of ambiguity. The ambiguity of the CEFR descriptors used in the present study, for example, has been widely discussed (e.g., Weir, 2005). Given this, it may be more reasonable to think of a ‘zone’ within which accurate BPS representations might exist rather than a single point. Put differently, within limits, different experts or judges might have different but more or less equally defensible interpretations of the written standards. Thus, ‘accurate BPS representations’ are here understood as those which fall within a ‘zone’ along the latent trait continuum. Threats to validity exist when BPS representations fall outside of this range, such that they cannot be defended as reasonable interpretations or ‘translations’ of the PLDs. Research on different types of performance level descriptors (PLDs) and the results have offered some support for the BPS assumption. Impara, Giraud & Plake 7.

(20) (2000) found judges set a higher cut score on the same exam when given PLDs reflecting a higher degree of proficiency. Skorupski and Hambleton (2005) found that teacher descriptions of performance levels converged after training and orientation activities. Giraud, Impara & Plake similarly found that teachers given more detailed PLDs generated more detailed descriptions of the BPS, and Fehrmann, Woehr, Arthur (1991) found that two groups of panelists who received more thorough training with practice rounds produced estimates that were in closer agreement than a third group which received minimal training. The literature on standard setting has focused on three general variables which might lead to violations of the BPS assumption: factors other than the PLDs influencing the development of BPS representations, panelist background and panelist stakes in the outcome. In a qualitative study, McGinty (2005) found that judges seemed to be basing their representations of the BPS on particular students who had been granted degrees (indicating achievement of the standard) instead of on the PLDs describing what students should master to earn the degree. Reid (1985) found having judges make estimates for the total group before doing so for the target group lowered the cut score for the target group considerably, suggesting the possibility that consideration of the student population as a whole influenced the development of the BPS representation. Similarly, in a study of a speaking and a listening test being linked to the CEFR, Papageorgiou (2010) found that some panelists relied on information other than the PLDs in making their judgments. Another set of studies have focused on the performance of panelists with different backgrounds. Hamberlin (1992, in Brandon, 2004) found that non-teachers in a school (administrators, curriculum specialists, etc.) set significantly higher standards than did teachers. It may be that teachers, more familiar with the precarious nature of newly learned skills, developed a somewhat less ‘able’ BPS, or that administrators place a higher priority on setting higher standards that would reflect well on the school. Cross et al. (1984) found that public high school teachers and teacher-educators in universities set different cut scores on a teacher education battery. Busch and Jaeger (1990) found similar effects for public school and college/ university-based judges on a similar test, noting that ratings provided by the public 8.

(21) school judges were more influenced by item performance data than were the ratings from the college/university content specialists. Verhoeven et al. (1999, 2002) compared practicing professionals (doctors) with recently graduated students, and found that the recent graduates gave more homogeneous judgements and set a significantly more lenient cut score. Another study found that psychology graduate students setting cut scores for a psychology test had less variation in their scores than a group of undergraduate students who had just taken the course, suggesting that the graduate students may have shared more similar representations of the BPS (Maurer et al., 1991). However, other studies (e.g., Norcini, Shea, & Kanya, 1988; Plake, Impara & Potenza, 1994) have found no significant difference in the ratings provided by judges with different backgrounds. McGinty (2005) also found that the consequences associated with different cut scores also seemed to contaminate the process, with judges who were high school teachers feeling a tension between the desire to set high standards and the desire to be viewed by the public as doing a good job (which would be called into question if they set a high cut score resulting in more students failing). In that study, the majority initially wanted to set high standards but, McGinty observed, “reality set in when some participants pointed out that teacher performance would be judged by the passing rates on the test.” Consistent with McGinty’s conclusion, Ferdous and Plake (2005) found that judges in the U.S. who indicated that they were influenced by the consequences of cut scores in relation to the No Child Left Behind law set lower cut scores. In the present study, the PLDs were comparatively very detailed and the training period relatively lengthy. Further, it is unlikely that the panelists anticipated significant consequences resulting from their cut score decisions. These considerations would suggest a high level of agreement about the ability level of the BPS. On the other hand, the panelists came from rather diverse backgrounds, ranging from recently graduated students, to native English speaking teachers with years of classroom experience and panelists with administrative job positions. These background differences might be expected to produce divergent BPS representations. Assumption 2: Accurate Representation of Item Functioning. This has surely been the most controversial and well-researched assumption of the Angoff method 9.

(22) (Brennan & Lockwood, 1980; Chang, 1999; Chang et al., 1996; Clauser et al., 2009; Fehrmann, Woehr & Arthur, 1991; Goodwin, 1999; Hurtz & Jones, 2009; Impara & Plake, 1998; Lorge & Kruglov, 1953; Plake & Impara, 2001; Plake, Impara & Irwin, 1999; Shepard, Glaser, Linn, & Bohrnstedt, 1993; Van Der Linden, 1982). The overwhelming consensus which has emerged from this research is that judges are indeed quite limited in their ability to represent item functioning. Most such studies have reported correlations between the means of modified Angoff judges’ item estimates and actual difficulty levels (i.e., empirical p-values). Brandon’s 2004 review of the literature on the Angoff method reported that, for 29 correlations reported, average correlations were .63 for operational standard settings and .51 for nonoperational standard settings. This moderate level of success in meeting the assumption has remained the rule in studies published since Brandon’s review (e.g., Clauser et al., 2009). Research in this area has increasingly sought to investigate the variables influencing accuracy in assessing item difficulty. Panelist background and expertise has been the focus of one line of research, with inconclusive results. Van De Watering and Van Der Rijt (2006) found that students were more accurate than their teachers but the Verhoeven studies discussed above failed to find a difference between panelists with different backgrounds. Assumption 3: Quantification. After developing representations of the BPS and of item functioning, Angoff judges next need to juxtapose these representations, imagine how the just-proficient student would interact with the item, conceptualize the degree of challenge posed by the task and ‘quantify’ this by estimating the probability of the BPS answering correctly. The ability of panelists to quantify their expectations as probabilities has rarely been explicitly discussed. This is curious, as there is little reason to expect this to be a natural task for most people and, conceptually, it is not clear how a panelist is expected to perform it. Furthermore, previous research offers reason to believe that the central tendency or centrality effect, in particular, may commonly occur when the Angoff method is used. The centrality effect has long been known to influence judgments made in settings similar to that of the Angoff. Indeed, over a century ago, Hollingworth noted that judgments of “time, weight, force, brightness, extent of 10.

(23) movement, length, area, size of angles, have all shown the same tendency to gravitate toward a mean magnitude, the result being that stimuli above that point in the objective scale were underestimated and stimuli below overestimated” (Hollingworth, 1910, p. 426). This effect has been consistently found within the psychophysics tradition: Stevens and Greenbaum (1966) reviewed a series of experiments demonstrating the same effect, which they referred to as a “regression effect.” More than a decade later, Poulton provided an updated review of the literature concerning this tendency, which he referred to as “contraction bias” and described as “a general characteristic of human behavior” (Poulton, 1979, p. 778). Unfortunately, this literature has rarely been referred to in relation to the Angoff method, despite its obvious relevance. If the centrality effect were present in an Angoff setting, it would manifest as a tendency for judges to overestimate the difficulty of relatively easy items and to underestimate the difficulty of relatively difficult items. The standard deviation of judges’ estimates would also be smaller than the standard deviation of the empirical item difficulties (i.e., those derived from the actual administration of the test to the relevant student population). Precisely this pattern of results has been found in a number of studies. In Lorge and Kruglov’s (1953) study of the ability of judges to estimate item difficulty, the standard deviation of the judges’ estimates was 16.3 compared to 23.7 for the empirical item difficulties. Shepard (1994) found that trained Angoff judges systematically overestimated examinee performance on difficult items and underestimated examinee performance on easy items. In Goodwin (1999), 14 judges made estimates for all examinees and for the borderline examinees on a 140-item financial certification exam. The standard deviations of the judge’s estimates were .09 for the total group and .10 for the borderline group; the corresponding standard deviations from the actual exam results were .19 and .18 respectively. Heldsinger and Humphry (2005) and Heldsinger (2006) reported results from a study in which 27 judges used a modified Angoff procedure with 35 items from a Year 7 reading test. The standard deviation of the item difficulties set by the panelists was 0.5 logits, less than half the standard deviation of 1.16 logits from the actual exam results. The authors used the ratio of the standard deviations to re-scale the Angoff results and 11.

(24) found that that it significantly altered the final cut score. Schulz (2006), in addition to providing one of the first attempts to theoretically elucidate the nature of this bias as it relates to standard setting, reported results from a pilot study, with 21 Angoff panelists making estimates for items from the 2005 NAEP Grade 12 math exam. The results suggested ‘scale shrinkage’ which, significantly, persisted even through the third round of ratings. Finally, Clauser et al. (2009), reported results from two operational standard setting exercises for a physician credentialing examination, with six Angoff judges making estimates for 200 items (34 of which has associated empirical data) on one, and six judges and 195 items (43 with empirical data) on the other. Even though items with “very high” or “very low” p-values were excluded from the study, the judges were still found to “systematically overestimate the probability of success on difficult items and underestimate the probability of success on easy items” (Clauser et al., 2009, p. 17). In fact, results consistent with a centrality effect appear to have been found every time they have been looked for. The one seeming exception is a study by Impara and Plake, in which, according to the authors, panelists “did not systematically overestimate (or underestimate) performance on easy items or overestimate (or underestimate) performance on hard items” (Impara & Plake, 1998, p. 77). However, the particular methodology used in that study makes it difficult to directly compare their results with the studies mentioned above. In their study, the authors asked 26 sixth-grade science teachers to estimate the probabilities of success on each item in a 50-item science test for two groups: the borderline (“D/F”) students in their class, and the class as a whole. They also asked the teachers to assign and record the class grades for each student. The researchers then compared predicted with actual performance for both groups with the borderline group defined by the teacherassigned class grades. They found that the teachers overestimated the performance of the class as a whole but underestimated the performance of the borderline group. They then examined the relationship between predicted and actual item difficulty levels for both groups, categorizing estimates as overestimates (more than .10 over the actual pvalue), underestimates (more than .10 under the p-value) and accurate estimates (within .10 of the actual p-value). The results were then further divided according to the difficulty level of the item (items with p-values below .34, between .34 and .66 12.

(25) and above .66). They concluded that “these results did not show a consistent variation in accuracy of prediction simply as a function of item difficulty” (p. 77). This study certainly speaks to the ability of panelists to estimate the performance of particular students and may be of particular interest in comparing the modified Angoff method with student-centered standard setting methods, such as the contrasting groups method. Nonetheless, their results cannot be compared directly with results from the studies mentioned above, for at least two reasons. First, as noted in Clauser et al. (2009), Impara and Plake defined the borderline group in terms of the class grades assigned by the teachers. In order to make a direct comparison of estimated and observed difficulty levels, the authors would have needed to have defined the groups statistically, in accordance with the modified Angoff method: by the number of items each group was predicted to answered correctly. Doing so would have resulted in a different set of proportion-correct (‘p’) values, a different categorization of items into the three levels of item difficulty, and different percentages of estimates falling into each of the accuracy categories used by the authors (overestimates, accurate estimates and underestimates). In other words, the relevant comparison is with students who performed around the mean score derived from the teachers’ item-by-item estimates. Second, the authors provide no information on the dispersion of estimates, such as the range or standard deviation. Without these, and given the above issue of category definition, Impara and Plake’s findings cannot be used as evidence either for or against the presence of a centrality bias. In short, then, based on previous research, there is strong reason to believe that the Angoff method is highly vulnerable to a central tendency bias which has the potential to undermine one of its core assumptions.. 2.1.3. Rater Effects. An important part of the validation process generally is to identify possible threats to validity, formulate them as hypotheses and then seek to empirically refute them (APA/ AERA/NCME, 1999; Kane, 1994; Messick, 1998). For standard setting and subjective rating situations more broadly, such hypotheses can be explicitly formulated in terms of the presence of possible ‘rater effects,’ defined as a “broad category of effects [resulting in] systematic variance in performance ratings that is 13.

(26) associated in some way with the rater and not with the actual performance of the ratee” (Scullen, Mount and Goff, 2000, p. 957). These rater effects have been investigated in some depth within two broad research traditions. The first of these has focused on the psychological processes involved in making subjective evaluations and on the potential sources of rater effects (Pula & Huot, 1993). The second tradition has focused on detecting and diagnosing rater effects by searching for their characteristic patterns in ratings data. Research within this latter tradition has resulted in a variety of criteria to evaluate the psychometric quality of ratings, across different measurement frameworks, including classical test theory, analysis of variance, regression analysis, generalizability theory and Rasch measurement/item response theory (Saal, Downey, & Lahey, 1980; Stemler, 2004; Stemler & Tsai, 2008). Within this broad literature, rater effects have been defined in various ways (Myford & Wolfe, 2003, 2004; Saal, Downey & Lahey, 1980). The present study will follow Wolfe’s division of rater effects into three categories leniency/severity, inaccuracy and centrality/extremism (Wolfe, 2004). These are discussed in turn. Leniency/Severity. This effect is present when raters gives scores that are consistently either too high or too low. In terms of an Angoff standard setting, leniency/severity is present when a judge’s probability estimates are uniformly either lower or higher than is warranted by the performance-level descriptors. A judge displaying a leniency bias would assign comparatively low probability estimates to the items, resulting in a lower cut score and a higher percentage of students meeting the standard. Conversely, a judge displaying a severity bias would attribute to the BPS more ability than warranted by the PLDs and would thus assign higher probability estimates, resulting in a higher cut score and a lower percentage of students meeting the standard. Inaccuracy. To the extent that this effect is present, ratings will appear unrelated to the presence or absence of the latent trait being rated. In an Angoff standard setting, this effect would create inaccuracies in the representations of item functioning. (It should be noted that, within the broader category of inaccuracy, it is possible to make a further distinction between randomness and differential dimensionality (Wolfe & McVay, 2011). Randomness is present when a rater’s ratings 14.

(27) diverge in a non-systematic manner from error-free measurements (Wolfe, 2004), whereas differential dimensionality occurs when ratings systematically deviate from the ratings that would be assigned by an error-free process, violating the assumption of local independence and the related assumption of unidimensionality. Differential dimensionality may result due to a number of specific biases which have been discussed in the literature, such as halo effect (Saal et al., 1980, p. 474), logical error (Newcomb, 1931; Linn & Gronlund, 2000), or bias/interaction effects (Lumley & McNamara, 1995; Lynch and McNamara, 1998; Wigglesworth, 1993). However, pursuing and further specifying the cause of inaccurate ratings is beyond the scope of the present study.) Centrality/Extremism. The centrality effect (discussed above in relation to the quantification assumption) is present when a rater clusters his or her ratings around a certain point of the rating scale or around the center of the perceived range of performances, resulting in a compressed distribution. This results in reduced variation in assigned ratings, and in ratings that are accurate at the center of the ability range but which overestimate the ability of less proficient examinees and underestimate the ability of more proficient examinees. In an Angoff setting, this would mean overestimation of the probability of success for more difficult items and underestimation of the probability of success for easier items. 1 A less frequently discussed effect is extremism, present when ratings cluster at extreme ends of the rater’s distribution of ratings (Wolfe, 2004). Where this effect is present in an Angoff setting, difficult items would tend be judged as being even more difficult than they really are, and vice-versa, easier items would be judged as being even easier than they actual are.. 2.2. Detection of Rater Effects with the MFRM. In recent years, latent trait models, and the many-facet Rasch model (MFRM) in particular, have been widely proposed for use in detecting, diagnosing and, to some. 1Centrality. or central tendency is often defined to occur when ratings cluster near the midpoint of the rating scale, and distinguished from range restriction which is defined to occur when ratings cluster around any point of the rating scale (Saal, Downey & Lahey, 1980). Here, in line with an earlier tradition (Hollingsworth, 1910), these are treated as a single rater effect occurring when ratings cluster around the average rating.. 15.

(28) extent, adjusting for rater effects (Eckes, 2005; Engelhard, 1992, 1994, 1996; Linacre, 1989; Myford & Wolfe, 2003, 2004). Of particular interest here, a number of researchers have applied the MFRM to detect rater effects in the context of standard setting (Eckes, 2009; Engelhard, 2007; 2009; 2011; Engelhard & Anderson, 1998; Englehard & Cramer, 1997; Engelhard & Gordon, 2000; Engelhard & Stone, 1998; Noor, 2007). This section first describes the relevant members of the Rasch family of models and discusses the indices which have been proposed for the detection of specific rater effects. This application of the model relies on the assumption that no group-level rater effects are present. This assumption is often left implicit and, to date, there is no instance in the literature in which it has been explicitly evaluated prior to application of the model. This assumption is thus treated at some length, and means for evaluating this assumption are described.. 2.2.1. Latent Trait (Rasch) Models: Parameters and Indices. It may be possible to infer the presence of rater effects from the parameter estimates generated by the model, from the residuals between expected and observed values, from the separation statistics, and from the correlations between ratings and the indices generated by the model. Each is discussed below.. Latent Trait Models and Model Parameters In Rasch’s original model for dichotomous data, responses are a stochastic function of person and item parameters:. (2.1). where βn is the location of person n along the underlying latent trait, δi is the location of item i along the same latent variable, and Pni1 and Pni0 are the probabilities of person n on item i scoring 1 and 0, respectively. Applying the model assumes the existence of a quantitative underlying variable (e.g., EFL reading or listening ability), and when parameters are estimated from the raw response data, an interval scale along which all examinees and items can be located is generated for this variable. The 16.

(29) distance between any two items, any two students or any item and any student indicates a specific quantity of the attribute being measured. The origin of the scale is arbitrary (often set at the mean item difficulty location), as is the unit which partititions the latent variable into specific quantities. Such a situation describes a single specified frame of reference, understood as a collection of agents (students), a collection of objects (items), and outcomes of the interactions between them (Rasch, 1977). The frame of reference for the dichotomous model is constructed by setting up interactions between agents and objects so as to transmit variations in underlying person ability and item difficulty to the measurement outcome, performance on the test. The Rasch rating scale model (Andrich, 1978) expands the dichotomous model to allow for polytomous cases, so that. (2.2). where τk, refers to the threshold between two adjacent rating scale categories, and where Pnijk and Pnij(k-1) refer to the probability that person n attempting item i endorses category k and category k-1, respectively. Note that in many empirical frames of reference involving a rating scale, the relationship between the latent variable and measurement outcome has been dramatically changed. Variation is no longer transmitted through the direct interaction between person and item. Rather, ratings involve subjective perception and judgment concerning the outcome of the interaction between the task or item and the person. This opens the way for rater effects to influence the measurement outcome. The many-facet Rasch model (MFRM; Linacre, 1989) provides a further extension of the model to take account of other features or ‘facets’ of the rating situation, starting with the severity of the different raters. Thus, a typical facets model for a three-facet situation can be formulated as follows:. (2.3). 17.

(30) where βn is the ability of person n, δi is the difficulty of item i, λj is the severity of judge j and τk is the difficulty of observing category k relative to category k-1, and Pnijk and Pnij(k-1) refer to the probabilities of examinee n being graded on item i by judge j, with a rating of category k and k-1, respectively. In the above model, the rating scale is the same for all judges and all criteria. However, it is also possible to use a partial-credit model (Masters, 1982) in which each criterion has its own rating scale. This may be desirable if it is expected that the use of the scale categories is expected to differ between different criteria. This model can be formulated as follows:. (2.4). In this model, τk refers to the difficulty of observing category k on item i. In other situations, we may want the scale to remain the same across criteria but allow it to vary across judges. This can also be modeled using the following formula:. (2.5). In this model, the threshold of the steps between adjacent rating categories varies among judges. The two-subscript term, τjk, refers to the difficulty of observing category k used by judge j. Further facets can be included in the model based on hypothesized sources of construct-irrelevant variance. Thus, if interaction effects (e.g., between judges and criteria or raters and ratees) or sources of differential rater functioning (DRF) such as gender or professional background are believed to be present, these can also be included.. (2.6). where γg is the ratee group (e.g., gender, profession) and φjg is a bias interaction term representing the interaction between judges and a group of ratees. This facet indicates 18.

(31) the degree to which rater j’s ratings for ratee group g differ from the expected ratings of rater j for ratee group g, as predicted by a model not containing the term. Another model which can be used for parameter estimation is the binomial trials model (Wright & Masters, 1982), which is used when the response format calls for a specific number of independent attempts at each item with a dichotomous outcome (success or failure), and the number of successes is counted. The responses are thus defined as the number of independent successes. This model is as follows:. ln(Pnix / Pnix−1 ) = β n − δ i − τ x. (2.7). where τx, refers to the difficulty of achieving a count of x relative to a count of x-1, and where Pnix and Pnix-1 refer to the probability of person n achieving a count of x on item i and a count of x-1 on item i, respectively. This model has been recommended as being particularly appropriate for a modified Angoff situation (Eckes, 2009; Engelhard & Anderson, 1998; J.M. Linacre, personal communication, July 7, 2010), since in the modified Angoff method, judges are presented with a series of dichotomous items and essentially asked: “Out of 100 barely proficient students, how many would answer this item correctly.” The probabilities obtained can thus be modeled as outcomes of binomial trials, with the number of independent trials fixed at 100 and the judges asked to count the number of successes (i.e., the number of barely proficient students who would answer the item correctly).. Separation Statistics The family of Rasch models generates a series of indices designed to ensure that the elements of a particular facet (e.g., judges or items in an Angoff setting) are “sufficiently well separated in difficulty to identify the direction and meaning of the variable” (Wright & Masters, 1982, p. 91). These indices depend on the standard errors of the parameter estimates and the standard deviation of the elements of the facet being analyzed. The ratee separation ratio, G, is a measure of the spread of the ratee performance measures relative to their precision, with separation expressed as a ratio 19.

(32) of the ‘true’ (adjusted for measurement error) standard deviation of the measures over the average standard error. The standard error associated with each particular estimate is calculated as the square root of the maximum score (‘mN’) divided by the observed score (‘S’) multiplied by the maximum minus the observed score; the result is multiplied by a factor (‘Y’) which increases with sample dispersion to control for the spread of the sample.. SE = Y. mN S(mN − S). (2.8). This formula inflates the denominator and results in lower standard errors for elements near the center of the distribution, which are more ‘well-targeted’; the relatively less well-targeted elements at the extremes have larger standard errors. The mean square error (MSE) is the mean of the error variances:. N. MSE =. ∑ SE. 2 i. (2.9). i=1. N. The MSE can be used to adjust the observed variance of the measurements. Because each individual measurement contains error, a ‘true’ variance can be defined as:. True SD2 = SD2 - MSE. (2.10). The average error or root mean square error (RMSE) is then defined as the square root of the mean square error:. (2.11). 20.

(33) With these indices in place, an element separation ratio, G, can be computed using the adjusted or ‘true’ standard deviation and the root mean square error, as follows:. G = True SD / RMSE. (2.12). This ratio is a measure of the spread of the measures relative to their precision. It can also be used to derive the separation index, H, which indicates the number of measurably different levels (or strata) of performance. This index is defined as:. H = (4G + 1) / 3. (2.13). The proportion of observed variance which is not due to estimation error is used to indicate the reliability with which the elements in the sample are separated:. SDTrue MSE G2 R= = 1− = SDObserved SDObserved 1+ G 2. (2.14). Finally, a chi-square statistic is generated by the Facets (MFRM) program to assess the statistical significance of the differences between the elements within each facet. For the rater or judge facet in an Angoff setting, the chi-square statistic is calculated as follows:. ⎡ β2 ⎛ β2 ⎞2 ⎤ ⎢ n2 − ⎜ ∑ n2 ⎟ ⎥ σ ⎝ σn ⎠ ⎥ χ 2 = ∑ ⎢⎢ n ⎥ 1 ∑σ 2 ⎥ ⎢ n ⎢⎣ ⎥⎦. (2.15). Where βn is the cut score for Judge N and σn is the associated standard error. The statistic has an approximate chi-square distribution with N - 1 degrees of freedom, where N is the number of judges. The chi-square statistic for the items in an Angoff 21.

(34) can be calculated by substituting the corresponding values for the items into the equation.. Residuals-based Indices The family of Rasch models uses the residuals between the expected ratings generated by the model’s parameter estimates, and the actual observed ratings, to generate fit statistics which may help to indicate possibly mismeasured examinees, raters or tasks. The residual for an observation is defined as:. Rnij = Xnij − Enij. (2.16). where Xnij = the observed rating, and Enij = the expected rating, based on model parameter estimates.. The formula for the expected rating, Enij, is. (2.17) where m = the number of rating scale categories, and k = a counting index representing the value of each rating scale category.. Residuals are usually standardized for interpetation, using (2.18) where (2.19). 22.

(35) Here, note that. VEnij. is the variance of an observation so that its square root is its. statistical information. !. Two fit statistics are generated for each parameter estimate, based on the mean. of the squared standardized residuals of the observed scores from their expected scores. As they are based on model residuals, fit statistics capture and summarize deviations from expected ratings. The outfit statistic is simply the mean of these standardized residuals. Outfit statistics are particularly sensitive to departures in the data in the extreme rating categories. Infit statistics attach a weight to each standardized residual based on its variance, making them more sensitive to unexpected ratings that fall near the center of the rating scale. Infit and outfit statistics have an expected value of 1.00 and can range from zero to infinity. A 0.1 increase in a fit statistic is associated with a 10% increase in unmodeled error. Values less than 1.0 indicate that the model predicts the data better than expected, based on model expectations for levels of error. Outfit is calculated as follows:. (2.20). 2 where, N is the number of examinees, I is the number of items and zrni is the. standardized score residual. The formula for infit is:. (2.21). where Wrni is the variance of the score residual. Both of these statistics can be standardized to obtain the standardized infit and outfit statistics. According to a widely used rule of thumb for interpreting fit statistics, overfit to the model (better 23.

(36) than expected model-data fit) is suggested when mean square fit statistics fall below 0.7 for multiple-choice exams and 0.6 for rating situations, and misfit or underfit to the model (worse than expected model-data fit) is suggested when the values are above 1.3 for multiple-choice exams and 1.4 for rating situations (Wright & Linacre, 1994). The corresponding values for standardized fit statistics are ±2.0.. Correlational Indices Many of the indices which have been recommended for use in detecting rater effects are correlations between different model-generated indices or between model indices and raw score ratings. There are two widely used raw score correlations. The ‘single rater-rest of rater’ (SR/ROR) correlation, which is the Facets version of the point-biserial correlation, is a raw score indicator derived by calculating the correlations between the ratings of the different raters across all facets. A second raw score indicator, not specific to the Facets model but widely used in Angoff settings, is the correlation between the estimates of the judges and the empirical item p-values. In this study, these are referred to as score/p-value correlations. A number of latent trait correlations are also widely used as indicators. The point-measure correlation is the latent trait analog to the point-biserial correlation. It is the correlation between the scores assigned by a particular rater to a group of examinees and the ability estimates for the same examinees, and thus indicates the consistency between how a particular rater ranks the examinees and how the raters collectively rank the same group. Similarly, the score-expected correlation is the correlation between the observed and modeled scores. Finally, the expected-residual correlation is a measure of the relationship between the residual (observed minus modeled or expected score) and the modeled or expected scores themselves, while the closely related measure-residual correlation indicates the relationship between the estimated measures for the ratees (items) and the residuals for a set of judge-item interactions. 2.2.2. Indices for Detecting Rater Effects. 24.

(37) This study focuses on MFRM indices for detecting rater effects. As ‘classical’ or raw score statistics will be used for the sake of comparison, these are also introduced below.. Leniency v. Severity Leniency and severity effects manifest as scores which are either lower or higher than those of other raters. MFRM indices rely primarily on the estimated severity measures for the judges and on the separation statistics. For detecting whether individual judges were lenient or severe in relation to the group, a number of indicators are available.. 1. Mean scores. Directly comparing the mean scores of the ratings assigned by each judge is the standard indicator within a raw score framework.. Within the MFRM framework, a number of further indicators exist.. 2. Judge severity measures. Leniency and severity can be examined directly by comparing the values for the different judges on the judge severity parameter, λr. (In an Angoff standard setting, where the only ‘examinee’ is the BPS as imagined by the different judges, the severity parameter can be omitted and judge severity would appear as different values for the βn parameter - representing the location of the cut score on the latent variable.). 3. Fixed chi-square test of the hypothesis that the judges share the same level of severity. A significant difference would indicate that at least two judges differed in severity.. 4. Follow-up t-tests. Significant findings on the above chi-square test can be followed up with t-tests between pairs of judges, using the judge severity measures and associated standard errors to determine whether the two judges differ significantly in their displayed levels of severity.. 25.

(38) 5. Judge separation ratio. This ratio measures the spread of the measures for the different judges relative to their precision.. 6. Judge separation index. This index indicates the number of statistically distinct severity levels among the raters.. 7. Reliability of the judge separation index. This measures the reliability with which the judges have been separated. A value of 0.0 would indicate that the panelists were exchangeable, while higher values indicate that the judges were reliably separated in terms of their severity.. There are no agreed-upon criteria for the above indices. Their value lies in providing information about the degree to which rater severity levels diverged. Actual interpretation remains largely a matter of judgment. In a standard setting, ‘interchangeability’ of judges is not normally expected. While all of the judges are subject-area experts, they are also chosen to represent diverse backgrounds and may be expected to come to different but defensible interpretations of the performance level descriptors which articulate the standard. It thus becomes a question of judgment on the part of those evaluating the judges’ performance as to how much difference is acceptable. Myford & Wolfe (2004b) suggest using t-tests to identify judges whose measures differ significantly from one another. Wolfe (2004) flags raters who differ significantly from the group mean. Given the above consideration concerning standard setting judges, another approach would be to define ‘problem judges’ as those who are at least 2 standard errors (SEs) in distance from any members of the main cluster of judges. For leniency/severity, there are no clear indicators of group-level effects, which would indicate when most or all of the members of the group were displaying leniency or severity. The only indicators available to detect group-level leniency/ severity are the group level category usage statistics. The problem with attempting to use these to identify group-level effects in an Angoff standard-setting is that it presupposes some prior expectation concerning which categories should be used. If such information existed, standard setting would be unnecessary. 26.

(39) Inaccuracy Inaccurate ratings are typically diagnosed through correlations and patterns in statistical indicators that are based on residuals.. Two raw-score indices are used here.. 1. Raw-score correlations. When scores from an external framework are available, as is often the case in operational Angoff standard settings, these are used. The critical value of the correlation coefficient can be used to flag problematic raters.. 2. Single Rater/Rest of Rater (SR/ROR) Correlations. Inter-rater correlations are often used when no external scores are available. The Facets software package calculates this raw score statistic. The critical value of the correlation coefficient can be used to flag problematic raters.. In addition to these raw score statistics, four MFRM indices have been proposed for use in investigating individual level inaccuracy effects.. 3. Point-measure correlation. This is the correlation between scores assigned to a group of ratees (items) by a particular examinee and the Rasch parameter estimates or measures for the same ratees. Low consistency between these two scores should be reflected in a low correlation. The critical value of the correlation coefficient can be used to flag problematic raters.. 4. Score-expected correlations. The Facets software program generates an expected score for each rater-ratee interaction. A low correlation between observed and expected scores would indicate inaccuracy. The critical value of the correlation coefficient can be used to flag problematic raters.. 27.

(40) 5. Standard deviation of the residuals. Accurate ratings would result in small, randomly distributed residuals. A large standard deviation of the residuals would thus indicate inaccuracy. Wolfe (2004) ‘arbitrarily’ defined large as 1.25, and small at .75.. 6. Judge fit statistics. These should be sensitive to rater inaccuracy. For mean square fit indicators, values above 1.4 are typically used to flag raters for misfit. For standardized fit statistics, values above 2.0 are used.. Indices have also been proposed for detecting group-level effects.. 7. Item separation statistics. Myford & Wolfe (2004) suggested examining the item separation statistics for the ratees (fixed chi-square test of the hypothesis that items share the same measure with a non-significant chi-square value suggesting a grouplevel effect, item separation ratio with a low ratio suggesting a group-level effect, item separation index with a low value suggesting a group-level effect and reliability of the item separation index with a low value suggesting a group-level effect) for evidence that the raters or judges did not effectively discriminate between or ‘separate’ the ratees.. Centrality/Extremism Centrality results in ratings regressing towards the perceived mean of the stimuli range, while extremism results in ratings that cluster near the extremes of the distribution. For detecting individual-level rater centrality and extremism effects, the following indices have been proposed. One raw-score indices is commonly used in detecting centrality/extremism effects.. 1. Standard deviation. With centrality, because the observed ratings form a narrower or more tightly compressed distribution around the mean of the distribution than do the expected ratings, a standard deviation that is smaller for observed than for expected ratings is an indicator of the presence of the effect. Conversely, since the distribution of ratings around the mean is more dispersed where extremism is present, 28.

(41) this effect is indicated by a standard deviation that is larger for observed than for expected ratings. A weakness of the use of the standard deviation as an indicator is that random error would also be expected to inflate the standard deviation, making it difficult to distinguish between accuracy and centrality, on the one hand, and inaccuracy and extremism on the other (Wolfe, 2004; Yue, 2011). Nonetheless, in a simulation study, Yue (2011) found the standard deviation to be one of the better indices for detecting centrality.. Additionally, a number of latent trait indices have been suggested.. 3. Standard deviation of the residuals. Centrality would likely result in relatively small residuals, whereas raters displaying extremity will produce residuals with large standard deviations. Wolfe (2004) ‘arbitrarily’ defined large as 1.25, and small at .75.. 4. Judge fit statistics. Although widely used, there is considerable ambiguity concerning how fit statistics might respond to rater centrality/extremism. Research has indicated that centrality may not manifest consistently in fit indices (Wolfe et al., 2000) and Myford & Wolfe (2004) argue that centrality might manifest in fit statistics that were either too low or too high. In her simulation study, Yue (2011) found that fit was not an effective indicator of centrality.. 5. Expected-residual correlation (rexp,res). Proposed by Wolfe (2004), this is the correlation between model-predicted ratings and the residual (observed minus expected rating) for each rater-ratee interaction. Negative correlations indicate centrality, and positive correlations indicate extremism. This is so because raters displaying centrality would assign higher than expected ratings to ratees with low expected values, resulting in positive residuals, and lower than expected ratings to ratees with high expected values, resulting in negative residuals. Raters displaying extremism would show precisely the opposite pattern (negative residuals for ratees with low expected values and positive residuals for ratees with high expected values), resulting in a positive correlation. In a simulation study, Yue (2011) found this to be. 29.

參考文獻

相關文件

(a) A special school for children with hearing impairment may appoint 1 additional non-graduate resource teacher in its primary section to provide remedial teaching support to

220V 50 Hz single phase A.C., variable stroke control, electrical components and cabling conformed to the latest B.S.S., earthing through 3 core supply cable.. and 2,300 r.p.m.,

✓ Express the solution of the original problem in terms of optimal solutions for subproblems. Construct an optimal solution from

✓ Express the solution of the original problem in terms of optimal solutions for subproblems.. Construct an optimal solution from

„ A host connecting to the outside network is allocated an external IP address from the address pool managed by NAT... Flavors of

From the perspective of promoting children’s learning, briefly comment on whether the objectives of the tasks were achieved with reference to the success criteria listed in the

Two cross pieces at bottom of the stand to make a firm base with stays fixed diagonally to posts. Sliding metal buckles for adjustment of height. Measures accumulated split times.

• view from reference: one compatible reference can point to many advanced contents. • view from method: one compatible method “contract”, many different