Psychological Research, 72, 115-122 - 以多面向Rasch模式為基礎檢驗Angoff標準設定法的效度議題

van der Linden, W.J. (1982). A latent trait method for determining intrajudge

consistency in the Angoff and Nedelsky techniques of standard setting. Journal

of Educational Measurement, 19, 295-308.

van de Watering, G., & van der Rijt, J. (2006). Teachers’ and students’ perceptions of

assessments: A review and a study into the ability and accuracy of estimating the difficulty levels of assessment items. Educational Research Review, 1(2),

133-147.

Verhoeven, B.H., Van der Stegg, A.F.W., Scherpbier, A.J.J.A., Muijtjens, A.M.M., Verwijnen, G.M., & Van der Vleuten, C.P.M. (1999). Reliability and credibility of an Angoff standard setting procedure in progress testing using recent

graduates as judges. Medical Education, 33, 832-837.

Verhoeven, B.H., Verwijnen, G.M., Muijtjens, A.M.M., Scherpbier, A.J.J.A., & Van der Vleuten, C.P.M. (2002). Panel expertise for an Angoff standard setting procedure in progress testing: Item writers compareted to recently graduated students. Medical Education, 36, 860-867.

Ward, L. M. (1973). Repeated magnitude estimations with a variable standard:

Sequential effects and other properties. Perception & Psychophysics, 14, 193–

200.

Weir, J.C. (2005). Limitations of the Common European Framework for developing comparable examinations and tests. Language Testing, 22(3), 1-20.

Weigle, S. C. (1998). Using FACETS to model rater training effects. Language

Testing, 15(2), 263-287.

Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10, 305–335.

Wilson, M., & Case, H. (2006). An examination of variation in rater severity over time: A study in rater drift. In M. Wilson & G. Engelhard, Eds., Objective

Wolfe, E.W. (2004). Identifying rater effects using latent trait models. Psychology

Science, 46, 35-51.

Wolfe, E.W., Chiu, C.W.T., & Myford, C.M. (2000). Detecting rater effects with a multi-faceted Rasch rating scale model. In M.Wilson & G. Engelhard (Eds.),

Objective measurement: Theory into practice (Vol. 5, pp. 147-164). Stamford,

CT: Ablex.

Wolfe, E.W., & McVay, A. (2010). Rater effects as a function of rater training context.

Retrieved from http://www.pearsonassessments.com/NR/rdonlyres/

6435A0AF-0C12-46F7-812E-908CBB7ADDFF/0/RaterEffects_101510.pdf

Wolfe, E.W., & McVay, A. (2011, April). Application of latent trait models to

identifying substantively interesting raters. Presented at the Annual Conference of the American Educational Research Association, New Orleans. Retrieved from

http://www.pearsonassessments.com/hai/images/PDF/

AERA_Application_Latent_Trait_Models.pdf

Wright, B.D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370.

Wright, B.D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement.

Chicago: MESA.

Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago: MESA.

Yue, Xiaohui. (2011). Detecting rater centrality effect using simulation methods and

Rasch measurement analysis (Doctoral dissertation, Virginia Polytechnic

Institute and State University). Retrieved from http://scholar.lib.vt.edu/theses/

available/etd-07272011-104720/unrestricted/Yue_X_D_2011.pdf

Appendix A-1. Item Quality Statistics from Original Administration of Test, Reading

Item ^Measure_(logits) InFit

Mn Sq OutFit

Mn Sq Item

Pt Bsrl Item ^Measure_(logits) InFit

Mn Sq OutFit

Mn Sq Item Pt Bsrl

iR1 1.04 1.02 1.08 0.34 iR26 -2.00 0.95 0.75 0.20

iR2 -0.31 1.02 0.98 0.34 iR27 -0.29 1.00 1.00 0.37

iR3 -1.52 0.88 0.67 0.44 iR28 -1.92 0.84 0.70 0.44

iR4 -1.32 0.96 0.86 0.24 iR29 -0.99 0.94 0.84 0.40

iR5 0.35 0.91 0.84 0.39 iR30 -1.15 0.96 0.91 0.38

iR6 -0.61 0.94 0.85 0.42 iR31 -0.24 0.96 0.93 0.41

iR7 -0.63 1.01 0.96 0.34 iR32 -0.7 0.933 0.8494 0.4307

iR8 1.20 1.01 1.03 0.26 iR33 -1.19 0.9361 0.9058 0.4046

iR9 0.32 1.03 1.05 0.22 iR34 0.9 1.1199 1.2024 0.2758

iR10 0.72 1.02 1.01 0.25 iR35 -0.4 0.9387 0.9007 0.4335

iR11 -2.01 0.92 0.75 0.31 iR36 -1.03 0.9038 0.7603 0.333

iR12 -0.31 1.13 1.24 0.15 iR37 -0.06 0.9898 0.9709 0.2683

iR13 0.91 1.04 1.07 0.30 iR38 -1.15 0.9332 0.9168 0.2978

iR14 1.11 1.01 1.05 0.31 iR39 0.81 0.9374 0.9281 0.3145

iR15 -0.43 0.94 0.87 0.41 iR40 -0.51 0.8674 0.7653 0.3598

iR16 -0.19 0.94 0.92 0.36

iR17 1.18 1.07 1.10 0.21

iR18 0.86 1.02 1.02 0.28

iR19 1.62 1.17 1.29 0.08

iR20 1.55 1.05 1.08 0.23

iR21 0.72 1.14 1.17 0.12

iR22 1.48 0.99 1.00 0.27

iR23 0.10 1.08 1.11 0.15

iR24 2.28 1.03 1.13 0.12

iR25 1.85 1.00 1.04 0.18

Appendix A-2. Item Quality Statistics from Original Administration of Test, Listening

Item ^Measure_(logits) InFit

Mn Sq OutFit

Mn Sq Item

Pt Bsrl Item ^Measure_(logits) InFit

Mn Sq OutFit

Mn Sq Item Pt Bsrl

iL1 0.38 1.04 1.05 0.30 iL26 0.11 0.92 0.86 0.42

iL2 -0.77 0.93 0.88 0.37 iL27 1.40 0.93 0.94 0.37

iL3 -0.75 0.96 0.90 0.35 iL28 0.92 1.00 0.99 0.33

iL4 0.76 1.02 1.05 0.32 iL29 -0.75 1.00 0.95 0.32

iL5 -0.20 0.89 0.82 0.45 iL30 0.76 0.90 0.86 0.45

iL6 0.11 0.93 0.91 0.41 iL31 1.15 1.07 1.11 0.26

iL7 -1.47 0.92 0.76 0.36 iL32 -1.33 0.96 0.91 0.31

iL8 -0.71 0.93 0.85 0.39 iL33 0.65 1.05 1.10 0.31

iL9 -0.78 0.97 0.96 0.33 iL34 -0.2 1.03 1.03 0.32

iL10 -1.12 0.97 0.86 0.32 iL35 1.43 0.96 0.98 0.38

iL11 0.89 0.97 0.98 0.38 iL36 0.48 0.96 0.93 0.37

iL12 -0.55 0.93 0.82 0.40 iL37 -1.84 0.94 0.72 0.31

iL13 0.16 1.00 1.00 0.32 iL38 -1.06 0.93 0.81 0.37

iL14 2.06 1.15 1.39 0.11 iL39 0.14 0.90 0.82 0.44

iL15 -0.21 0.97 0.97 0.34 iL40 -0.14 0.96 0.94 0.36

iL16 1.30 0.94 0.95 0.37

iL17 0.02 1.04 1.07 0.29

iL18 0.67 1.05 1.07 0.28

iL19 0.83 0.94 0.93 0.39

iL20 0.88 1.09 1.20 0.23

iL21 -1.52 0.91 0.81 0.37

iL22 -0.88 0.94 0.94 0.37

iL23 -1.11 0.92 0.83 0.38

iL24 -0.28 0.88 0.81 0.46

iL25 0.57 1.09 1.09 0.23

Appendix B-1. CEFR Scales Used to Provide Performance Level Descriptors (PLDs)

Scale No. Scale Title (CEFR page)

LISTENING LISTENING

1 Common Reference Levels: global scale (p. 24) 13 OVERALL LISTENING COMPREHENSION (p. 66)

14 UNDERSTANDING CONVERSATION BETWEEN NATIVE

SPEAKERS (p. 66)

15 LISTENING AS A MEMBER OF A LIVE AUDIENCE (p. 67)

16 LISTENING TO ANNOUNCEMENTS AND INSTRUCTIONS (p. 67) 17 LISTENING TO AUDIO MEDIA AND RECORDINGS (p. 68)

24 IDENTIFYING CUES AND INFERRING (p. 72)

26 UNDERSTANDING A NATIVE SPEAKER INTERLOCUTOR (p. 75) 27 CONVERSATION (P. 76)

28 INFORMAL DISCUSSION (WITH FRIENDS) (P. 77) 29 FORMAL DISCUSSION AND MEETINGS (P. 78) 30 GOAL-ORIENTED CO-OPERATION (P. 79)

31 TRANSACTION TO OBTAIN GOODS AND SERVICES (P. 80) 32 INFORMATION EXCHANGE (P. 81)

READING READING

1 Common Reference Levels: global scale (p. 24)

2 Common Reference Levels: self-assessment grid (p. 26) 20 OVERALL READING COMPREHENSION (p. 69) 21 READING CORRESPONDENCE (p. 69)

22 READING FOR ORIENTATION (p. 70)

23 READING FOR INFORMATION AND ARGUMENT (p. 70) 24 READING INSTRUCTIONS (p. 71)

26 IDENTIFYING CUES AND INFERRING (Spoken & Written) (P. 72)

Appendix B-2. CEFR Global Scale (CEFR, p. 24)

CEFRLevel Common Reference Levels: global scale (CEFR, p. 24)

Can understand with ease virtually everything heard or read. Can summarise information from different spoken and written sources, reconstructing arguments and accounts in a coherent presentation. Can express him/herself spontaneously, very fluently and precisely, differentiating finer shades of Proficient meaning even in more complex situations.

Can understand a wide range of demanding, longer texts, and recognise implicit meaning. Can express him/herself fluently and spontaneously without much obvious searching for expressions. Can use language flexibly and effectively for social, academic and professional purposes. Can produce clear, well-structured, detailed text on complex subjects, showing controlled use of organisational patterns, connectors and cohesive devices.

Can understand the main ideas of complex text on both concrete and abstract topics, including technical discussions in his/her field of specialisation. Can interact with a degree of fluency and spontaneity that makes regular interaction with native speakers quite possible without strain for either party. Can produce clear, detailed text on a wide range of subjects and explain a viewpoint on a topical issue giving the advantages and Independent disadvantages of various options.

Can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. Can deal with most situations likely to arise whilst travelling in an area where the language is spoken. Can produce simple connected text on topics which are familiar or of personal interest. Can describe experiences and events, dreams, hopes and ambitions and briefly give reasons and explanations for opinions and plans.

Can understand sentences and frequently used expressions related to areas of most immediate relevance (e.g. very basic personal and family information,

shopping, local geography, employment). Can communicate in simple and routine tasks requiring a simple and direct exchange of information on familiar and routine matters. Can describe in simple terms aspects of his/her background, immediate environment and matters in areas of immediate need.

Can understand and use familiar everyday expressions and very basic phrases aimed at the satisfaction of needs of a concrete type. Can introduce

him/herself and others and can ask and answer questions about personal details such as where he/she lives, people he/she knows and things he/she has. Can interact in a simple way provided the other person talks slowly and clearly and is prepared to help.

Appendix C. Angoff Judge Response Form

(Note: Only the first page is reproduced below. The pages for the remaining 32 items are identical.)

Circle or insert the probability that a just-B1 level student

在文檔中以多面向Rasch模式為基礎檢驗Angoff標準設定法的效度議題 (頁 116-122)