Limitations of the Present Study

CHAPTER 5 CONCLUSIONS & DISCUSSION

5.4 Limitations of the Present Study

The present study examined whether or not standard procedures for training and preparing judges for the Angoff standard setting method worked to accurately predict judge’s final performance. As with all research, the outcome of this study was limited in its findings by a range of factors.

Standard setting is an inherently weak form of research. Although research is necessary to investigate the procedures and results of standard setting, standard setting is a policy

procedure organized to inform policy makers, rather than researchers, about a test. Given the small number of individuals involved, conventional statistical analysis and measurement is limited. It is very difficult to extract information about the effects that are the aim of the study, and about other important subjects such as gender, ethnicity or personal experience. In this study, an attempt was made to control for this by making panels as homogeneous as possible, while still maintaining representativeness. Despite these efforts, in the standard setting, researchers can never be certain that research results are not influenced by, or even the result of outside factors.

One of the solutions to this issue was to set the significance level for this study at a low level, the p < .05 level. But this carried with it its own set of risks. This paper reported on the results of 392 different correlation calculations. In addition, there were many exploratory correlations conducted but not reported. Setting significance at p < .05 was necessary because it allowed for the larger patterns of the standard setting to be observed. It also made it possible for many of the observations to occur by chance. At least 19 of the correlations reported in this study are expected to be significant by chance. Which ones these are, is of course unknown, but, while the trend in the results of this study are clear, individual findings need to be interpreted with caution.

The small size of the panels involved in standard setting have made it difficult to apply classical test theory, even though panels of this size - or smaller - are not uncommon in high stakes testing, such as those found in No Child Left Behind (Michigan State Department of Education, 2007). Attempts at adopting item response theory (IRT) to standard setting have gone largely ignored and as a result, standard setting has remained outside the domain of

psychometrics continuing to involve an almost artistic sense to interpreting the tests.

Finally, while all the participants were selected because of their extensive experience in language teaching, this does not guarantee that the standard setting went as planned. The results of a standard setting are limited by the same training and skill that qualify the participants to be part of the standard setting. It is possible that some of the findings of this standard setting

resulted from the biases and prejudges previously held by the judges. The recommended style of training judges for the standard setting is designed with this in mind and focuses directly on familiarizing judges with the CEFR and its proper use. Judges were asked several times about their understanding of the CEFR. However, this is no guarantee that training worked as intended.

It is possible that judges entered the standard setting using a standard different from the CEFR, and that the findings reflect a standard other than the one intended by the organizers and

moderators. However, risk of this was kept to a minimum. On the morning of Day 2, just before the operational standard setting, judges were asked if they understood the CEFR. On a scale of 1 to 4, the mean average for all 18 judges was 3.06. So while this is a possibility, the judges indicated that they did not believe this to be a problem.

Despite careful planning, results also reflect the skills and abilities of the organizers and moderators of the standard setting event. Standard setting is a difficult and complicated process.

The decisions it is used to manage are often very important. In the case of the standard setting from this study, it is possible that findings also reflect the limitations in the experience of the organizers and moderators.

89 REFERENCES

American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational, &

Psychological Testing (US). (2014). Standards for educational and psychological testing.

Amer Educational Research Assn.

Angoff, W. H. (1971). Scales, norms, and equivalent scores. In: R. L. Thorndike (Ed.), Educational Measurement (pp. 508-600). Washington, DC: American Council on

`Education.

Brandon, P. R. (2004). Conclusions about frequently studied modified Angoff standard-setting topics. Applied Measurement in Education, 17(1), 59–88.

Bond, T. G., & Fox, C. M. (2001). Applying the Rasch Model: Fundamental Measurement in Human Sciences. Mahwah, NJ: Erlbaum.

Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B. S. (2002). A comparison of Angoff and Bookmark standard setting methods. Journal of Educational Measurement, 39(3), 253-263.

Chi, M. T. H., Glaser, R., & Farr, M. J. (Eds.). (1988). The Nature of Expertise. Hillsdale, NJ:

Erlbaum.

Cizek, G. J. (1996). An NCME instructional module on setting passing scores. Educational Measurement: Issues and Practice, 15(2), 20-31.

Cizek, G. J. (2001). Conjectures on the rise and call of standard setting: An introduction to context and practice. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives, (pp. 3-17). Routledge.

Cizek, G. J. (Ed.). (2012). Setting performance standards: Foundations, methods, and

90 innovations. Mahwah. NJ: Erlbaum.

Cizek, G. J. (2012a). The forms and functions of evaluations in the standard setting process. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations, (pp. 165-178). NJ: Erlbaum.

Cizek, G. J., & Bunch, M. B. (2007). Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests. Thousand Oaks, CA: Sage.

Cizek, G.J., Bunch, M.B., & Koons, H. (2004). Setting performance standards: Contemporary methods. Educational Measurement: Issues and Practice, 23(4), 31–50.

Clauser, J. C., Margolis, M. J., & Clauser, B. E. (2014). An examination of the replicability of Angoff standard setting results within a generalizability theory framework. Journal of Educational Measurement, 51(2), 127-140.

Clauser, B. E., Mee, J., Baldwin, S. G., Margolis, M. J., & Dillon, G. F. (2009). Judges' use of examinee performance data in an Angoff standard‐setting exercise for a medical licensing examination: An experimental study. Journal of Educational

Measurement, 46(4), 390-407.

Clauser, B. E., Mee, J., & Margolis, M. J. (2013). The effect of data format on integration of performance data into Angoff judgments. International Journal of Testing, 13(1), 65-85.

Clauser, B. E., Swanson, D. B., & Harik, P. (2002).A multivariate generalizability analysis of the impact of training and examinee performance information on

judgments made in an Angoff-style standard-setting procedure. Journal of Educational Measurement, 39(4), 269–290.

Council of Europe. (2001). Common European framework of reference for languages.

Cambridge: Cambridge University Press.

Council of Europe. (2009). Manual for relating language examinations to the Common Common European Framework of References for Language Learning, Teaching, Assessment.

Cambridge: Cambridge: Cambridge University Press.

Crocker, L. & Zieky, M. (1994). Joint Conference Standard Setting for Large-Scale Assessments. National Assessment Governing Board. Washington, D.C.

Cronbach, L. J. (1988). Five perspectives on validation argument. In H. Wainer & H. Braun (Eds.), Test Validity (pp. 3–17). Hillsdale, NJ: Erlbaum.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.

Cross, L. H., Impara, J. C., Frary, R. B., & Jarger, R. M. (1984). A comparison of three methods on the National Teacher Examination. Journal of Educational Measurement, 21(2), 113- 129.

Egan, S. J., Dick, M., & Allen, P. J. (2012). An experimental investigation of standard setting in clinical perfectionism. Behaviour Change, 29(3), 183-195.

Elman, B. A. (2000). A cultural history of civil examinations in late imperial China. University of California Press.

Embretson, S.E. and Reise, S.P. (2000). Item response theory for psychologists. Mahwah, NY:

Lawrence Erlbaum Associates.

Engelhard, G. (2007). Evaluating bookmark judgments. Rasch measurement Transactions, 21, 1097-1098.

Engelhard, G. and Anderson, D. W. (1998). A binomial trials model for examining the ratings of standard setting judges. Applied Measurement in Education, 11(3), 209-230.

Fitzpatrick, A. R. (1989). Social influences in standard setting: The effects of social interaction on group judgments. Review of Educational Research, 59(3), 315-328.

George, S., Haque, M. S., & Oyebode, F. (2006). Standard setting: comparison of two methods.

BMC Medical Education, 6(1), 46.

Glaser, R. (1963). Instructional technology and the measurement of learning outcomes.

American Psychologist, 18(8), 519–522.

Goodwin, L.D. (1999). Relations between observed item difficulty levels and Angoff

minimum passing levels for a group of minimally competent examinees. Applied Measurement in Education, 12(1), 13-28.

Green, D. R., Trimble, C. S., & Lewis, D. M. (2003). Interpreting the results of three different standard setting procedures. Educational Measurement: Issues and Practice, 22(1), 22–32.

Halpin, G., Sigmon, G., & Halpin, G. (1983). Minimum competency standards set by three divergent groups of raters using a three judgmental procedures: Educational and Psychological Measurement, 47(1), 977-983.

Hambleton, R. K. (1980). Test score validity and standard-setting methods. Criterion-referenced measurement: The state of the art, 80, 123.

Hambleton, R. K. (2001). Setting performance standards on educational assessments and criteria for evaluating the process. In Cizek G. J. (Ed.), Setting performance standards: Concepts, methods, and perspectives, (pp. 89-116).

Hambleton, R. K., Pitoniak, M. J., & Copella, J. M. (2012). Essential steps in setting

performance standards on educational tests and strategies for assessing the reliability of results. In Cizek G. J. (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 47–76). New York, NY: Routledge.

Pitoniak, M. J. (2006). Setting performance standards. Educational Measurement, 4, 433-470.

Hertz, N. R., & Chinn, R. N. (2002, April). The role of deliberation style in standard setting for licensing and certification examinations. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.

Holden, R. (2010). "Face validity". In Weiner, Irving B.; Craighead, W. Edward. (Eds,),

The Corsini Encyclopedia of Psychology (4th ed).(pp. 637-638). Hoboken, New Jersey: Wiley.

Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy.

International Journal of Forecasting, 22(4), 679-688.

Hurtz, G. M., & Auerbach, M. A. (2003). A meta-analysis of the effects of modifications to the Angoff method on cutoff scores and judgment consensus. Educational and

Psychological Measurement, 63(4), 584–601

Huynh, H. & Schneider, C. (2005). Vertically moderated standards: Background, assumptions, and practices. Applied Measurement in Education, 18(1), 99-113.

Impara, J.C., & Plake, B.S. (1998). Teachers’ ability to estimate item difficulty: A test of the assumptions in the Angoff standard-setting method. Journal of Educational Measurement, 35(1), 69-81.

Jaeger, R. M. (1991). Selection of judges for standard‐setting. Educational Measurement: Issues and Practice, 10(2), 3-14.

Johnson, E. J. (1988). Expertise and decision under uncertainty: Performance and process. In M.

Chi, R. Glaser, & M. J. Farr (Eds.), The Nature of Expertise. (pp. 209-228). Hillsdale, NJ:

Lawrence Erlbaum Associates.

Kaftandjieva, F. (2010). Methods for Setting Cut Scores in Criterion-references Achievement Tests. A Comparative Analysis of Six Recent Methods with an Application to Tests of

Reading in EFL. EALTA publication. Retrieved March 25, 2013 from http://www.ealta.eu.org/documents/resources/FK_second_doctorate.pdf Kane, M. T. (2006). Validation. Educational Measurement, 4(2), 17-64.

Kane, M. T. (2001). So much remains the same: conception and status of validation in setting standards. In G. J. Cizek (Ed.), Setting performance standards: concepts, methods and perspectives (pp. 19–51). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527-535.

Larkin, J. H., McDermott, J., Simon, D. P., & Simon, H. A. (1980). Expert and novice performance in solving physics problems. Science, 208, 1335-1342.

Lavallee, J. (2012). Validation Issues in an Angoff Standard Setting: A Facets-based

investigation. Unpublished PhD Dissertation, Department of Counseling and Educational Psychology, National Taiwan Normal University, Taipei, Taiwan.

Linn, R. L. (2003). Accountability: Responsibility and reasonable expectations. Educational Researcher, 32, 3-13.

Linn, R. L., Baker, E. L., & Betebenner, D. W. (2002). Accountability systems: Implications of requirements of the No Child Left Behind Act of 2001. Educational Researcher, 31, 3–

16.

Linn, R. L., & Shepard, L. A. (1997). Item-by-item standard setting: Misinterpretations of judge’s intentions due to less than perfect item inter-correlations. In Council of Chief State School Officers National Conference on Large Scale Assessment, Colorado Springs, CO.

Lissitz, R. W. & Huynh H. (2003). Vertical equating for state assessments: Issues and solutions in determination of adequate yearly progress and school accountability. Practical

Assessment, Research & Evaluation, 8(10). Retrieved March 25, 2012 From http://pareonline.net/getvn.asp?v=8&n=10

Lissitz, R. W. & Wei, H. (2008).Consistency of standard setting in an augmented state testing system. Educational Measurement, 27(2), 46-56.

Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(3), 635–694.

Loomis, S. C. (2012). Selecting and Training Standard Setting Participants. Setting performance standards: Foundations, methods, and innovations, 107-134.

Lorge, L, & Kruglov, L. (1953). A suggested technique for the improvement of difficulty prediction of test items. Educational and Psychological Measurement, 12(4), 554-561.

McGinty, D. (2005). Illuminating the “Black Box” of standard setting: An exploratory qualitative study. Applied Measurement in Education, 18(3), 269–287.

Margolis, M. J., & Clauser, B. E. (2014). The Impact of Examinee Performance Information on Judges’ Cut Scores in Modified Angoff Standard‐Setting Exercises. Educational

Measurement: Issues and Practice, 33(1), 15-22.

Mee, J., Clauser, B. E., & Margolis, M. J. (2013). The impact of process instructions on judges’

use of examinee performance data in Angoff standard setting exercises. Educational Measurement: Issues and Practice, 32(3), 27-35.

Messick, S. (1981). Constructs and their vicissitudes in educational and psychological measurement. Psychological Bulletin, 89(3), 575–588.

Messick, S. (1989).Validity. In R. L. Linn (Ed.), Educational measurement (pp.13–103).

Washington, DC: American Council on Education and National Council on Measurement in Education.

Messick, S. (1998). Test validity: A matter of consequence. Social Indicators Research, 45(1-3), 35–44.

Michigan State Department of Education. (February, 2007). Retrieved from

http://www.michigan.gov/documents/mde/MI-ELPA_Tech_Report_final_199596_7.pdf Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2001). The Bookmark procedure:

Psychological perspectives. In G. J. Cizek (Ed.), Setting performance standards:

Concepts, methods, and perspectives (pp. 249-281). Mahwah, NJ: Erlbaum.

National Council for Measurement in Education. (2015). Retrieved from

http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossar

y1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorV

Nedelvsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14(2), 3-19.

Nelson, D. S., (1994). Job analysis for licensure and certification exams: science or politics?

Educational Measurement: Issues and Practice, 13(3), 29-35.

Norcini, J., Lipner, R., Langdon, L., & Strecker, C. (1987). A comparison of three variations on a standard-setting method. Journal of Educational Measurement, 24(1), 56-64.

Norcini, J. J. & Shea, J. A. (1997). The credibility and comparability of standards. Applied Measurement in Education, 10(1), 39–59.

Plake, B., & Giraud, G. (1998). Effect of a modified Angoff strategy for obtaining item

performance estimates in a standard setting study. Paper presented at the Annual Meeting of the American Educational Research Association. San Diego, Calf.

Plake, B. S., Melican, G. J., & Mills, C. N. (1991). Factors Influencing Intrajudge Consistency During Standard‐Setting. Educational Measurement: Issues and Practice, 10(2), 15-16.

Raymond, M. R., & Reid, J. B. (2001). Who made thee a judge? Selecting and training

Participants for standard setting. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives, (pp. 119-157).

Reckase M. D.(2000). The ACT/NAGB standard setting process: How "modified" does it have to be before it is no longer a modified-Angoff process? Paper presented at the annual meeting of the American Educational Research Association, New Orleans.

Reckase, M. D. (2006) Rejoinder: Evaluating standard setting methods using error models proposed by Schulz. Educational Measurement, 25(3), 14-17.

Roach, A. T., McGrath, D., Wixon, C., & Talapatra, D. (2010). Aligning an early childhood assessment to state kindergarten content standards: application of a nationally recognized alignment framework. Educational Measurement: Issues and Practice, 29(1), 25-37.

Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88(2), 413-428.

Schafer, W. D. (2005). Criteria for standard setting from the sponsor’s perspective.

Applied Measurement in Education, 18(1), 61-81.

Schoonheim‐Klein, M., Muijtjens, A., Habets, L., Manogue, M., Van Der Vleuten, C., & Van der Velden, U. (2009). Who will pass the dental OSCE? Comparison of the Angoff and the borderline regression standard setting methods. European Journal of Dental Education, 13(3), 162-171.

Shepard, L.A. (1980). Standard setting issues and methods. Applied Psychological Measurement, 4(4), 447-467.

Shepard, L. A. (1994). Implications for standard setting of the National Academy of Educational Evaluation of the National Assessment of Educational Progress achievement levels. In:

Proceedings of the joint conference on standard setting for large-scale assessments of the National Assessment Governing Board and the National Center for Educational Statistics (pp.

143–159). Washington, DC: U.S. Government Printing Office.

Smith, R. L. and Smith, J. S. (1988). Differential use of item information by judges using Angoff and Nedelsky procedures. Journal of Educational Measurement, 25(4), 259-274.

Taube, K.T. (1997). The incorporation of empirical item difficulty data in the Angoff standard-setting procedure. Evaluation and the Health Professions, 20(4), 479-498.

Taylor, J. (2014, July 17). Difference Between Within-Subject and Between-Subject [Blog] Retrieved from http://www.statsmakemecry.com/smmctheblog/within-subject-and-between-subject-effects-wanting-ice-cream.html

van de Watering, G., & van der Rijt, J. (2006). Teachers’ and students’ perceptions of assessments: A review and a study into the ability and accuracy of estimating the difficulty levels of assessment items. Educational Research Review, 1(2), 133-147.

Verhoeven, B. H., Verwijnen, G. M., Muijtjens, A. M. M., Scherpbier, A. J. J. A., & Van der Vleuten, C. P. M. (2002). Panel expertise for an Angoff standard setting procedure in progress testing: item writers compared to recently graduated students. Medical Education, 36(9), 860-867.

Wessen, C. (2010). Analysis of Pre- and Post-Discussion Angoff ratings for evidence of social influence effects. Unpublished MA Dissertation, Department of Psychology, University of California, Sacramento.

Wiley, A., & Guille, R. (2002). The occasion effect for “at-home” Angoff ratings. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.

Yin, P. & Schultz, E. M. (2005). A comparison of cut scores and cut score variability from Angoff-based and Bookmark-based procedures in standard setting. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada.

100

101

Appendix 1. Common European Framework of Reference - Global Scale Level Performance Level Descriptors

Can understand with ease virtually everything heard or read. Can summarise information from different spoken and written sources, reconstructing arguments and accounts in a coherent presentation. Can express him/herself spontaneously, very fluently and precisely,

differentiating finer shades of meaning even in more complex situations.

Can understand a wide range of demanding, longer texts, and recognise implicit meaning. Can express him/herself fluently and spontaneously without much obvious searching for expressions. Can use language flexibly and effectively for social, academic and professional purposes.

Can produce clear, well-structured, detailed text on complex subjects, showing controlled use of organisational patterns, connectors and cohesive devices.

Can understand the main ideas of complex text on both concrete and abstract topics, including technical discussions in his/her field of

specialisation. Can interact with a degree of fluency and spontaneity that makes regular interaction with native speakers quite possible without

102

strain for either party. Can produce clear, detailed text on a wide range of subjects and explain a viewpoint on a topical issue giving the advantages and Independent disadvantages of various options.

Can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. Can deal with most situations likely to arise whilst travelling in an area where the language is spoken. Can produce simple connected text on topics which are familiar or of personal interest. Can describe experiences and events, dreams, hopes and ambitions and briefly give reasons and explanations for opinions and plans.

Can understand sentences and frequently used expressions related to areas of most immediate relevance (e.g. very basic personal and family information, shopping, local geography, employment). Can communicate in simple and routine tasks requiring a simple and direct exchange of information on familiar and routine matters. Can describe in simple terms aspects of his/her background, immediate environment and matters in areas of immediate need.

103 A1

Can understand and use familiar everyday expressions and very basic phrases aimed at the satisfaction of needs of a concrete type. Can

introduce him/herself and others and can ask and answer questions about personal details such as where he/she lives, people he/she knows and things he/she has. Can interact in a simple way provided the other person talks slowly and clearly and is prepared to help.

Source: CoE, 2001, p. 24.

104 Appendix 2. Informed Consent Form

Informed Consent Form for ELC Standard Setting Pilot Studies (July 2010)

The ELC assessment committee is doing research on the standard setting process used to link tests to the Common European Framework of Reference (CEFR). The results of this process are important, because they determine what test scores count as ‘proof’ that a student has reached a certain ability level. However, the process itself is very subjective, and there is no way to prove that a given score means that a student has “really” reached a given ability level. The purpose of this study is thus to help us better understand the factors that influence the decision-making process, as part of the longer-term goal of improving the process.

You are being invited to take part because of your background in the TEFL field. Your participation is entirely voluntary, and your choice will have no bearing on your job or on any work-related evaluations or reports. If you accept, you will be asked to complete a short

preparatory assignment and to participate in two one-day workshops. At these meetings, you will receive more training and then you will be asked to make a series of judgments concerning the

在文檔中 Angoff標準設定之判斷者的評估 (頁 98-133)

CHAPTER 5 CONCLUSIONS &amp; DISCUSSION

5.4 Limitations of the Present Study

CHAPTER 5 CONCLUSIONS & DISCUSSION