教師命題過程與學生答題過程研究

全文

(1)國立臺灣師範大學英語學系博士論. 文. Doctoral Dissertation Department of English National Taiwan Normal University. 教師命題過程與學生答題過程研究. Exploring Teachers’ Test-constructing Processes and Students’ Test-taking Processes. 指導教授：程玉秀博士 Advisor: Dr. Yuh-show Cheng 研究生：曾繁萍 Fan-ping Tseng. 中華民國一百零三年八月 August, 2014.

(2) 中文摘要本文旨在研究以下三個問題：第一，高中教師如何命一份英文學科能力測驗的模擬試題？資深教師與新手教師在命題時的考慮點有何不同？第二，高中學生如何回答英文學科能力測驗模擬試題的題目？高程度學生與低程度學生的答題策略有何不同？第三，學生答題時的考慮點與教師命題時的考慮點是否一致？四位高中英文教師及四十八位高中學生參與此研究。教師的任務是要命一份英文學測模擬試題，內含詞彙測驗、綜合測驗、及閱讀測驗等共二十八題選擇題；學生的任務則是要回答教師所命的模擬試題題目。所有參與者在執行任務時，必須要進行有聲思考法，以作為本研究的主要分析資料。本研究主要結果如下。首先，資深教師與新手教師在命題時的考慮點略有不同；資深教師的命題考量較以學生為中心，而新手教師的命題考量則較符合評量上的命題原則。此外，資深教師所命的試題並沒有優於新手教師；而且，在四位教師所命的題目中，有不少試題是被專家評定為有暇疵、不合適，並需要修正及改進的。其次，學生在作答不同類型題目時，大致上會採用不同的策略。然而，學生在作答三種類型的題目時，均有使用「消去法」。此結果顯示，消去法乃學生在本研究最常使用的答題策略。另外，高程度學生比低程度學生較常使用字彙及文法知識和演繹思考法來作答；而低程度學生比高程度學生較常利用「猜測法」來回答任何類型的題目。研究也發現，學生的答題考慮點與教師的命題考慮點大不相同，兩者的一致率只有 33%。此外，學生的想法和新手教師的想法較一致，而和資深教師的想法較不相同。高程度學生在綜合測驗的答題考慮點上，和教師們的命題考慮點出入較大；而低程度學生在閱讀測驗的考慮點上，和教師們的考慮點不一致性較高。. 關鍵字：試題命製、試題命製過程、答題過程、策略運用、字彙測驗、綜合測驗、閱讀測驗、有聲思考法. i.

(3) ABSTRACT This study aims to investigate three research questions. First, how did experienced and novice teachers construct mock tests for the Scholastic Ability English Test (SAET)? Second, how did higher- and lower-proficiency students take those mock tests? Third, were students’ considerations for answering the tests consistent with teachers’ test-constructing considerations? Four senior high school teachers and forty-eight senior high school students participated in this study. All participants were asked to do think-aloud while performing their tasks. The teachers were asked to construct twenty-eight items of multiple-choice questions on vocabulary, cloze, and reading comprehension. The students were asked to answer the questions constructed by the teachers. Major findings of this study are summarized as follows. First, the experienced teachers and novice teachers seemed to make different types of considerations in constructing their tests. The experienced teachers took more student-oriented factors into account while the novice teachers took more test-construction principles into consideration. Despite their different considerations in test-constructing processes, the two experienced teachers did not seem to produce better test items than the two novice teachers. All four teachers had constructed some items that were deemed poor, problematic, or inappropriate from the authority’s perspective. Second, students generally used different strategies when answering different types of questions. However, they seemed to use the strategy of “elimination” very frequently on three types of tests. In terms of the proficiency levels, higher-proficiency students tended to use their vocabulary knowledge, grammar knowledge, and deductive reasoning more frequently than lower-proficiency students in answering the items. On the other hand, lower-proficiency students tended to use the strategy of “guessing” more frequently than higher-proficiency students across three types of questions. Third, students’ considerations for answering test items clashed with teachers’ test-constructing considerations to a great extent; the overall consistency rate between them was only about 33% in this study. Furthermore, students generally thought in a way more congruent with novice teachers than with experienced teachers. In addition, higher-proficiency students’ considerations clashed more with teachers’ considerations on cloze items while lower-proficiency students’ considerations clashed more with teachers’ considerations on reading comprehension questions. Key words: test-construction, test-constructing process, test-taking process, strategy use, vocabulary test, cloze test, reading comprehension test, think-aloud. ii.

(4) ACKNOWLEDGEMENTS I owe a great deal to many people who have given me love, support, and assistance during my long years of doctoral study at NTNU. Without them, I would never have got this degree, nor would I have been able to complete this dissertation. My first sincere gratitude goes to my esteemed advisor, Dr. Yuh-show Cheng, for her clear guidance and valuable advice for me during these years of writing this dissertation. If it had not been for her constant and timely encouragement, I would have given up, not to mention finishing this research project. I am truly grateful for everything she had done for me at so many critical moments. I am also indebted to my exceptional committee members, Dr. Chiou-lan Chern, Dr. His-nan Yeh, Dr. Vincent Wu-chang Chang, and Dr. Hsueh-ying Yu, for their insightful comments and constructive remarks on my dissertation. Their valuable suggestions have contributed a lot to the improvements of this dissertation. I would also like to thank all the excellent professors who have taught me during my studies at NTNU. Thank you for leading me into the research field, and thank you for nurturing me so that I can become what I am now. These eminent professors are Dr. Vincent Wu-chang Chang, Dr. Yuh-show Cheng, Dr. His-nan Yeh, Dr. Chiou-lan Chern, Dr. Hsi-chin Chu, Dr. Howard Hao-jan Chen, Dr. Wen-Ta Tseng, Dr. Ho-ping Feng, Dr. Chyiruey Chen, Dr. Chien-Jer Charles Lin, and Dr. Miao-Hsia Chang. My deepest appreciation also extended to all the high school teachers and students participating in this study. Their precious time and effort devoted to this research work helped make this dissertation a reality. I would also like to thank Dr. Hsueh-o Lin and Dr. Jung-Han Chen for their critical comments on the test items used in this study. I greatly appreciate their generous help. I owe a debt of gratitude to my family for their love and unfailing support throughout my study. I would like to thank my dearest mother, sister, brother, sister-in-law, and niece, who took the sweet burden of taking care of my aging father most of time while I was away working on my dissertation. Their unselfish sacrifice gave me the extra time to take on such a demanding research project, and thus contributed a lot to the completion of this dissertation. I would like to express my profound gratitude to my husband, son, and daughter for their faithful love and support during my long years of study at NTNU. Although they did not quite understand what I have been working on these years, they still support me wholeheartedly. Without their company and encouragement, I would never have the strength and motivation to finish this dissertation. Dear, I love you all. Finally, may all glory be to God alone! Soli Deo gloria! To my Lord in heaven, I dedicate this dissertation.. iii.

(5) TABLE OF CONTENTS. CHINESE ABSTRACT………………………………………………………………..i ENGLISH ABSTRACT……………………………………………………………….ii ACKNOWLEDGEMENTS…………………………………………………………iii LIST OF TABLES…………………………………………………………………..viii LIST OF FIGURES……………………………………………………………………x CHAPTER ONE INTRODUCTION…………………………………………1 Motivation and Background……………………………………………………...1 Statement of the Problem and Research Rationale………………………………3 Purpose of the Study……………………………………………………………...6 Research Questions………………………………………………………………8 Delimitations……………………………………………………………………..8 Significance of the Study………………………………………………………...9 CHAPTER TWO LITERATURE REVIEW………………………………..11 Overview of Language Testing Research……………………………………….11 Studies on Students’ Test-taking Process……………………………………….14 Early Attempts……………………………………………………………..15 Studies on Multiple-choice Reading Comprehension Tests……………….16 Studies on Cloze Tests……………………………………………………..18 Studies on Teachers’ Test construction……………………………….................20 Training in Teachers’ Test Construction…………………………………...21 Studies on Test Constructor Effect………………………………………...24 Research into the Relationship Between Test-constructing and Test-taking Processes…………………………………………………………………..26 Verbal Report in Language Testing……………………………………………..28 CHAPTER THREE METHODOLOGY………………………………….33 Participants……………………………………………………………………...33 Instruments……………………………………………………………………...36 Background Questionnaire………………………………………………...36 Feedback Sheet…………………………………………………………….36 Foreign Language Proficiency Test………………………………………..37 Two Sets of Materials for Test Construction………………………………37 Four Mock Tests for the Scholastic Ability English Test………………….40 iv.

(6) Data Collection Procedures……………………………………………………..41 Collection of Teachers’ Verbal Reports……………………………………42 Collection of Students’ Verbal Reports……………………………………43 Data Analysis Procedures…………………………………………………….....44 CHAPTER FOUR RESULTS AND DISCUSSION ON TEACHERS’ TEST CONSTRUCTION …………………………………………………….46 Results of Teachers’ Background Questionnaires………………………………46 Experienced Teacher 1 (ET 1)……………………………………………..48 Experienced Teacher 2 (ET 2)……………………………………………..48 Novice Teacher 1 (NT 1)…………………………………………………..49 Novice Teacher 2 (NT 2)…………………………………………………..49 Analyses of Teachers’ Think-aloud Protocols…………………………………..49 Construction of Vocabulary Test Items……………………………………49 The Construction Processes and Considerations of Experienced Teacher 1 (ET 1)………………………………….......................49 The Construction Processes and Considerations of Experienced Teacher 2 (ET 2)………………………………………………...54 The Construction Processes and Considerations of Novice Teacher 1 (NT 1)…………………………………………………………...58 The Construction Processes and Considerations of Novice Teacher 2 (NT 2)…………………………………………………………...59 Construction of Cloze Test Items………………………………………….64 The Construction Processes and Considerations of Experienced Teacher 1 (ET 1)………………………………………………...65 The Construction Processes and Considerations of Experienced Teacher 2 (ET 2)………………………………………………...67 The Construction Processes and Considerations of Novice Teacher 1 (NT 1)…………………………………………………………...69 The Construction Processes and Considerations of Novice Teacher 2 (NT 2) …………………………………………………………..71 Construction of Reading Comprehension Questions……………………75 The Construction Processes and Considerations of Experienced Teacher 1 (ET 1) ………………………………………………..76 The Construction Processes and Considerations of Experienced Teacher 2 (ET 2)………………………………………………...77 The Construction Processes and Considerations of Novice Teacher 1 (NT 1)…………………………………………………………...78 v.

(7) The Construction Processes and Considerations of Novice Teacher 2 (NT 2) …………………………………………………………..79 Results of Teachers’ Feedback Sheets…………………………………………..82 Analyses of Teacher-constructed SAET Mock Tests…………………………85 Analyses of Vocabulary Items…………………………………..................85 A Critique of Vocabulary Items……………………………………………88 Problems with the stems……………………………………………...88 Problems with the options……………………………………………91 General discussion on vocabulary items……………………………..94 Analyses of Cloze Items…………………………………………………...96 A Critique of Cloze Items ………………………………………………..104 Problems with the choice of blanks (or testing points)……………..104 Problems with the options ………………………………………….106 General discussion on cloze items……………………………..........109 Analyses of Reading Comprehension Questions…………………………111 A Critique of Reading Comprehension Questions ………………………113 Problems with the question stems ………………………………….113 Problems with the options ………………………………………….115 General discussion on reading comprehension questions…………..119 General Discussion on the Four Teachers’ Test Construction Performances….121 CHAPTER FIVE RESULTS AND DISCUSSION ON STUDENTS’ STRATEGES TO ANSWER TEST QUESTIONS…………………………125 Results of Students’ Performances on the Four Mock Tests ………………….125 Noteworthy Items on Form A…………………………………………….129 Noteworthy Items on Form B ……………………………………………131 Noteworthy Items on Form C…………………………………………….135 Noteworthy Items on Form D …………………………………………...140 General Discussion on Students’ Performances on the Noteworthy Items145 Results on Students’ Strategies to Answer Questions…………………………145 Results of Students’ Strategies for Answering Vocabulary Items………..146 Results of Students’ Strategies for Answering Cloze Items ……………..151 Results of Students’ Strategies for Answering Reading Comprehension Questions …………………………………………………………...160 General Discussion on Students’ Strategies for Answering Test Questions……………………………………………………………166 Results of Students’ Opinions about Think-aloud Method and This Study…...171. vi.

(8) CHAPTER SIX RESULTS AND DISCUSSION ON THE CONSISTENCY BETWEEN TEACHERS’ TEST-CONSTRUCTING AND STUDENTS’ TEST-TAKING CONSIDERATIONS……………………….174 Results of Comparisons Between Teachers’ and Students’ Considerations …..174 Items That Caused Inconsistency Between Teachers’ and Students’ Considerations …………………………………………………………...183 Items on Form A ………………………………………………………....184 Items on Form B………………………………………………………….187 Items on Form C………………………………………………………….192 Items on Form D ………………………………………………………...196 General Discussion on the Inconsistency Between Teachers’ and Students’ Considerations on the Four Forms …………………………………199 CHAPTER SEVEN CONCLUSION……………………………………203 Summary of the Major Findings ……………………………………………...203 Pedagogical Implications ……………………………………………………..205 Limitations of the Study ………………………………………………………207 Directions for Future Research………………………………………………...208 REFERENCES…………………………………………………………………….210 APPENDICES……………………………………………………………………..218 Appendix A Research Consent Form for Teachers……………………………….218 Appendix B Background Questionnaire………………………………………….219 Appendix C Feedback Sheet……………………………………………………...220 Appendix D Shortened Version of FLPT…………………………………………221 Appendix E Research Consent Form for Students……………………………….226 Appendix F Materials for Test Construction …………………………………….227 Appendix G Four Forms of the SAET Mock Tests ………………………………230 Appendix H Dates of Data Collection …………………………………………...248 Appendix I Teacher-constructed SAET Mock Tests ……………………………249 Appendix J Words Chosen by Different Teachers in Their Tests ……………….266 Appendix K Students’ Answers to the Items on Each Form ……………………..267 Appendix L Frequencies of the Comparisons Between Students’ Test-taking Strategies and Teachers’ Test-constructing Considerations …...........271. vii.

(9) LIST OF TABLES Table 1. Table 2. Table 3. Table 4. Table 5. Table 6. Table 7.. Three Variations of the Verbal Report Procedure………………………….29 Participants’ FLPT Scores and Exams Averages…………………………..35 Comparison of Material A and Material B………………………………...39 Results of Teachers’ Background Questionnaires…………………………47 Teachers’ Considerations in Constructing Vocabulary Items……………...64 Teachers’ Considerations in Constructing Cloze Items……………………74 Teachers’ Considerations in Constructing Reading Comprehension Questions…………………………………………………………………..81 Table 8. Results of Teachers’ Feedback Sheets……………………………………..82 Table 9. Distribution of Items Testing on Different Parts of Speech……………….86 Table 10. Words Teachers Chose As Correct Options………………………………..87 Table 11. Frequencies of the Problems with the Stem in Vocabulary Items…………90 Table 12. Frequencies of the Problems with the Options in Vocabulary Items………93 Table 13. Results of the Appropriateness Checklist for Vocabulary Items…………..95 Table 14. Types of cloze items the teachers constructed……………………………..97 Table 15. Distribution of the cloze item types teachers constructed…………………98 Table 16. Frequencies of the Problems with the Choice of Blanks in Cloze Items...105 Table 17. Frequencies of the Problems with the Options in Cloze Items…………..108 Table 18. Results of the Appropriateness Checklist for Cloze Items……………….110 Table 19. Distribution of the reading comprehension question types teachers constructed………………………………………………………………..112 Table 20. Frequencies of the Problems with the Question Stems in Reading Comprehension Items…………………………………………………….115 Table 21. Frequencies of the Problems with the Options in Reading Comprehension Items……………………………………………………………………...118 Table 22. Results of the Appropriateness Checklist for Reading Comprehension Questions…………………………………………………………………120 Table 23. Means of Students’ Scores on the Mock Tests…………………………...126 Table 24. Items Worthy of Note on the Four Forms………………………………..127 Table 25. Noteworthy Items Constructed by Four Teachers………………………..128 Table 26. Students’ Strategies for Answering Vocabulary Items……………………148 Table 27. Frequencies of Each Strategy Students Used in Answering Vocabulary Items……………………………………………………………………...149 Table 28. Students’ Strategies for Answering Cloze Items…………………………153 Table 29. Frequencies of Each Strategy Students Used in Answering Cloze Items..158 Table 30. Students’ Strategies for Answering Reading Comprehensions Questions.161 viii.

(10) Table 31. Frequencies of Each Strategy Students Used in Answering Reading Comprehension Questions………………………………………………..164 Table 32. Frequencies of Students’ Opinions about Think-aloud and This Study….172 Table 33. Comparisons Between Teachers’ and Students’ Considerations…………178 Table 34. Comparisons Between Teachers’ and Students’ Considerations Across Two Proficiency Levels………………………………………………………..180 Table 35. Comparisons Between Teachers’ and Students’ Considerations on Three Types of Items……………………………………………………………181 Table 36. Comparisons Between Teachers’ and Students’ Considerations on Three Types of Items Across Two Proficiency Levels………………………….182. ix.

(11) LIST OF FIGURES Figure 1. Procedures for Producing Four Forms of Tests……………………………41. x.

(12) CHAPTER ONE INTRODUCTION Motivation and Background Tests seem to play a major and prominent role in Taiwan’s high school language classrooms. As an English teacher in a senior high school in Taiwan, I find both myself and my students constantly facing language tests of all kinds in a semester, such as class quizzes, weekly tests, midterms, finals, etc. Among these different tests, midterms and finals are considered most important by students, since these are formal, school-required exams, the results of which profoundly affect their academic records. Thus, students would work very hard to prepare for the exams. After they take the exams, the results are always analyzed quantitatively, with different scores presented to teachers and school authorities for comparison. However, I don’t think the raw scores show us enough information about the students’ understanding of what is being tested on the exams. In other words, students’ test scores may not faithfully represent their true language abilities; there might be some other factors, such as guessing, involved. This assumption aroused my interest in how students take the exams. I think that the results of the investigation into students’ test-taking process would add substantial meanings to test scores. The school-required exams (i.e., midterms or finals) are also taken seriously by teachers, who are mostly responsible for preparing the exams. In my school, the formal exam is prepared by one teacher alone. Most of the time, the teacher who is assigned to construct a formal exam is under great pressure, and I am no exception. As far as I am concerned, constructing a formal exam is no easy task, and the test-constructing process is a laborious one. Since I am always struggling through the test-constructing process, I am curious about how other teachers go through such a process and what factors they take into consideration when they construct a test. 1.

(13) My inquiries into how teachers construct a test and how students take a test led me into the ample and diverse research of the language testing field. The blossoming language testing research has covered many aspects of the testing practice, such as test types, test validity, test-scoring methods, test-takers, and test-taking processes, to name just a few. Taken together, the abundant research body seems to center around two major themes: one on issues concerning “tests,” and the other on issues regarding “test-takers” (the recipients of tests). Yet tests are not born in a vacuum; instead, they are produced or written by teachers or researchers (the initiators of tests). But, to my surprise, research on the part of the initiators of tests or on how tests are constructed receives little attention, leaving not only a missing piece to the testing field but also a large, potential area for further investigation. Bachman (2000), in his state-of-the-art article on language testing, concludes that he believes “there are two areas in which language testing and language testers must continue to grow and develop: the professionalization of the field, and validation research” (p.18). I think my present research well accords with Bachman’s (2000) arguments. For one, Bachman’s first prediction, the professionalization of the field, has two major thrusts: “the training of language testing professionals; and the development of standards of practice and mechanisms for their implementation and enforcement” (p.19). I believe the results of my study on teachers’ test-constructing processes might shed some light on teacher education curriculum, which is in line with Bachman’s (2000) focuses. For another, the results of my research on students’ test-taking processes might contribute some fruits to the test validation research, which is one of Bachman’s (2000) major concerns. Taken together, this study is motivated by the desire to resolve the puzzle in my teaching career as well as the possibility of bridging the research gap in the language testing field. 2.

(14) Statement of the Problem and Research Rationale The present study consists of three aspects: (1) an investigation into how teachers construct tests, (2) an exploration into how students take tests, and (3) a comparison between teachers’ test-constructing considerations and students’ considerations for answering the tests constructed by the teachers. In the whole testing research, the factor of test-constructors doesn’t seem to receive its due attention compared with the issues concerning tests and test-takers. Yet, as Jafarpur (2003) points out, in a program where test-construction is one individual teacher’s responsibility, the role of the test-constructor is more important than in a program where test-construction is carried out by a committee. I agree with Jafarpur’s (2003) observation, since test-construction is usually one teacher’s responsibility in many of Taiwan’s English teaching contexts, especially in middle schools. Thus, research on test-construction might add insights into the testing practice in Taiwan. There have been some studies investigating teachers’ test-constructing skills (e.g., Carter, 1984; Coniam, 2009), and some describing the training courses on test-construction (e.g., Kirschner, Spector-Cohen, & Wexler, 1996; Johnson, Becker, & Olive, 1999). Jafarpur (2003) took it further, exploring the test-developer as a facet of test variance. Among those studies, though Johnson, Becker, & Olive (1999) and Coniam (2009) have reported some teachers’ reflections on their test development process after finishing the test items, no study, to my knowledge, has directly examined teachers’ test-constructing process by using the think-aloud method. Therefore, I think it is worthwhile to have a thorough investigation and analysis of teachers’ test-constructing process through the think-aloud method. While investigating test-constructing process, I also hope to examine possible test constructor effect, especially length of teaching years, in test variance. Jafarpur’s (2003) examination of teacher-produced tests shows that there was a test constructor 3.

(15) effect on the performance of test-takers using multiple-choice reading comprehension tests that had no specifications. Jafarpur’s (2003) results aroused my interest in the test constructor effect on test variance using tests with specifications. Based on the research findings of rater effect on performance tests, I assume test constructor may have an effect on variance of multiple-choice tests. For example, Brown (1995) explored the influence of rater backgrounds (native/nonnative; with/without industry and teaching experience) on assessments in an oral test of Japanese for tour guides. Her results showed that there were significant differences in ratings awarded for some individual criteria, though there were no significant differences between different types of rater in terms of the overall grade awarded. In another study, Lim (2011) examined new and experienced raters’ performance longitudinally over multiple time points in writing assessment. The results showed that novice raters, who initially differed in performance from their experienced counterparts, learned to rate appropriately relatively quickly, and that raters were able to maintain rating quality over time. Since studies such as Brown (1995) and Lim (2011) have suggested that there exists rater effect even with rating criteria provided, by analogy, it is reasonable to assume that there might also be test-constructor effect involved in test development even with test specifications provided. Therefore, in addition to investigating teachers’ test-constructing processes, I want to explore how the tests produced by novice teachers differ from those produced by experienced teachers as well. Unlike the scant studies on test-constructing processes, research into test-taking processes has received more attention, and has been recognized as part of the construct validation research (Anderson et al., 1991; Bachman, 2000; Cohen, 2006). Cohen (1984) was probably one of the early researchers in exploring L2 test-taking strategies through verbal report data. Later on, more studies followed the trend, and have shed some light on how students were actually thinking while they were taking 4.

(16) tests. For instance, Nevo (1989), Anderson et al. (1991), and Rupp, Ferne, and Choi (2006). investigated students’ test-taking process in multiple-choice reading. comprehension tests, while Storey (1997), Sasaki (2000), Yamashita (2003), and Moghaddam (2010) examined students’ test-taking process in cloze tests. Among these studies, Yamashita (2003) also compared the test-taking processes of skilled readers and those of less skilled readers, and the results did show that the two groups adopted different information in answering the gap-filling cloze test. These test-taking process studies on L2 learners are valuable in that they have caught and described part of the cognitive processes while L2 learners were taking their tests. Yet, L2 learners are unique in each context, and their test-taking processes might vary from culture to culture. Since there have been no published studies, to the best of my knowledge, examining Taiwanese EFL learners’ test-taking processes, I would like to investigate the actual processes of how Taiwanese high school students take their English tests. Moreover, motivated by Yamashita (2003), I would also like to examine whether there is any difference between test-taking processes of high-proficiency students and those of their low-proficiency counterparts. In his pioneering study on L2 test-taking process, Cohen (1984) stated that “the purpose of such research has been to explore the closeness-of-fit between the tester’s presumptions about what is being tested and the actual processes that the test taker goes through” (p.70). Later, Nevo (1989), also commented that “the examiners’ assumptions regarding what they test and their expectations from the respondents often do not match the actual processes which the respondents undergo during testing” (p. 200). Both researchers have pointed out the phenomenon that test takers may answer test items in ways different from what test constructors have expected. Nevertheless, it is a pity that after two decades, there is still little research empirically examining the degree of fitness between test-constructors’ considerations and 5.

(17) students’ test-taking considerations. One of such rare studies is Gierl (2001), which compared cognitive representations of test developers and those of students on a mathematics test. As far as I know, no such similar comparison has been conducted in the language testing field. Hence, I think a comparison between EFL teachers’ test-constructing considerations and EFL students’ considerations for answering tests would be a research line worth pursuing. Given the motivation and research rationale stated above, I would like to investigate EFL teachers’ test-constructing processes and EFL students’ test-taking processes by using the think aloud method. In addition, I will also compare teachers’ test-constructing considerations and students’ considerations for answering the test items produced by the teachers. Purpose of the Study The purpose of the study is threefold: (1) to investigate how EFL teachers construct tests; (2) to examine how EFL students take tests; and (3) to explore whether there is any match or mismatch between teachers’ test-constructing considerations and students’ considerations for answering the test items produced by the teachers. The present study was situated in a context I am familiar with. In other words, the participants in the study were Taiwanese senior high school teachers and students, and the tests they constructed or took were mock tests for the Scholastic Ability English Test (SAET). The reasons for selecting the SAET as the research tool are as follows. To begin with, the SAET is a nationwide test administered by the College Entrance Examination Center (CEEC). All Taiwanese senior high school students are familiar with the test since they have to take either the SAET or the Department Required English Test (DRET), the other important test administered by the CEEC, to enter university or college. Since the SAET, usually held in January, occurs prior to 6.

(18) the DRET, held in July, most students will take the SAET, and some will skip the DRET if they are admitted to the universities they want with their SAET grades. The statistics released by the CEEC also indicate that the number of students taking the SAET is usually much larger than that of students taking the DRET1. Thus, the SAET is considered a very important college entrance exam by both the senior high school teachers and students. I believe that studies on examining how teachers and students treat the SAET would yield more valuable fruits than those examining other tests or exams. Although I regard the SAET as a crucial entrance exam for Taiwanese senior high school students, the present study probed into the processes of how teachers constructed “mock tests” of SAET and how students answered them. The main cause for this substitution of mocks tests for the real test is that I do not have the legitimate access to the real SAET, which is usually prepared by the CEEC committee. Hence, it is impossible for me to employ the real SAET as my investigation instrument. Despite this, I think the mock test of SAET can still serve a good purpose for this study for the following three reasons. First, my main research goal is to explore the processes of test-constructing and test-taking, not to examine the purpose of the test itself. Regarding this, even though the SAET and its mock test may serve different purposes, the former being an achievement test and the latter more like a diagnostic one2, the differences between them would not influence my study to a great extent. Thus,. 1. The numbers of students taking the SAET and the DRET in the years of 2011, 2012, and 2013 are as follows: (year) 2011 2012 2013 Number of students taking the SAET 146,302 154,560 150,030 Number of students taking the DRET 82,164 75,839 65,966 sources: http://www.ceec.edu.tw/AbilityExam/SatStat/學測歷年報名人數 1030103.pdf http://www.ceec.edu.tw/AppointExam/DrseStat/102DrseStat/指考歷年報名人數 1020613.pdf 2 A diagnostic test, according to Hughes (2003), is used to identify learners’ strengths and weaknesses. Therefore, I think an SAET mock test serves the purpose of a diagnostic test because it shows students what their weaknesses are in preparing for the SAET. 7.

(19) having no access to the real SAET, I consider its mock test a good substitute. Second, among the ready-made SAET mock tests on the market, many of them, as I observe, are constructed by individual teachers instead of by a committee. This happens to resemble the testing practice in most English classrooms in Taiwan, where a test is usually prepared by an individual teacher alone. Consequently, I think investigating how an individual teacher construct a mock test may reveal more of the true testing practice than exploring how a committee prepares a formal test. Third, it is also a common practice that students often take several mock tests before sitting for the SAET. Therefore, the use of mock tests as research tools in the present study will not seem peculiar to students or teachers since they are quite familiar with mock tests. Given the above three reasons, I think the mock tests of SAET are legitimate research instruments in the study. Research Questions To achieve the above-mentioned research purposes, the present study addressed the following three research questions: 1.. What considerations do teachers take into account when they construct mock tests for the Scholastic Ability English Test (SAET)? How do the tests constructed by novice teachers differ from those constructed by experienced teachers?. 2.. What strategies do students use to answer the SAET mock tests? How do the higher-proficiency students use strategies to answer the test items differently from the lower-proficiency students?. 3.. Are students’ considerations for answering the SAET mock tests consistent with teachers’ test-constructing considerations? Delimitations This study focused merely on the reading part of the mock test for the SAET 8.

(20) though the original formal test of the SAET consists of two sections－reading and writing. The reasons for narrowing down my research scope to reading are as follows. First, reading section of the SAET accounts for 72 points out of 100, while writing section accounts for only 28 points. To make my study more focused and compact, I decide to explore only the reading section, which is the major part of the SAET. Second, the reading section of the SAET is tested in the format of multiple-choice questions, one that is favored in many large-scale tests in Taiwan because of its reliable and rapid, economical scoring. As many scholars (e.g. Heaton, 1988; Weir, 1990; Cohen, 1994; Hughes, 2003) have indicated that good multiple-choice questions are notoriously difficult to construct, I think it is highly worthwhile to investigate how teachers construct the multiple-choice reading questions. Significance of the Study The significance of this study can be discussed from three perspectives: (1) design of language testing research, (2) teacher education curriculum, and (3) language testing research in Taiwan. To begin with, the design of this study is pioneering in that it is the first study to explore EFL teachers’ test-constructing process by using think aloud method and that it is also the first study to investigate the match and mismatch between EFL teachers’ test-constructing considerations and EFL students’ considerations for answering the tests constructed by the teachers. Therefore, the design of this study not only adds a new research direction to the language testing research as a whole, but also offers a new dimension of examining test validation in particular. Secondly, the results of this study will shed new light on teacher education curriculum, in particular, on the testing course for EFL teachers. For one, the study examines the processes of how novice and experienced teachers construct a test; the findings may have implications for test construction guidelines. Teachers in training 9.

(21) might also benefit from this study as they would know in detail how a test is constructed from scratch. For another, the analyses of teachers’ test-constructing processes and students’ test-taking processes will help teachers know what students are thinking about while taking tests and whether the students are thinking in line with them. Thus, the results can familiarize teachers with EFL learners’ test-taking behavior or strategies, giving them the background to derive pedagogical implications in their own teaching practice. Lastly, this study is unique in the language testing research in the context of Taiwan. Up to date, there seems to have been no published study examining teachers’ test-constructing considerations by using the think aloud method in Taiwan. It is hoped that the present study will generate more similar studies to help portray Taiwanese EFL learners’ test-taking behavior and EFL teachers’ test-constructing considerations. Most importantly, if the present study, which used the SAET mock test as the research tool, receives some critical acclaim in the testing field in Taiwan, it is hoped that a study using the real SAET as the research tool can be conducted by the CEEC committee some day. Such studies, I believe, will help improve the quality and validity of the SAET, one of the standardized tests in Taiwan.. 10.

(22) CHAPTER TWO LITERATURE REVIEW In this chapter, research concerning the following themes will be reviewed. First, I will give an overview of language testing research. Second, I will review studies on how students take tests and on how teachers construct tests. Third, I will review research into the relationship of teachers’ test-constructing processes and students’ test-taking processes. Finally, I will review literature concerning the technique of verbal report in language testing. Overview of Language Testing Research Language testing research, a well-established branch of applied linguistics, has evolved and expanded through the years. Bachman (2000), in his state-of-the-art article, chronicled the major developments of testing research in the last two decades of the 20th century and also predicted the future directions for testing research in the 21th century. To gain a rough understanding of the whole testing research and to situate the present study in the testing field, I will briefly review Bachman (2000) in the following. According to Oller (1979, cited in Bachman, 2000), language testing research, from the mid-1960s through the 1970s, was dominated by the hypothesis that language proficiency consisted of a single unitary trait, and the research methodology used was often a quantitative and statistical one. Then, the 1980s saw the influence of second language acquisition (SLA) research on testing research. Research in SLA spurred language testers to investigate not only a wide range of factors on language test performance (e.g., Douglas & Selinker, 1985; Chapelle, 1988; Hale, 1988), but also the strategies involved in the process of test-taking itself (e.g., Cohen, 1984). It was during this period that research on test-taking process began to emerge. Toward the end of 1980, language testers were challenged by Pienemann et al. (1988) to 11.

(23) explicitly take into consideration language learners’ developmental sequence in the design of language tests and in the interpretation of test scores. Testing research in the 1990s witnessed expansions in five major areas: (1) research methodology; (2) practical advances; (3) factors that affect performance on language tests; (4) performance assessment; and (5) ethical issues. Each of the five areas will be summarized briefly below. Methodological approaches employed in language testing research in the 1990s have become increasingly sophisticated and diverse. Newer and more powerful quantitative methods, such as criterion-referenced measurement (Lynch & Davidson, 1997), generalizability theory (Bachman, 1997), item response theory (Pollitt, 1997), and structural equation modeling (Kunnan, 1998), have superseded classical norm-referenced reliability coefficients and exploratory factor analysis. Moreover, qualitative approaches have also been applied to language testing research. They include expert judgments, introspective and retrospective verbal reports, observations, questionnaire and interviews, text analysis, conversational analysis, and discourses analysis (Banerjee & Luoma, 1997). Concerning practical issues, testing research agenda began to see advances in the areas of cross-cultural pragmatics (e.g., Hudson et al., 1992; 1995), languages for specific purposes (e.g., Douglas, 2000), computer-based assessment (Gruba & Corbel, 1997), and a renaissance in research into the testing of vocabulary (e.g., Read, 2000) and the development of new kinds of vocabulary tests (e.g., Nation, 1990; Laufer & Nation, 1999). Regarding factors affecting performance on language tests, research has mainly focused on characteristics of the testing procedure (e.g., Fulcher, 1996; Riley & Lee, 1996), characteristics of test takers (e.g., Hill, 1993), and the test-taking process (e.g., Storey, 1997). A number of the test-taking process studies have used qualitative 12.

(24) methodologies mentioned above, such as verbal reports, questionnaires, and discourse analysis. “Performance” assessment (McNamara, 1997), or “alternative” or “authentic” assessment (Herman et al., 1992; Wiggins, 1993) in the 1990s, has been spurred largely by widespread dissatisfaction with standardized multiple-choice tests in the communicative language teaching context, and by the developments in task-based language teaching and assessment (Norris et al., 1998). Performance assessment measures. include. classroom. observation,. portfolios,. conferences,. journals,. questionnaires, interviews, self- and peer- assessment, group oral assessment, etc (Brown, 1998). Ethical issues in the 1990s included research into washback on instruction (e.g., Alderson & Wall, 1993; Wall & Alderson, 1993), ethics of test use (e.g., Lynch, 1997; Shohamy, 1997), and professionalization of the testing field (e.g., Stansfield, 1993), which includes two interrelated activities: professional training and a code of practice (Davies, 1997). After reviewing the major developments of testing research in the last two decades of the 20th century, Bachman (2000) also suggested some future directions for testing research in the 21th century. He believes that “there are two areas in which language testing and language testers must continue to grow and develop: the professionalization of the field, and validation research. However, rather than being two disparate directions,…these are two virtually related areas that lie on the same path” (Bachman, 2000, p. 18). According to Bachman (2000), the professionalization of language testing has two major thrusts: (1) the training of language testing professionals; and (2) the development of standards of practice and mechanisms for their implementation and enforcement. Bachman (2000) further argues that “we will need not only to develop 13.

(25) standards of professional competence in language assessment, but also to become more active advocates for the inclusion of such standards in the standards for the training and certification of language teachers” (p. 20). In regard to validation research, Bachman (2000) believes that the research in the past decade into factors and processes that affect language test performance and test scores will continue to blossom. In addition, “the debate over methodological issues has …moved from an overly simplistic view of the incompatibility of quantitative and qualitative approaches to a greater appreciation of their complementarity and of the necessity for including a range of approaches in our research agendas” (Bachman, 2000, p. 22). In conclusion of his article, Bachman (2000) voices again that the professionalization of our field and validation research will continue to be vital to language testing. Bachman (2000) believes that: Language testing will grow as a profession in the twenty-first century to the extent that it effectively marshals the resources at its disposal to continue to vigorously investigate the validity of the inferences we make on the basis of test scores and the fairness of the uses we make of these scores. Validity and fairness are issues that are at the heart of how we define ourselves as professionals, not only as language tester, but also as applied linguists. (p. 25) After reviewing Bachman’s (2000) overview article of the testing research, I think my present study well-fitted into the future directions mentioned in Bachman (2000). On one hand, my research focus on teachers’ test-constructing processes is in line with research into “the professionalization of the field.” On the other hand, my research concern of students’ test-taking processes helps contribute to the “validation research.” It is against this backdrop that the present study unfolds. Studies on Students’ Test-taking Process In this section, we will review several verbal report studies on students’ test-taking process in reading tests, the focus of the current study. 14.

(26) Early Attempts Cohen (1984) is one of the early efforts examining test-taking process by using verbal report data. The main purpose of Cohen (1984), which described the results of five student course papers, was to discuss methods for obtaining verbal report data on L2 test-taking strategies and to report on some types of the findings obtained. The verbal report methods explored in Cohen (1984) included think-aloud and self-observation. (i.e.,. introspection,. immediate. retrospection,. and. delayed. retrospection). The data obtained EFL and ESL students’ test-taking strategies on cloze tests and multiple-choice reading comprehension tests. The results, in general, showed that not all of the students read the entire cloze passage or reading passage before answering the test items although they were requested to do so. In terms of cloze tests, it was found that some students did not use the context to find clues for filling in the blank, and that students would use the strategy of translation in doing cloze tests. Moreover, when not knowing how to fill in a blank, poor students would leave it blank, and better students would make guesses. In terms of multiple-choice tests, students reported either reading the questions first or just part of the article and then looking for the corresponding questions. Moreover, students would use a strategy of matching material from the passage with material in the item stem and in the alternatives. In sum, Cohen (1984) has demonstrated some ways how verbal report data can be obtained, and the paper concluded that “there is value in striving for a closer fit between how test constructors intend for their tests to be taken and how respondents actually take them” (Cohen, 1984, p. 79). Following Cohen (1984), we will first, in the following, review other verbal report studies on multiple-choice reading comprehension tests (e.g., Nevo, 1989; Anderson et al., 1991; Rupp, Ferne & Choi, 2006), and then review those on cloze tests (e.g., Storey, 1997; Sasaki, 2000; Yamashita, 2003; Moghaddam, 2010). 15.

(27) Studies on Multiple-choice Reading Comprehension Tests Nevo (1989) examined students’ test-taking strategies on a multiple-choice reading comprehension test by adopting the methods of immediate introspective verbal report and retrospective report. Forty-two Hebrew students studying French participated in the study, and they were asked to complete a multiple-choice test on four reading passages (two in Hebrew and the other two in French). An innovation of Nevo (1989) is that a checklist of fifteen strategies was provided for students to facilitate their reporting of strategy use after completing each item of the test. The results showed that there was a transfer of strategies from L1 (Hebrew) to L2 (French), and that the most frequently used strategies in both languages were returning to the passage and clues in the text. It was also found that in L2, students used more strategies which did not lead to the correct answer than in their L1. Finally, the major contribution of Nevo (1989) to the verbal report method is that by providing a checklist, it is possible to obtain feedback from students about their strategy use on an item-by-item basis. Anderson et al. (1991) presented the results of an exploratory study that examined three types of information (test-taking strategies, item content, and item performance) in the investigation into the construct validity of a reading comprehension test. The participants were twenty-eight Spanish-speaking students, and they were asked to produce retrospective think-aloud protocols while taking an English reading comprehension test, which contained forty-five multiple-choice questions. The results were as follows. First, there was a statistically significant association between students’ reported strategies and the three question types determined by the test developers. Second, students’ strategy use was significantly related to item difficulty and to item discrimination. More specifically, five strategies were worthy of note in the study. First, the strategy stating failure to understand 16.

(28) occurred more frequently on inference test items, was used fewer times on easy items, and was used more times on items that discriminated well among those students who scored high on the test. Second, paraphrasing occurred more frequently on items asking students to identify the direct statement of the passage, and was used more times on items classified as acceptable in terms of discrimination. Third, guessing was reported more times on inference items, fewer times on easy items, and occurred about as often on acceptable and rejected items in terms of item discrimination. Fourth, matching the stem with a previous portion of the text was reported fewer times on items directed at identifying the main idea, reported fewer times on easy items, and reported fewer times on acceptable items. Fifth, making references to time allocations was reported fewer times on inference questions and more times on acceptable items. This study has thus showed us the value of test-taking protocols, along with other data sources, in the investigation of construct validity of a reading comprehension test. Rupp, Ferne, and Choi (2006) examined test-takers’ use of strategies on a multiple-choice reading comprehension test, with a purpose of investigating the equivalence of reading processes and strategy use in testing and non-testing reading conditions. The participants were ten ESL adult learners, who were first asked to verbally report their test-taking process in a semi-structured interview and then were asked to do concurrent think-aloud. The results showed that reading processes in a test condition were strikingly different from those in a non-testing context. Moreover, the construct of reading comprehension was shown to be assessment specific and was fundamentally determined through item design and text selection. In terms of learner strategies, the study presented three findings. First, learners viewed responding to multiple-choice questions as a problem-solving task rather than a comprehension task. Second, learners selected a variety of unconditional and conditional response strategies to deliberately select choices. Third, learners combined a variety of mental 17.

(29) resources interactively when determining an appropriate choice. In sum, the authors concluded that their findings support the development of response process models that are specific to different item types, the design of further experimental studies of test method effects on response processes, and the development of questionnaires that profile response processes and strategies specific to different item types. Studies on Cloze Tests Storey (1997), employing the methods of concurrent think aloud and immediate retrospection, investigated twenty-five Hong Kong EFL students’ test-taking process in a 13-item, multiple-choice, discourse cloze test. The purpose of the study was to provide introspective validation of the testing technique and the test items by assessing observed test-taking behavior against a predicted model of ideal performance. The results revealed that different items entailed varying degrees of construct validity. Some students were found to have used theoretically expected reading processes, while others merely considered information at the within-sentence level. Although there was a mismatch between the theoretically assumed processes and the actual processes applied by some test-takers (such as use of the strategies of elimination and surface matching), the items were capable of generating construct-relevant processing, and the test was judged to have a good degree of construct validity. Sasaki (2000) investigated how content schemata activated by culturally familiar words might have influenced students’ test-taking processes in a cloze test. Sixty Japanese EFL students were divided into two groups, each completing either a culturally familiar or an unfamiliar version of a cloze test. The participants were asked to produce immediate retrospective protocols while taking the test, and then to recall the passage after they had completed the whole test. The results showed that those who read the culturally familiar cloze text tried to solve more items and generally 18.

(30) understood the text better, which resulted in better performances than those of the students who read the unfamiliar text. The paper concluded it has demonstrated the merits of using multiple data sources for investigating students’ test-taking processes, and that the results also support the claim that cloze tests can measure higher-order processing abilities. Replicating Sasaki’s (2000) experiment, Moghaddam (2010) examined the effects of cultural schemata on Iranian students’ test-taking processes in a cloze test. The participants were 116 Iranian university students, who were divided into two groups, each completing either a culturally familiar or a culturally unfamiliar version of a cloze test. They were asked to develop retrospective protocols of their test-taking process and recalls of the cloze passage. Similar to the findings of Sasaki (2000), the results of Moghaddam (2010) showed that students who read the culturally familiar cloze text generally understood the text to a greater extent and resulted in a high score in comparison with those who read the unfamiliar text. Both Sasaki (2000) and Moghaddam (2010) suggested that cultural schemata has certain effect on students’ test-taking processes in cloze tests. Yamashita (2003) compared skilled and less skilled readers in their processes of taking a gap-filling cloze test. Twelve Japanese EFL students (six skilled and six less skilled) were required to complete a 16-item gap-filling test while thinking aloud about their test-taking processes; afterward, they were interviewed informally by the researcher. The results demonstrated that both skilled and less skilled students used text-level information more frequently than other types of information (such as clause-level, sentence-level, and extra-textual information). However, the skilled readers used text-level information more frequently than the less skilled readers. In sum, the gap-filling test generated processes that made readers utilize text-level constraints, and overall differentiated well between skilled and less skilled readers. 19.

(31) We have reviewed several studies on students’ test-taking processes or strategies so far. Although those studies were conducted for different purposes (e.g., validation, comparison of L1 and L2, comparison of skilled and less skilled readers, or the effects of cultural schemata), they all employed the method of verbal report in their experiment. It can be seen clearly that verbal report has been widely utilized as a means of collecting qualitative data. As Sasaki (2000) comments well, “The productand process-oriented data complemented each other, providing insights that could not have been gained in the absence of one or the other” (p. 107). In the present study, I will also use verbal report to examine Taiwanese students’ test-taking process in a reading test. Studies on Teachers’ Test Construction “Classroom teachers are in the front line of introducing students to formal learning, including assessment” (Leighton, et al., 2010, p. 7). That is, the first test students take in class is usually made by their teachers, and it is also their classroom teachers that prepare them for the formal, large-scale tests. Therefore, teachers’ “assessment literacy” (Stiggins, 1991) is very important. According to Stiggins (1991), “teacher assessment literacy [emphasis in the original] is characterized by understanding what it takes to produce high-quality achievement data for both classroom and large-scale tests, scrutinizing achievement data and not accepting it at face value, and being sufficiently confident to ask questions about technical information and complicated summaries of test scores” (Leighton, et al., 2010, p. 9). Although assessment literacy is important, it is a pity that many teachers do not seem to be equipped with a solid grounding in the basic knowledge of assessment principles or practices (Leighton, et al., 2010). Many educators have also noted that, for teachers, producing good tests is a demanding task (Davidson & Lynch, 2002). The general inadequacies of teachers’ knowledge of test-constructing skills can be shown in the 20.

(32) studies reviewed in the following. Training in Teachers’ Test Construction To begin with, Carter (1984), almost three decades ago, investigated teachers’ competence in test item development by asking them to identify and write specific items aimed to measure particular reading skills (main idea, detail, inference, and prediction), and by interviewing them about their test-constructing perceptions and processes. The results showed that teachers had more difficulty in identifying and developing items tapping higher-level reading skills (i.e., inference and prediction) than in identifying and writing items to test lower-level cognitive skills (i.e., main idea and detail). The interview data suggested that teachers felt insecure about their knowledge of basic principles for item writing and that they might possess a limited repertoire of test-constructing skills. Based on these results, Carter (1984) argued for an emphasis on the testing course in preservice and inservice teacher education. To equip teachers with test-construction principles and to improve their test-writing skills, many teacher education programs began to include language testing courses. Kathleen M. Bailey and James D. Brown have reported the results of two questionnaire surveys (Bailey & Brown, 1996; Brown & Bailey, 2008) of instructors of language testing courses worldwide. They found that the contents of the language testing courses were quite diversified, covering topics such as hands-on experiences, general topics, item analysis, descriptive statistics, test consistency, and test validity. In spite of these diverse topics which could enhance teachers’ assessment literacy, some preservice or inservice teachers did not take the testing course either as an elective or as a requirement (Brown & Bailey, 2008). Consequently, it is possible that some classroom teachers might be ill-prepared to face the challenges of classroom assessment because they do not have the opportunity to learn to do so. Among the various topics a testing course might offer, “test-writing,” one of the 21.