九十八年學測閱讀測驗考生作答策略之初探

全文

(1)國立臺灣師範大學英語學系博. 士. 論. 文. Doctoral Dissertation Department of English National Taiwan Normal University. 九十八年學測閱讀測驗考生作答策略之初探. Strategies in Response to the Reading Comprehension Items on the 2009 General Scholastic Ability English Test. 指導教授：張武昌博士 Advisor: Dr. Wu-chang Chang 研究生：鄭瑞芝 Jui-chih Harriet Cheng. 中華民國一百零一年八月 August, 2012.

(2) 中文摘要本研究探討高三考生在 2009 年學科能力測驗閱讀測驗部分的作答過程與策略運用。分別針對考生在閱讀測驗部分普遍的策略使用、不同題型的策略使用、以及難易度不同考題的策略使用進行分析與探討。2009 年學科能力測驗閱讀測驗部分共有四篇文章，每篇文章長度約 248 至 298 字。十八位高三考生(高、中、低程度考生各六位)於 2009 年學科能力測驗結束後兩天之內參與本研究。在資料採集的過程中，十八位高三考生針對同樣一份考題的閱讀測驗部分進行作答，並詳細口述其閱讀過程、每題作答過程與選項依據。研究者參酌並修正 Cohen 與 Upton 於 2006 年所作的閱讀及考題作答策略分類表，將十八位高三考生的口述語料進行閱讀及考題作答策略分類，並做進一步的分析與探討。研究發現十八位高三考生在 2009 年學科能力測驗閱讀測驗十六道考題的作答過程中，共運用十八項閱讀策略與三十六項考題作答策略。其中高成就考生在作答時，逾八成的考題會依據文章內容做選項的選取或刪除；中等程度考生在作答時，近五成的考題會依據文章內容做選項的選取或刪除；低成就考生只有三成的考題會依據文章內容做選項的選取或刪除。回答閱讀測驗十六道題目時，多數高成就考生針對不同題型的考題，分別所使用的閱讀與作答策略相當一致。中、低成就考生則使用不同的作答策略。回答文章主旨或用意的題型時，多數高成就考生會直接依據文章的大意做選項的選取或刪除。在回答文章細節、字義、指涉等題型時，多數高成就考生會詳讀文章一部分以搜尋線索，再依據文意做選項的選取或刪除，多數高成就考生鮮少運用字面作答技巧。回答2009年學測閱讀測驗難度最高的考題(第44題)時，考生需要統整文章段落之間的訊息，再針對四個選項作真偽的判別。參與本研究的十八位高三考生中，多數考生在閱讀題目之後會再詳讀文章一部分搜尋線索，然而只有高成就考生實際依據文意做選項的選取或刪除，中、低成就考生皆未依據文意做選項的選取或刪除。中、低成就考生無法理解選文內容與選項字義，因而使用較多的作答策略，其中又以運用先備知識與字面作答技巧居多。回答 2009 年學測閱讀測驗難度極低的一道考題(第 48 題)時，考生需要在閱讀題目之後，詳讀文章一部分以搜尋線索，再依據文意判定某一名詞的指涉。參與本研究的考生相對使用少量的應答策略，且十八位考生全部答題正確。然而中、低成就考生在依據文意判定指涉的策略使用上，遠低於運用字面作答技巧的策略使用。五成程度中等的考生與全數低成就考生皆未依據文意做名詞的指涉，而直接運用字面作答技巧，選取出現文章關鍵字的選項。從本研究高三考生的應答策略中，發現此道考題的設計似無法有效評量考生對名詞指涉的能力。本研究發現程度不同的考生在 2009 年學測閱讀測驗文章的閱讀與作答的過程不甚相同。高成就考生已具備該測驗所欲測試考生的英文能力，因而能夠有.

(3) 效率地閱讀文章並成功回答多數的題目。低成就考生尚未具備一定程度的英文能力，無法理解選文內容與多數選項字義，因而使用較多字面作答技巧。本研究結果可提供國內英語教師與大考中心作為日後命題與審題的參酌。在設計考題時，正答選項中出現文章關鍵字的考題，似比正答選項中未出現文章關鍵字的考題簡單；誘答選項中出現文章關鍵字卻陳述不真事實的考題，似比誘答選項未出現文章關鍵字的考題更具挑戰性。在設計考題時，應避免正答或誘答選項能以考生先備知識判定真偽的題目，並宜注意考題難度的排序，盡量將難度低的考題先置於難度高的考題，以增加考生答題的成就感並提升閱讀動機。本研究結果亦提供英語教師作為教學上的參考。英語教師可以加強學生基本語言能力與善用閱讀策略，協助中、低成就學生克服閱讀的困難，英語教師並宜挑選多種適合學生程度的文章讓學生廣泛閱讀，從閱讀中提升閱讀能力。. 關鍵字: 學科能力測驗，閱讀測驗，作答策略，閱讀策略，高成就考生，中等程度考生，低成就考生，考題難度，命題效度，外語評量。.

(4) ABSTRACT This study describes the responding strategies that test takers used on the reading comprehension subtest of the 2009 General Scholastic Ability English Test [GSAET]. The investigation focused on the examinees’ response strategies for: (1) 16 reading comprehension items in general; (2) 7 types of reading comprehension items; and (3) the most and least challenging test items. Verbal report data were collected from 18 12th graders across proficiency levels, i.e., 6 high achievers, 6 average achievers, and 6 low achievers. Participants worked on the reading comprehension subtest of the 2009 GSAET, containing four 248-298 word passages with four items. Participants’ verbal report was evaluated to determine strategy use based on the modified versions of Cohen and Upton’s reading and test-taking strategies coding rubrics (2006). The participants used a total of 18 reading strategies and 36 test taking strategies in response to the 16 reading comprehension items. The high achievers selected or discarded the options through textual meaning 80% of the time; the average achievers, 50% of the time; and the low achievers, only 30% of the time. In response to the reading comprehension items, the high achievers generally showed a consistent pattern in the use of the main response strategies for different types of questions, whereas the average and low achievers resorted to a variety of test taking strategies. In response to global questions, e.g., questions on the main idea or purpose of the passage, the high achievers generally selected the options through passage overall meaning. In response to local questions, e.g., questions asking for vocabulary meaning, a specific referent, inference, specific information, cause-effect relationship, or details of the passage, the high achievers generally went back to the passage, read carefully for the clues, and selected the options through vocabulary, sentence, or paragraph meaning. They generally selected or discarded the options through textual meaning instead of test-wiseness strategies. In response to the most challenging question, Q44, which required the respondents to integrate information conveyed across paragraphs and justify the correctness of the statements in four options, most of the participants read the question. and then went back to the passage, carefully reading a potion of the passage to look for clues. While the high achievers generally selected and discarded options through textual meaning, none of the average and low achievers selected the option through textual meaning. Showing difficulty comprehending the passage and wrestling with the option meaning, the average and low achievers used more test-taking strategies in response to the question. They relied more on their background knowledge and the.

(5) key word association strategy in their option selection. In response to one of the least challenging items, i.e., Q48, all of the participants manipulated relatively fewer response strategies and successfully selected the option. But the strategy of verifying the referent was used at a much lower frequency rate than the test-wiseness strategy of key-word matching among the average and low achievers. Half of the average achievers and all of the low achievers selected the option through key-word matching strategy. The response strategies thus did not seem appropriate for the purpose of the item and provided weak evidence for theory-based validity. This study showed that examinees of different proficiency levels processed the passages/tasks differently. The high achievers, whose English proficiency had reached a certain level required of the 2009 GSAET, were able to read the passages efficiently and completed most of the test items successfully. The low achievers, whose English proficiency had not reached a certain level to cope with most of the test items, wrestled with the meanings of the words in the passages/tasks and failed to process the passages/tasks globally. The findings provide insights into the construction of L2 reading tests. They suggest that questions with the correct option containing key words in the passage are likely to be easier than questions without; questions with distracters containing words in the passage but describing something irrelevant to the passage are likely to be more challenging than those without; and questions with options involving statements which can be judged wrong from the examinees’ background knowledge do not make attractive distracters. They also confirm the importance of sequencing test items by difficulty, with easy items preceding challenging ones. The findings also provide pedagogical implications, suggesting that L2 teachers may assist learners to cope with difficulties in reading by improving word-level competences and promoting use of comprehension strategies from a range of texts appropriate to learners’ proficiency levels.. Key word: GSAET, reading comprehension, reading strategy, test-taking strategy, high achiever, average achiever, low achiever, item difficulty, validity, second language assessment..

(6) ACKNOWLEDGEMENTS This dissertation would not have been possible without the assistance and support of the kind people around me. First and foremost, I would like to express my utmost gratitude to Dr. Wu-chang Chang, my dissertation advisor, for his encouragement, guidance, and support in the development and completion of this study. I would also like to express my very great appreciation to my committee members—Dr. Ing Cherry Li, Dr. Chiou-lan Chern, Dr. Hsi-nan Yeh, and Dr. Hsueh-ying Yu, for their encouraging and constructive remarks on this research work. I am heartily thankful to Dr. Yu-hwei Shih and Dr. Chun-yin Doris Chen, for providing a stimulating and fun environment for me to learn and grow through textbook writing, which motivated me to pursue doctoral research. I am also grateful to Dr. Wu-chang Chang, Dr. Ing Cherry Li, Dr. Chiou-lan Chern, Dr. Hsi-nan Yeh, Dr. Yuh-show Cheng, Dr. Hsueh-o Lin, Dr. Hao-jan Chen, Dr. Hsi-chin Chu, Dr. Miao-Hsia Chang, and Dr. Chyiruey Chen, for the many thought-provoking lectures and for all the assistance they provided. I would like to extend my special thanks to all of the helpful participants in this research work, for their willingness to give their time generously and provide rich and valuable sources of data for analysis. I am particularly grateful to Ms. Yu-lan Chen, for her professional assistance with the coding of my data. I would also like to thank Ms. Chin-lan Kuo and Ms. Hui-chi Yang, for their assistance with the recruitment of participants. I would like to thank Meg Li, Vincent Su, Beryl Li, Karen Tien, and Eric Lin at NTNU, and my friends and colleagues, especially Ms. Yu-jin Li, for their kindness, friendship, and emotional support. I would also like to thank Mr. Yiao-chung Chang, Principal of Song-shan High School of Commerce and Home Economics, for his support in my pursuit of professional growth. Lastly, but most importantly, I would like to express my deepest appreciation to my family, for their love and unfailing support throughout my study. I am grateful to my parents and grandmother, for their love and caring for me and my family. I am thankful to my sister and aunt, for all the wonderful adventures they made with my daughter after school. I am thankful to my brother, for his inspiring words. I am particularly thankful to my husband, Peter, for his love, support and encouragement at all times; and to my precious daughter, Ruby, for her laughter, sweet smile, and all the surprises she has brought to me. To them I dedicate this dissertation.. vi.

(7) TABLE OF CONTENTS. Page Acknowledgements……………………………………………………………. vi List of Tables…………………………………………………………………... x List of Figures…………………………………………………………………. xii. CHAPTER ONE INTRODUCTION……………………………………. Background and Rationale……………………………………………. .. Purposes of the Study…………………………………………………… Significance of the Study………………………………………………... 1 2 4 6. CHAPTER TWO LITERATURE REVIEW…………………………...... Validity in Assessment………………………………………………….. Downing’s Framework for Test Development…………………… Weir’s Framework for Language Test Development……………... 8 8 9 12. Research in Second Language Reading………………………………… Reading Comprehension and Strategy Use………………………. Issues in Second Language Reading……………………………… Research in Second Language Reading Assessment………………. …... Construct of Second Language Reading Tests…………………… Verbal Report in Assessment…………………………………. …. Individual Differences in Strategy Use…………………………… Item Difficulty in Assessment……………………………………. General Scholastic Ability English Test (GSAET) &. 15 15 18 21 21 24 26 29. Department Required English Test (DRET)…..………………………… Characteristics of GSAET & DRET……………………………… Main Statistical Specifications on Item Analysis in GSAET & DRET………………………………………………. Test Specifications of 2009 GSAET & 2009 DRET………. …….. 35 35. CHAPTER THREE RESEARCH METHODOLOGY………………….. Participants………………………………………………………………. 44 44. 37 38. Instruments……………………………………………………………… 48 Reading Comprehension Subtest of the 2009 GSAET…………… 49 Vocabulary Levels Test: Test B…………………………………… 52 vii.

(8) Data Collection Procedures……………………………………………... Data Analysis…………………………………………………………….. 52 57. CHAPTER FOUR RESULTS AND DISCUSSION…………………….. Strategy Use in Reading Comprehension Subtest of the 2009 GSAET………………………………………………………… Strategy Use in General…………………………………………... Distribution of Response Strategies……………………………… Response strategies among the high achievers…………….. 70 70 70 80 83. Response strategies among the average achievers………… Response strategies among the low achievers…………….. Response Strategies for Different Item Types…………………………... Response Strategies for Questions on Main Idea or Purpose…….. Response Strategies for the Question on Vocabulary Meaning….... 86 89 92 92 106. Response Strategies for Questions on a Specific Detail………….. Response Strategies for the Question on Referent……………….. Response Strategies for Questions on Inference…………………. Response Strategies for Questions on Multiple Details…………... 116 125 129 138. Response Strategies for Questions on Cause-Effect Relationship... Response Strategies for the Most/Least Challenging Items…………….. Response Strategies for the Most Challenging Items……………. Response Strategies for the Least Challenging Items……………. Passage with the Most Challenging Items………………………... Passage with the Least Challenging Items……………..………… Perceived Difficulty of Test Items……………………………….... 148 156 158 169 177 179 181. CHAPTER FIVE CONCLUSION……………………..………………... Responses to Research Questions……………………………….………. Response to Research Question 1.………………………………... Response to Research Question 2………..……………….………. Response to Research Question 3……..………………….……… Concluding Remarks………...………………………………….………. Implications……….………………………………………….………… Limitations………………..……………………………….……………. Suggestions for Future Studies………………………………………….. 185 185 185 189 193 196 204 206 207. REFERENCES………………………………………………………….. …….. 208. viii.

(9) Appendix A: Reading Comprehension Items on the 2009 GSAET…………… Appendix B: A Vocabulary Levels Test: Test B……………………………….. Appendix C: Distribution of Responses on Reading Comprehension Items in 2009 GSAET…………………..…...…………………………… Appendix D: Item Facility & Item Discrimination on Reading Comprehension Items in 2009 GSAET……………….. Appendix E: Letter of Invitation………………………………………………. Appendix F: Consent Form……………………………………………………. Appendix G: Participants’ Response to Reading Comprehension Items. 227 232. on the 2009 GSAET………………………………………...…… Appendix H: Reading Strategies Coding Rubric (R)………………………….. Appendix I: Test-Management Strategies Coding Rubric (T)………………… Appendix J: Test-Wiseness Strategies Coding Rubric (TW)…………………... 247 248 250 252. ix. 242 244 245 246.

(10) List of Tables Table 1. Twelve Steps for Effective Test Development…………………………. Table 2. Types of Reading………………………………………………………. Table 3. Test Specifications of 2009 GSAET & DRET………………………… Table 4. Test Specifications of Reading Comprehension Subtests of 2009 GSAET & DRET………………………………………………… Table 5. Participants’ Profiles…………………………………………………… Table 6. Texts and Test Items in Reading Comprehension Subtest of. 10 23 41. 2009 GSAET…………………………………………………………… Table 7. Revised Reading Strategies Coding Rubric (R)………………………... Table 8. Revised Test-Management Strategies Coding Rubric (T)……………… Table 9. Revised Test-Wiseness Strategies Coding Rubric (TW)……………… Table 10. Uses of Abbreviations and Font Styles in Examples of. 51 61 63 64. 42 48. Verbal Reports………………………………………………………… 65 Table 11. Reading Strategies Used by Participants across Proficiency Levels…. 71 Table 12. Test-taking Strategies Used by Participants across Proficiency Levels……………………………………………………. 73 Table 13. Strategies Responding to the Reading Comprehension Items on the 2009 GSAET………………………………………………….. Table 14. Strategies Responding to Reading Comprehension Items among High Achievers…………………………...…………………… Table 15. Strategies Responding to Reading Comprehension Items among Average Achievers……….…………………………………… Table 16. Strategies Responding to Reading Comprehension Items among Low Achievers……………………………………………….... 81 84 87 90. Table 17. Response Strategies for Questions on Main Idea or Purpose………… Table 18. Response Strategies for the Question on Vocabulary Meaning………. Table 19. Response Strategies for Questions on a Specific Detail……………… Table 20. Response Strategies for the Question on Referent…………………… Table 21. Response Strategies for Questions on Inference……………………… Table 22. Response Strategies for Questions on Multiple Details……………… Table 23. Response Strategies for Questions on Cause-Effect Relationship…… Table 24. Most Challenging Items in the Reading Comprehension Subtest……. Table 25. Least Challenging Items in the Reading Comprehension Subtest…….. 94 107 117 126 130 139 149 156 157. Table 26. Response Strategies for the Most Challenging Question (Q44)………. 159. x.

(11) Table 27. Response Strategies for the Second Most Challenging Question (Q49)………………………………………………………………….. Table 28. Response Strategies for the Least Challenging Question (Q46)……… Table 29. Response Strategies for the Third Least Challenging Question (Q45)…………………………………………………………………... xi. 165 171 172.

(12) List of Figures Figure 1. A Socio-cognitive Framework for Validating Reading Skills……… Figure 2. A Heuristic for Thinking about Reading Comprehension…………… Figure 3. Urquhart and Weir’s Model of the Reading Process…………………. xii. 14 17 22.

(13) CHAPTER ONE. INTRODUCTION. As a high school English teacher in Taiwan, I have long been interested in the topic of language assessment as well as its influence on learning. I have witnessed how the students have gone through hundreds or thousands of tests during 3 years of learning. Being concerned about the impact that each test might have on learners, I have always been cautious about each of the test items I construct. From pop quizzes, mid-terms, finals, to large-scale mock exams, I have attempted to construct each of the test items for a purpose and have always been concerned about the issue on construct validity. Despite all the plans and thoughts I have made before and while constructing the test, there’s always room for improvement after each test is administered. Over the years, I have learned from the students’ feedback that tests, if well-constructed, can mean much more than a necessary evil. They can be encouraging and educational in the students’ learning rather than simply a means of measurement. In Taiwan, the General Scholastic Ability English Test (GSAET) and the Department Required English Test (DRET) are held annually for the twelfth graders for the purpose of college admission. After each of the high-stakes tests is held, there are voices concerning the validity of the test design and test scores. There are always 1.

(14) disputes about the proportion of difficult and easy items in the test. Some teachers hold the view that questions should be made difficult so that high achievers might not feel “disfavored” or “sacrificed” in the test. Some teachers prefer a low proportion of difficult questions so as to encourage the average and the low achievers. Most of the voices, however, come from teachers and test designers. Wouldn’t it be nice if we could also listen to the voices from the examinees as well? The present study thus attempts to investigate the twelfth graders’ use of reading and test taking strategies in response to the reading comprehension items in one of the high-stakes tests, i.e., the 2009 General Scholastic Ability English Test (College Entrance Examination Center [CEEC],2009a), as a process of construct validation of the test. Drawing on the verbal report data of a sample of examinees across proficiency levels, we will have a better understanding of the examinees’ mental process and their strategy use in response to the reading comprehension subtest of the test. We will also investigate the examinees’ response strategies to different items and further discuss the appropriateness of test content and item design in the test. In-depth interviews with the examinees are also conducted to triangulate with the interpretation of test scores. Background and Rationale In Taiwan, the General Scholastic Ability English Test and the Department 2.

(15) Required English Test are held annually for the twelfth graders for the purpose of college admission. Both tests are intended to measure the examinees’ scholastic or academic achievement in English in a well-established high school curriculum (CEEC,2007a, 2007b, 2011a, 2011b). Both tests are high-stakes tests and require systematic and comprehensive validity studies for effective test development. In both tests, the examinees’ reading ability is assessed in the format of multiple-choice question items. While descriptive statistics of item responses (e.g., item facility and discrimination) are publicly reported and used as one source of validity evidence (e.g., CEEC, 2009b, 2009c), most other validity studies on both tests conducted by the test developers are not publicly displayed for security reasons. To validate the interpretations of test scores, test developers need to conduct both quantitative and qualitative analyses (Bachman, 1990; Brown, 2005; Fulcher and Davidson, 2007; Weir, 2005). As Weir states, “Validation can be seen as a form of evaluation where a variety of quantitative and qualitative methodologies are used to generate evidence to support inferences from test scores” (Weir, 2005: 15). Despite statistical reports of test scores, qualitative analysis of the examinees’ performance in tests also provides valuable insights into the evaluation and construction of the appropriateness of tests. Research has shown that understanding reader purposes and mental processes 3.

(16) involved in reading is crucial in foreign language assessment (Afflerbach, 2000; Alderson, 2000, 2005; Cohen & Upton, 2006, 2007; Enright, Grabe, Koda, Mosenthal, Mulcahy-Ernt, & Schedl, 2000, Green, 1998). As Alderson notes, “The importance of a detailed record of a learner’s responses would also surely strike a chord among researchers of foreign language learning in the twenty-first century.” (2005: 24). Qualitative analysis of the more proficient examinees’ test taking processes allows us to explore the nature of the test (e.g., what reading skills and strategies are used; whether the examinees choose the right answer based on expected reading skills and strategies rather than construct-irrelevant reasons). Qualitative analysis of the examinees’ test taking processes across proficiency levels allows us to explore the nature of language learning (e.g., what makes reading difficult; what learning obstacles are experienced by slow learners; how learners of different proficiency levels process the task; what teachers can do to help learners cope with difficulties in reading). Purposes of the Study The General Scholastic Ability English Test is a high-stakes test intended to measure the examinees’ academic performance in English. The results in the test are generally used as a key criterion for college admission. Therefore, the demand for test validity is undoubtedly high. While the College Entrance Examination Center 4.

(17) provides descriptive statistics of item responses (e.g., item facility and item discrimination) as one source of validity evidence, most other validity studies on the test are not publicly displayed for security reasons. To validate test construction, we need both quantitative and qualitative analyses to generate evidence to support inferences from test scores. Research has shown that understanding reader purposes and mental processes involved in reading is crucial in foreign language assessment. Qualitative analysis of the test taking processes among the more proficient examinees allows us to explore the nature of the test. Qualitative analysis of the examinees’ test taking processes across proficiency levels allows us to explore the nature of language learning. The purposes of this study are threefold. First, this study aims to investigate the test taking processes, especially the use of reading and test-taking strategies among the 12th graders across proficiency levels in the reading comprehension subtest of the 2009 GSAET. Second, this study aims to explore the test taking processes, especially the use of reading and test-taking strategies on different question/item types among the12th graders across proficiency levels, as part of the process of construct validation on the 2009 GSAET. Third, this study aims to investigate the test taking processes, especially the use of reading and test-taking strategies on the most and the least challenging questions among the12th graders across proficiency levels, so as to 5.

(18) identify factors that contribute to the difficulty of reading comprehension items. This study will address the following research questions: 1. What are the strategies used by the 12th graders across proficiency levels in response to the reading comprehension items on the 2009 GSAET? 2. What are the strategies used by the 12th graders across proficiency levels in response to different item types on the reading comprehension subtest of the 2009 GSAET? 3. What are the strategies used by the 12 graders across proficiency levels in response to the most and the least challenging items on the reading comprehension subtest of the 2009 GSAET? Significance of the Study The findings of this study will contribute to the research in second language reading assessment and learning. The quantitative analysis of the examinees’ strategy use presents the most prevalent strategies used by the examinees across proficiency levels in response to the reading comprehension items of a high-stakes national test, the 2009 General Scholastic Ability English Test. The qualitative analysis of the examinees’ response strategies aids test developers in understanding the mental processes that the respondents actually go through while taking the test, which provides valuable insights into the evaluation and 6.

(19) construction of the appropriateness of tests. The qualitative analysis of the examinees’ test taking processes across proficiency levels also aids language teachers in exploring the nature of language learning. With an understanding of what obstacles learners come across while reading, language teachers may assist learners to cope with difficulties in reading more efficiently.. 7.

(20) CHAPTER TWO. LITERATURE REVIEW. This chapter presents an overview of research relevant to the current study. Since this study aimed to investigate the test taking processes and response strategies among the 12th graders across proficiency levels in the reading comprehension subtest of the 2009 GSAET, we will first address the most important issue of all in language testing: validity. Next, we will focus on research in second language reading and in second language reading assessment. Research in second language reading involves issues in reading comprehension, reading strategies, and second language reading. Research in second language reading assessment involves theoretical accounts of second language reading assessment, the use of verbal report data in assessment, strategy use in reading tests, and item difficulty in reading assessment. Lastly, we will briefly describe the construct of the two high-stakes tests in Taiwan held for the 12th graders for the purpose of college admission: the General Scholastic Ability English Test (GSAET) and the Department Required English Test (DRET). Validity in Assessment Validity is placed at the center of psychological, educational, and social testing (Alderson, Clapham, & Wall, 1995; American Educational Research Association [AERA], American Psychological Association [APA], & National Council on 8.

(21) Measurement in Education [NCME], 1999; Bachman, 1990; Bachman and Palmer, 1996; Cronbach, 1988, 1989; Cronbach and Meehl, 1955; Downing, 2006; Fulcher and Davidson, 2007; Haladyna, 2004, 2006; Linn, 2006; Messick, 1989, 1995; Weir, 2005). According to the Standards for Educational and Psychological Testing, validity is defined as “the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of test scores” and is thus the most fundamental consideration in test development and evaluation (AERA, APA, & NCME, 1999: 9). Different types of validity studies generate validity evidence to support the intended test score interpretations and uses (Bachman, 1990; Messick, 1995; Haladyna, 2006; Weir, 2005) and “none of these by itself is sufficient to demonstrate the validity of a particular interpretation or use of text scores” (Bachman, 1990: 237). The higher the stakes associated with test scores, the greater the concern for validity (Downing, 2006; Linn, 2006; Weir, 2005). Downing’s Framework for Test Development Based on a comprehensive review of research literature and empirical practice, Downing (2006) presents a twelve-step process for effective and efficient test development, as shown in Table 1. In this framework, each step consists of a series of “interrelated activities,” which provide multiple sources of “validity evidence” for a testing program. As Downing notes, the “intensity” of each activity depends on “the 9.

(22) type of test under development, the test’s purposes and its intended inferences, the stakes associated with the test scores, the resources and technical training of the test developers.” (p.4) Table 1 Twelve Steps for Effective Test Development Steps. Example Test Development Tasks. 1. Overall plan. Systematic guidance for all test development activities: construct; desired test interpretation; test format(s); major sources of validity evidence; clear purpose; desired inferences; psychometric model; timeliness; security; quality control. 2. Content definition. Sampling plan for domain/universe; various methods related to purpose of assessment; essential source of content-related validity evidence; delineation of construct. 3. Test specifications. Operational definitions of content; framework for validity evidence related to systematic, defensible sampling of content domain; norm or criterion referenced; desired item characteristics Development of effective stimuli; formats; validity evidence related to adherence to evidence-based principles; training of item writers, reviewers; effective item editing; CIV owing to flaws Designing and creating test forms; selecting items for. 4. Item development. 5. Test design and assembly 6. Test production 7. Test administration. specified test forms; operational sampling by planned blueprint; pretesting considerations Publishing activities; printing or CBT packaging; security issues; validity issues concerned with quality control Validity issues concerned with standardization; ADA issues; proctoring; security issues; timing issues. Note. CIV= construct-irrelevant variance; ADA = American with Disabilities Act; CBT = computer-based testing. From Twelve steps for effective test development (p. 5), by S. M. Downing, 2006, In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates. (Table continues) 10.

(23) Table 1 (continued) Steps 8. Scoring test responses 9. Passing scores. 10. Reporting test results 11. Item banking 12. Test technical report. Example Test Development Tasks Validity issues; quality control; key validation; item analysis Establishing defensible passing scores; relative vs. absolute; validity issues concerning cut scores; comparability of standards: maintaining constancy of score scale (equating, linking) Validity issues: accuracy, quality control; timely; meaningful; misuse issues; challenges; retakes Security issues; usefulness, flexibility; principles for effective item banking Systematic, thorough, detailed documentation of validity evidence; 12-step organization; recommendations. In Downing’s framework, purposes are crucial to the development of an overall plan for the test. They are critical in determining test content, in selecting the format of test items, and in guiding the evaluation of test uses and interpretations of scores. Once the purposes of the test are clarified, the test content should then be precisely specified. Clear content specifications “specify the number or proportion of items that assess each content and process/skill area; the format of items, responses, and scoring rubrics and procedures; and the desired psychometric properties of the items and test such as the distribution of item difficulty and discrimination indices.” (AERA, APA, & NCME, 1999, p.183). As Downing notes, appropriate identification of test content facilitates effective test items, which are designed to “measure important content at an appropriate 11.

(24) cognitive level.” Downing emphasizes the importance of effective item creation and describes it as “more art than science.” In many large-scale high-stakes tests, the multiple choice format is widely used because multiple choice items can be administered in a short time and the test taker responses can be objectively and efficiently scored. Despite research evidence for the principles of writing effective multiple choice items (Alderson, Clapham, & Wall, 1995; Haladyna, 2004; Haladyna & Downing, 1989, 2004; Haladyna, Downing, & Rodriguez, 2002), the creation of effective test items remains a challenge for test developers. As Downing states: …The principles of writing effective, objectively scored multiple-choice items are well-established and many of these principles have a solid basis in the research literature… Yet, knowing the principles of effective item writing is no guarantee of an item writer’s ability to actually produce effective test questions. Knowing is not necessarily doing. Thus, one of the more important validity issues associated with test development concerns the selection and training of item writers…The most essential characteristic of an effective item writer is content expertise…many other item writer characteristics such as regional geographic balance, content subspecialization, and racial, ethnic, and gender balance must be considered in the selection of item writers. (Downing, 2006, p.11). Weir’s Framework for Language Test Development Based on current developments in theory and practice, Weir (2005) presents a coherent “evidence-based validity” framework for language test design and implementation, which involves providing evidence relating to context validity, 12.

(25) theory-based validity, criterion-based validity, scoring validity, and consequential validity. According to Weir, the framework is specifically designed for English for Speakers of Other Languages (ESOL), but also applies to all forms of educational assessment. In the construction of tests, test developers must provide all of the five types of validity evidence to “justify the correctness of our interpretations of abilities from test scores” (p.2). Among the five types of evidence, context validity and theory-based validity evidence, which are collected before the test event, are concerned with what abilities a test is intended to measure and how the choice of tasks in a test is representative of the abilities required in “real life language use.” The other three types of validity evidence (i.e., scoring validity, criterion-based validity, and consequential validity), which are generated after the test has been administered, are concerned about the reliability of test scores, the extent to which test scores correlate with external criteria of real life performance, and the consequences of test use for test stakeholders: learners, teachers, parents, government and official bodies, and the marketplace. Weir further describes in detail a socio-cognitive framework specifically designed for validating reading tests, which is presented as a flowchart of boxes, from test taker characteristics, theory-based validity, and context validity, to scoring validity, consequential validity, and criterion-related validity. As shown in Figure 1, Weir’s 13.

(26) framework provides us with insights into what type of evidence can be collected at different stages of reading test construction and how the different types of validity evidence fit together.. Figure 1. A Socio-cognitive Framework for Validating Reading Skills From Language testing and validation: An evidence-based approach (p.44), by C. J. Weir, 2005. New York: Palgrave Macmillan. 14.

(27) Research in Second Language Reading This section provides an overview of research in second language reading. We will begin with theoretical accounts of reading comprehension and reading strategies, which pave the way for a later review of research in second language reading. Reading Comprehension and Strategy Use Reading comprehension has been discussed from a number of perspectives. In a review of reading comprehension research, Pressley (2000) summarizes that comprehension depends on a number of lower order processes (e.g., skilled decoding of words) and higher order processes (e.g., relating text content to background knowledge; use of comprehension strategies). As Pressley notes, reading comprehension “begins with decoding of words, processing of those words in relation to one another to understand the many small ideas in the text, and then, both unconsciously and consciously, operating on the ideas in the text to construct the overall meaning encoded in the text” (p.551). Along with previous research, Pressley confirms that accurate and fluent (automatic) word recognition is a prerequisite for reading comprehension (Carver, 1997; LaBerge & Samuels, 1974; Perfetti, 1997; Pressley, 1998, 2000). During the process of reading, language comprehension processes interact with higher-level processes. Readers may automatically relate text content to prior knowledge and/or consciously activate comprehension strategies. When readers’ activation of schematic knowledge is relevant to the information in the 15.

(28) text, reading is successful. While good readers typically make inferences based on prior knowledge directly relevant to the ideas in the text, poor readers make “unwarranted and unnecessary” inferences by drawing on prior knowledge not directly relevant to the most important ideas in the text (Anderson & Pearson, 1984; Hudson, 1990; Hudson & Nelson, 1983; Rosenblatt, 1978; Williams, 1993). In terms of strategy use, good readers use a variety of strategies, including being aware of reading purposes, overviewing the text, reading selectively, making associations, evaluating and revising hypotheses, revising prior knowledge, figuring out unknown words in text, underlining and making notes, interpreting text, evaluating the text, reviewing the text, and using the information in the text (Pressley & Afflerbach, 1995). Given the importance of both lower order and higher order processes in reading comprehension, Pressley further suggests that teachers promote learners’ comprehension abilities by improving word-level competences, building background knowledge, and promoting use of comprehension strategies. In developing a proposed research agenda for reading comprehension, the RAND Reading Study Group (2002) defines reading comprehension as “the process of simultaneously extracting and constructing meaning through interaction and involvement with written language” (p.11). According to the proposal, three key elements are essential in reading comprehension: the reader, the text, and the activity 16.

(29) (e.g., purpose for reading, processes while reading, and consequences of reading). In reading comprehension, the three elements are interrelated within a larger sociocultural context which interacts with all of the three elements, as illustrated in Figure 2.. Figure 2. A Heuristic for Thinking about Reading Comprehension From Reading for understanding: Toward a R&D program in reading comprehension (p.12), by RAND Reading Study Group, 2002. Santa Monica, CA: Science and Technology Policy Institute, RAND Education.. In this framework, good readers have a wide range of capacities and abilities, including cognitive capacities (e.g., attention, memory, critical analytic ability, inferencing, visualization ability), motivation (e.g., a purpose for reading, an interest in the content being read, self-efficacy as a reader), and different types of knowledge 17.

(30) (e.g., vocabulary, domain and topic knowledge, linguistic and discourse knowledge, knowledge of specific comprehension strategies). Before reading, readers have purposes in mind. While reading, they process the text with regard to the purposes. They construct various representations of the text that are important for comprehension, including the surface code (e.g., the exact wording of the text), the text base (e.g., idea units representing the meaning), and a representation of mental models embedded in the text. Reading activities may have direct consequences in knowledge, application, and engagement, or other long-term consequences. In reading comprehension, all of the three key elements—the reader, the text, and the activity — are interrelated within a sociocultural context. Issues in Second Language Reading In the context of second language reading, research has stressed the interactive nature of bottom-up and top-down processing (Bernhardt, 1991; Carrell, Devine & Eskey, 1988). While reading, readers interact with both bottom-up and top-down processing. In bottom-up processing, readers “begin with the printed words, recognize graphic stimuli, decode them to sound, recognize words and decode meaning.” In top-down processing, readers “activate what they consider to be relevant existing schemata, and map incoming information onto them” (Alderson, 2000, pp. 16-17). Drawing on an extensive review of research in reading comprehension, Alderson 18.

(31) concludes that bottom-up and top-down approaches are both important in reading and “the balance between the two approaches is likely to vary with text, reader, and purpose” (p. 20). According to Alderson, variables that affect the nature of reading are mainly “the interaction between reader and text variables in the process of reading” (p.32). Reader variables include schemata and background knowledge, knowledge of language, knowledge of genre/text type, metalinguistic knowledge and metacognition, content schemata, knowledge of subject matter/topic, knowledge of the world, cultural knowledge, reader skills and abilities, reader motivation, reader affect, etc. Text variables include text topic and content, text type and genre, text organization, linguistic variables, text readability, typographical features, the medium of text presentation, etc. Bernhardt and Kamil (1995) make a thorough review of research and claim that second language reading is an interaction of L1 reading ability and L2 linguistic knowledge (e.g., word knowledge and syntax). While L1 literacy accounts for 20% of the variance of L2 reading ability, L2 linguistic ability accounts for 30% of the variance (27% of word knowledge and 3% of syntax). A number of studies have confirmed the contribution of L1 to L2 reading development (Grabe, 2009; Guthrie, 1988; Koda, 2005; Rutherford, 1983). Koda (2005) argues that L1 processing experience has influence on the development of L2 reading skills. Grabe (2009) also 19.

(32) suggests that L1 reading abilities such as metalinguistic awareness and basic cognitive skills are likely to transfer to L2 reading contexts. Alderson (1984, 2000) concludes from a number of studies that both L2 language knowledge and L1 reading knowledge are important factors in second language reading, with L2 language knowledge being a more powerful factor than L1 reading ability. He also confirms that a “linguistic threshold” exists and that L2 learners transfer their L1 reading ability to L2 reading contexts only when they reach a certain proficiency level. In other words, less proficient L2 learners need to improve their linguistic knowledge so as to engage themselves in L2 reading. Alderson (2000) further suggests that learners’ linguistic threshold varies with task: “the more demanding the task, the higher the linguistic threshold.” Research in second language reading has shown that fluent word recognition, processing efficiency, and reading rate are vital in reading comprehension (Alderson, 2000; Bernhardt, 1991, 2000; Grabe, 1991; Grabe & Stoller, 2002; Koda, 1996, 1997). Insufficient linguistic knowledge constrains second language reading processes (Alderson, 2000; Bernhardt, 1991, 2000; Brisbois, 1995). Vocabulary difficulty has consistently been shown to have an effect on comprehension for L1 and L2 readers (Alderson, 2000; Carver, 1994; Freebody & Anderson, 1983; Hu & Nation, 2000; Laufer, 1992; Nation, 1990, 2001; Read, 2000; Williams & Dallas, 1984). Brisbois 20.

(33) (1995) argues that L2 knowledge is critical in reading comprehension, especially among learners at the beginning levels. Other studies also suggest that insufficient vocabulary hinders L2 reading performance (Hu & Nation, 2000; Segalowitz, 1986; Segalowitz, Poulsen, & Komoda, 1991), and that lower-level processing predominates the reading process among beginning L2 learners (Clarke, 1979; Horiba, 1993). Meanwhile, reading speed is related to fluency: with increased L2 proficiency, reading rate improves (Favreau & Segalowitz, 1982; Haynes & Carr, 1990), and error rate decreases (Bernhardt, 1991). Research in Second Language Reading Assessment This section provides an overview of research in second language reading assessment. We will first highlight the model proposed by Urquhart and Weir (1998) and Weir (2005) in the construct of second language reading tests. Next, we will address issues in assessment, including the use of verbal report in assessment, individual differences in strategy use, and item difficulty in reading assessment. Construct of Second Language Reading Tests In the context of second language reading assessment, how reading ability is defined affects the construct of a test. One prevalent perspective is to view reading as a set of comprehension processes (Alderson, 2000; Grabe, 1991, 1999, 2000; Grabe and Stoller, 2002; Urquhart & Weir, 1998; Weir, 2005) and can be broken down into 21.

(34) reading skills and strategies needed for testing purposes (Urquhart & Weir, 1998; Weir, 2005). Based on Urquhart and Weir’s model (1998), Weir (2005) develops a model of the reading process, as presented in Figure 3, to account for four types of reading, as shown in Table 2.. Figure 3. Urquhart and Weir’s Model of the Reading Process From Language testing and validation: An evidence-based approach (p. 92), by C. J. Weir, 2005. New York: Palgrave Macmillan. It is adapted from Reading in a second language: Process, product and practice (p. 106), by A. H. Urquhart & C. J. Weir, 1998. Harlow: Longman.. 22.

(35) Table 2 Types of Reading Global Level. Local Level. Establishing accurate comprehension of explicitly stated main ideas and supporting details Making propositional inferences Expeditious Skimming quickly to establish: Discourse topic and main ideas, or Reading structure of text, or relevance to needs. Search reading to locate quickly and understand information relevant to predetermined needs. Careful Reading. Identifying lexis Understanding syntax. Scanning to locate specific points of information.. Note. From Language testing and validation: An evidence-based approach (p. 90), by C. J. Weir, 2005. New York: Palgrave Macmillan. It is adapted from Reading in a second language: Process, product and practice (p. 123), by A. H. Urquhart & C. J. Weir, 1998. Harlow: Longman.. In this model, Goalsetter and Monitor, which are metacognitive mechanisms, “mediate among different processing skills and knowledge sources available to a reader” and “enable a reader to activate different levels of strategies and skills to cope with different reading purposes” (Weir, 2005: 95-96). Once the test takers have clear purposes for reading, they choose the most appropriate strategies in response to the task demand. The higher the demand of a task is, the more components of the model are involved (Urquhart and Weir, 1998; Weir 2005). As illustrated in Table 2, the process of reading involves the use of different skills and strategies. According to Urquhart and Weir (1998) and Weir (2005), reading can be global or local comprehension. Global reading is comprehension beyond the 23.

(36) sentence level such as reading for main idea or important details, whereas local reading is comprehension within the sentence level, such as reading for word meaning or pronominal reference. In a reading test, the demand of a careful reading item at the global level is usually higher than that of a scanning item since the former requires the test taker to go through the whole text and activate all components of the model in use while the latter might just involve a few components. Weir (2005) further points out that test developers should consider the appropriateness of different questions and reading strategies for different types of texts. In a scanning test, for example, the text should provide sufficient and varied specific details for readers. In a careful reading test, the text should include enough main ideas or important points. In an inferencing test, the text should include pieces of information that can be linked together. In a skimming or search reading test, the text should have a clear organization and provide explicit ideas in the surface level. Verbal Report in Assessment Research in reading comprehension assessment has consistently recognized the importance of investigating the examinees’ cognitive processing, thought process, and strategy use through verbal report measures as part of the process of test validation (Afflerbach, 2007; Anderson, 1991; Anderson, Bachman, Perkins, & Cohen, 1991; Cheng, Fox, & Zheng, 2007; Cohen, 1984, 1988, 2000; Cohen & Upton, 2006, 2007; 24.

(37) Ericsson & Simon, 1993; Gass & Mackey, 2000; Green, 1998; Perkins, 1992; Phakiti, 2003; Pressley & Afflerbach, 1995; Urquhart & Weir, 1998; Weir, 2005; Weir, Yang, & Jin, 2000). Green (1998) defines verbal reports or verbal protocols as “the data gathered from an individual under special conditions, where the person is asked to either think aloud or to talk aloud” (p. 1). According to Green, verbal protocols may be gathered as the task is carried out concurrently (i.e., when the task is carried out) or retrospectively (i.e., after the task has been carried out). In either concurrent or retrospective verbal reports, the prompts given to the individual can be non-mediate (e.g., requests such as ‘keep talking’) or mediate (e.g., requests for explanations or justifications). Green provides a comprehensive and in-depth overview of the use of verbal protocols in language assessment and concludes that verbal protocol analysis has the potential to “elucidate the abilities that need to be measured, and also to provide a means for identifying relevant test methods and selecting appropriate test content” (p.120). Verbal protocol analysis is widely used to probe into the examinees’ use of reading and test taking strategies during the test (Alderson, 2005; Anderson, Bachman, Perkins, & Cohen, 1991; Cheng, Fox, & Zheng, 2007; Urquhart & Weir, 1998; Weir, 2005; Weir, Yang, & Jin, 2000). As Ellis (2004) states, “collecting verbal explanations…would appear, on the face of it, to provide the most valid measure of a 25.

(38) learner’s explicit knowledge” (p. 263). Weir (2005) suggests that test developers and teachers use verbal report measures to investigate the examinees’ mental process while taking a test. The analysis of verbal reports allows test developers and teachers to: (1) evaluate whether the test measures what it is intended to measure; (2) compare the use of reading skills and strategies between good and poor readers. Individual Differences in Strategy Use In a review of reading comprehension research, Perfetti (1997) claims that research on individual differences among readers is crucial to understanding the nature of reading abilities. In other words, if we want to understand the nature of reading comprehension, we need to know the sources of individual differences between good and poor readers. Perfetti suggests that good readers differ from poor readers in the following aspects: processing efficiencies (e.g., speed & automaticity of word recognition), word knowledge, processing efficiencies in working memory, fluency in syntactic parsing and proposition integration, and the development of an accurate and reasonably complete text model of comprehension. In the evaluation of second language reading assessment, Weir (2005) claims that when proficient readers process different reading tasks (e.g., skimming, scanning, search reading, careful reading) through skills and strategies appropriate to the purposes of the tasks, then the test measures what it is intended to measure and is 26.

(39) valid in terms of theory-based validity. On the contrary, if examinees successfully process the tasks through test taking strategies instead of applying appropriate reading skills and strategies, then the test does not measure what it is intended to measure and provides weak evidence for theory-based validity. According to Weir, typical test taking strategies include: (1) matching words in the question with the same words in the text; (2) using clues in other questions to answer the question under consideration; (3) using prior knowledge to answer the questions; (4) blind guessing not based on any particular rationale (p. 94). Pressley and Afflerbach (1995) classify reading strategies into three types: planning and identifying strategies, by which readers construct the text meaning; monitoring strategies, by which readers regulate comprehension and learning; and evaluating strategies, by which readers reflect or respond to the text. Research in second language learning has shown that readers use the same range of strategies to comprehend, interpret, and evaluate texts (Carrell & Grabe, 2002; Cohen & Upton, 2007; Upton, Lee-Thompson, & Li-Chun, 2001). Extensive studies have demonstrated that readers use their prior knowledge to determine the importance of information in the text and make inferences about the text. While good readers typically make inferences based on prior knowledge directly relevant to the ideas in the text, poor readers make inferences by drawing on prior 27.

(40) knowledge not directly relevant to the most important ideas in the text (Afflerbach, 1986, 2007; Anderson & Pearson, 1984; Hudson, 1990; Hudson & Nelson, 1983; Rosenblatt, 1978; Williams, 1993). Anderson (1991) examined L2 readers’ use of reading strategies across different proficiency levels. The results showed that good readers used significantly more strategies than poor readers but there was no difference in the number of different strategies used. L2 readers used similar types of strategies across proficiency levels and across tasks while reading and while taking a reading test. Anderson, Bachman, Perkins, and Cohen (1991) investigate the relationship among test taking strategies, item content evaluation, and item performance statistics. Sixty-five Spanish speakers enrolled in an ESL program were asked to respond to a reading test of 15 passages, each containing 44-135 words and followed by two to four multiple choice questions. The respondents were asked to make retrospective think-aloud protocols on the use of reading and test-taking strategies. Their study suggests that item type can have a considerable influence on the reading and test-taking strategies that are used and on how they are used. Cohen and Upton (2006) provide a detailed description of test takers’ use of reading and test-taking strategies in response to different item types in the reading section of the prototype of the new TOEFL. Thirty-two nonnative speakers of English 28.

(41) from four language groups, whose proficiency levels were placed at about the 75th percentile in relation to other examinees, were asked to read two 600-700 word passages with 12-13 items and then made verbal report on the use of strategies. Drawing on the respondents’ verbal reports, the researchers categorize the test takers’ response strategies into 28 reading strategies, 28 test-management strategies, and 3 test-wiseness strategies. The results reveal that the test takers, while approaching the test-taking task, view it as a test-taking task and aim to get the answer right instead of learning from the reading passages. Besides, the test takers tend to answer the questions through their understanding and interpretation of the passages, which further confirms that the success in the new TOEFL requires academic reading skills. Item Difficulty in Assessment Research has confirmed the importance of sequencing test items by difficulty, with less challenging items being presented prior to more challenging ones. The purpose is to support motivation and provide examinees with experiences of success on items early in the test (Afflerbach, 2007; Alderson, 2000, 2005; Enright, Grabe, Koda, Mosenthal, Mulcahy-Ernt, & Schedl, 2000; Urquhart & Weir, 1998; Weir, 2005). As is previously reviewed, research in native and second language reading comprehension has stressed the interaction between bottom-up and top-down 29.

(42) processing, and variables that affect the nature of reading are mainly the interaction between reader variables and text variables (Alderson, 2000). In the context of reading assessment, Grabe (2009) summarizes from theoretical accounts that variables contributing to the difficulty of a task include text topic, text language, background knowledge, and task type. Based on an extensive review of literature, Alderson (2000) concludes that the difficulty of a reading test depends on the interaction among the different characteristics of text and items, including the language of the questions (e.g., wording and frequency of the vocabulary used in items and options), the level of information required by a question (e.g., local or global), the relationship between the question and the required information (e.g., textually explicit or implicit question, location of the information, type of question), text topic (e.g., the extent it engages background knowledge), text length, text structure, text wording, and the number of questions asked on a text. Assessment tasks, items, and prompts can vary in difficulty. It is generally believed that the higher the cognitive and linguistic demand of an item, the more difficult the item is. Research has shown that items intended to test examinees’ critical reading comprehension are more challenging to the examinees since they require more integrative abilities and processing time (Afflerbach, 2007). Items intended for a careful reading at the global level is usually more challenging than scanning items 30.

(43) since the former require more cognitive and linguistic processing (Weir, 2005). In the TOEFL Monograph TOEFL 2000 Reading Framework: A Working Paper, the test developers provide a framework for test design which includes four academic reading purposes in the test: reading to find information (e.g., search reading); reading for basic comprehension (e.g., understanding main ideas or the main theme); reading to learn (e.g., integrating the information); reading to integrate information across multiple texts. According to the test developers, items designed for reading to learn or reading to integrate information across texts are expected to be more challenging and require more processing abilities than items designed for reading to find discrete information or reading for general comprehension (Enright, Grabe, Koda, Mosenthal, Mulcahy-Ernt, & Schedl, 2000). However, when Cohen and Upton (2007) examine L2 examinees’ performance in three types of questions (e.g., basic comprehension, inferencing, and reading to learn), they find that items designed for reading to learn, though requiring more time for the examinees to complete, are not necessarily the most challenging items. Of the three item types, the inferencing items are most challenging (77% success rate), the basic comprehension items are the next most challenging (83% success rate), and the reading to learn items are the easiest (90% success rate). Research in reading assessment has shown that L2 learners have more difficulty 31.

(44) in longer items and linguistically complex items (Abedi, Lord, & Plummer, 1997) and that linguistic modification reduces the gap in test scores between L1 and L2 readers (Abedi, Lord, Hofstetter, & Baker, 2000). Along with previous studies, Alderson (2000) suggests that items should be written in simpler language than the text. Afflerbach (2007) also argues that the language of comprehension items should be comprehensible to the examinees since the purpose is to evaluate the examinees’ understanding of the reading text. Items with complex sentence structure and relatively difficult vocabulary may confuse the examinees and reduce the effectiveness of the task. Cohen and Upton (2007) analyze the verbal report data of the examinees taking the new TOEFL test and conclude that items with complex ‘question intent’ or ‘option meaning’ are more challenging to the examinees. Freedle and Kostin (1993) have also found negations correlate with difficulty of items. The more negations are used in the options, the harder the item is. Text features also play a role in determining the difficulty of a reading task. Research has demonstrated that coherent texts are easier for readers to understand the main idea than incoherent texts (Kintsch & Yarborough, 1982; Williams & Moran, 1989). Texts with abstract information are assumed to be cognitively and linguistically more complex and more difficult than texts with concrete information (Enright, Grabe, Koda, Mosenthal, Mulcahy-Ernt, & Schedl, 2000; Freedle & Kostin, 1993; Weir, 32.

(45) 2005). Texts with syntactic complexity such as competing NP arguments and long referential distance may lower readers’ processing efficiency and thus increase difficulty (Carpenter, Miyake, & Just, 1994; Just & Carpenter, 1992;). Freedle and Kostin (1993) have also found item difficulty correlates with factors such as negations, referential pronouns, rhetorical organizers, fronted structures, vocabulary, sentence length, paragraph length, number of paragraphs, concreteness, location of main idea, passage length, and lexical overlap. In other words, items with more negations in the options are harder. Texts become harder when more referential pronouns overlap within the text clauses, or when main idea information is in the middle of passage. Texts with more fronted text structures are harder than texts with fewer fronted structures. Texts with longer sentences or paragraphs are more difficult than texts with shorter sentences or paragraphs. The more passages a text contains, the harder the text is. The more polysyllabic words are used in a text, the harder the text is. On the other hand, texts with more lexical overlap are easier than texts with fewer lexical overlap. Texts with more concrete information are easier than texts with more abstract ideas. Vocabulary difficulty has consistently been shown to have an effect on text comprehension for L1 and L2 readers (Alderson, 2000; Carver, 1994b; Clarke, 1979; Freebody & Anderson, 1983; Horiba, 1993; Hu & Nation, 2000; Laufer, 1992; Nation, 1990, 2001; Read, 2000; Williams & Dallas, 1984). Texts with a high percentage of 33.

(46) uncommon words or idiomatic expressions are more challenging to L2 readers (Williams and Dallas, 1984). In a study with L1 readers, Carver (1994) concludes that L1 readers are likely to encounter nearly 0% of unknown words in easy reading and 2% of unknown words in difficult reading. Laufer (1992) presents evidence that L2 readers need to know at least 95% of the running words in the text for adequate comprehension. Hu and Nation (2000) find a predictable relationship between percentage coverage of known words and comprehension: the higher percentage coverage of known words, the better the compression of a text. They also argue that L2 readers need to know 98% of the running words in the text for adequate comprehension, and 95% for minimally acceptable comprehension. L2 readers do not comprehend texts with only 80% coverage of known words. Research has shown that prior knowledge assists readers in constructing meaning from text (Anderson & Pearson, 1984; Bernhardt, 1991; Carver, 1994; Hudson, 1990; Hudson & Nelson, 1983; Rosenblatt, 1978; Urquhart and Weir, 1998; Williams, 1993). Accordingly, text content that is closely related to the readers’ or the examinees’ prior knowledge is assumed to be easier than content that is remotely related to their prior knowledge (Afflerbach, 2007; Bernhardt, 1991; Clapham, 1996a, 1996b; Urquhart and Weir, 1998; Weir, 1990, 1993, 2005). In their review of studies on topic familiarity in tests, Urquhart and Weir (1998) conclude that the text topic in a 34.