應用潛在特質理論發展與驗證一單字測驗

全文

(1)國立台灣師範大學英語學系碩. 士. 論. 文. Master Thesis Graduate Institute of English National Taiwan Normal University. 應用潛在特質理論發展與驗證一單字測驗. An Application of Latent Trait Theory to Developing and Validating a Vocabulary Size Test. 指導教授：曾文鐽 Advisor: Dr. Wen-Ta Tseng 研究生：鄭名秀 Ming-Hsiu Cheng. 中華民國一百零三年六月 June, 2014.

(2) 摘要單字量跟語言學習一向有密不可分的關係，單字量在聽、說、讀、寫都扮演了重要的角色。單字的學習需要不斷地複習，要有效率地學習單字，學生需了解自己目前的單字量，才能將焦點放在符合自己程度的閱讀以及單字教材上。然而，目前並沒有針對台灣學生需求所設計的英語單字量測驗。本研究以大考中心公布的 6,840 參考字彙表，設計一份英語單字量測驗，以潛在特質理論驗證此單字測驗的信度、效度以及試題品質。測驗共有 180 題，皆為四個選項的選擇題。並探討單字頻率與試題難度、鑑別度、猜測之間的關係。受試者為台中地區高一及高二的學生，共 1,838 人。研究結果發現，由於整份試題具有良好的適合度，表示此份試題具有效度，同時也具有高信度(0.98)。高信度跟效度不只反映試題品質，同時也顯示選擇題是適合發展單字測驗的形式。單字頻率跟難度有明顯的關係，單字頻率越低，單字難度越難，而階層一到三的難度上升比階層三到六的難度劇烈。此現象顯示階層一到三的單字是學好單字的基礎，一旦學習者克服階層一到三的難關，就能夠利用大量閱讀與單字基模來學習更多的單字。另外，發現程度較差的受試者有猜答案的傾向，且偏好猜測 B 與 C 這兩個中間選項。從試題分析當中，發現題幹跟選項中字首字根的線索會對鑑別度造成影響，建議設計試題時需要特別注意。此現象也顯示學生有能力將單字基模中對字首字根的知識應用到新字上。 i.

(3) 關鍵字: 潛在特質理論、英文單字量測驗、英文字彙頻率、試題發展. ii.

(4) ABSTRACT Vocabulary size is closely related to language learners’ overall proficiency. It correlates significantly with listening, speaking, reading and writing. Incorporating a word into one’s mental lexicon requires repetition. To acquire new vocabulary words effectively, a learner need to have a clear idea of his current vocabulary size to concentrate his efforts onto the reading and vocabulary learning material that fits his present vocabulary size. However, for now, no vocabulary size test is developed based on the Taiwanese learning context. The present study is aimed to develop a vocabulary size test based on the 6,840 reference word list released by College Entrance Exam Center. The validity, reliability and test quality of the test in question are checked by the three-parameter logistic model of Latent Trait Theory. The test takes on the format of four-choice multiple choice questions, containing 180 items. The study also sets forth to explore the interaction between frequency levels and the difficulty, discrimination and pseudo-guessing parameters. Participants in this study are freshmen and sophomores from a senior high school in Taichung, totaling 1,838 subjects. Findings revealed that the test in question exhibited satisfactory validity and reliability. Validity was supported by overall model fit and hierarchical difficulty along with the high reliability value (0.98). With validity and reliability established, iii.

(5) the MC format was proven to be an appropriate format for vocabulary size test. Furthermore, it was found that the mean difficulty across frequency bands formed an upward nonlinear slope with the decrease of word frequency. The rise of difficulty from Level 1 to Level 3 was much steeper than that from Level 3 to Level 6. Such phenomenon revealed that the first three thousand words are the threshold for further vocabulary expansion. Once learners cross this threshold, they would be equipped with the ability to apply strategies such as extensive reading and word schema to acquire new words. On the other hand, it was found that lower-achievers are prone to take guesses and favor the distractors in the middle position. Content analysis suggests that clues in stems and distractors exert a certain influence on the discrimination power of items, which shows that learners are capable of transferring their word schema to unknown words.. Keywords: Latent Trait Theory, English vocabulary size test, word frequency, test development. iv.

(6) ACKNOWLEDGEMENTS My master thesis would not have been completed without the help and encouragement of my beloved teachers, classmates, colleagues, friends and family. This is not an easy undertaking but I am thankful that I have made the decision to embark on this academic journey and find it one of the most fulfilling and fruitful experiences in my life. First and foremost, my sincere gratitude goes to my advisor, Dr. Wenta Tseng, whose expertise, intelligence and patience guided me through countless moments of doubts and confusion. Whenever I felt lost in my research direction, research design or data analysis and interpretation, the timely and insightful suggestions and guidance Dr. Tseng provided never failed to clear all the shadow of doubts and point to a clear direction towards which I could steer my efforts and make breakthroughs. The inspiration I gained from Dr. Tseng is invaluable and will continue to influence me through my teaching career. I would also like to express my heartfelt appreciation for my committee members, Dr. Hsi-chin Chu and Dr. Chao-Chang Wang, for their constructive suggestions, warm encouragements, infinite kindness and patience for me. Their valuable comments and meticulous reading of my thesis not only refined the study substantially but also bestowed confidence and motivation upon me to work harder both in the academic and teaching arena. I must also acknowledge my graduate school instructors who enlightened me with their expertise, enthusiasm and philosophy towards life. It has been a great privilege to study and explore the field of TESOL under their instruction and guidance. Special thanks go to my colleague Gustav Chou, who made my data collection process smooth and delightful. Moreover, I am indebted to my colleague Jenny Cheng, my classmates Ingrid Wong and Janet Chou for their sharing and advice along my thesis writing process. I would also like to thank all the teachers, colleagues, and friends who encouraged me along the way. Finally, my dear family, especially my parents and my husband, deserve my sincere gratitude for their unconditional support and infinite love and care for me. This thesis wouldn’t have been completed without them.. v.

(7) TABLE OF CONTENTS CHINESE ABSTRACT .........................................................................................................i ENGLISH ABSTRACT .........................................................................................................iii TABLE OF CONTENTS .......................................................................................................v LIST OF TABLES .................................................................................................................vi LIST OF FIGURES ...............................................................................................................vi CHAPTER ONE INTRODUCTION .....................................................................................1 Background and Motivation ..........................................................................................1 Research Questions ........................................................................................................4 Significance of the Study ...............................................................................................5 CHAPTER TWO LITERATURE REVIEW .........................................................................7 Vocabulary Knowledge ..................................................................................................7 Vocabulary Measurement ...............................................................................................19 Measurement Theories ...................................................................................................26 Vocabulary Tests ............................................................................................................33 CHAPTER THREE METHOD .............................................................................................37 Test Construction ...........................................................................................................37 Participants and Test Administration .............................................................................40 Scoring and Coding........................................................................................................41 Data Analysis .................................................................................................................41 CHAPTER FOUR RESULTS ................................................................................................43 Goodness of Fit Statistics...............................................................................................43 Test Information Function..............................................................................................44 Item Characteristics and Item Parameter Estimates.......................................................45 Interaction between Frequency Bands ...........................................................................52 CHAPTER FIVE DISCUSSION ...........................................................................................55 Test Quality ....................................................................................................................55 Item Parameters and Frequency Bands ..........................................................................58 CHAPTER SIX CONCLUSION ...........................................................................................71 Summary of Major Findings ..........................................................................................71 Implications....................................................................................................................73 Limitations of the Study.................................................................................................75 Suggestions for Future Research ...................................................................................76 REFERENCES ......................................................................................................................77 APPENDIX A The CEEC Multiple Choice Vocabulary Test ................................................88 APPENDIX B The Item Characteristic Curves for the 180 Test Items .................................96 APPENDIX C Item Parameter Estimates and Fit Statistics of the 180 Test Items ................119 vi.

(8) LIST OF TABLES Table 1: What is Involved in Knowing a Word ......................................................................8 Table 2: The Descriptive Statistics for Parameters a, b, and c...............................................47 Table 3: The Descriptive Statistics for Discrimination, Difficulty and Guessing in Each Level ........................................................................................................................53 Table 4: Description of Test-takers’ Responses across Four Options on High-discrimination Items .......................................................................................64 Table 5: The Descriptive Statistics of Response Pattern........................................................66 Table 6: The Mean Guessing Value of Items across Keyed Positions ...................................66 LIST OF FIGURES Figure 1: A Typical Vocabulary Profile of a L2 Learner ........................................................15 Figure 2: Frequency Profile for Japanese Learners of EFL ...................................................16 Figure 3: An Example of One-Parameter Logistic Model .....................................................30 Figure 4: An Example of Two-Parameter Logistic Model .....................................................31 Figure 5: An Example of Three-Parameter Logistic Model ..................................................31 Figure 6: Test Information Function ......................................................................................44 Figure 7: The Conditional Standard Error of Measurement Function ...................................44 Figure 8: Item Characteristic Curves for Item 22 and 88 ......................................................46 Figure 9: Item Characteristic Curves for Item 23 and 165 ....................................................46 Figure 10: Item Characteristic Curves for Item 77 and 28 ....................................................46 Figure 11: The Frequency Distribution of the b Parameter ...................................................47 Figure 12: The Frequency Distribution of the a Parameter ...................................................48 Figure 13: The Frequency Distribution of the c Parameter ...................................................51 Figure 14: The Relationship between Word Frequency Level and Difficulty .......................52 Figure 15: The Relationship between Word Frequency Level and Discrimination ...............53. vii.

(9) CHAPTER ONE INTRODUCTION Background and Motivation Vocabulary size plays a crucial role in language acquisition. It is one of the strongest predictors of reading comprehension (Bernhardt & Kamil, 1995; Hsueh-Chao & Nation, 2000; Laufer, 2010; Nation, 2006; Qian, 2002; Ulijn & Strother, 1990). Besides reading comprehension, vocabulary size also correlates significantly with listening (Nation, 2006; Stæhr, 2009), speaking (Milton, 2009) and writing ability (Stæhr, 2008). Given the considerable interaction between vocabulary and the four skills, it comes as no surprise that vocabulary size associates closely with performance on overall proficiency tests ranging from TOFEL, IELTS to UCLES examinations (Milton, 2009). The importance of vocabulary size has generated wide interest in pinpointing effective vocabulary learning strategies. Among vocabulary learning strategies, extensive reading is considered one of the most effective ways to acquire vocabulary words (Nagy, Herman, & Anderson, 1985; Nagy et al., 1985). However, it is found that a solid language base, particularly vocabulary size, acts as a prerequisite for L2 or EFL learners to gain vocabulary words from reading without assistance. For instance, Nagy (1985) argues that when engaged in reading, L2 learners have to cross a certain 1.

(10) lexical threshold before they can employ the approach of learning from context. Laufer (1982) states that L2 vocabulary size determines whether EFL learners can successfully apply reading strategies in L2 reading while Lee and Schallert (1997) concludes that a threshold of lexical competence should be crossed before L1 reading strategies can have a positive effect on L2 reading. Hence, vocabulary size, as a threshold for further language development, deserves deliberate attention and substantial time on the learner’s part (Milton, 2009). The acquisition of new vocabulary words requires repetition (Hilton, 2008). Learners may need to consciously target certain vocabulary words in order to learn effectively (Milton, 2009). Nation (1990) suggests that the acquisition of Level one to Level three words is of top priority over the introduction of vocabulary learning strategies. Laufer (2010) points out that low frequency words, especially Level 5, Level 6, and Level 7 words, also deserve instruction given the considerable improvement these words can contribute to reading comprehension. The pressing need to focus on a certain frequency level of vocabulary calls one thing into question: teachers need to have an idea of a learner’s current vocabulary size of the target language before they pinpoint the frequency band the learners need to work on. Therefore, it is crucial to gain access to a valid measurement of students’ vocabulary size so that students and teachers can steer their efforts toward obtaining the lexical 2.

(11) knowledge they need. In Taiwan, vocabulary has long been a focus of instruction in senior high school English classes. Time spent on vocabulary instruction for one single lesson can last for one to two periods depending on the teaching styles of instructors. Whether taught in context or singled out for further explanation, vocabulary words account for a large portion of class hours. Besides textbooks, vocabulary books listing high frequency words ranging from level 1 to level 6 words compiled based on the wordlist issued by College Entrance Examination Center are often used as supplementary materials. Despite the seemingly substantial assistance, students still encounter difficulty acquiring new vocabulary words. Even if students devote considerable amounts of time and energy into vocabulary learning, their progress might, still, be hindered by poor learning strategies. Some students only focus on the quantity of words at the cost of quality of learning. As a result, they often confuse similar vocabulary words with one another. What’s worse, their partial or incorrect impression on those words prevents further development of productive vocabulary, which explains their poor writing skills. Another common problem for some students is that they tend to devote needless efforts into memorizing low-frequency words. For students with a sizable vocabulary, such learning behavior poses little danger. For students with limited vocabulary, however, such miss of focus may take its toll on learners’ progress. To 3.

(12) tackle those problems in vocabulary learning, as Nation (1990) argues, before the training on vocabulary learning strategies, what students need is to familiarize themselves with high frequency words. Only in this way can they be equipped with a solid vocabulary base to employ vocabulary learning strategies. Students need a reliable vocabulary size estimate to focus efforts on a certain frequency band. We shall be able to tackle the issues Taiwanese students face by pinpointing the frequency band requiring deliberate attention for every individual learner. Therefore, a well-established vocabulary size test for Taiwanese students to measure their own vocabulary size is much needed. Vocabulary size measurements, in light of its paramount importance, has been studied and established in many forms. Vocabulary size tests such as VLT and VST are widely used by researchers and teachers in the TESOL field. Nevertheless, none of such vocabulary size tests is developed based on the learning context in Taiwan, let alone taking the word reference list released by the CEEC into account. Thus, the main purpose of this study is to fill the void by developing a valid and reliable vocabulary size test tailored to the needs of senior high students in Taiwan. Research Questions of the Study This study aims to explore the following three research questions. 1.. To what extent do the data based on the multiple choice vocabulary test format 4.

(13) fit the 3PL Latent Trait Theory model? 2.. What are the reliability and validity for the CEEC multiple choice vocabulary test as checked by 3PL Latent Trait Theory model?. 3.. How will the three parameters interact with different word frequency levels? Significance of the Study The present study bears pedagogical and theoretical significance. Pedagogically. speaking, if a multiple choice vocabulary test format can be proven to be valid and reliable, it can serve as a tool for teachers for placement or diagnostic purposes. To begin with, this vocabulary size test can be used to assess the learning progress of students. Teachers in Taiwan have a large number of students in a class. It is not uncommon for a senior high school in Taiwan to have over 40 students in a single class. With such an overwhelming number of students, teachers are in need of a reliable tool to measure students’ vocabulary size in order to provide tailor-made assistance. Likewise, students can benefit from the present study as they would be allowed access to information on their current vocabulary size. Students would, therefore, be able to focus their efforts on the vocabulary words requiring immediate attention without detouring onto memorizing some less frequent words which contribute little to the current needs of theirs. In addition, since a learner’s vocabulary size grows with his or her proficiency level, this vocabulary size test can be included 5.

(14) as part of the placement test as a reliable and handy measurement of proficiency level. As for the theoretical aspect, since test items in the study include vocabulary words in six frequency bands, the result can provide insights into individual item performance across levels. Therefore, we might be able to explore how items in a certain frequency band differ from items from another frequency band. The findings of this study will add to the literature further information on the relationship across frequency bands and provide implications for test construction and pedagogy.. 6.

(15) CHAPTER TWO LITERATURE REVIEW This chapter presents the important issues involved in developing a quality vocabulary measure. The first section presents important aspects of vocabulary knowledge, followed by a review of the CEEC word list. In the next section, important criteria for a good vocabulary test will be reviewed. Then measurement theories will be discussed. In the final section, three most widely used receptive vocabulary measures will be illustrated. Vocabulary Knowledge Several important aspects of vocabulary knowledge will be reviewed in the following section. Firstly, we look at how vocabulary knowledge is defined by the receptive/productive and breadth/depth distinction and the limitations of such dichotomy. Then, the issue of vocabulary size will be explored. Subsequently, the relationship between frequency and vocabulary size will be examined. Receptive and Productive Vocabulary People don’t know every word in their vocabulary equally well. If a word can be perceived and comprehended from a listening or reading text, it belongs to one’s receptive vocabulary; if a word can be retrieved from memory and used in spoken or written form, it falls into the productive category. Nevertheless, this simple and 7.

(16) intuitive definition glosses over the complex nature of vocabulary knowledge. Melka (1982), for instance, argued that receptive and productive vocabulary should be treated as a continuum instead of an arbitrary dichotomy.. Corson (1995), on the other hand, looked at this issue from the perspective of use. Referring to receptive and productive vocabularies as passive and active vocabularies respectively, he proposed that passivee vocabulary consists of four aspects: active vocabulary, words that are partly known, low frequency words not readily available for use, and words that are avoided in active use. Nation (2001) devised a model showing what is involved in knowing a word receptively eptively and productively. The model is shown below.. Table 1 What is involved in knowing a word. Adopted from Nation (2001, p27). As can be seen in Table 1, vocabulary knowledge is divided into three aspects: form, meaning, and use. In each subcategory, subcategory, the capitalized R refers to receptive knowledge while the capitalized P indicates productive knowledge. By merely 8.

(17) surveying the list, readers get a scale-like idea. A more careful look into the use category reveals a touch of the perspective Corson adopted. In view of the varied definition and subcategories of receptive/productive distinction, one can easily imagine the obstacles standing in the way of empirical studies. A sheer difference in test format can result in questionable reliability. For instance, when productive knowledge is estimated by a recall test and receptive knowledge by a multiple choice recognition test, the guessing factor involved in the latter casts doubts on the result (Webb, 2008). Likewise, a recall test providing partial spelling of a target word can be argued to be testing only receptive knowledge as some partially pronounced word can still be recognized by learners (Melka, 1997; Morton, 1979). Therefore, for tests attempting to gauge the difference between the two knowledge constructs, aspects of knowledge and ability should be specified clearly to obtain reliable and valid estimates (Read, 2000). Despite the variation in definition and assessment discussed above, the fact that a learner’s receptive vocabulary is larger than productive vocabulary has been empirically established (Fan, 2000; Laufer, 1998; Laufer & Paribakht, 1998; Morgan & Oberdeck, 1930; Waring, 1997). Webb (2008) found that receptive vocabulary size correlates with productive vocabulary size. If a learner has a bigger receptive vocabulary than others, he or she will be very likely to have a bigger productive 9.

(18) vocabulary. Thus, receptive vocabulary tests may have its value in predicting productive vocabulary sizes. Breadth and Depth of Vocabulary Knowledge Breath and depth of vocabulary knowledge, first proposed by Anderson and Freebody (1981), is a useful distinction separating breath as how much words a learner knows from depth as how well the words are known. Like the receptive/productive distinction, these two also pose a deceptively simple dichotomy. Milton (2009) points out that attempts to measure vocabulary breadth and depth can result in unreliable results owing to different methods to gauge vocabulary breadth and depth. A vocabulary breath test counting a word as known by the numbers of checks on the option “I have seen this word before” can produce a far bigger estimate than a breadth test requiring the test-taker to translate L2 into L1. On the other hand, vocabulary depth is even more complex considering all the knowledge involved in knowing a word. Depth of knowledge can be described in a variety of ways. Looking at depth from the perspective of a developmental scale is the most widely adopted approach. One earlier example is the Vocabulary Knowledge Scale developed by Paribakht and Wesche (1997), as illustrated in the following. I. I don’t remember having seen this before. II. I have seen this word before, but I don’t know what it means. 10.

(19) III. I have seen this word before, and I think it means ________. (synonyms or translation) IV. I know this word. It means ______. V. I can use this word in a sentence: ________. (Write a sentence.)(If you do this section, please also do Section IV .). Another approach is to perceive vocabulary knowledge as consisting of different dimensions of word knowledge. Nation (1990) conceptualized vocabulary knowledge as a list including eight aspects: spoken form, written form, grammatical patterns, collocations, frequency, appropriateness (register), meaning, and associations. Whichever perspective we adopt, it is evident that vocabulary depth cannot be achieved by one single exposure. Instead, it is an incremental process which requires multiple exposure to achieve mastery of a word (Schmitt, 2010; Webb, 2005, 2007). Also, when designing a vocabulary test, it is vital to consider and specify the developmental stage or dimension the test in question is meant to estimate to prevent confusion or measurement error. What Counts as a Word Vocabulary breadth, in a narrower sense, can be thought of as learner’s vocabulary size. Efforts have been poured into measuring vocabulary size in various aspects (Read, 2000). Before going into detail, several terms for vocabulary count should be clearly defined. To begin with, the number of tokens, or running words, is the total number of words occurring in a text. If an individual word appears five times 11.

(20) in a text, then it would be counted five times. The number of types, on the other hand, refers to the number of different words appearing in a text. An individual word will be counted once despite the number of occurrence in the text. When looking at vocabulary size, researchers use lemma and word families to count words. A lemma consists of a base word and its inflected forms (third person –s, -ed, -ing; pural–s; possessive –s; comparative –er; superlative -est). A word family is made up of a base word, its inflected forms and transparent derivations. Transparent derivations refer to the derived forms whose meaning can be understood by means of the knowledge of the base word when encountered in context (Bertram, Baayen, & Schreuder, 2000; Nagy, Anderson, Schommer, Scott, & Stallman, 1989). For instance, the base word extend contains extends, extending, extended, extensive, extensively, extension and extent in its word family (Read, 2000). One issue with word families is the inevitable subjective judgments involved in deciding whether a word should be categorized into a family or not. Also, it has been found that L2 learners have less inclusive word families than native speakers. Some native speakers might even have larger word families than others (Bauer & Nation, 1993; Nation, 2006; Read, 2000). Discrepancies in vocabulary size estimates are often results of different ways of counting vocabulary words. Vocabulary sizes counted according to lemmas can have a 12.

(21) much inflated figure than its counterparts counted by word families. Thus, it is vital to distinguish among those terms in order to have a lucid grasp on the issue of vocabulary size and text coverage. Vocabulary Size and Text Coverage To have a basic idea of vocabulary size, we must know how many words there are in English. Webster’s Third International Dictionary has around 54,000 word families (Goulden, Nation, & Read, 1990). A college graduate whose native tongue is English has, roughly speaking, a vocabulary of 20,000 word families (Goulden et al., 1990). For EFL learners, 20,000 word families can be quite formidable a figure. In Taiwan, senior high students preparing for the college entrance exam have a learning goal of approximately 7000 lemmas based on the word list issued by CEEC. For overseas students who pursue advanced degree through English, a vocabulary size of 8-9000 word families is the average estimate. Another way is to look at how large a vocabulary is required to accomplish certain tasks. A learner needs to know 95% of the words in a text to guess an unknown word correctly in context (Laufer & Paribakht, 1998). If a learner wishes to comprehend unsimplified texts without assistance, he or she needs to know 98% of the words in a text (Carver, 1994; Hsueh-Chao & Nation, 2000). For instance, for a learner to read Alice in Wonderland independently, it requires about 5000 word 13.

(22) families to achieve such a coverage (Nation, 1992). For materials such as newspaper and novels like The Great Gatsby and Lady Chatterley’s Lover, about 8-9000 word families is required (Nation, 2006). In terms of spoken language, a learner would need about 6-7000 word families to understand 98.08% of the movie Shrek (Nation, 2006). From the figures mentioned above, we can see that the nature or genre of a text has a substantial impact on the vocabulary size which is needed to achieve a 95 to 98% coverage of the text. This observation echoes the importance to mind the word frequency information of reading or listening materials for learners to maximize their gains in the process of learning. Besides the issue of text coverage and comprehension, vocabulary size can serve as an indicator of proficiency level. Meara (1996) noted that “all things being equal, learners with big vocabularies are more proficient in a wide range of language skills.”It was also found that while a L2 learner makes progress in proficiency level, the quality and quantity of L2 lexical competence increase as well (Zareva, Schwanenflugel, & Nikolova, 2005).Furthermore, a learner’s vocabulary size can reflect the current proficiency level of a learner, indicate the frequency band the learner should work on, and serve as reference for the choice of learning material and for replacement purposes (Read, 1988). Nevertheless, it should be noted that vocabulary knowledge should not be seen as the only focus of language learning and 14.

(23) teaching. Simply memorizing a large quantity of words does not guarantee progress in overall proficiency. Rather, lexical knowledge should be taken as a vital and integral aspect of language proficiency which improves along with other aspects of language abilities (Read, 2000). Word Frequency The frequency of a word refers to the chances of occurrence of the word in texts. The more frequent a word is, the more likely learners are to encounter the word. Thus, frequency has a great impact on acquisition, processing and use of vocabulary words. Schmitt (2010) hailed word frequency as “the single most important characteristic of lexis that researchers must address.”. Figure 1. A typical vocabulary profile of a L2 learner (Meara, 1992) Frequency has long been assumed to be an indicator of difficulty. The more frequent a word is, the easier it is for learners to acquire it (Mackey, 1967; McCarthy,. 15.

(24) 1990; Palmer, 1968). This assumption is later proven to be true by empirical evidence. Meara (1992) proposed a typical vocabulary frequency profile of a L2 learner, noting that it should exhibit a downward slope from left to right, as demonstrated in Figure 1. Milton (2006) tested this hypothesis, using X-Lex (Meara& Milton, 2003) to test 227 Greek learners of EFL on their vocabulary size. The result was a similar distribution, a gentle downward slope tilting from left to right. The relationship between vocabulary size and frequency bands is found to be significant.. Figure 2. Frequency profile for Japanese learners of EFL (Aizawa, 2006) Aizawa (2006) conducted a vocabulary test on 363 Japanese learners of English based on the Japanese JECET 8000 word list. As revealed in Figure 3, the resulting profile shows a similar downward slope from left to right between the 1000 to 4000 bands. However, from the 5000 level on, the slope flattens out. This phenomenon indicates that the effect of frequency might not be as significant after crossing a 16.

(25) certain frequency threshold. Still, a test on the most frequent bands can give valuable insights into the development of lexicon (Aizawa, 2006). In addition to the significance of frequency in terms of acquisition as shown by the profiles, frequency is found to be the overriding factor among factors thought to be influencing difficulty. Milton and Daller (2007) attempted to judge the influence of frequency, word length, and degree of cognateness on word difficulty for 106 British learners of French. They found a consistent correlation between learnability and frequency while neither of the other two factors showed much influence on learning. From the research discussed above, it is evident that frequency plays a pivotal role on vocabulary acquisition. Such a rule carries profound measurement and pedagogical implications. A test on frequent vocabulary words can reveal valuable information on the progress of a learner. Given the influence frequency exerts on difficulty, it would also be feasible to use frequency as a deciding factor when filtering through target words for test items. As for pedagogical purposes, frequency allows teachers and curriculum developers to tell words worthy of attention from less frequent words, whose low repetition rate makes it a less effective learning goal. Nation and Waring (1997) points out that the learning of the first 3000 high frequency words should take priority over other less frequency words. Only after learners are equipped with the ability to comprehend unsimplified texts can they start to learn 17.

(26) vocabulary words in context. Once the learning goal is accomplished, other words of lower frequency no longer require much deliberate attention in class. Efforts can instead be directed to reading strategies or knowledge on word parts, which serve as tools for learners to expand their vocabulary size. The same is true for EFL learners in Taiwan. Before going on ambitious quests such as reading unabridged novels or newspapers for extensive reading, students need a vocabulary size which allows them to cruise the world of authentic English without difficulty. This is also why word lists assume such importance for language learners and instructors alike. The CEEC Word List The CEEC word list, a project sponsored by the College Entrance Examination Center, was complied to serve as reference for the test construction of the two college entrance exams in Taiwan---the Scholastic Aptitude English Test and the Department Required English Test. The word list consists of 6480 words with 1080 words in each frequency band. Since DRET is more advanced in difficulty as opposed to SAET, the first four frequency bands are intended for SAET while the rest 2160 words are supplementary words intended for DRET. The editors collected words from three sources--- nine sets of English textbooks for senior high and junior high students published in Taiwan, five sets of American elementary school readers, and 21 English 18.

(27) word lists compiled in Britain, America, Canada, Japan, Taiwan and China. After filtering through about 16,000 words complied from the above sources based on their frequency, researchers went on to include words close to the life experiences of senior high school students in Taiwan, such as acne, dandruff, diploma, download, upload, calculator and curriculum. In the word list in question, words are defined as word families, including the base word, its inflected forms and transparent derivations. Derivation forms with transparent affixes such as –ness, -ment, in-, im-, ir, il-, non-, re-, -ing and -ed are excluded from the word list. On the other hand, derivational forms with suffixes such as –ful, -able, -ion, -ation, cation, -ition are included in the word list as separate words. For instance, respectful and dreadful earn independent entries in the word list since their meanings undergo certain changes after the infliction. This approach echoes the notion that EFL and second language learners might have less inclusive word families in their mental lexicon than native speakers of English. Vocabulary Measurement Important Criteria for a Good Vocabulary Size Test Heaton (1990) proposed that the two basic principles for a test are validity and reliability. According to Bachman and Palmer (1996), the most important quality of a test is its usefulness, which is determined largely by how test takers perform on the 19.

(28) test. How test takers respond to test items and the outcome of those responses will be the main concern when it comes to the evaluation of test usefulness. Among the qualities determining the usefulness of a test, reliability and construct validity are of primary concern (Bachman, 2004). Hughes (2003) proposed the basic principles a test or test system should be evaluated against, as is shown below: . consistently provides accurate measures of precisely the abilities in which we are interested; has a beneficial effect on teaching;. . is economical in terms of time and money. Still, the priority is to make sure that the abilities being measured are what we. intend to measure. In the following section, the concepts of reliability and validity will be reviewed. Reliability Reliability represents the consistency of measurement (Bachman & Palmer, 1996). Theoretically speaking, if a subject takes a test twice, he should get identical scores. Also, if a group of test-takers take a test twice and are ranked according to their performances, their ranking for the two tests should be exactly the same. However, in reality, such ideal is not easy to achieve since a lot of factors other than the ability to be measured can stand in the way to obtaining a reliable test result. The factors interfering with test scores include test method facets, attributes of the 20.

(29) test-taker that are not considered part of the target language ability, and unpredictable, temporary factors (Bachman, 1990). For example, if the items in a vocabulary test are too long, the test-takers might get items wrong simply because they are too fatigued to pay full attention to the content of the test. Emotional or physical conditions might also prevent the test-takers to perform as well as they should. Temporary factors such as the administrative process, and external conditions like noises or temperature might also temper with the test results. These factors lead to inconsistency of test results, or measurement errors. Therefore, it’s of paramount importance to minimize the effect such factors have on the representativeness of the test score so that a reliable test result can be obtained. The lower the degree of fluctuation the test results show, the more reliable a test is. In other words, if measurement error can be minimized, reliability can be ensured (Hughes, 2003). Reliability can be empirically tested by means of Classic Test Theory. One approach is called the test-retest method. Subjects are asked to take the same test twice, and the two sets of test scores will be compared for reliability coefficients. However, this method is difficult to practice. If a subject takes the two tests in too short an interval, he or she might be able to recall the answer. If the two tests are administrated in too long an interval, practice effect may get in the way (Davies et al., 1999; Schmitt, 2010). 21.

(30) The other method, the spilt half method, is much more popular among researchers since it requires only one administration. Instead of administrating identical tests, a test will be split in two parallel forms to be analyzed for coefficient of internal consistency. Yet, the prerequisite that the two halves of the test should be proven to be equivalent is difficult to be satisfied (Hughes, 2003; N. Schmitt, Schmitt, & Clapham, 2001). In fact, the theoretical assumption of reliability as “the correlation between test scores on two parallel tests” is difficult to be operationalized (Hambleton, Swaminathan, & Rogers, 1991). In view of the limitations of Classic Test Theory in accurately gauging reliability, Item Response Theory, capable of estimating the standard error of measurement across ability levels and test-takers, has been adopted to obtain a more proper estimate of reliability (Schmitt, 2010). The theoretical construct of Latent Trait Theory and how it achieves a more accurate estimate of reliability will be discussed in detail in the section of LTT. Validity If a test exhibits validity, it measures the ability or skill the test is intended to measure. Every test has its own purpose or use, be it for replacement or for assessment. If the test result does not reflect the ability the test claims to measure, the decisions made based on the test scores would lose its credibility. Hence, test 22.

(31) developers should make sure the interpretation of test results does the test-takers justice. Also, given the washback effect of tests, the inferences we make from test scores should be correct and appropriate to prevent unwanted social and educational consequences (Messick, 1980, 1996). To design a valid test, it is necessary to devise the specifications of the ability to be measured. What’s more, the interpretation and utility of the test scores that further action is based on should also be examined to validate a test. As Bachman (1990) notifies, it is the “validity of the way we interpret or use the information gathered through the testing procedure” that we are examining (Cronbach, 1971; Messick, 1990). In a nutshell, the process of validation should be based on logical, empirical and ethical considerations. Validity is a unitary concept comprised of different types of validity. Construct validity, first used in the field of psychological testing, is increasingly deemed the overarching concept of validity in recent years (Davies et al., 1999; Hughes, 2003). Construct here means a theorized definition of the ability or skill the test aims to measure. Construct validity refers to the degree to which a test can accurately gauge the theoretical construct of the ability of a test subject (Anastasi & Urbina, 1997; Bachman, 1990; Hughes, 2003). In addition, the interpretation of test results should allow further generalization to a bigger target language area (Bachman & Palmer, 1996). Under the umbrella of construct validity are three most commonly discussed 23.

(32) facets of validity: content validity, criterion-related validity, and face validity. The first two is generally considered empirical evidence for the validation of a test. Content validity refers to whether test items can properly represent the target language skills or structures the test is intended to measure. For instance, if a grammar test focuses on subjects’ knowledge of present tense, then the test items should capture all the concepts related to present tense. If present progressive is not included, then the test cannot be said to have content validity, even if it covers all the other concepts of present tense. To achieve content validity, developers of a test must first specify the domain of the ability that would be covered. That is, specifications of the skills or structures should be spelled out. Specifications of a test include content, structure, timing, medium, techniques, criteria levels of performance, and scoring procedures (Hughes, 2003). What is worthy of note is that “content” refers to the potential content of any version of a test rather than a specific single version of it. Empirical support for content validity can be derived by means of Factor Analysis or LLT analysis (Davies et al., 1999). Criterion-related validity denotes the correlation between the test in question and a well-established, reliable assessment. If the criterion is a test administrated at about the same time, then it is concurrent validity that is being measured. If the criterion is a test or performance assessment which will be conducted in the future, then predictive 24.

(33) validity is being calculated. Concurrent validity can be established in two ways. One is to examine how a test can discriminate among test-takers of different levels of proficiency. The other is to compare results of the test in question to those of a standardized or well-established test measuring the same ability or skill (Bachman, 1990). Both methods have their own issues. The former is less used by researchers due to the difficulty in defining the standard of proficiency to compare against. The latter requires the criterion measure to have undergone construct validation, the only sufficient evidence for validity, before being taken as a criterion to compare against (Messick, 1990; Thorndike, 1949). Since it is not easy to find a valid and reliable test to serve as criterion, it is not easy to establish empirical evidence for concurrent validity. As for predictive validity, it primarily focuses on how the test scores can predict criterion behavior, or performance in the future. If a test aims to screen students for admission into a language gifted class, its criterion would be the performance of the students who are recruited into the class. A potential pitfall in the use of predictive validity is that such a test might not cover an ability holistically. For instance, if a vocabulary size test can accurately predict future performance for students in language talented classes, it doesn’t mean such a test amounts to a valid test on overall proficiency. As Bachman (1990) points out, without empirical evidence, 25.

(34) “language tests developed for purposes of prediction…cannot…be interpreted as valid measures of any particular ability.” Face validity involves the perception of untrained observers on the test in question. If the test looks as if measuring what it claims to assess, then it has face validity. Although face validity has virtually no effect on the empirical establishment of construct validity, the lack of it can cause the test-takers to take the test lightly and consequently jeopardize the test results (Davies et al., 1999; Hughes, 2003). Measurement Theories Classic Test Theory The central idea of CTT holds that an observed score ( σ X2 ) consists of a true score ( σ t2 ) and an error score ( σ e2 ), as indicated by the following formula (Anastasi & Urbina, 1997).. σ X2 = σ t2 + σ e2 Since the objective of testing is to obtain a score as close to true score as possible, CTT use an index known as the standard error of measurement (SEM) to calculate and demonstrate how much the true score is affected by error score. As shown by the formula as follows, rxx refers to SEM. The higher rxx is, the more reliable the test is (Bachman, 2004; Davies et al., 1999). 2. rxx ' = σ t2 σX. 26.

(35) Despite CTT’s popularity among researchers, it has as much limitation and thus comes under criticism. Two key issues lie in its basic assumption: the item statistics are dependent solely on the performance of the particular group of test-takers taking the test. Therefore, if a test is taken by another group of subjects with different characteristics or levels of proficiency, a very different set of item statistics will be derived. On the other hand, if a group of subjects take two tests of differing difficulty, their ability will be estimated differently. Such limitation makes item statistics obtained from different groups of individuals incomparable. Comparison between the item statistics obtained from a subject’s performance on two different tests is equally problematic. In view of these weaknesses in theoretical assumption, CTT falls under the challenge of Item Response Theory, which is free of the group-dependent and item-dependent problems of CTT. Latent Trait Theory Latent Trait Theory, or Item Response Theory, is another widely used approach in language testing. LTT assumes that two variables decide an individual’s performance on a test: the ability of a test-taker and the characteristic of items. Underlying LTT is the belief that the performance of subjects on a test can be explained or predicted by estimating their abilities, or traits as embodied in the test 27.

(36) score. Traits are often referred to as latent traits as they are abstract, unobservable traits which are not directly measurable (Hambleton & Swaminathan, 1984). As opposed to CTT, LTT is capable of producing indices that are comparable across different tests and groups of subjects. Hambleton and Swaminathan (1984) pinpointed the three key advantages of LTT: (1) Item parameter estimates are independent of the group of examinees used. (2) Test taker ability estimates are independent of the particular set of test items used. (3) Precision of ability estimates are known. The first two advantages are known as invariance, the major distinctive feature distinguishing LTT from CTT. Invariance indicates that the ability distribution of test-takers does not affect item characteristics and that item characteristics do not affect the ability distribution of test-takers. Even if a test is administrated to two groups of test-takers with different ability, the resulting item characteristics will still be identical. Compared to CTT, LTT is established on stronger assumptions. One key assumption, unidimensionality, indicates that a single trait suffices to explain the test performance of a subject. That is, there would be only one single dominant trait that can account for the score pattern and ranking of the test takers. If a test requires more than two dominant traits to explain the performance, it is called a multidimensional 28.

(37) model, which lacks sufficient empirical support (Davies et al., 1999; Hambleton & Swaminathan, 1984). Another central assumption is local independence, which assumes that “when abilities influencing test performance are held constant, examinees’ responses to any pair of items are statistically independent (Hambleton et al., 1991).” If local independence is achieved, a test taker’s responses to different items will exhibit no relationship. This condition can be achieved when traits other than the ability to be measured are ruled out. There is an item characteristic curve (ICC) for each test item which showcases the relationship between the latent trait and the probability of answering correctly. There can be an infinite number of LTT models with different combination of parameters. Among the most commonly used model are one-parameter logistic model, two-parameter logistic model, and three-parameter logistic model. One parameter logistic model, also known as the Rasch model, takes into account one sole parameter: item difficulty, which denotes bi. The bi parameter resides on the ability scale where the chances of answering correctly stand at 0.5. As Figure 1 indicates, the nearer the curve is to the right end of the ability scale, the more difficult the item is. To have a 50% chance of answering item 2 correctly, a test-takers needs to have a bi higher than 2, while test-takers only need a bi higher than -1 to 29.

(38) answer item 3 correctly. The only difference between curves in the one parameter logistic model is their location on the ability scale. This model requires at least 100 participants and about 25 items to produce reliable results.. Figure 3. An Example of One-Parameter One Parameter Logistic Model. Adopted from Hambleton, 1991 As for the two-parameter parameter logistic model, it takes item discrimination into account along with item difficulty. The item discrimination parameter, denoting ai, can be shown by how steep the slope is. An item possessing high discrimination has a steeper curve than items with lower discrimination. The acceptable range of ai remains within (0, 2). The higher the value is, the more discriminative ve the item is. As indicated by Figure 4, ICCs presented in the two-parameter two parameter logistic model differ in terms of their location as well as slopes. While item 1 and item 2 are the most difficult ones, item 4 possesses better discriminating power since it has the steepest slope among the four curves. Alderson et al (1995) proposed that it takes a minimum of 200 subjects to get. 30.

(39) Figure 4.An Example of Two-Parameter Two Parameter Logistic Model. Adopted from Hambleton, 1991. Figure 5. An Example of Three-Parameter Three Parameter Logistic Model. Adopted from Hambleton, 1991. the two parameter model running while McNamara (1996) suggested that 500 subjects and 20 items are required. The three-parameter parameter logistic model includes an additional parameter --pseudo-chance-level level parameter. It allows the observation on the lower end of the ability scale, providing valuable information on tests using selective response, like 31.

(40) multiple choice questions. Figure 5 is an example of the three parameter model. As shown in Figure 3, the lower asymptote of each curve is no longer zero. Item 3, with its guessing value at 0.25, is more susceptible to guessing in comparison to the other items with lower guessing values. Since the 3PL model involves three parameters, it requires a minimum of 1000 participants and 60 items. Aside from providing information on test items and latent traits, Rasch analysis can single out problematic test items and subjects, which is done through estimation of goodness of fit, namely, fit analysis. When fit analysis is conducted, the latent trait of examinees is calculated, and then predictions of an examinee’s performance are made based on the calculation. If the predictions match the observed behavior, then goodness of fit is attained. If there are discrepancies between the two, misfit items or person will be located. Misfit items suggest two possible situations. On one hand, they might be poorly written items with low discriminating power, which need further revision. On the other hand, those items might not be assessing the one latent trait the whole test claims to measure. Such information enables test developers to identify, revise, or exclude poor items and therefore refine a test. Likewise, poor person fit statistics imply that the test does not accurately estimated the ability of the misfit person. It might happen when a person randomly guesses through the whole test or when a test 32.

(41) proves a poor measurement for a large population of examinees (Davies et al., 1999; McNamara & Candlin, 1996). Vocabulary Tests Checklist Tests Checklist tests, also known as Yes/No tests, are developed by Paul Meara as an efficient way to gauge vocabulary size. All it requires examinees to do is check the words they think they know. Since simply putting a check requires little time, this test format allows more items to be included in a test. Nevertheless, with convenience come pitfalls. On one hand, the test-takers might have varied definition for “knowing a word.” For two examinees with similar ability, their test results might be quite different due to the degree of conservativeness they exert when deciding whether a word is known or not. On the other hand, an examinee can simply check all the items as known words despite their poor knowledge of the target items. Fortunately, there are solutions to the issues mentioned above. For the former, problems can be avoided by specifying the definition of knowing a word. For instance, if at least one meaning of this item is known, the item should be checked as known. As for the latter, pseudo-words can be added to rule out unreliable test-takers. The following is a sample of Checklist Tests.. 33.

(42) 1.. □ tort. 2.. □ sale. 3.. □ mentir. 4.. □ dent. 5.. □ laid. 6.. □ puis. 7.. □ verre. 8.. □ pays. Vocabulary Levels Test Meara (1996) deems VLT “the nearest thing we have to a standard test in vocabulary.” As its title suggests, it focuses on four frequency bands: 2000, 3000, 5000 and 10000 word families. The four frequency levels all bear significances. The 2000 band represents a threshold of a daily conversation; the 3000 band is sufficient for basic authentic reading; a vocabulary of 5000 allows one to read authentic material independently while the 10000 level encompasses all the high frequency words (Schmitt, 2010). 1. coach 2. darling ______ a thin, flat piece cut from something 3. echo ______ person who is loved very much 4. interior ______ sound reflected back to you 5. opera 6. slice. The above shows a typical cluster of VLT. There are six options and three stems for a cluster. The six options, consisting of three target words and three distractors, are 34.

(43) in the same frequency band and word class. For each frequency band are five noun clusters, three verb clusters, two adjective clusters, as a reflection of word class distribution in English. When stems are being written, definitions are rendered as short, succinct, concise and easy as possible. It’s important to use high frequency words in definitions so that the examinees do not fail an item due to unknown words in the stem. For instance, stems for the 2000 levels words are written in words in the 1000 frequency band. In addition, stems are all arranged based on length whereas options are arranged alphabetically, so that no clue is revealed for random guessing. VLT is developed and widely used for placement and diagnostic purposes. Vocabulary Size Test Developed by Paul Nation, VST is presented in multiple choice forms, with an example sentence as the stem and four definitions as options. An example of VST is illustrated below. PATIENCE: he has no patience. a. will not wait happily b. has no free time c. has no faith d. does not know what is fair Examinees are required to choose the definition closest to the meaning of the 35.

(44) boldfaced target word. VST contains frequency levels from 1000 to as much as 14000, with target words based on the frequency list of the spoken section of British National Corpus (BNC). Each frequency band contains ten items, adding up to 140 items in total. The stem is written with an aim to show as little clue as possible (Schmitt, 2010). According to Nation (2006), words within the 14000 band have a coverage of 99% of the running words in written and spoken discourses. Created to estimate overall vocabulary size, VST can provide valuable information on learner’s progress (Nation, 2006). VST was validated using the Rasch model by Beglar (2010) with a Rasch reliability index of 0.96.. 36.

(45) CHAPTER THREE METHODOLOGY Test Construction The test in question consisted of 180 items. Target words were chosen from the reference word list issued by College Entrance Examination Center. This word list was comprised of 6 levels, each including 1080 words. 30 words were randomly taken from each frequency level as target words. To reflect the distribution of word classes of English words, 15 nouns, 9 verbs and 6 adjectives were chosen from each level. Items were written in the form of four-choice MC format. Examinees had to choose the word which was closest in meaning to the definition provided in the stem. A vocabulary size test in the format of multiple choice questions generally takes two forms: definition-word matching (such as VLT) and filling the target word in an example sentence (such as VST). Since the present study was a vocabulary size test for senior high school students, it was important for the stem to remain as easy to understand as possible so that the participants would not fail an item simply because they misunderstood the stem. In view of this, the author chose to adopt the format of definition-word matching because it was much easier to write a definition with high frequency words with the help of dictionaries. To write good example sentences, it would be difficult to limit the words used to L1 to L2 words since clear contexts and 37.

(46) appropriate collocation would definitely require words in lower frequency bands. Therefore, the format of definition-word matching was chosen and definitions were written with words within the first two levels in the CEEC word list to minimize confusion. Cambridge Dictionary, Longman Dictionary, MacMillan Dictionary and Oxford Dictionary were consulted to keep the definition proper and concise. The three distractors were taken randomly from the same frequency band of the target word. To minimize guessing effect, the distractors for target words with telling suffixes such as –er were deliberately rendered. For instance, the distractors for interpreter (an expert whose job is to change someone’s message into another language) were traitor, janitor and sneaker. Target words with obvious clues were also given specifically-chosen distractors. For example, for handicraft (things made by hands) were handbasin, handicap and handcuff as distractors. To avoid possible fatigue or frustration resulting from items of lower frequency, the ordering of level bands did not follow its original order. Rather, each level was divided into two parts and arranged according to the following order: Level 1, Level 5, Level 1, Level 5, Level 2, Level 6, Level 2, Level 6, Level 3, Level 4, Level 3 and Level 4. Before implementation, the test items underwent careful examination of the author, the supervising professor and an experienced fellow senior high school teacher. 38.

(47) Then twelve high school seniors were asked to take the whole test as a trial so that the author could locate potential problems and make modifications accordingly. The following are some sample items. Answers are underlined. For the complete test please refer to Appendix 1. Level 1 1. without any other 2. 3.. (A) difficult (B) stupid (C) yummy feeling that you want to drink something (A) thirsty (B) usual (C) excited in or from another country (C) separate (A) wonderful (B) foreign. Level 2 4. very thin (A) talkative 5. 6.. (B) perfect. (C) skinny. not expensive (A) wild (B) main (C) cheap wanting more than you really need (A) classic (B) brief (C) recent. (D) alone (D) terrible (D) entire. (D) instant (D) thick (D) greedy. Level 3 7. 8. 9.. avoid something by moving quickly (A) dodge (B) creep (C) peel. (D) relate. laugh at and make fun of someone (A) scatter (B) tease (C) divorce (D) predict keep something so that you can use it if you need (A) hesitate (B) surround (C) reserve (D) balance. Level 4 10. cruel (A) passive. (B) modest. 11. the same (A) identical (B) durable 12. with a good smell. (C) organic. (D) brutal. (C) thorough (D) glorious. 39.

(48) (A) habitual Level 5 13. comfortable (A) foul 14. being alone. (B) fragrant. (C) peculiar. (D) critical. (B) cozy. (C) apt. (D) sly. (A) destructive (B) absurd (C) solitary 15. selling things in large quantities at low prices (A) wholesale (B) muscular (C) exterior. (D) feeble (D) permissible. Level 6 16. spread out (A) boost (B) throb 17. cut something heavily (B) merge (A) hack. (C) unfold. (D)seduce. (C) vomit. (D) deem. 18. make a small hole in something (A) concede (B) foster. (C) pierce. (D) uncover. Participants and Test Administration This test was administrated in a senior high school located in Taichung City. It was a co-educational school with female students’ PR value at 87 and male students’ PR values at 84. Freshmen had about 6 hours of English class per week while sophomores had about 5 hours of English class per week. 944 freshmen and 894 sophomores participated in this test, totaling 1838 participants. The test was administrated during a two-hour class rally by the homeroom teachers of each respective class. Test administrators and students were not informed of the purpose of the test as a part of a research, nor were they aware of the source of the vocabulary words. Students were encouraged to take the test seriously. They were informed by their English teachers that the test results could serve as a reference for their 40.

(49) vocabulary size. Another incentive was that students ranking top would be recruited to join the national English vocabulary competition. Examination papers were collected the moment the bell rang, which allowed about 110 minutes of test time. Scoring and Coding There were 180 items in the test, each accounting for 1%. A correctly answered item was given 1 point while an incorrectly answered item was given 0 point. Answers were collected in the form of computer cards, scoring being done by means of computer. Data Analysis The present study aims to establish a valid and reliable estimate of vocabulary size for L2 English learners and to explore how item difficulty, discrimination and pseudo-guessing parameters interact with frequency bands. The 3PL model of Latent Trait Theory was adopted to fulfill the goals of the study for the following reasons. To begin with, LTT would present the author with statistics of overall model fit---which would reveal whether the test as a whole is an appropriate measure of latent ability or not. With overall model fit established, the validity of the present vocabulary test would also be established. Information on reliability would be calculated by means of the LTT model. Second, the 3PL model allowed insights into three item parameters---difficulty, discrimination, and pseudo-guessing. Hence, the author would 41.

(50) be able to gather information on the three parameters across frequency bands and make comparisons. Also, the information on difficulty and discrimination would enable the author to separate well-written items from poor ones. Third, since test items in the form of multiple-choice might be susceptible to the effect of guessing, the employment of 3PL provides critical information on the pseudo-guessing effect, allowing the author to rule out unreliable individual examinees. Misfit items and test-takers would be spotted and ruled out. XCalibre was used to analyze the data.. 42.

(51) CHAPTER FOUR RESULTS In this chapter, results of the vocabulary test in question examined by means of the 3PL model of Latent Trait Theory will be presented. It begins with overall fit statistics. Then, item characteristic curves and the values of the three parameters will be reported. In the final section, the interaction of three parameters across frequency bands will be presented. Goodness of Fit Statistics The overall model fit (x2/ df) was estimated to be 2.16. Since a value less than 3 indicated good fit, the whole vocabulary test was proven to fit the 3PL LTT model. In other words, the test in question demonstrated unidimensonality--- the quality of measuring only one single latent ability, which in this case indicated vocabulary size. As for individual items, the majority of items showed good fit except for four items in Level 6--- item 76, 82, 86, 87, as illustrated in the following. Misfit items: 76. secret (A) equivalent. (B) analytical (C) confidential. 82. spread out (A) boost (B) throb 86. cause something to happen. (C) unfold. (A) oppress (B) depict (C) specify 87. reject something with a group’s agreement (A) quench (B) boycott (C) dispatch 43. (D) spectacular (D)seduce (D) render (D) prosecute.

(52) Test Information Function. Figure 6. Test Information Function. Figure 7. The Conditional Standard Error of Measurement (CSEM) Function. The Test Information Function was a function of Latent Trait Theory to calculate the amount of information the test yielded across abilities. Theta denoted ability of test-takers. As revealed in Figure 6, the maximum information was 47.928 at theta = 0.050. There was a steep rise from theta -1 to theta 0 and a mild downward slope from theta 0.050 to theta 2. Along with the Standard Error of Measurement across ability 44.

(53) continuum illustrated in Figure 7, it could be observed that the test in question provided useful and valid information on learners’ whose abilities range from theta -2 to theta 2.5 and that it provided most information when the theta value fell between 0 and 2. Based on the uni-modal distribution of test response, the feature of unidimensionality could be captured. The fact that the distribution of learner ability took the form of an uni-modal rather than bi-modal indicated the test measured only one single latent ability, which proved the vocabulary size test in question to be a valid and reliable test. Item Characteristic Curves and Item Parameter Estimates Item Characteristic Curves demonstrated the relationship between ability (θ) and the probability of answering the item correctly. How the ICCs were shaped was dependent on parameter a, b, and c--- each denoted discrimination, difficulty and pseudo-guessing respectively. Parameter b was represented by the location of the ICCs on the ability scale. The closer the curve was to the right end of the ability scale, the more difficult the item was. On the other hand, the closer the curve was to the left end, the easier the item was. For instance, the most difficult items--- 77, 56 and 88--lay on the right end of the scale while the easiest ones---2, 3 and 4--- lay on the left end of the scale, as illustrated in Figure 8.. 45.

(54) Figure 8. Item Characteristic Curves for item 2 and 88.. Figure 9. Item Characteristic Curves for item 23 and 165.. Figure 10. Item Characteristic Curves for item 77 and 28. Parameter a was indicated by the slope of curves. The steeper the curve was, the more discriminating power the item had. As could be seen in Figure 9, item 165, which possessed the most discriminating power, manifested in a very steep slope. 46.

(55) while item 23, which had the least discriminating power, showed a much flatter curve. Parameter c was demonstrated by the asymptote, or the lowest level of the ICC. As revealed in Figure 10, item 28 had a guessing value at 0.385 while item 77 had a guessing value at 0.138. The statistics showed that item 77 was much less susceptible to the influence of guessing. The ICCs for all the test items were presented in Appendix B.. Maximum. Minimum. Mean. Difficulty (b). 2.871. -2.63. 0.058. Discrimination (a). 3.068. 0.294. 1.242. Guessing (c). 0.385. 0.138. 0.272. Table 2. The Descriptive Statistics for the Parameters a, b and c. Figure 11. The Frequency Distribution of the b Parameters As shown in Table 2, which was derived from the summary of parameter. 47.

(56) statistics for all the items presented in Appendix C, the difficulty values of the test in question ranged from -2.63 to 2.871. Since the difficulty value typically varied from -3 or -4 to +3 or +4 , the performance of items fitted the 3PL LTT model well. The distribution of difficulty across items was demonstrated in Figure 11.. Figure 12. The Frequency Distribution of the a Parameters As for parameter a, discrimination values fell between 0.294 and 3.068. While the item difficulty parameter was defined on the scale from minus infinity to positive infinity, the acceptable range for parameter a was (0.6, 2.5) in practice. And the items with a discrimination value lower than 0 should be eliminated from the test since a negative discrimination value indicated that a subject with higher ability had a lower chance of getting the item correct (Brown & Hudson, 2002). As shown in Figure 12, most items remained in the acceptable range. Two items were estimated to demonstrate higher discrimination value than usual---item 107and 165 in Level 6. 48.

(57) Items are presented as follows in order of discrimination power. Items with higher discriminating values, as opposed to the ones within the acceptable range, could not only discriminate well within a small rang of ability but yield more information on the latent trait than its counterparts. Top twelve most discriminating items: 165. a person who is paid to work for somebody (A) youngster (B) inspector (C) caterpillar 107. a group of powerful, rich, smart people in a society (A) bias (B) freak (C) chaos 174. that act of stopping someone from speaking (A) recreation (B) frustration (C) convention 148. build (A) prosper (B) identify (C) construct 145. make or produce. (D) employee (D) elite (D) interruption (D) transform. (A) manufacture (B) discipline (C) industrialize 152. clothes that need washing (A) laundry (B) orphan (C) vehicle 171. an event when someone is trying to win something. (D) eliminate. (A) enthusiasm (B) microscope (C) tendency 135. make air, water, or land too dirty to use. (D) competition. (A) pollute (B) frustrate (C) deposit 108. the need to take a drug without being able to stop. (D) observe. (A) pesticide (B) diplomacy 116. a range of colors and sounds (A) spectrum (B) turmoil 35. each of the 60 parts of an hour (A) dream (B) supper. (C) validity. (D) addiction. (C) obligation. (D) hygiene. (C) popcorn. (D) minute. (C) regret. (D) credit. 155. something difficult or heavy (A) shadow (B) burden. Top twelve least discriminating items: 49. (D) riddle.