• 沒有找到結果。

建構與驗證英文詞彙知識測驗

N/A
N/A
Protected

Academic year: 2021

Share "建構與驗證英文詞彙知識測驗"

Copied!
117
0
0

加載中.... (立即查看全文)

全文

(1)國立台灣師範大學英語學系 碩 士 論 文 Master Thesis Graduate Institute of English National Taiwan Normal University. 建構與驗證英文詞彙知識測驗 Constructing and Validating an English Vocabulary knowledge Test. 指導教授:朱錫琴 Advisor: Dr. Hsi-Chin Chu 研究生:蕭郁霖 Yu-Lin Hsiao 中華民國一百零七年六月 June, 2018.

(2) 摘要. 由於單字量和語言學習各方面有著緊密的關係,因此在語言學習中有著舉足 輕重的地位。為了有效率地學習新字彙,學生必須先了解自己的單字量在哪個階 層,才能有效地把學習重點放在符合自己程度的教材上。目前台灣學校英語學習 教育多採用大考中心所公布的 6,840 參考字彙表,但是裡面已有部分單字對於當 前台灣中小學語言學習者來說,有難度過高的狀況產生,因此本研究採用最新以 COCA 語料庫為基礎的字頻表,並從裡面依據字頻挑選前 6,050 個單字來製作單 字測驗。 本測驗分 A 和 B 兩份試卷共 180 題,每題皆有四個選項的單選題。不同於以 往的單字測驗的選字模式,本測驗只集中挑選六個單字階層中附近的字彙來製作 試題,期望以是否通過某單字階層來解釋受試者的字彙量。以潛在特質理論驗證 裡面的三個項度: 難易度、鑑別度、及猜測度來驗證本測驗的試題品質。受試者 為台北地區的高中生及大學生,共 1,198 人。 根據研究結果顯示,本測驗在三個項度之中皆展現良好的適合度。由此可知 此為具有高效度及信度的測驗,同時也可得知選擇題測驗是適合單字測驗的模 式。字彙的難度與單字頻率有著顯著的關係,單字頻率越高,則表示其難度越低 越常見。從難度上升幅度來看,階層四到六的幅度遠小於階層一到三的幅度。由 此可見,單字階層三為單字學習重要的分水嶺。若學習者能融會貫通階層一到三 的單字,便能應用原有的單字基模來學習更多新的單字。此外,從試題困難度分. i.

(3) 析中,題幹與選項中的詞綴會對試題困難度造成影響。因此,試題設計者須在這 方面多加留意。 關鍵字:潛在特質理論、英文單字字彙頻率、試題發展、英文單字量測驗、單字難度. Abstract. Vocabulary size has been considered crucial in language learning because of its strong correlation with various aspects of language acquisition. On account of its decisive status in language learning, the issue of how to learn vocabulary effectively has attracted significant attention in second language acquisition studies. Therefore, learners have to know which vocabulary size they should work toward so that they can focus on suitable learning materials. Currently, most schools in Taiwan have adopted the official CEEC wordlist, but it has been found that some of the words are too difficult for language learners. In other words, some low frequency words have been put in Level Two or Three, which hinders learners from making progress in language learning. Thus, the present study adopted the latest word frequency list from COCA and chose the most frequent 6,050 words as the sample pool to develop a vocabulary test. The present study was divided into two forms, Form A and Form B. The format of the test is four-choice multiple choice questions. Unlike traditional vocabulary size tests, the present study only chose target words near each frequency band in order to explore the concept of passing a certain frequency level to explore learners’ vocabulary sizes. The quality of the test is checked through the difficulty, discrimination, and guessing ii.

(4) parameters of Latent Trait Theory. The participants of the present study consisted of high school students and senior students in a national university. According to the results of this study, this vocabulary test presents good overall model fit, which supports the validity and reliability of the present study and further indicates that the multiple choice format is suitable for a test of vocabulary size. Word difficulty has much to do with word frequency: The more frequently vocabulary occurs, the easier it is. Considering the slope of difficulty, the slope from Levels Four to Six is less steep than that from Levels One to Three, which indicates that Level Three is the cutoff point for vocabulary learning. If language learners can master the words of the first three levels, they can easily use their pre-existing word schemata to acquire more new words. Moreover, it is noted that the influence of telltale morphemes should be taken in to consideration when analyzing the difficulty of the test. If stems are embedded so as to provide hints of corresponding word parts, caution must be taken when designing the distractors.. Keywords: Latent Trait Theory, word frequency, word difficulty, English vocabulary size test, test development.. iii.

(5) ACKNOWLEDGEMENT. This master thesis has finally been completed with guidance and support from a lot of people, including my professors, classmates, friends, and family members. I have got so much help along the way that I want to show my deep thankfulness to them. I would like to express my heartfelt gratitude to Professor Wen-Da Tseng and Professor Hsi-Chin Chu, my research advisors, for their patient guidance, enthusiastic encouragement and useful critiques of this research work. I can never accomplish this thesis without their constant and selfless assistance in keeping my progress on schedule. Whenever I felt discouraged, they were always by my side and guided through the difficulties that I faced. My grateful thanks are also extended to my committee member, Dr. Yeu-Ting Liu, for his constructive advice and kind patience with me. Because of his valuable feedback, this thesis project became more comprehensive and more details concerning the research were taken into consideration. I would also like to show my appreciation to my friends, Joanna Tu, Mia Chen, and Yi-Ting, Huang. They helped me correct the grammar errors and discussed unclear parts of my thesis. It is kind of them to spend their precious free time giving me suggestions and talking with me when I lost faith in myself. At last, I owe my sincere gratitude toward my beloved family. It is their understanding and love that give me the strength to take on this challenge and walk through the whole journey confidently without fear.. iv.

(6) TABLE OF CONTENTS CHINESE ABSTRACT ...................................................................................................... i ABSTRACT .......................................................................................................................ii ACKNOWLEDGEMENT ................................................................................................ iv TABLE OF CONTENTS ................................................................................................... v LIST OF TABLES ............................................................................................................ vi LIST OF FIGURES .......................................................................................................... vi CHAPTER ONE. INTRODUCTION ................................................................................ 1 Background and Motivation ...................................................................................... 1 Research Questions of the Study ............................................................................... 4 Significance of the Study ........................................................................................... 4 CHAPTER TWO. LITERATURE REVIEW .................................................................... 5 The Importance of Vocabulary ................................................................................... 5 Ways to Define Word Units ....................................................................................... 9 How to Measure Vocabulary .................................................................................... 16 Measurement Theories ............................................................................................. 20 Types of Vocabulary Tests........................................................................................ 27 CHAPTER THREE. METHODOLOGY ........................................................................ 29 Test Construction ..................................................................................................... 29 Participants and Test Administration ....................................................................... 33 Scoring and Coding.................................................................................................. 34 Data Analysis ........................................................................................................... 35 CHAPTER FOUR. RESULTS ......................................................................................... 35 Fit Statistics.............................................................................................................. 36 Item Characteristic Curves (ICCs) ........................................................................... 38 Values of Item Parameters ....................................................................................... 40 CHAPTER FIVE. DISCUSSION .................................................................................... 44 Test Quality .............................................................................................................. 44 Item Difficulty and Frequency Bands ...................................................................... 46 CHAPTER SIX. CONCLUSION .................................................................................... 53 Major Findings ......................................................................................................... 53 Pedagogical Implications ......................................................................................... 55 Limitations of the Research ..................................................................................... 56 Directions for Future Research ................................................................................ 57 REFERENCES ................................................................................................................ 58 Appendix A The Vocabulary Test (Form A)..................................................................... 73 Appendix BThe Vocabulary Test (Form B) ..................................................................... 78 Appendix C ...................................................................................................................... 83 Appendix D ...................................................................................................................... 84 Appendix E.Item Parameter Estimates of the 180 Test Items........................................ 102. v.

(7) LIST OF TABLES Table 1 Nation’s aspects involved in knowingaword. Adopted from Nation .................... 6 Table 2. The Descriptive Statistics for the Parameters a, b and c .................................... 40 Table 3.The Descriptive Statistics for the Item Difficulty in Each Frequency Level ...... 47 Table 4.Descriptive Statistics for the Three Parametersof the Items withLowMean Item Difficulty .................................................................................................. 50 Table 5. Form B: Items withLowItemDifficulty in Level 6 ............................................. 51. LIST OF FIGURES Figure 1. A typical profile. ............................................................................................... 12 Figure 2. Frequency profile for Japanese learners of EFL. .............................................. 12 Figure 3. Graphical Depiction of a One-Parameter Logistic Model. ............................... 24 Figure 4. Graphical Depiction of a Two-Parameter Logistic Model. . ............................. 25 Figure 5. Graphical Depiction of a Three-Parameter Logistic Model.. ........................... 26 Figure 6. Test Information Function ................................................................................ 36 Figure 7. Conditional Standard Error of Measurement (CSEM) Function ...................... 37 Figure 8. Item Characteristic Curves for Items 1 and 86 ................................................. 39 Figure 9. Item Characteristic Curves for Items 26 and 116 ............................................. 39 Figure 10. Item Characteristic Curves for Items 47 and 86 ............................................. 39 Figure 11. The Frequency Distribution of the a Parameter .............................................. 40 Figure 12. The Frequency Distribution of the b Parameter ............................................. 41 Figure 13. The Frequency Distribution of the c Parameter.............................................. 42 Figure 14. The Relationship between Word Frequency Level and Item Difficulty (for Items 1–90) ...................................................................................................... 48 Figure 15. The Relationship between Word Frequency Level and Item Difficulty (for Items 91–180) .................................................................................................. 48. vi.

(8) CHAPTER ONE. INTRODUCTION. Background and Motivation Vocabulary size has been considered crucial in language learning because of its strong correlation with various aspects of language acquisition, including reading comprehension (Bernhardt &Kamil, 1995; Hsueh-Chao & Nation, 2000; Laufer, 2010; Nation, 2006; Qian, 2002; Ulijn & Strother, 1990), listening (Nation, 2006; Stæhr, 2009), speaking (Milton, 2009), and writing ability (Stæhr, 2008). Due to its close association with these four skills, vocabulary size undoubtedly plays a significant role in a wide variety of overall proficiency tests from the TOFEL and IELTS to UCLES examinations (Milton, 2009). On account of its decisive status in language learning, the issue of how to learn vocabulary effectively has attracted significant attention. Among the effective strategies to acquire vocabulary, extensive reading has long been thought of as the most indispensable (Nagy, Herman, & Anderson, 1985; Nagy et al., 1989). Language learners can infer the meaning of an unknown word from the context and thus enlarge their vocabulary size. However, numerous studies have established that one of the important prerequisites for Second Language (L2) or English as Foreign Language (EFL) learners to learn new vocabulary from reading without help is a large vocabulary size. Nagy et al. (1985) suggested that whenL2 learners are reading, there should be a lexical threshold they must have crossed to be able to use the context to infer or gain the meanings of unknown words. Laufer (1982) argued that L2 vocabulary size plays a major role in determining whether a strategy can be successfully applied to reading by EFL learners. Lee and Schallert (1997) proposed 1.

(9) that First Language (L1) reading strategies could benefit L2 reading only when a certain threshold of lexical competence has been achieved. Accordingly, there remains the question of the specific threshold of vocabulary size for further language advancement (Milton, 2009). To gain more vocabulary, repetition is key (Hilton, 2008). Effective vocabulary learning requires learners to keep targeting specific words (Milton, 2009). Nation (1990) stated that vocabulary-learning strategies should be introduced after learners have acquired high frequency words from Levels One to Three. Besides these high frequency words, low frequency words from Levels Five to Seven have their own influential status and thus need instruction because of their contribution to reading comprehension (Laufer, 2010). Hence, the specific threshold of word frequency has generated considerable recent research interest, which leads in turn to the question of what the learner’s current vocabulary band might be, which instructors need to know to determine which level of words they should focus on. As a consequence, there is an urgent need for both students and teachers for a valid instrument to measure and assess students’ vocabulary size and thereby facilitate increases in word knowledge without wasted efforts. Vocabulary teaching makes up a great percentage of English class time in Taiwan’s senior high schools. The amount of time spent on vocabulary varies from teacher to teacher. Normally, for each lesson, vocabulary instruction takes two to five class periods. As supplementary materials, teachers usually use vocabulary books issued by the College Entrance Examination Center (CEEC, 2000) that list high-frequency words in frequency levels from one to six. Although engaged in what should be sufficient vocabulary learning, students actually lack the success in vocabulary acquisition these books promise, perhaps through applying wrong strategies for learning 2.

(10) vocabulary. For one thing, similar words confuse some students at times, and the partial knowledge of a word might hinder their progress in acquiring productive vocabulary. For another, students are prone to spend more time on low-frequency words since they are more difficult to remember. Although this will not be of harm to advanced learners, low achievers with insufficient vocabulary knowledge will soon lose interest because they will hardly ever use such low-frequency words. Given the problems mentioned above, Nation (1990) suggested, only students equipped with high-frequency words should be trained using vocabulary strategies. Cheng (2014) created a vocabulary size test based on the CEEC word list and concluded that in order to expand vocabulary, the first three levels of vocabulary for CEEC word list are the threshold for students to cross for the successful application of word strategies to acquire new words. However, Lin (2006) stated comparisons of the size of new vocabulary between senior high schools and junior high schools suggest that the size of new vocabulary required might be far beyond students’ present vocabulary size. Moreover, the density of new words was too high and the frequency of the new words repeated in the textbook should be increased. As a consequence, it is possible that the CEEC word list applied by textbooks might have problems. The specific issue of the appropriate word frequency band has long been discussed and thus deserves our attention. To determine this word frequency band for students in Taiwan, we need a valid test for vocabulary size based on a well-constructed vocabulary list. Many forms of lists have been created for the purpose of measuring vocabulary size, of which the Vocabulary Levels Test (VLT) and Vocabulary Size Test (VST) are the most common among researchers and instructors. Although the CEEC word list has been referred to as the supreme word list and applied to many textbooks, because of its problem of recondite vocabulary, an alternative approach to assess 3.

(11) students’ vocabulary size is needed. Thus, the present study aims to fill this void by developing a valid and reliable vocabulary size test based on the new word frequency list from COCA to help English learners in Taiwan diagnose their word band and instructors can thus give English learners better assistance in vocabulary learning.. Research Questions of the Study The aims of the present work are as follows: 1.. How do test items in a multiple choice vocabulary test format fit the Latent Trait Theory model?. 2.. What are the reliability and validity of the newly developed vocabulary test based on the new word frequency list from COCA and tested by Item Response Theory?. 3.. How well did the selected margin words of the test items show the differences of item difficulty in their word frequency level?. Significance of the Study The present study bears pedagogical and theoretical significance. In the pedagogical field, it can be helpful for both instructors and learners as a placement test or diagnostic assessment. Given the large size of classes in Taiwan, which average around 40 students, teachers find it hard to assess every student’s learning progress. A reliable vocabulary test can be of great help to teachers in determining the individual student’s learning development, especially their word acquisition. On the other hand, if the learners know their vocabulary size, they can spend more time on the appropriate high-frequency words that they need instead of wasting their 4.

(12) efforts on low-frequency words. Furthermore, this vocabulary size test can serve as a timely and trustworthy measurement of learners’ proficiency level because of the strong relation between vocabulary size development and proficiency growth. In its theoretical aspect, by examining the test items in the six frequency bands, we can find how items perform at different frequency levels. The present study can provide valuable knowledge across frequency bands and holds pedagogical implications for teachers to construct better tests.. CHAPTER TWO. LITERATURE REVIEW. The present study aims to develop a reliable and valid vocabulary test. This chapter consists of three major sections. The first section deals first with vital issues of vocabulary knowledge; then the next part focuses on the CEEC wordlist and Mark Davis’ wordlist. The second section reviews paramount standards for an exemplary vocabulary test and is followed by a discussion of measurement theory. The final section of the literature review elaborates three major vocabulary measures.. The Importance of Vocabulary In this section, vocabulary knowledge is separated into its different aspects. First we start with the definition of receptive/productive vocabulary, then we focus on the distinction between the breadth/depth of vocabulary and the limitations of this opposition. Later, are view of vocabulary size will follow with an examination of how vocabulary size and frequency are related. Ways to Define Vocabulary 5.

(13) It is hard for people to acquire every word in a language and apply it well. A word can be categorized as receptive vocabulary when it is either gained or understood from a reading or listening context. On the other hand, productive vocabulary consists of the words that can be drawn from memory and used when a learner speaks or writes. Nonetheless, the complicated nature of vocabulary knowledge has not been well elaborated. Melka (1982) suggested that we should take both receptive and productive vocabulary as a continuous stream rather than two totally opposed ideas. However, in terms of word usage, Corson (1995) considered receptive and productive vocabulary to be passive and active vocabulary treated independently of each other. He assumed that passive vocabulary is composed of four classes of words: words that people avoid using; words with low frequency that are not readily available for use; words that were not fully acquired; and active vocabulary. Nation (2001) created a model that demonstrated what was included in receptive and productive vocabulary, as shown below:. Table 1 Nation’s aspects involved in knowing a word. Adopted from Nation (2001,p. 27).. 6.

(14) As shown in Table 1, vocabulary knowledge comprises three parts: form, meaning and use. When browsing the list, a scale-like idea pops out. With closer observation, readers can grasp the ideas that Corson adopted. When conducting empirical studies, it is easy to find that with the variety of definitions and classifications of receptive and productive difference, researchers usually confront difficulties in analysis. Test format is often the factor that leads to questionable reliability. For example, a multiple recognition test used to estimate learners’ receptive knowledge was called into question based on the possibility of guessing (Webb, 2008). On the other hand, it might be suspected that a recall test originally used to test productive knowledge merely tested receptive knowledge, since learners could still notice some partially pronounced words (Melka, 1997; Morton, 1979). Accordingly, in order to gain reliable and valid results, we should clarify the aspects of knowledge and ability, particularly if we wish to measure the difference between productive and receptive knowledge (Read, 2000). Although the differences in definitions and measurements discussed above can be varied, in any case learners’ productive vocabularies are smaller than their receptive knowledge, which has long been known from many studies (Fan, 2000; Laufer, 1998; Laufer&Paribakht, 1998; Morgan &Oberdeck, 1930; Waring, 1997). Webb (2008) proposed that there was a close relationship between receptive and productive vocabulary size. The larger the receptive vocabulary that learners have, the larger the size of the productive vocabulary size is likely to be. Because of this close correlation, receptive vocabulary tests can play a leading role in predicting productive vocabulary sizes. Anderson and Freebody (1981) suggested that the distinction between the breadth and depth of vocabulary knowledge was a useful idea to distinguish words, explain 7.

(15) how many words a learner knows, and the quality of the words learned. Milton (2009) proposed that using different ways to measure the breadth and depth of vocabulary would yield invalid results. Take vocabulary breadth test as an example. The words are counted by how many checks are received for the option, “I have seen this word before,” which can provide an estimate of the much larger range of known words than can a breadth test that asks the test-taker to translate words from L2 to L1. Vocabulary depth is a more complicated concept of how well a word is known, which can be applied in different ways. The developmental scale is the approach that is usually adopted. Paribakht and Wesche (1997) designed the Vocabulary Knowledge Scale, of which an example is as follows: I.. I don’t remember having seen this before.. II.. I have seen this word before, but I don’t know what it means.. III. I have seen this word before, and I think it means. .( synonyms or translation). IV. I know this word. It means . V.. I can use this word in a sentence: (Write a sentence.)(If you do this section, please also do Section IV. Nation (1990) pointed out another approach, which conceived of vocabulary knowledge as having eight components involving aspects of word knowledge: the written form, spoken form, frequency, collocations, appropriateness, meaning, grammatical patterns, and associations. From the studies mentioned above, it is evident that vocabulary depth cannot be seen as a whole from only one single aspect. For the purpose of mastering a word, learners need to immerse themselves in diverse exposure to vocabulary (Schmitt, 2010; Webb, 2005, 2007). Furthermore, in order to avoid measurement errors, there is a need to take a close look at the developmental stage or the specific word knowledge whenever a test designer constructs a vocabulary test. 8.

(16) Ways to Define Word Units A wide variety of studies and efforts have been made to measure learners’ vocabulary size, which can be seen as vocabulary breadth (Read, 2000). To dig more deeply into the matter of vocabulary knowledge, a few technical terms should be introduced. First of all, the number of running words, which can be also called tokens, is the number of words total that appears in a text. For instance, a given word will be counted three times instead of once if it shows up in a text three times. The number of types is another important idea, which is the number of distinct words occurring in a text. Unlike the number of tokens, a given word will be counted once no matter how many times it shows up in a text. Different fields have different terms for counts of word. From the aspect of vocabulary size, lemma and word families are the key terms. A lemma is a word regarded as its root form along with the inflected forms (third person -s, -ed, -ing; plural-s; possessive -s; comparative -er; superlative -est). Take the word “do” as an example. The lemma “do” consists of do together with does, doing, did, and done. Furthermore, a word family is a group of related words including its base word, inflected forms, and transparent derivations formed from the same word (Bertram, Baayen, &Schreuder, 2000; Nagy et al., 1989). The following example can better explain the idea of transparent derivation. The word family of the base word “complete” includes completes, completing, completed, completely, completion, and completeness (Read, 2000). However, it is hard to decide whether to include a word in a word family during categorization due to the possibility of subjective judgments. In addition, it is well established that native speakers normally have larger word families than second language learners. (Bauer & Nation, 1993; Nation, 2006; Read, 2000).. 9.

(17) Among several ways to count a word mentioned above, it is clear that if we estimate the vocabulary size based on lemmas, we will obtain a larger number than if we count word families. Therefore, specifying these key terms is important before moving on to the next core idea: vocabulary size and text coverage. The Relation between Vocabulary Size and Text Coverage When it comes to vocabulary size, it is important to know the total number of English words. There are about 54,000 word families in Webster’s Third International Dictionary (Goulden, Nation, & Read, 1990). In general, English native speakers who have graduated from college have about 20,000 word families (Goulden et al., 1990). However, second language learners who have a proficient level of English are estimated normally to have 8 to 9 thousand word families. Given a certain vocabulary size, if a learner needs to comprehend a written text, it is estimated that the learner should understand approximately 95% of the words in the text to capture the correct meaning of an unknown word from the context (Laufer&Paribakht, 1998). Furthermore, to grasp the full idea of unsimplified authentic material without help, 98% of the words should be known (Carver, 1994; Hsueh-Chao & Nation, 2000). Take Alice in Wonderland for example. Nation (1990) proposed that a learner should have around 5000 word families to understand the content. In the case of other materials like newspapers or novels like The Great Gatsby, it is gauged that roughly 8 to 9 thousand words are needed to ensure the coverage of the text (Nation, 2006). Similarly, in the field of spoken language, Nation (2006) mentioned that for comprehension of 98.08% of the movie Shrek, 6 to 7 thousand word families are required. As the studies reviewed make clear, it is transparent that vocabulary size plays an important role in reaching the coverage of 95 to 98 percent of different fields of the 10.

(18) text. Hence, we should take word frequency into consideration when giving learners reading or listening materials. Furthermore, studies show that vocabulary size has a significant influence not only on text coverage but on learners’ proficiency levels. Meara (1996) mentioned that on account of broad vocabulary coverage, learners were able to master different kinds of language skills. Studies also discovered that the more L2 learners’ vocabulary size increases, the higher the proficiency level L2 learners can reach. (Zareva, Schwanenflugel, &Nikolova, 2005). In addition to being an indicator of proficiency level, Read (1988) suggested that vocabulary size could also be used to detect learners’ frequency threshold, and thus researchers should focus on that frequency band. With this in mind, learners could choose or change suitable learning materials according to their current vocabulary size. Although vocabulary size has a powerful impact on language learning, there are still other factors that can influence learners’ progress in their language proficiency. We should consider word knowledge essential to improvement in language learning, and as knowledge that ought to function well with other fields of language abilities (Read, 2000). The Importance of Frequency Word frequency is one of the leading factors influencing word acquisition, language processing, and use of vocabulary; thus it plays an extremely influential role in lexis. (Schmitt, 2010). If the occurrence of a word rises, learners will acquire the word more easily. Hence, the difficulty of a word can be known from its frequency. (Mackey, 1967; McCarthy 1990; Palmer, 1968). This assumption was proved by Meara (1992) who pointed out a type of L2 learners’ vocabulary frequency profile and further. 11.

(19) specified that the trend should be that of a slope falling from left to right, as shown in Figure 1.. Figure 1. A typical profile. Adapted from Meara (1992, p. 6) To prove this assumption, Milton (2006) used X-Lex (Meara& Milton, 2003) to test 227 Greek EFL learners on the words they knew. A comparable distribution was found, a curve downward from left to right. Therefore, it is assumed to be true that vocabulary size is highly related to frequency bands.. Figure 2. Frequency profile for Japanese learners of EFL. Adapted from Aizawa (2006).. However, according to the results of a vocabulary test, Aizawa (2006) argued that the effect of frequency would not be noticeable above a certain frequency band. This is demonstrated in Figure 2, which shows a similar trend of steady decrease until Frequency Level Four. From Level Five the vocabulary size by band starts to fluctuate. Although Figure 2 does not as a whole show the vocabulary profile that. 12.

(20) Meara (1992) mentioned, it is worth noting that they share the same trend from Levels One to Four. As Figures 1and 2 show, frequency has great impact on EFL learners’ learning and it is the paramount factor influencing difficulty. Milton and Daller (2007) further conducted a study of 106 British learners learning French. They aimed to evaluate the relations of word difficulty with frequency, word length, and degree of cognateness. The results showed that learnability was highly correlated to frequency but has little relation to the other two factors. When it comes to vocabulary acquisition, as the research above shows, it is evident that frequency is of central importance. As a result, such a rule can be applied to vocabulary measurement and the current teaching status quo. To know learners’ learning progress, a vocabulary frequency test can be a useful tool to identify crucial information about learners. For the purpose of pedagogy, when designing a test, frequency can help the designer to specify the priority of the target words that should be learned. Nation and Waring (1997) suggested that the most frequent 3000 words should be given higher priority when learning than other less frequent words. Once learners have no difficulty understanding the original text without assistance, they are capable of learning new words. Low-frequency words would not be a focus of learning since they would hinder comprehension of the text. Therefore, instructors will have more time to elaborate and focus on the use of skills like word inference to enlarge learners’ vocabulary size. As a consequence, a good word frequency list is an urgent need for both instructors and language learners, since it will offer great help to instructors who want to ensure learners can comprehend most of the texts they are given after 13.

(21) reaching a certain level of high-frequency words. In the next section, the dominant frequency wordlist in Taiwan will be introduced. The Origin of the CEEC Word List In Taiwan, the CEEC wordlist (Jeng et al.,2002), sponsored by the College Entrance Examination Center, is the official wordlist and is recognized as the must-have wordlist reference, especially for seventh to twelfth-graders, according to two influential entrance exams, the Scholastic Aptitude English Test (SAET) and the Department Required English Test (DRET). Compared to DRET, SAET is comparatively simple; therefore, the words from Level One to Level Four are mainly for the SAET, whereas the rest of the words are supplementary words intended for students who want to take the DRET. Each level is made up of 1080 words. The CEEC wordlist is compiled from three sources (Jeng et al.,2002). The first includes five sets of American elementary school readers. The second consists of nine sets of published English textbooks for Taiwan’s senior high and junior high students. The last includes 21 English word lists compiled from Asian and Western countries, including Taiwan, China, Japan, Britain, America, and Canada (Jeng et al.,2002). The editors of the CEEC wordlist collected approximately 16,000 words from the resources mentioned above. According to word frequency, the editors picked words close to teenagers’ real life and classified them into levels from one to six. The words in the list are defined as word families of a base word, its inflected forms, and transparent derivations. For instance, the derivational forms are words that end in-ness or -ment, or start with re-, in-, ir-, and non-, except for those ending in-ing and -ed. However, derivational forms ending in-able, -ion, and -ation are considered different words. For example, manageable and handful are regarded as different words because of the meaning change due to derivation (Cheng, 2014).. 14.

(22) Though the CEEC wordlist is seemingly well-developed and has been applied in many English textbooks for junior and senior high schools in Taiwan, in terms of its applicability, few studies have examined the word frequency for each level, not to mention using other reliable wordlists to reexamine the frequency of the words in the CEEC wordlist. Some of the words are too difficult and seldom used. For example, on the new frequency list (Gardner & Davies, 2013), the word “wreath” is placed in Level 5 in the CEEC wordlist, while by actual frequency of occurrence the word falls at 10,066, which is about Level 10. However, the low-frequency word “elite,” placed in Level 6 of the CEEC wordlist, is at Level 3 in the new frequency list (Gardner & Davies, 2013). Because of these cases, it is reasonable to doubt that the CEEC wordlist is entirely suitable for high school students in Taiwan since learners might miss important words or spend too much time on low-frequency words. Lin (2006) indicates that there is a gap of vocabulary size between English textbooks for junior high school and high school students. There are too many new words in the English textbooks for high school students, and book editors should increase the occurrence of new words. The reason for this might be the wrong classification of words. Therefore, a reliable wordlist is urgently needed by both educators and learners. The New Word Frequency List The new word frequency list (WFL) (Gardner & Davies, 2013) consists of more than 120 million words selected from the Corpus of Contemporary American English (COCA). Of these, 86 million words are from journals and 26 million words from magazines. They are all texts in academic genres and the most recent academic texts in the database were collected in 2015. It is composed of nine subgenres, history, education, law and political science, social science, humanities, science and technology, medicine and health, business and finance, and philosophy. The philosophy genre also 15.

(23) concerns religion and psychology. The frequency of a word in different subgenres can be found. The words on the list are grouped by lemma (e.g., [laugh] = {laugh, laughed, laughing}), which makes the list easier to understand. If the word has several parts of speech, it can be seen which is used more often by frequency. Besides this, the words are classified by colors so that the word can be distinguished if it belongs to the “technical” genre, which only appears in certain specific genres. On the other hand, it can also be seen if the word is a “general academic word.” The words are organized by frequency so that educators or learners can concentrate on the more important words that appear in daily life (Gardner & Davies, 2013).. How to Measure Vocabulary Criteria for Constructing a Good Vocabulary Test Both validity and reliability are fundamental criteria that the test constructor should bear in mind (Heaton, 1990). Bachman and Palmer (1996) proposed that usefulness be seen as the greatest quality of a test. Test takers’ performance on test items and results are the major concerns of test quality. The evaluation of test usefulness lies mostly in reliability and construct validity (Bachman, 2004). Hughes (2003) presented the fundamental rule that a test should be gauged by the following ideas. First, it should offer precise ways to measure the abilities in whose accuracy people are interested. Second, the test itself should have positive effects on pedagogy. Third, the test should be economical in time and money.. 16.

(24) For test designers, the intended purpose of the test should be consistent with the abilities they seek to measure. The following section will review reliability and validity. The Importance of Reliability Bachman and Palmer (1996) proposed that reliability should be in accordance with the measurement. That is, no matter how many times the subject takes the test, the same score should be obtained each time. In addition to scores, the rankings for two identical tests in a group of test takers should be the same as well. Owing to unpredictable factors such as test method facets or attributes of the test takers, reliable test results are hard to obtain (Bachman, 1990). For instance, test items are overextended, which leads to test takers’ fatigue and failure to finish the whole test, and thus influences the reliability of the test. Moreover, both test-takers’ emotions and physical status are unavoidable factors that might affect their test performance. Unpredictable variables in our surroundings, such as noises or temperature, might also hinder test-takers’ scores on the test. The factors mentioned above contribute to measurement errors that the test designers have to minimize for objective results. The fluctuations of the test results should be kept as small as possible to ensure reliable test results (Hughes,2003). Since assuring the reliability of the test is of paramount importance, several means of gauging reliability will be reviewed. Based on CTT, the test-retest method is one way to testify reliability. The same test will be tested twice, allowing researchers to obtain reliability coefficients by comparing the two sets of scores. However, the interval between the two tests might lead to measurement errors because during the interval between tests, subjects might be affected to some degree. For example, if the interval is too short, it is possible that 17.

(25) the test-takers may recall the answer. In addition, the test-takers might be affected by a practice effect (Davies et al., 1999; Schmitt, 2010). Compared to the test-retest method, the split-half method is more favorable. A one-shot test and test forms are the critical keys. To avoid the practice effect, a single test is separated into two parallel forms. The coefficient of internal consistency will later be analyzed to verify its reliability. However, the equivalence of the two halves of tests has long been hard to prove (Hughes, 2003; Schmitt, Schmitt, &Clapham, 2001). After all, “the correlation between test scores on two parallel tests” in terms of reliability is hard to verify (Hambleton, Swaminathan, & Rogers, 1991). To better measure reliability, Item Response Theory (IRT) has thus been utilized by researchers, for it can estimate the standard error of measurement on testtakers and even their different ability levels (Schmitt, 2010).. Types of Validity A test of high validity will be the one whose result can successfully demonstrate the ability that it claims to measure (Davies et al., 1999; Hughes, 2003). Validity is made up of four elements: predictive, concurrent, content, and construct validity. Among these, construct validity is considered of greatest value (Cronbach&Meehl, 1955). Besides those four elements of validity, some scholars have also mentioned a fifth, face validity (Davies et al., 1999; Heaton, 1975; Hughes, 2003). Face validity concerns whether the superficial purpose of the test is achieved from the point of view of test takers, instructors, and test developers (Heaton, 1975; Hughes, 2003). For instance, if test takers are low achievers, instead of using lowfrequency words to teach them, high-frequency words would be more suitable. 18.

(26) Regardless of a lack of construct validity, face validity is still worth consideration (Davies et al., 1999; Hughes, 2003). Criterion validity is composed of the two key elements, concurrent validity and predictive validity. Evidence of concurrent validity, judged by correlating the new test and an established measure at the same time, is hard to obtain since such an accepted standard test hardly ever exists (Bachman, 1990; Hughes, 2003; Messick, 1990; Thorndike, 1949). Schmitt (2010) further indicated that because of the multifaceted nature of vocabulary, it was hard to find vocabulary tests that measure the same kind of word knowledge. In comparison with concurrent validity, predictive validity is applied to determine to what extent a test can accurately predict the subjects’ future achievement (Hughes, 2003). For example, if the correlation between a job seeker’s cognitive test for job performance and performance records from his supervisor is significant, the cognitive test has predictive validity. However, there are still problems to be solved. Take a vocabulary test as a placement test, for example. The student might feel himself/herself placed in the wrong class if he/she doesn’t put in the effort to learn. As a result, this criterion validity might have limitations in vocabulary tests. Content validity can be seen as “the degree to which the samples of test tasks are representative of the content domain of interest” (Hanna &Dettmer, 2004, p.354). In other words, if the test items can show the skills or the ideas about which the test designer cares, the test has content validity. The way to obtain this validity is to compare the test descriptions and content of the test. Accordingly, clear objectives before developing a test are more important than the test itself. What should be included in specifications of a test are timing, techniques, procedures of scoring, 19.

(27) criteria for performance, structure, and content. The relation between content validity and construct validity is empirically supported, for unclear or wrong objectives of a test are not likely to have corresponding results (Hughes, 2003). The last and most important is construct validity, defined as the potential ability that can be hypothesized from a theory (Heaton, 1975). By correlating the results of a test based on the theory and the theoretical meaning of a concept, the test has construct validity if the correlation is highly significant (Anastasi&Urbina, 1997; Bachman, 1990; Hughes, 2003). A broader range of research areas can thus be tapped through a generalized test of the hypothesized concept of a theory. Both reliability and validity are meant to guarantee that the research is reliable and trustworthy. Hence, the interpretation of empirical studies should be objective lest it influence the related research genre (Bachman, 1990). In order to have a well-developed vocabulary size test, theories about vocabulary measurement are discussed below.. Measurement Theories Classical Test Theory Classical test theory (CTT) is also referred to as the true score theory, predicting the results of psychological testing by applying psychometric theory. The ability of target subjects is an example of psychological testing, based on the idea that the target subjects’ score on one test should be the same as the sum of a true score and an error score on the test. Anastasi and Urbina (1997) expressed the idea by the following formula. σx 2=σt 2 +σe 2 20.

(28) Here, σx 2 stands for the observed assessment score,σt 2is the true exam score, which cannot be measured, andσe 2is an error score that can be estimated. Given that the aim of the test is to yield the possible true score, the standard of measurement is thus applied to show the degree to which the standard error influences the true score of an exam (Bachman 2004; Davies et al., 1999). Simply put, the goal of CTT is to interpret and thus confirm or amend the reliability of psychological testing (Davies et al., 1999). However, because of two fatal defects, CTT has long been criticized and attacked by proponents of IRT, reviewed in the next section. Schumacker (2005) indicated that the essential elements of CTT are item difficulty and item discrimination. Both rely merely on the collected statistics of the sample, for which it can be foreseen that the reliability and validity of the test result will be affected by the unsure proficiency of test takers and their various specialties. In addition, the test difficulty will vary from person to person. For instance, if the examinee is a high achiever, the results for item difficulty will be comparatively lower than for a low achiever. CTT does not provide the test takers’ response to any one test item. Accordingly, owing to the incomparable results of the subjects, IRT is considered more favorable among researchers, for IRT has the two crucial traits of sample and measurement independence. The Modern Psychometric Approach In light of its relation to the CTT, the Item Response Theory (IRT) is referred to as a modern psychometric approach, also widely known as the Latent Trait Theory (LTT). It is applied not only to clinical rating scales but also to other instruments, such as surveys and achievement tests, to measure their quality (Kean & Reilly, 2014). The 21.

(29) proficiency and item traits of the subject play an important role in the IRT, as these two variables determine a subject’s performance. Traits, or latent traits, are usually invisible and cannot be evaluated straightforwardly. For example, rating scales with no expected correct answers have such options as often, usually, seldom, and never (Hambleton&Swaminathan, 1984). The LTT has three valuable characteristics that enable researchers to obtain results by evaluating or comparing various groups of subjects or tests without interference: 1) Measures developed from the LTT are sample independent, so subjects’ performance will not be influenced by their group work or members. 2) Specific groups of test items will not affect subjects’ ability evaluation. 3) The LTT can precisely estimate subjects’ abilities. The first two share a feature called invariance, making the LTT stand out from the CTT. (Kean & Reilly, 2014). Unidimensionality, another premise of the LTT that distinguishes it from the CTT, holds that subjects’ performance is due to just one significant trait, which means that not only the subjects’ performance but also their rankings can be explained by that single trait. In contrast, a multidimensional model requires two or more traits to describe the resulting performance of subjects in spite of little supporting evidence (Davies et al., 1999; Hambleton&Swaminathan, 1984). Still another important idea to be mentioned is local independence, the assumption that subjects’ answers to any test item would not be affected by other test items under the premise that mutual influences of abilities and subjects’ performance are steady and the same (Hambleton et al., 1991). Three types of LTT models with different parameter numbers are broadly adopted: (1) One-parameter logistic model (1PL model) The 1PL model, commonly known as the Rasch model, estimates only an item difficulty parameter. Discrimination and guessing parameters of all items in a test are set 22.

(30) to zero (Brown &Hudson, 2002; Hambleton et al., 1991). When using the model, two basic conditions must be satisfied. One is that the minimum number of subjects is 100; the other is that the number of test items should be over 25. (2) Two-parameter logistic model (2PL model) In addition to item difficulty, a 2PL model can estimate one additional parameter, called discrimination. In this model, the guessing parameter is still set to be zero. As for sample size, a 2PL model needs over 500 subjects and 30 test items (Kean & Reilly, 2014; McNamara&Candlin, 1996). (3) Three-parameter logistic model (3PL model) A 3PL model includes the aforementioned three parameters, item difficulty, discrimination, and guessing parameters. The minimum conditions for the model are 1000 test-takers and 60 items (Kean & Reilly, 2014; McNamara&Candlin, 1996). As for which model to select, this theoretically depends on each model’s minimum requirements, i.e., the sample size and test items of the study. No matter which of the aforementioned models is adopted, each item in a test has its own item characteristic curve (ICC) used to account for the relations between latent trait (ability) and the probability of getting the right response. For example, Figure 3 presents the relationship of items to one another through the graphical form of a 1PL model, including three items with different difficulty levels, i.e., –1, 0, and +1.. 23.

(31) Figure 3. Graphical Depiction of a One-Parameter Logistic Model. Adopted from Hula, Fergadiotis, & Martin,2012.. According to the assumption of the LTT, if the difficulty level of an item meets the level of ability of a subject, the chance of getting the correct answer is 0.5. This assumes that a person’s ability would have positive relevance to item difficulty, which means that as a subject’s ability increases, the chance of getting right answers becomes higher, and vice versa (Hambleton et al., 1991; Hula et al.,2012).However, the 1PL model has the pitfall that it assumes that the discrimination of all test items is the same, but in fact it varies from item to item. Under such circumstances, a 2PL model is necessary to overcome this problem. Figure 4, the graphical depiction of a 2PL model, applies the equal difficulty level to three items and shows the relationship between ability and the probability of getting correct answers. The discrimination values of the three items, items 1, 2 and 3, are 2, 1, and 0.5, respectively (Hula et al., 2012).. 24.

(32) Figure 4. Graphical Depiction of a Two-Parameter Logistic Model. Adopted from Hula et al.,2012.. Figure 4 shows that as the discrimination of an item increases, the slope of the item’s ICC becomes steeper, which means that if the discrimination of a test item is high, the item could separate high-level subjects from low-level ones quite closely (Hula et al., 2012). Another assumption made by both 1PL and 2PL models is that the guessing parameter is set to zero. That is, it is impossible for subjects to guess on test items; thus, the guessing parameter will not affect the test results. Figures 3 and 4 show that the lower the ability of subjects, the poorer the odds of the subjects getting the correct answers (Hula et al., 2012). This assumption is appropriate only to certain types of items, such as picture naming or word repetition; in the real testing context, there are many types of items for which the assumption does not hold, including word choices, minimal pair questions, true-false questions, etc. In these types of items, guessing is inevitable. Consequently, a 3PL model, which contains a pseudo-guessing parameter, is needed to meet this problem (Hambleton et al., 1991; Hula et al.,2012).. 25.

(33) Figure 5. Graphical Depiction of a Three-Parameter Logistic Model. Adopted from Hula et al., 2012.. Figure 5 shows a 3PL model of three test items with the same difficulty and discrimination. The pseudo-guessing values of items 1–3 are 0.5, 0.25, and 0, respectively. The pseudo-guessing parameter is a useful indicator for researchers since each item can be assigned a unique guessing parameter. It is usually applied to sets of multiple-choice questions with an equal number of response options. However, as its estimation requires a large sample size, a pseudo-guessing parameter, or a 3PL model, is not widely applied in research (Hula et al.,2012). In addition to subjects’ ability and item information inferred from the LTT Models mentioned above, Rasch analysis also provides information on questionable items and subjects. Fit analysis is conducted to estimate the goodness of fit. Through fit analysis, precise information about subjects’ latent traits can be obtained and subjects’ performance is also predictable. Goodness of fit can be achieved only when the subjects’ expected performance matches the observed results. If the two fail to match, misfit test items or subjects would be indicated (Cheng, 2014). There are two possibilities for misfit items: 1) Some items are poorly developed and thus have low discrimination. 2) The latent trait that the whole test aims to assess is not given by the in-test items. With fit analysis, problematical items can be detected, after which test developers are able to 26.

(34) edit or redesign them (Davies et al., 1999; McNamara &Candlin, 1996).. Types of Vocabulary Tests In the following section, the most commonly used vocabulary tests, the Vocabulary Levels Test and Vocabulary Size Test, will be reviewed in detail. Vocabulary Levels Test The Vocabulary Levels Test (VLT), developed and published by Paul Nation in 1990, has been broadly used by instructors and researchers who aim to diagnose learners’ vocabulary size (Read, 2000; Schmitt, Schmitt&Clapham, 2001). “It is the nearest thing we have to a standard test in vocabulary” (Meara, 1996, p.38). VLT targets four frequency bands, the 2000, 3000, 5000, and 10,000-word families, and they all serve different functions of use in English. For example, the 2000-word families are the basic requirement for daily conversation, while the 3000-word families are set as the standard to read authentic materials. For those who want to read authentic materials independently, 5000-word families are the key threshold. As for 10,000-word families, they are essential for anyone who wants to undertake higher-level studies in English. The words on the list are basically selected from several well-known word lists such as Thorndike and Lorge’s word list (1944), Kučera and Francis’s word list (1967), and the General Service List (West, 1953). The academic part of the list is composed of Campion and Elley’s word list (1971) and Xue and Nation’s (1984) University Word List, while in Schmitt et al.’s version (2001), the academic part of the list is replaced by Coxhead’s Academic Word List (1998, 2000).. 27.

(35) The VLT has the format of word-definition matching. The stems include three definitions and the corresponding options consist of three target words and distractors, which all have the same part of speech and fall in the same frequency band. The meaning of the stems is kept as simple as possible so that the test-takers will not be misled by long sentences. The following is an example of an adjective cluster in Schmitt et al.’s version.. Each level consists of ten clusters distributed to reflect English word classes, composed of five noun clusters, three verb clusters, and two adjective clusters. In order to prevent random guessing, all the stems are organized by word length and appear in alphabetical order.. Vocabulary Size Test The Vocabulary Size Test (VST), developed by Paul Nation, aims to measure language learners’ receptive knowledge of vocabulary size. It is composed of 14 frequency bands ranging from the 1000 to the 14,000-word families occurring in the British National Corpus (BNC). The format of the VST is multiple choice. An example sentence includes the stem with four definitions as options. The following is an example of the VST.. 28.

(36) Test-takers have to choose the best equivalent to the boldfaced word in the example sentence. For each frequency level there are ten items, and thus for 14,000 word families there are a total of 140 stems. As the example shows, like the VLT, the stem in the VST should be kept as concise as possible. However, unlike the VLT, in the VST, if the testtakers want to answer correctly, they need deeper knowledge of the target word since the test-takers need to distinguish the meanings with similar elements among the four options. The VST can reveal useful information about the test-taker’s overall vocabulary size (Schmitt, 2010). According to Beglar (2010), the VST was validated by the Rasch model. In the study, the test-takers consisted of 19 English native speakers and 178 Japanese native speakers. The results not only showed an adequate fit to a Rasch model, but also demonstrated a decreasing trend of test-takers’ scores as the word level increased. The reliability index of the study was over .96, which suggested that the VST was well suited to measuring overall vocabulary size.. CHAPTER THREE. METHODOLOGY. Test Construction The present vocabulary test includes 180 items. According to frequency, the most frequent 6050 words were chosen from the new frequency list (NFL) (Gardner & Davies, 2013). The words chosen were ordered by frequency and each. 29.

(37) group of one thousand words was divided into the same level, yielding six levels. In order to determine whether the subjects reached a particular word frequency level, 30 target words were chosen from the margin of two levels between words 950 to 1050 relative to the lower margin of the lower level. For example, in Level Two, 30 target words were chosen randomly from words in the range from 1950 to 2050. In addition, thirty test items were chosen randomly from each frequency level. In order to avoid systematic error, the items were split into two forms, Form A and Form B. There were 90 items of each form with target words in six levels. A four-choice MC format was adopted in this test. The subjects had to relate the definition of the stem to the word with the closest meaning. The format of definition-word matching was chosen because it was simpler to write a stem with high-frequency words. In order to avoid including words in a stem that were too difficult, which might lead to misapprehension of a word, it was essential to keep the words in a stem as easy as possible. As a result, the definitions in the stems were all written with words in the first two high-frequency levels so as to minimize the possibility that the subjects failed the test because of the use of obscure words. Four well-known English dictionaries of high reputation, the Longman Dictionary, Oxford Dictionary, MacMillan Dictionary, and Cambridge Dictionary, were used as references to keep the stem concise and clear. To ensure that the difficulty levels of the three distractors were identical, words were picked within the same frequency band as the target word. To eliminate the guessing effect, if the target word had an obvious suffix or prefix, which would reveal information about the target word, the distractors should also contain it. Take the target word “legislation,” for example. The distractors for it would be combination, motion, and contribution. The suffix –tion indicates that the part of 30.

(38) speech was noun; therefore, if the distractors were not in a similar form, the subjects might get the correct answer without knowing the meaning. Besides eliminating the guessing factor, the test should be designed to prevent possible fatigue. For instance, a continuous sequence of items from low or high-frequency levels might make subjects tired or bored. As a result, in the present study the order of the target words did not follow the original order of the word frequency level from one to six. Instead, they were organized in the following order: Level One, Level Two, Level Five, Level Three, Level Four, and Level Six. For each level, there were fifteen target words. The test items were closely examined by the author, three experienced senior high school teachers, the supervising professor, and a foreign consultant. After a thorough examination, ten high school students were assigned to take the test as a pretest to prevent possible errors and allow the author to fix problems and make changes to it. The sample items were as follows. The correct answer is shown here in bold font. For the whole vocabulary test, please see Appendix A(Form A) and Appendix B (Form B).. Level 1 1. the part of the day between the afternoon and night (A)evening (B)unit (C)adult (D)fear 2. a large formal meeting (A)writer (B)conference (C)camera (D)chair 31.

(39) 3. a way of doing something (A)board (B)doctor (C)husband (D)style Level 2 4. someone who breaks the law (A)criminal (B)audience (C)participant (D)labor 5. to go or do something very quickly (A)stare (B)own (C)rush (D)hurt 6. in need of rest or sleep (A)sorry (B)tired (C)willing (D)wild Level 3 7. not healthy (A)gay (B)ill (C)still (D)holy 8. a large group of people (A)round (B)tribe (C)clinic (D)mixture 9. an area of grass that is cut short (A)peer (B)silver (C)coast (D)lawn Level 4 10. mental or physical pain (A)accounting (B)counterpart (C)suffering (D)departure 11.a part that you make or join to another part (A)transaction (B)addition (C)determination 32.

(40) (D)pension 12.a person who works in another person's house, doing jobs such as cooking and cleaning (A)pause (B)spouse (C)servant (D)vitamin Level 5 13. the players in a sports team who play in a particular game (A)forecast (B)lineup (C)rebound (D)cleanup 14. to use something or someone instead of another thing or person (A)formulate (B)obscure (C)substitute (D)underscore 15. to try to do something (A)generator (B)endeavor (C)incidence (D)microwave Level 6 16. showing what something is like (A)outrageous (B)reflective (C)miniature (D)lengthy 17. relating to muscles (A)marginal (B)muscular (C)sensible (D)naïve 18. to go regularly to and from work (A)commute (B)multiply (C)offset (D)notify. Participants and Test Administration This test was administered in five senior high schools and a national university, located in Taipei, Changhua, Yilan, and Keelung. The PR value of the 33.

(41) participants was from 70 to 95. English classes for tenth graders were about four hours every week, whereas eleventh graders had five hours. The college students were non-English-major senior students taking an English class as an elective course. There were in total 1,198 participants taking the test, of whom 849were from senior high schools and 349 participants were college students. For senior high school students, it took 50 minutes to finish the test, either Form A or Form B, as pilot tests. For each class, test administrators were all English teachers who chose one form to implement the test. Neither the teachers nor the test participants knew the purpose of the test, which was a part of the research. None of them were informed of the source of the vocabulary words. The participants’ English teachers asked their students to take the test seriously in order to measure their own vocabulary size. Also, according to the test results, the English teacher would choose the top five students in each class to participate in the national English vocabulary contest. The test administrator distributed and later collected the test paper after the bell rang. After the pilot tests showed good test validity, the formal test combined with Forms A and B was implemented among the senior students in the national university. The formal test followed the same test schedule as the pilot tests; it consisted of 180 test items and the test lasted 100 minutes.. Scoring and Coding The test was split into two forms. In either Form A or Form B, each item accounted for 1%. Correct answers received one point, while wrong ones received zero points. The participants used computer cards to answer the questions, and the test results were collected by computer.. 34.

(42) Data Analysis The present study aims to establish an instrument to provide valid and reliable estimates of vocabulary size for L2 English learners and to explore how item difficulty, discrimination, and pseudo-guessing parameters interact with frequency bands. The 3PL model of Latent Trait Theory was adopted to meet the goals of the study for the following reasons. First, LTT would present the author with statistics of overall model fit that would reveal whether the test as a whole is an appropriate measure of latent ability. With overall model fit established, the validity of the present vocabulary test would also be established. Information on reliability would be calculated by means of the LTT model. Second, the 3PL model allows insights into three item parameters— difficulty, discrimination, and pseudo-guessing. Hence, the author would be able to gather information on and make comparisons of the three parameters across frequency bands. Also, the information on difficulty and discrimination would enable the author to separate well-written items from poor ones. Third, since test items in multiple-choice form might be susceptible to the effects of guessing, the employment of 3PL provides critical information on the pseudo-guessing effect, allowing the author to rule out unreliable individual examinees. Thus, misfit items and test-takers would be spotted and ruled out. XCalibre was used to analyze the data. Finally, the test can show to what extent the selected margin words in each vocabulary band can demonstrate that the subjects have passed a particular frequency band.. CHAPTER FOUR. RESULTS. This chapter reports the results of the vocabulary test inspected with the threeparameter logistic Item Response Theory model (3PL IRT model). The first section 35.

(43) presents the comprehensive fit statistics and then reports item characteristic curves and the values of the three parameters.. Fit Statistics The overall model fit of the test was calculated as 1.17, which demonstrates a good fit to the 3PL LTT model. That is, the entire vocabulary test presented unidimensionality. This study was thus evaluated as evaluating a single latent ability. In general, most of the test items showed good fit. Only Item 59 (No.74 on the test) in Level Four failed to fit the LTT model. Since the levels of the words were rearranged on the test to prevent test fatigue, the item number was not the same as for the test item. The clear coding was in Appendix C. The misfit item is as follows: 59.someone who is kept as a prisoner by an enemy (A)receiver (B)laughter (C)hostage (D)container. Figure 6. Test Information Function 36.

(44) Figure 7. Conditional Standard Error of Measurement (CSEM) Function In Latent Trait Theory, our interest is in estimating the value of the ability parameter for an examinee. The Test Information Function is used to calculate the amount of information that can be obtained from the test. The ability parameter is denoted by θ. Inspection of Figure 6 shows that the amount of information has a maximum at an ability level of 39.361 at θ = −0.300.Apart from that, as shown in Figure 7, the minimum CSEM was 0.159 at θ = −0.300, which was within the range−2<θ<2.This indicates that abilities were estimated with precision. Outside this range, the amount of information decreased rapidly, and the corresponding ability levels were not estimated very well. It can be observed that the test in question provided useful and valid information on learners whose abilities ranged from θ =−2 to θ = 2 and that it provided the most information when the value of θ fell between 0 and 2. Based on the unimodal distribution of test responses, the feature of unidimensionality was confirmed, indicating that the test measured only one latent. 37.

(45) ability. This demonstrates without a doubt the validity and reliability of this vocabulary size test.. Item Characteristic Curves (ICCs). In Latent Trait Theory, the item characteristic curve describes the relationship between latent ability and performance on a test item. The shape of the ICC is determined by the parameters a, b, and c: a indexes the discrimination of the item, while b is the item difficulty parameter, equal to the value of θ where the probability of a correct response equals .50 + (c/2). It follows that multiple-choice items with more positive b parameters are more difficult for examinees, as a higher trait level is required to choose the keyed response 50% of the time. Finally, c equals the probability of an examinee with an infinitely low θ obtaining a correct response due to guessing. In order to examine the difficulty of an item more closely, let us discuss the ICCs for a parameter b. The further the curve is to the right end of the ability scale, the more difficult the item is. On the other hand, if the curve is closer to the left end, the item is easier. The most difficult items on this test were Items 86, 109, and 84, which lie on the right end of the scale. In contrast, the easiest items were Items 1,2, and 18, which lie on the left end of the scale, as shown in Figure 8.. 38.

(46) Figure 8. Item Characteristic Curves for Items 1 and 86. Figure 9. Item Characteristic Curves for Items 26 and 116. Figure 10. Item Characteristic Curves for Items 47 and 86. 39.

(47) The parameter a shows the discriminating power indicated by the slope of curves. If the slope of the curve is steeper, then the item has greater discriminating power. From Figure 9, we see that Item 116, with the steepest slope, had the most discriminating power. On the other hand, the milder curve of Item 26 manifested the least discriminating power. As for c, it is shown by the lowest level of the ICC. More details are shown in Figure 10. The guessing value of Item 47 was 0.2659; on the other hand, the guessing value of Item 86 was 0.189. This shows that Item 86 had the least probability of being influenced by guessing. The complete ICC graphs are provided in Appendix D.. Values of Item Parameters. A complete summary of parameter statistics is given in Table 2. The entire 180 items of parameter statistics are given in Appendix E. As shown in Table 2, the minimum difficulty statistics of the test was at −2.699, whereas the maximum was at 2.840. The range of the difficulty value was within the range of−3 to 3, which meant that the performance of items fit the 3PL LTT model. The distribution of difficulty across items is presented in Figure 12. Parameter. Items. Mean. SD. Min. Max. a (Discrimination) b (Difficulty) c (Guessing). 180 180 180. 0.888 −0.352 0.241. 0.185 1.127 0.009. 0.410 −2.699 0.185. 1.401 2.840 0.266. Table 2. The Descriptive Statistics for the Parameters a, b and c. Figure 11. The Frequency Distribution of the a Parameter 40.

(48) Figure 12. The Frequency Distribution of the b Parameter The frequency distribution of a is shown in Figure 11. The acceptable range of discrimination was from 0.6 to 2.5 in practice, and values of discrimination lower than zero should lead to deletion of the test items. Negative discrimination value implies that the more excellent the subject, the lower the chance of answering the question correctly (Brown & Hudson, 2002). The discriminating power of the questions given in Table 2 show values ranging from 0.410 to 1.401, which means that all of the items on the test were in the acceptable range. The following items are the top five items—those with the greatest discriminating power. On the other hand, the five test items with the lowest discriminating power could also be listed as a comparison group. Top five most discriminating items: 116.a natural ability for being good at a particular activity (A)extent. (B)talent. (C)contrast. (D)airport. (C)soft. (D)afraid. 20. extremely big (A)vast. (B)warm. 137.a complete change from one thing to another (A)portray. (B)switch. (C)persuade. (D)function. (C)nose. (D)cash. 115. an area of ground (A)land. (B)root. 180. large, cow-like animal, with long, curved horns (A)twilight. (B)buffalo. (C)sweep. (D)demise. 41.

參考文獻

相關文件

• In the present work, we confine our discussions to mass spectro metry-based proteomics, and to study design and data resources, tools and analysis in a research

The share of India &amp; Taiwan in the World economy and discussed how world export-import market is increasing year by year.. The data shows us that the business between these

This study intends to bridge this gap by developing models that can demonstrate and describe the mechanism of knowledge creation activities from the perspective of

Thus, the purpose of this study is to determine the segments for wine consumers in Taiwan by product, brand decision, and purchasing involvement, and then determine the

The prevalence of the In-service Education is making the study of In-service student satisfaction very important.. This study aims at developing a theoretical satisfaction

This study aims to explore whether the service quality and customer satisfaction have a positive impact on the organizational performance of the services and whether the

Based on a sample of 98 sixth-grade students from a primary school in Changhua County, this study applies the K-means cluster analysis to explore the index factors of the

In this study, the combination of learning and game design a combination of English vocabulary and bingo multiplayer real-time and synchronized vocabulary bingo game system,