單字階層測驗之局部獨立性檢測

全文

(1)國立臺灣師範大學英語學系博士論文 Doctoral Dissertation Department of English National Taiwan Normal University. 單字階層測驗之局部獨立性檢測. Investigating Local Item Dependence in the Vocabulary Levels Test. 指導教授：劉宇挺 Advisor: Dr. Yeu-Ting Liu 研究生：戴禮 Student: Nigel P. Daly. 中華民國 108 年 6 月. June 2019.

(2) 摘要關鍵字: 單字測驗，局部依賴，單向度，Rasch 模式，潛在特質，單字階層測驗單字階層測驗(Vocabulary Levels Test, VLT)在一般前後測之實驗設計研究中，常作為分班測驗，診斷測驗，和學習的基準。相較於其它的詞彙量測驗，像是 VST 或者是 Yes/No 測驗，單字階層測驗在過去的 35 年間受到最多的注目，儘管此單字階層測驗的項目題組形式遭到一些質疑。因為單字階層測驗包含三個項目(定義)，和六個選項(單字)。因為三個項目組合為同一題組的選項，曾有質疑指出回答其中一個項目會不公平地影響(或決定)同一題組的其他選項的答覆。這種局部依賴稱作為項目鍊(item chaining)，且此種現象明顯地違反經典測驗理論和試題反應理論的項目獨立之基本的假設。假若項目鍊在測驗中是一種普遍的現象，此同時也挑戰另一個測驗理論的基本假設：單向度或者是測驗本身設計之評量能力，此以單字能力為例。若因為項目依賴違反兩個在測驗中基本理論假設，測驗的信度和效度將令人存疑。本論文的目標為檢測一個簡短版本的單字階層測驗之項目獨立性，其中包含三個階層而不是五個階層。利用更廣泛的 Rasch 模式，以檢測在單字階層測驗中的項目獨立性之現象和範圍。本論文的資料蒐集包含 302 位大學和研究生的測驗資料，主要利用 Winstep 軟體在 20 個不同資料階層中，進行兩種類型的單項度測驗(1. 主成分殘差分析(PCAR)和 2. Yen 的 Q3 值，此數值可以找出局部依賴的項目)。 1. 2. 3. 4.. 結合三個單字階層測驗 2，3 和 5(一個資料階層) 每一個獨立單字階層測驗(三個資料階層) 四個能力組別和所有的單字階層測驗(四個資料階層) 四個能力組別和三個獨立單字階層測驗 (十二個資料階層). 另外執行兩項分析；模擬資料包含非隨機殘差和實證資料之比較，另一項是用 Rasch 模式分析三項目組合的題組。總結，本研究綜合分析 42 個不同分析量化結果和質性分析有問題的題目，包含以下兩種方法：1. 作答規律包含答案，誘答選項，和未回應的選項；2. 利用 COCA 蒐集的單字頻率和分佈資料，COCA 是目前最大的英文語料庫(Davies, 2008-)。相似於文獻中的一些研究發現，其中單向度的 Rasch 分析結果顯示可接受之配適度，個人和項目之可信度。另外，和模擬資料比較時，亦很少不可解釋的變異數。這些數值顯示，單字階層測驗項目分析結果沒有發現明顯的或是有問題的測驗題目。但是，透過 20 個階層資料的組合分析顯示超過三分之一的測驗題目有以下傾項：1. 有兩項題目有局部依賴，依賴程度為弱到中等程度(相關係數.

(3) 0.3-0.7)；和/或者 2. 測驗題目中未在 Rasch 單字知識向度中，卻依據主成分殘差分析有顯著負荷量(超過 +/- 0.3)。執行質性分析以進一步了解 Rasch 統計檢測之結果。結果顯示由上述至少在兩項上述分析中，發現有一小組七個題組為可能有問題的局部依賴項目，而這些題組將進行題目敘述和單字頻率檢視。雖然統計和質化分析的結果不能將局部依賴歸咎於項目鍊，這七個題組項目確實有一些共同的性質造成一些問題降低了測驗的能力。這些性質包含兩個項目在困難度上面有相當大的差別，於此論文中稱作 “2-vs-1 困難群”；事實上，在 30 個題組中就有 19 個題組項目有此傾向。當一個測驗中困難群在同一個題組中位置彼此相近，但是卻距離邊緣第三項很遠，此現象由 Q3 數據檢驗呈現是有微弱或是中等的局部依賴現象。這個現象出現在六個題組中(占總 20%)。當局部依賴的現象出現於在題組中的前兩題項目，第一題是比第二題更難，且遠比第三題困難的情形時(約四分之一到三分之一的測驗者不回答此題組)，題組的第一題根據主成分殘差分析的結果顯示，此項題目和 Rash 向度之單字知識顯示不相關。這個情形出現在單字階層測驗 3 和 5 中的四個題組(占總 13%)。本研究指出一個重要的議題就是單字困難度，在單字診斷測驗中如同單字階層測驗，這個議題一直以來都被忽略或者是被研究者視為擾嚷變數(Culligan, 2015)。就我所知，此種測驗類型的單字困難度從不曾被實際地理論化過，但是卻被默認為是語料庫中單字頻率的一種功能。儘管有一些相反的論述(Schmitt et al. [2001] 的單字階層測驗, and Beglar [2007]的 VST)，基本的假設是單字頻率越低(i.e., 比較不常見)，此單字項目在單字階層測驗中就比較困難。本研究的結果顯示如此之假設是有問題的，基於兩原因。第一，Schmitt et al. (2001)的單字階層測驗的版本是基於過時且數量小的語料庫，因此在單字階層測驗中沒有正確的單字頻率，特別是在於低程度單字階層測驗 3 和 5。主要的原因是當語料庫包含相對數量少的文章，也沒有考慮單字分布的情形(i.e., 該單字在語料庫中的多少文章中出現)。因此這些單字會有不一致和偏斜的分布的情形。第二，最重要的是單字困難度的評量並沒有和頻率的資料作相關聯之測試，即使將分布資料納入考量。這個觀察同時也顯示第二語言學習者的單字量，並沒有和純英文的語料庫做一個相關檢視，特別在於前 2000 字上面。建議應該要研究單字頻率和困難度的相關性。.

(4) Key words: Vocabulary testing, local item dependence, unidimensionality, Rasch model, latent trait, Vocabulary Levels Test Abstract The Vocabulary Levels Test (VLT) has been used as a placement test, diagnostic test and benchmark for learning in pre- and post-test type of studies. Compared to other vocabulary size tests like the VST and Yes/No test, the VLT has received the most attention in research publications in the last 35 years, despite widespread suspicion of its item cluster format. Since each item cluster is composed of three items (definitions) and six answer options (words), it is suspected that the answering of one item can unfairly influence—or depend on—the answering of another item in the cluster since the three cluster items draw from the same set of answer options. This type of Local Item Dependence (LID) is called item chaining and appears to be a flagrant violation of the basic assumption of Local Item Independence (LII) in Classical Test Theory as well Item Response Theory. And if item chaining is pervasive throughout the test, this also challenges another fundamental assumption in test theory: unidimensionality, or the test’s capacity to measure only one trait like vocabulary knowledge. If both of these assumptions are substantially violated by Local Item Dependence (LID), the test’s reliability and validity are necessarily called into question.. The purpose of this dissertation is to investigate the issue of LID in a shortened version of the VLT (three levels instead of five) using a wider variety of Rasch modelling approaches that were triangulated so as to identify the existence and extent of LID in the VLT. Specifically, data were collected for 302 Taiwanese university students or university graduates and Winsteps was used to run two types of dimensionality tests (1. Principal Components Analysis of Residuals [PCAR] and 2. Yen’s Q3 statistic that identifies pairs of locally dependent items) on 20 different data levels:.

(5) 1. three combined levels of the VLT2, 3, and 5 (1 data level) 2. each independent VLT level (3 data levels) 3. four ability groups versus combined VLT levels (4 data levels) 4. four ability groups versus three independent VLT levels (12 data levels).. Two more analyses were also conducted: simulated data with non-random residuals factored out were also compared to the empirical data, and items were grouped into three-item clusters to perform a Rasch analysis of testlets. In total, this study synthesized the results of 42 different analyses and qualitatively investigated the resulting problematic testlets using 1. response patterns of answer keys, distractors and items left unanswered, and 2. word frequency and dispersion information from COCA the largest and most updated currently available English language corpus (Davies, 2008-).. Similar to previous research findings, the unidimensional Rasch analyses showed acceptable fit statistics, person and item reliability, and very little unexplained variance, especially when compared with the simulated data. The testlet analysis also did not uncover any obviously problematic testlets. However, from a combination of the above 20 levels of analysis, more than a third of the testlets appeared either to 1. have a pair of locally dependent (LD) items that were weakly to moderately dependent on each other (correlation of 0.3-0.7), and/or 2. have items with substantive PCAR loadings (beyond +/- 0.3) on a dimension that was not the Rasch dimension of vocabulary knowledge. Additional qualitative investigations were conducted in an effort to better understand and explain the Rasch statistical results. A subset of seven testlets that emerged from at least two of the above analyses were assumed to be the most likely.

(6) candidates of problematic LID, and these were more closely scrutinized using qualitative procedures of checking item wording and word frequency.. Although the statistical and qualitative procedures cannot conclusively show that the cause of LID is item chaining, the seven items share a number of characteristics that clearly create a problematic dynamic that undermines the proper functioning of testlets. These characteristics include a pair of items that considerably differ in difficulty measures from the third item in the cluster, which I have called a “2-vs-1 difficulty bundle”; in fact, 19 out of 30 testlets shared this configuration. However, when these difficulty bundles in a testlet are fairly close together but far apart from the outlying third item, the Q3 LID analysis identified them as either weakly or moderately locally dependent; this was the case for six testlets (20% of the total). And when this LID pair was the first two items with the first item more difficult than the second, and much more difficult than the third outlying item (with a quarter to one third of test-takers leaving the pair unanswered), the first item in the testlet was identified by the PCAR as negatively correlating with Rasch dimension of vocabulary knowledge; this was the case for four testlets (13% of the total) in VLT3 and 5.. A key issue that emerged from this investigation is item difficulty in a vocabulary diagnostic test like the VLT, which has been variously ignored or treated as a “nuisance variable” by researchers (Culligan, 2015). Difficulty in this type of test has never, to the best of my knowledge, been overtly theorized, but has been tacitly operationalized as a function of word frequency from a corpus. Despite some unargued claims to the contrary (Schmitt et al. [2001] for the VLT, and Beglar [2007] for the VST), the assumption is that the less frequent (i.e., less common) the word, the more difficult the.

(7) word-item on the VLT. This study shows that this is problematic for at least two reasons. First, the Schmitt et al. (2001) VLT versions are based on outdated and small corpora that have inaccurate word frequency information for all the VLT levels, but especially for the lower VLT3 and 5 levels; this is primarily because word frequency information will be necessarily inconsistent and skewed for less common words when using smaller corpora that contain a relatively small number of randomly sampled texts and do not account for dispersion (i.e., how many texts in the corpus containing the word). Secondly, and most importantly, difficulty measures—even when accounting for dispersion information—are often uncorrelated with frequency information, which shows that the learner’s second language (L2) lexicon does not mirror authentic English corpora, especially beyond the first 2000 words. Suggestions are given to help bridge the gap between frequency and difficulty..

(8) Acknowledgements I would like to thank my advisors Tony Liu and Thomas Tseng and Committee members Professors Hsiao, Wang and Lin for their feedback. I owe thanks to Professor Tseng for introducing me to Rasch measurement in a doctoral course, and I am truly indebted to Professor Liu for his guidance, encouragement and help navigating the pitfalls and ordeals of this dissertation process. Without his help, there is no doubt that this dissertation would never have seen the light of day..

(9) Contents Chapter 1. Introduction 1.1 Importance of vocabulary in SLA. 1 1. 1.2 What is vocabulary? Vocabulary knowledge - aspects, facets and dimensions 1 1.3 Vocabulary testing for fluency, depth and size. 5. 1.31 Vocabulary size tests: Yes/No, VST, VLT. 8. 1.32 Yes/no checklist. 8. 1.33 Vocabulary size Test (VST). 9. 1.4 Vocabulary Levels Test (VLT). 11. 1.41 VLT popularity and test format. 11. 1.42 VLT item difficulty as frequency. 13. 1.43 VLT format. 15. 1.44 Validation evidence. 16. 1.45 Value of VLT: decontextualized, reliability and expedient format. 18. 1.46 Suspected problems with the VLT: local dependence, guessing and unidimensionality. 20. Chapter 2. Literature Review 2.1 Statistical analysis: Virtues of Rasch modelling over CTT and IRT. 23 23. 2.11 Classical Test Theory. 23. 2.12 Modern Latent Trait Theory: IRT. 26. 2.13 Rasch vs IRT: Only Rasch is an invariant measuring method. 28. 2.14 A Rasch approach to understanding the VLT. 33. 2.15 Using Rasch measurement in vocabulary validation studies. 34. 2.2 The value of Residuals in the Rasch model. 36. 2.3 Unidimensionality. 37. 2.4 Local Item Dependence (LID). 39. 2.41 VLT and LID. 43. 2.5 Guessing. 46. 2.6 Research questions. 49. Chapter 3. Methods. 51. 3.1 Instrument. 51. 3.2 Participants. 51.

(10) 3.3 Data analysis. 52. 3.31 Fit tests. 52. 3.32 Principal components analysis of residuals (PCAR). 53. 3.33 Yen’s Q3 Statistic analysis (Q3 LID). 56. 3.34 Testlets vs other testlets (WTable 27). 58. 3.35 Simulated data. 58. 3.36 Individual level tests. 59. 3.37 Person ability vs combined VLT levels and individual level. 59. 3.38 Triangulation and Qualitative item analysis: response patterns, corpus frequency and dispersion checks, comparison with all other testlets. 60. Chapter 4. Results. 63. 4.1 Background. 63. 4.11 Test results. 63. 4.12 VLT fit statistics. 65. 4.13 Examinees – gender and ability levels. 67. 4.14 Dispersion of item difficulty measures across the three VLT levels. 71. 4.2 Extent of Multidimensionality and LID in the VLT. 74. 4.21 Empirical data vs simulated data. 74. 4.22 Principal components analysis of residuals. 76. 4.23 Q3 LID analysis. 78. 4.24 Comparing LID in testlets - Fit statistics for testlets. 79. 4.25 Summary. 83. 4.3 Extent of Multidimensionality and LID interaction with Individual VLT levels 84 4.31 Item fit statistics for individual VLT levels. 84. 4.32 PCAR and Q3 analyses for individual VLT levels. 85. 4.4 Extent of Multidimensionality and LID interaction with 4 Ability levels vs Combined VLT levels Summary. 88 92. 4.5 Extent of Multidimensionality and LID for 4 Different abilities vs individual VLT levels 92 4.51 VLT2 vs ability. 93. 4.52 VLT3 vs ability. 94. 4.53 VLT5 vs ability. 96.

(11) 4.54 Overview of ability vs level analyses Summary 4.6 Guessing and Response patterns: unanswered questions and chosen distractors Chapter 5. Discussion 5.1 Summary statistics. 97 100 101 105 105. 5.2 Research Question 1: What is the extent of unidimensionality and LID issues in VLT? - Triangulating the PCAR + Q3 LID + Testlet analyses 106 5.3 Research Question 2: What is the extent of SLD item chaining?. 109. 5.31 Comparison of seven problematic testlets. 109. 5.32 PCAR and Q3 LID + item analysis. 113. 5.32a PCAR analyses. 113. 5.32b Q3 LID and testlet analyses. 115. 5.33 Guessing, and distractor and blank answer analyses. 117. 5.34 Checking item difficulty with updated word frequencies and dispersion values (COCA) 118 5.35 Word frequency as item difficulty: NS vs NNS issues. 120. 5.36 Variable difficulty values within testlets: micro-linguistic relativism and guessing as additional threats to construct validity 122 5.37 micro-relativism analysis of seven problematic testlets. 126. 5.37a Item analysis: the aberrational AI testlet. 131. 5.37b Item analysis: LID pairs far vs close together - AD vs AE. 133. AD testlet. 133. AE testlet. 134. 5.37c Item analysis: Four special cases of PCAR + LID pairs for BD, BH, CA, CD 135 CD testlet. 138. BD testlet. 139. BH testlet. 140. CA testlet. 141. 5.38 Are there other similar testlet patterns in the other 23 VLT testlets?. 142. 5.4 Research Question 3: Strategic Guessing vs imbalanced Difficulty – two sides of the (in)-validity coin 149 5.5 Test-making implications and conclusions Chapter 6. Conclusion. 154 160.

(12) 6.1 Summary. 160. 6.2 Limitations and future work. 161. 6.3 Conclusion. 163. References. 165. Appendices. 185. Appendix A. Research works referring to Vocabulary breadth tests. 185. Appendix B. Use of VLT in research. 186. Appendix C. The VLT, levels 2,3,5. 187. Appendix D. Dimensionality check – all levels n=302 (WTable 23). 191. Appendix E. Simulated data Dimensionality check. 191. Appendix F. Testlet analysis (WTable 27). 191. Appendix G. Individual VLT level analyses (WTable 23). 191. Appendix H. Ability vs VLT (WTable 23). 191. Appendix I. 4 abilities vs VLT2. 192. Appendix J. 4 abilities vs VLT3. 192. Appendix K. 4 abilities vs VLT5. 192. Appendix L. Item measures. 193.

(13) List of Tables Table 1.1. Dimensions of vocabulary knowledge Table 3.1. Table 23.99 sample from Linacre’s Winsteps Manual Table 4.1. Summary statistics for the 3 VLT levels Table 4.2a. Empirical data – Summary of Test Fit statistics Table 4.2b. Simulated data – Summary of test fit statistics Table 4.3. Standardized residual variance for 3 levels Table 4.4. Simulated data Standardized residual variance for 3 levels Table 4.5. Standardized residual loadings for items with largest contrasts outside of +/0.3; items and answers are appended below Table 4.6. Winsteps modified Q3 statistic for locating pairs of dependent items for all three levels Table 4.7. Statistical summary for WTable 27 analysis of Testlets as item subtotals Table 4.8. Fit statistics for testlets Table 4.9. Person and Item fit statistics per individual VLT level Table 4.10. Item dimensionality and local item dependence measures per individual VLT level Table 4.11. Four ability levels vs Three VLT Levels - Person and Item fit statistics per individual level Table 4.12. Four Ability levels vs Three VLT Levels - Item dimensionality and local item dependence measures Table 4.13. Four Ability levels vs VLT-Level 2 - Item dimensionality and local item dependence measures Table 4.14. Four Ability levels vs VLT-Level 3 - Item dimensionality and local item dependence measures Table 4.15. Four Ability levels vs VLT-Level 5 - Item dimensionality and local item dependence measures Table 4.16. Summary of problematic items: 3 VLT levels vs 4 ability groups for Item dimensionality (PCAR loadings) and item dependence (residual correlations) Table 4.17. Triangulated results of testlets that exhibit varying degrees of multidimensionality (>0.3) and local item dependence (>0.31) Table 4.18. Response statistics (key, distractor and blanks) for each of the answer options for 6 LID pairs (highlighted in gray) Table 5.1. Items with residual loadings outside of +/- 0.3 Table 5.2. Testlet item pairs implicated in Q3 LID, testlet, and PCAR analyses Table 5.3. Seven problematic testlets with answers and distractors Table 5.4. Seven problematic testlets with difficulty measures, word frequency rankings and dispersion scores for answer keys vs distractors Table 5.5. Individual answer or distractor options that have either much lower or higher frequency ranks and dispersion scores than the overall 6-option average Table 5.6. VLT3 and 5 Testlet LID item pairs and PCAR item(s); difficulty measures come from Appendix D.

(14) Table 5.7. Characteristics of LID testlets vs other testlets with 2-vs-1 bundles of difficulty List of Figures Figure 1.1. Three dimensions of lexical knowledge and ability Figure 1.2. Quantitative comparison of vocabulary size tests in the research literature up to 2017 Figure 1.3. Uses of the VLT in research articles (based on the first 100 results of a Google Scholar search Figure 1.4. Samples of the NVLT multiple choice format compared with the VLT’s 6x3 clustering format Figure 2.1. The ruler analogy for Rasch objective measurement: plotting persons/subjects and items on the same metric Figure 2.2. Guttman scale correlating item difficulty and person ability Figure 2.3. An s-shaped monotonic Item Characteristic Curve (ICC) indicating 3 items of different difficulty: the farthest right is the most difficult and requires more ability to answer it correctly Figure 3.1. Sample of Standardized residual contrast plot Figure 3.2a,b. Bubble pathway; Wright maps of seven testlets Figure 4.1. Distribution of scores per individual and combined levels Figure 4.2. Boxplots for score dispersions for individual and combined levels Figure 4.3. Gender vs scores Figure 4.4. Item-person map showing 4+ levels of person ability using Standard Deviation cut-offs Figure 4.5. Pathway bubble chart measuring person ability and item difficulty Figure 4.6.a,b. Box plot of item difficulties per level; Normal distribution of difficulty measures in three VLT levels Figure 4.7. Line graph of item difficulty measures (ordered 1 to 90) with Standard Error bars and trendline. Figure 4.8. Comparing VLT levels in terms of item difficulties Figure 4.9. Standardized residual contrast plot for contrast 1 Figure 4.10. Distribution map of testlet difficulty across the 3 VLT levels Figure 4.11. Response patterns for the 6 LID testlets. Figure 5.1. Bubble pathway plot for 7 testlets of 21 items Figure 5.2 a,b. 2 Analyses of 7 problematic testlets: a. Wright map of problematic testlet items and all other items; b. Q3 analysis of only the 21 items in the 7 problematic testlets Figure 5.3. Wright map of relative testlet item difficulty and item word frequencies from COCA Figure 5.4. Response patterns for four problematic testlets from VLT3 and 5. Note white is how many chose the answer key, and black is how many were unanswered.

(15) Figure 5.5. Wright map of VLT2, 3, and 5 testlets ordered in vertical columns; LID pairs highlighted, PCAR >0.3 in bold, and third, outlying testlet item in bold and underlined Figure 5.6. Zipfian distribution of percent coverage of 1000-word level words of written texts.

(16) Chapter 1. Introduction. 1.1 Importance of vocabulary in SLA Given that vocabulary is the basis of language study and that language ability can be largely described as a function of vocabulary size (Alderson, 2005, p. 88), vocabulary learning and assessment are crucial for efficient and effective language development. Criticizing the general lack of guidelines and amount of vocabulary in language learning approaches and curricula, Milton (2013) even suggests that a vocabulary size metric could be introduced into curricula in order to serve as a useful measure to assess language ability for setting targets and tracking progress (see also Cameron, 2002; Hulstijn, 2011). The Vocabulary Levels Test is the most widely used vocabulary size test and could be a good candidate for such a vocabulary size metric if its shortcomings can be shown to be minimal or overcome.. 1.2 What is vocabulary? Vocabulary knowledge - aspects, facets and dimensions The social sciences in the 20th and into the 21st Century have been largely preoccupied with quantifying human behavior. Despite efforts to model themselves on the hard or physical sciences so as to legitimate their role in the academy and to secure funding, the “soft” sciences such as psychology, sociology, and education face challenges of measurement not found in the “hard sciences. Unlike physical substances that have extensive and observable traits, the objects of study in human sciences are traits such as attitudes or skills that are unobservable and latent. And while behaviorists in the 1950s and 1960s limited their investigations to only observable behavior, latent trait theorists in more recent decades posit that outward behavior is merely an indicator of internal 1.

(17) states, or latent traits, that are hidden. In this sense, vocabulary test results can be an indicator of the hidden latent trait of vocabulary knowledge.. In the testing of vocabulary knowledge, “word” is often taken to mean “word-family”, which comprises the base form of the word and all of its inflections and derivatives, such as buys, bought, buying, buyer, and buyers for the word-family “buy”. Bauer and Nation (1993) set up a pedagogically useful scale of six levels of word-family based on the criteria of regularity, frequency, productivity and predictability with the lowest levels consisting of the most common and regular inflections at level 1, e.g., “sells” and “selling”, to less common and less transparent derivatives at lower levels, e.g., “sellout”. A number of recent vocabulary size tests, such as the Vocabulary Size Test (VST; Nation & Gu, 2007; Nation & Beglar, 2007) and Vocabulary Levels Test (VLT; McLean & Kramer, 2015), have used Nation’s (2012) word-family frequency wordlist divided into 1000-word levels based on the 100-million-word British National Corpus (BNC; 2007). It is operationally assumed that if learners know one word-form, they can transfer that knowledge, but, as Laufer (2017) pointed out, if this assumption is incorrect, then vocabulary size tests like the VST and VLT overestimate a learner’s vocabulary. This is a complex issue to resolve and to date no research on vocabulary size testing (e.g., Beglar, 2010; McLean & Kramer, 2015; Webb et al., 2017) has even attempted to take this into account.. Knowing a word is not straightforward, and there are different aspects of knowing “words” or “word-families” that make up vocabulary knowledge. In fact, because vocabulary or lexical knowledge is a complex network composed of different components, stages and dimensions, there has been little consensus in defining. 2.

(18) vocabulary knowledge and little consistency in the usage of terminology such as model, construct, dimensions, and aspects (Gyllstad, 2013). Knowing a word involves knowing its various phonological, semantic, syntactic and even pragmatic aspects (cf. Richards, 1976; Nation, 2001). Knowledge of these aspects develops over time, from partial to precise knowledge, which can be described as a dynamic ebb and flow continuum moving from receptive knowledge that recognizes a word’s form and meaning to productive knowledge that enables the language user to appropriately use the word in speech and writing.. And even though a number of vocabulary knowledge models have been proposed that do not take into account the difference between productive and receptive vocabulary (e.g., Nagy and Herman’s [1987] breadth and depth model and Meara’s [1996] size and dimension model), several models recognize the importance of this distinction, such as Henrikson’s (1999) partial/precise, depth and receptive and productive model, and Nation’s (2001) word knowledge framework model. One classic study investigating the receptive-productive nature of lexical knowledge is Laufer and Goldstein (2004), which investigated the strength of learners’ vocabulary knowledge in four modes along the two axes of active/passive and recall/recognition. Their findings suggest that learners acquire vocabulary knowledge on a cline from passive recognition to active recall, which confirms the well-established fact that recall is much more cognitively demanding than recognition (cf. Tulving & Watkins, 1973; Griffin & Harley, 1996).. Vocabulary knowledge is, therefore, not simply a single trait, but rather multifaceted and involves different mechanisms. There are clearly distinct cognitive and physiological processes insofar as receptive knowledge depends on receiving input. 3.

(19) from the eyes or ears and productive knowledge depends on using the mouth or hands. At the same time, lexical knowledge is multidimensional, and while many researchers tend to agree on three dimensions, there has been no consensus on what these three dimensions are or how they are operationalized (Gyllstad, 2013), as Table 1.1 shows.. Table 1.1 Dimensions of vocabulary knowledge Researchers. Dimension 1. Dimension 2. Dimension 3. Henriksen, 1999. Partial to precise knowledge. Depth of knowledge. Receptive to productive ability. Meara, 2005. Vocabulary size. Vocabulary organization Vocabulary & associations) accessibility. Daller et al., 2007. Lexical breadth. Lexical depth. Lexical fluency. A number of researchers, however, conceive of lexical knowledge as a theoretical threedimensional space (e.g., Read, 2000; Daller et al., 2007; Mochizuki, 2012) that consists of breadth (i.e., overall size), depth (i.e., collocates and usage details), and accessibility or fluency (i.e., speed of access) (Figure 1).. Figure 1.1. Three dimensions of lexical knowledge and ability (based on Daller et al., 2007, p. 8; in Milton, 2013) Although Milton (2009, p. 16) noted that details of this three-dimensional model are lacking, his suggestion that fluency refers to productive word knowledge whereas both breadth and depth are aspects of passive/receptive word knowledge seems to overlook. 4.

(20) the productive aspect of vocabulary depth insofar as the correct usage of a word depends largely on its use with its relevant collocates. In terms of empirical study and defining an operationalized construct that can be measured, vocabulary size (how many?) and fluency (how fast?) can be more easily quantified than depth, which is more like a quality. In fact, Milton (2013, p. 62) has criticized the notion of vocabulary depth as an “ill-defined” catchall term that contains a number of seemingly disparate aspects (e.g., less common meanings, word associations, collocations, word parts, and grammatical functions) that problematize its treatment as a single construct that can be operationalized (Milton, 2009, p. 150; Gyllstad, 2013, pp. 22-23).. 1.3 Vocabulary testing for fluency, depth and size The testing and evaluation of the three dimensions of size, depth and lexical accessibility has been the subject of much research and many testing formats. Fluency, or the speed to lexical accessibility, in speaking and writing increases with language proficiency, and this research area has been the primary purview of psycholinguists who use computer programs designed to measure lexical access time in such tasks as locating words in a string of letters (Q_Lex; Coulson, 2005) or showing the test-taker a prime and then target word to see how long it takes to determine if there is a semantic relationship or not (Computer-based English Lexical Processing Task, CELP; Kadota, 2010).. Closely related to lexical fluency is vocabulary depth, which has been primarily investigated on the basis of knowledge of collocations and word partnerships. Some attempts to measure depth include affixes and association (Schmitt & Meara, 1997), synonyms and collocations (Mochizuki, 2002), and antonyms, word forms and. 5.

(21) derivations, and collocations (Koizumi, 2005; cited in Mochizuki, 2012). Perhaps the most widely used depth test by researchers is the Word Associates Format (WAF; Read, 1993), in which the test taker is asked to identify four words (out of eight options) that are either synonyms or collocates of the target word. Another popular depth test that combines both receptive and productive vocabulary knowledge is the vocabulary knowledge scale test (VKS), which uses a 5-point self-assessment scale and a raterscored assessment of word knowledge (Wesche & Paribakht, 1996). The selfassessment requires test-takers to indicate how well they know the word, and if they feel they do, then they need to write a synonym or translation or even a sentence to demonstrate the degree of their knowledge depth.. Finally, there is the dimension of vocabulary breadth or size. Apart from vocabulary size tests that are based on a sampling of dictionary entries (e.g., D'Anna, Zechmeister & Hall, 1991), most vocabulary size tests have been devised on the hypothesis that there is a direct relationship between the frequency of a word and the probability that a learner will know it, and for this reason, these tests are based on frequency wordlists (e.g., Thorndike & Lorge, 1944; Nation, 2012; Coxhead, 2000). In many fields of knowledge and skill acquisition, researchers generally acknowledge the “ubiquity of frequency effects” (e.g., Ambridge et al., 2015), even though the effects of word frequency on second language learners’ acquisition order and mental lexicon have been difficult to either empirically prove or disprove (R. Ellis, 1994). Nonetheless, most researchers generally concur with N. Ellis (2002a,b) that there is a close relationship between language acquisition and frequency of input, with word frequency and its range across media and genres seen as the most salient criteria for determining the usefulness of a word (Koprowski, 2005). This is especially the case for the top 2000 word-families. 6.

(22) which are clearly useful to know since they typically make up 84-90% of English usage (Webb et al., 2017), and for this reason frequency wordlists have long informed materials and test development for language learning.. While there have been vocabulary size tests measuring productive knowledge of vocabulary (e.g., Laufer and Nation’s (1999) Productive Levels Test), the bulk of research into and tests on vocabulary size have targeted receptive knowledge of formmeaning connections. However, compared to tests of other dimensions of vocabulary knowledge, vocabulary size tests have received the most research attention for at least two reasons: 1. this kind of test seems to be more straightforward to create and quantify, and 2. vocabulary breadth has been shown to be a key indicator for language proficiency (Laufer, 1992; Laufer & Goldstein, 2004; Milton, 2009).. The three most popular tests of vocabulary size in the literature are the Yes/No Test (also used in the Eurocentres Vocabulary Size Test and DIALANG diagnostic test), the Vocabulary Size Test (VST), and the Vocabulary Levels Test (VLT). A comparison of publication citations from Google Scholar reveals that compared to the other vocabulary breadth tests, there have been more research articles in which the VLT has been an important “focus”1 or reference (see Figure 1.2).. 1. “Focus” is here operationalized by using the search terms “test name (acronym)” under the assumption, as per acronym usage conventions, that the test was mentioned more than once and was therefore an important “focus” or reference in the research article. Data were collected in August 2017. See Appendix 1 for a table of these statistics.. 7.

(23) Figure 1.2. Quantitative comparison of vocabulary size tests in the research literature up to 2017 1.31 Vocabulary size tests: Yes/No, VST, VLT Each of these tests have their own rationales and specific formats, which largely contribute to how effective they are in measuring the trait of receptive vocabulary knowledge.. 1.32 Yes/no checklist In the Yes/No Test, also referred to as the “checklist” format (Schmitt, 2010), examinees are presented with words and nonwords, and they have to indicate whether they know the word (“yes”) or not (“no”). This test format enjoyed wide use in the 1990s as a computerized placement test in the Eurocentre’s Vocabulary Size Test (EVST; Meara & Jones, 1990) and in the 2000’s as a diagnostic tool in the online European DIALANG system (Alderson, 2005). Taking Meara’s (1992) version as an example of the Yes/No Test format, the test covers the first 5000 most frequent words,. 8.

(24) and the test-takers are asked to check the words they “know”. To counteract the overestimation of “known” words, pseudowords are also added in a proportion of 2333% of the total number of items (Schmitt, 2010, p. 200), and if identified as being known, the test scores will be accordingly adjusted, or “corrected”, downward. As for the EVST, a computer program presents to the test-taker 20 words randomly selected from each 1000-word frequency band up to the 10000-word level (based on Thorndike and Lorge’s [1944] word frequency list). The advantages of this test format are that the task is straightforward for the test-taker to take and test-maker to make, and a large number of items can be answered quickly. In terms of validation studies, Meara (1996) found that the Yes/No test correlated moderately well with other vocabulary tests, and Anderson and Freebody (1981) reported that the Yes/No format “corrected” score achieved a .84 correlation with a multiple choice version; they further conducted posttest interviews and determined from the post-interviews scores for both tests that the Yes/No version gave a better estimate of true word knowledge. Unfortunately, the concept of knowing a word can vary in degree and from person to person, and the construction and use of pseudowords that resemble existing words (in either the test takers’ L1 or L2) can be misleading.. 1.33 Vocabulary size Test (VST) Perhaps the two most widely used vocabulary size measures are the Vocabulary Levels Test (VLT) and the Vocabulary Size Test (VST), both of which are included in Schmitt (2010) and Nation (2001), and are freely available on Nation’s and Schmitt’s personal websites as well as Cobb’s Lextutor website. The VST (Nation & Gu, 2007; Nation & Beglar, 2007) aims to quantify a learner’s vocabulary size by testing a sample of 10. 9.

(25) words from 14 1000-word levels (based on Nation’s (2006) BNC family word lists). Each word has a non-defining example sentence whose purpose is to indicate part of speech, but not meaning, with four multiple choice definitions presented to the test taker (see the NVLT format in Table 1.4, which was based on the VST format). Because the test items are samples from 14 levels, it gives a wide-ranging estimate of vocabulary size, making it in theory useful to measure learner progress in vocabulary learning (Schmitt, 2010).. The VST has been validated (Beglar, 2010) by conducting a Rasch validation test which found that 1. examinees’ scores decreased towards lower frequency, 2. the Rasch model accounted for 86% of the total variation of test scores, 3. the items generally performed well, and 4. the reliability figures were very impressive (.96-.98); however, it should be noted that the 197 test subjects for a test of 14 levels with 140 items falls far short of the adequate sample size of 840, which is based on Linacre’s (2017) 2 minimal recommendation for this kind of analysis; in this light, Beglar’s (2010) Rasch test results are only suggestive. As an estimate of vocabulary size, the total score out of 140 (for the 14 1000-level version) multiplied by 100 can be said to approximate the total number of word families known by the learner; a score of 45 out of 140 therefore implies that the learner knows 4500 word-families. However, with no empirical support, the test relies on a questionable faith in the assumptions that only 10 words chosen out of each 1000-word level can validly represent 1. all the words in that level (Meara, 1996), and 2. learner’s knowledge, which is assumed to potentially mirror that of each level in the frequency wordlist.. 2. According to Linacre, each level would presumably treated as an independent form and should have about 30 test-takers with equal representation of ability levels (10 low, 10 intermediate, 10 high), and double this to be safe; this means an adequate sample should have 14 levels x 30 x 2 = 840 (Linacre, Sept. 13, 2017).. 10.

(26) 1.4 Vocabulary Levels Test (VLT) While the VST’s wide range of 14 levels makes it a comprehensive measuring instrument for determining a wide range of proficiencies, the limited number of words tested at each level affords insufficient information to evaluate the mastery of each level. The VLT, on the other hand, is not a vocabulary size test, but was rather intended to be used to profile learner’s vocabulary knowledge at particular frequency levels (Kremmel & Schmitt, 2018). With 150 items from only five 1000-word levels, the VLT has been described as a “diagnostic test” that can “let teachers quickly find out whether learners need to be working on high … or low frequency words” (Nation, 2001, p. 373; 21-22). The VLT has also been argued to be a “more pedagogically useful measure of lexical knowledge” as a “test designed to measure the degree of mastery of the most frequent words of English” (McLean & Kramer, 2015). This means that the VLT is not technically a vocabulary size test (Beglar, 2010; Nation, 2001), but can be used for research as a diagnostic test, and also placement test (Schmitt et al, 2001; Huhta, Alderson, Nieminen, & Ullakonoja, 2011).. 1.41 VLT popularity and test format The VLT has been made widely accessible. It has been published in paper form (Schmitt, Schmitt & Clapham, 2001; Nation, 1990; Schmitt, 2000; and Schmitt, 2010) as well as freely available on the popular personal websites of Paul Nation and Norbert Schmitt, as well as on Tom Cobb’s Lextutor website. Due to its availability and perceived usefulness, the VLT has been pervasively used and cited in research. According to a Google Scholar search of the first 100 results for the keyword search terms “vlt ‘vocabulary levels test’", 41% of the articles were position papers and 59% were experimental studies using the VLT as a vocabulary size measure, with almost one third 11.

(27) of the research relating to reading ability and another third as a measure of vocabulary gain over time or after a treatment (see Figure 1.3; see also Appendix B for a more complete summary of uses of the VLT in research). These research articles used either Nation’s (1990) version or Schmitt et al.’s (2001) version (see Appendix C for VLT 2001 version 1 for levels 2,3,5).. Figure 1.3. Uses of the VLT in research articles (based on the first 100 results of a Google Scholar search) From its inception in 1983, the VLT format traditionally covered 5 levels of word families (the 2000-, 3000-, 5000-, 10000-word levels, and the non-frequency-based university wordlist) (Nation, 1983, 1990; Beglar & Hunt, 1999; Schmitt et al., 2001), with frequency information coming from Thorndike and Lorge (1944) and Kučera and Francis (1967). There are two more recent versions that both focus only on the 1000 to 5000 levels (McLean & Kramer, 2015; Webb & Sasao, 2013) from the BNC corpus and the Academic Wordlist (Coxhead, 2001), but McLean and Kramer’s New Vocabulary Level’s Test (NVLT) version adopted the VST multiple choice format instead of the. 12.

(28) 6x3 clustering format of the traditional VLT (Figure 1.4), which differs from multiple choice insofar as 3 items are selecting reponses from the same pool of 6 response options.. Sample of NVLT multiple choice format. Sample VLT 6x3 cluster format. 1. time: They have a lot of time. a. money b. food c. hours d. friends. 1 king 2 water 3 spider 4 brush 5 shoe 6 bird. ___ eight legs ___ drawing ___ fly. Figure 1.4. Samples of the NVLT (McLean & Kramer, 2015; based on the VST) multiple choice format compared with the VLT’s 6x3 clustering format (Nation, 1983, 1990; Schmitt et al., 2001) As is evident from Figure 1.4, the VLT test items are designed to reveal “the very basic and initial stages of form-meaning link learning” (Kremmel & Schmitt, 2018). Moreover, the 30-item levels should be understood as independent tests that aim to diagnose the amount of “partial vocabulary recognition” mastery at these frequency levels. Over the years, different researchers have offered different and seemingly “arbitrary” (Xing & Fulcher, 2007) benchmarks for mastery: Nation (1983) originally proposed 66%, Schmitt et al. (2001) 87%, Xing and Fulcher (2007) upon Schmitt’s advice 80%, and more recently, Webb at al. (2017) recommend at higher cut-off score of 97% for the first 3000 words and 80% for the subsequent levels. As a low-stakes diagnostic test whose sole purpose is to estimate the extent of mastery of different frequency levels, test-takers are instructed not to guess and leave items blank if they have no idea of the answer.. 1.42 VLT item difficulty as frequency. 13.

(29) Research using the VLT has tended to be more interested in person ability, and for this reason, the issue of item difficulty has often been avoided or treated as a “nuisance variable” (Culligan, 2015), but the implicit assumption has been that each of the different frequency levels represent, or at least correlate with, different levels of difficulty, with higher frequency words from the 2000 level being more likely to be known (i.e., easier) than words at the 5000 level (i.e., more difficult). This is why validation studies for frequency-based vocabulary tests like the VST (Beglar, 2010) and VLT (Schmitt et al., 2007; McLean et al., 2015; Webb at al., 2017) try to show the positive correlation between VLT (frequency) level items and overall number of correct responses (though the results are not very transparent for levels past the 3000-word level; cf. Beglar, 2010; McLean et al., 2015; Webb et al., 2017). The correlation between word frequency and word-item difficulty is intuitive and is consonant with the general consensus that learners tend to acquire and access the most frequent vocabulary first (N. Ellis, 2002; N. Ellis & Larsen-Freeman, 2009; Bybee, 1995, 2006), which has prompted Milton to claim that “the importance of frequency in vocabulary learning is as near to a fact as it is possible to get in L2 acquisition” (Milton, 2009, 242). In the context of vocabulary testing, a higher ability student should therefore correctly answer a lower difficulty item, where lower difficulty item derives from a higher frequency level, such as VLT level 2. Validation studies have thus emphasized that the levels in the VLT form an implicational scale insofar as mastery of one level implies mastery of the previous one (e.g., Schmitt, 2000; Schmitt et al., 2001; Webb et al., 2017).. On the other hand, Culligan (2015) has shown that while word frequency correlates only “marginally better” with Rasch item difficulty measures than orthographic features (e.g., number of letters, syllables or phonemes), the correlation indices range from an. 14.

(30) unimpressive -0.17 to -0.71 (where the negative value indicates the expected inverse relationship between word frequency and item difficulty). It should also be noted that in their VLT validation study, Schmitt et al. (2001, p. 70) flatly rejected this frequencyas-difficulty assumption stating that “difficulty has no real bearing on whether a word belongs in a certain section or not”; however, they did not offer argument for this counter-intuitive statement and this issue will be discussed in more detail later in Section 2.3. Generally, though, the operating assumption is that there is—or should be—a correlation between word frequency level and item difficulty. In this way, the VLT levels are independent and modular subtests, and not necessary components of the VLT, such that certain levels can be removed if deemed inappropriate for the group of learners, e.g., the 2000 level may be skipped for advanced level learners.. 1.43 VLT format In terms of language representation, the VLT aims to echo the ratio of word forms in natural language. In Beglar and Hunt’s (1999) revisions of the 2000 level and University Word Level, the ration of noun-to-verb-to-adjective was 5:3:1, but Schmitt et al. (2001) examined the word-form distribution in a corpus and revised the ratio to 3:2:1 to reflect a more accurate ratio in natural language use.. The VLT’s traditional item format is comprised of clusters of three definitions and six word choices, about which there are two noteworthy aspects: the first is that the definitions are actually the items and the words are the answer options, and the second is that the questions are decontextualized (i.e., context-independent) and aim to test partial knowledge (i.e., the most common definition of the word or word association); this form of testing receptive vocabulary knowledge allows for more questions to be 15.

(31) asked, thereby enhancing its reliability. It has also been suggested that the VLT 6x3 testlet format is efficient in that each level tests not only the 30 word-answers (or “keys”) for the 30 items, but also the 30 distractors, meaning that each level is actually testing 60 words from each 1000-word frequency level (Read, 1988; Culligan, 2015). This, however, is a problematic assumption because it presupposes the deployment of testtaking strategies and guessing, which runs counter to the intention of this diagnostic test to provide an accurate assessment vocabulary level mastery. The aim of the test-taker, therefore, is not to get the highest score possible, but to answer as honestly as possible in order to understand the extent of their vocabulary knowledge.. The test format and purpose of the VLT, therefore, present two implications. The first is that each VLT level is a de facto independent test with the sole purpose of representing that frequency level with 30 items so as to diagnose the test takers’ extent of mastery of that level. The second implication is that the testlets/items should consistently span the range of 1000 words across the frequency level. After all, 1000 words is, for most EFL learners, a substantial percentage of their vocabulary knowledge, if we take into account Laufer’s (2010) research that estimate an average vocabulary size of 2000 to 4000 words for university EFL learners across the world.. 1.44 Validation evidence Read (2000) validated the VLT’s two main assumptions: 1. high frequency words are more likely to be known by ESL learners compared to low level frequency words, and 2. the VLT is an implicational scale, i.e., that mastery of the 5000 level implies mastery of the 2000 and 3000 levels. Schmitt et al.’s (2001) findings echoed Read’s, and later validation studies on leveled vocabulary size tests, like the VST and VLT, have 16.

(32) similarly sought to show that lower levels are more difficult using raw score, Rasch item difficulty scores, and/or ANOVA comparisons of scores from different levels (Beglar, 2010; McLean et al., 2015; Webb et al., 2017). Schmitt et al. (2001) also found high reliability in their two VLT test versions with an increased number of items per level from 27 (Beglar & Hunt, 1999) to 30.. In recent years, there have been a number of validation studies on vocabulary size diagnostic tests that have evaluated the tests based on Messick’s “general validity criteria or standards for all educational and psychological measurement” (Messick, 1995, p. 6). Messick (1989, 1995) outlined a comprehensive set of six criteria for a unified concept of construct validity: content, substantive, structural, generalizability, external, and consequential validity. The first three of these criteria involve the “internal” mechanics of the test and its construction: the content aspect refers to evidence of content relevance, representativeness and technical quality; the substantive aspect involves the theoretical rationales for observed consistencies in test responses, as well as empirical evidence that these processes are being used by the test-takers; the structural aspect consists of evaluating the degree to which the internal structure of the assessment, as represented by test scores, is consonant with the structure of the construct domain, which largely involves the construct-relevant sources of task or item difficulty. The remaining three aspects of construct validity are externally relational: generalizability refers to the extent that the score properties can generalize across different populations of test-takers, while the external aspect marshals evidence of comparability with outside sources, such as other validated tests, and finally, the consequential aspect involves the implications of test use and whether it is fair and unbiased. Several of these aspects have been the focus of validation studies for the. 17.

(33) Vocabulary Size Test (Beglar, 2010), Listening Vocabulary Levels Test (LVLT; McLean et al., 2015) and Vocabulary Levels Test (Webb et al., 2017), all of which have employed a Rasch-based measurement approach to structure and provide empirical support for their validation arguments. All of these studies reached positive conclusions about their collected and Rasch-measured evidence that supports the validity of these instruments. The details of these studies will be discussed after the Literature Review presents a conceptual overview of Rasch measurement. Nonetheless, Kremmel and Schmitt (2018) make the pointed observation that given the VLT’s widespread adoption by teachers and researchers in the last three decades, there have been “surprisingly few studies published that investigated the validity of the instrument”, especially of the Schmitt et al. (2001) version which has been used and cited more than any other version.. 1.45 Value of VLT: decontextualized, reliability and expedient format The VLT has undergone extensive revisioning since its initial release in 1983 (Nation, 1983, 1990; Beglar & Hunt, 1999; Schmitt et al., 2001; McLean & Kramer, 2015; Webb, Sasao & Ballance, 2016), and its usefulness has long been recognized, most recently by Harding, Alderson and Brunfaut (2015) who see the potential in tests like the VLT to diagnose learners’ vocabulary size in terms of bands of frequency levels. However, even though the VLT was extensively validated in 2001 (Schmitt et al., 2001), there has been in the last two decades an obvious trend away from decontextualized vocabulary testing on high stakes tests, such as TOEFL, IELTS, TOEIC, Pearson Academic, and BULATS, which subsume vocabulary testing within in larger macro-skills like reading. This move away from decontextualized testing, such as the pre-1995 TOEFL discrete vocabulary testing section, towards “authentic" language and tasks in “language-use tasks” (Read, 2000) marks the current dominant holistic and communicative approach to skills testing 18.

(34) that aims to assess the four language macro skills of reading, writing, listening and speaking as “the contextualized realization of the ability to use language in the performance of specific language use tasks” (Bachman & Palmer, 1996, pp. 75-76).. But there are still advantages of and proponents for decontextualized and discrete vocabulary testing. Even though decontextualized testing is no longer in vogue due to the concerns of negative washback that this kind of testing for high stakes tests encourages learners to rote memorize vast amounts of vocabulary without being able to use it, there are still those advocate the testing of vocabulary as a discrete and unidimensional latent trait for diagnostic assessment (Alderson, 2005; Hulstijn, 2011, p. 244). Cameron (2002), in an often-cited article, appeals to common sense and similarly defends the use of decontextualized, discrete testing of vocabulary: “Eventually, after sufficient contextualized encounters, a word will be recognized when it is met in context or in isolation . . . it does not seem unreasonable to test to see how much vocabulary can be recognized without extended linguistic or textual contexts” (p. 151). Finally, Nation (2012, online information on the VST) pointed out that there can still be positive washback associated with decontextualized testing: 1. decontextualized learning with flashcards has been shown to be highly efficient (Nation, 2001, pp. 297299), and 2. this learning seems to contribute to both explicit and implicit knowledge (Elgort, 2011).. VLT versions with 30 items (i.e., 10 bundles or testlets of 3 items) per level have been shown to have a higher reliability than the previous shorter versions (Schmitt et al., 2001). For a levels test that must consider the practicality of testing a limited number of items per level to represent 1000-word levels, the more items tested per level, the more. 19.

(35) representative and valid the results of that level. However, the trade-off of lengthy tests is impracticality and results that are negatively affected by loss of attention and testtaker fatigue. In this light, the 6x3 testlets means that more items can be answered in a given time than multiple choice questions; and there are those that argue that the clusters test not only the three answer-key words, but all six options (Read, 1988), which means that 60 words represent the 1000-word levels. In this sense, Culligan (2015) points out the VLT’s efficiency in allowing students to answer many questions in a short period of time. And even though the yes/no format is even more expedient than the 6x3 format, Cameron (2002) showed in a comparison study that the VLT was more useful than Meara’s (1992) Yes/No Test to diagnose the vocabulary knowledge of secondary school students.. 1.46 Suspected problems with the VLT: local dependence, guessing and unidimensionality In addition to its benefits, the VLT has major shortcomings related to its format and content. Relating to its format, concerns have been raised about the effects of guessing, and more importantly, about the matching format that links items together to influence the possible choices of their responses. There has been a long-standing concern that the VLT’s clustering format results in Local Item Dependence (LID) with the answer of one item possibly influencing the answering of another item. And relating to both format and content is the issue of unidimensionality, and whether the test is only testing the dimension of vocabulary knowledge. Analysis of the VLT’s unidimensionality and LID can be achieved using either Classical Test Theory (CTT) methods or Item Response Theory (IRT) methods. The next section gives a brief overview of these methods and argues that the Rasch method (often categorized as an IRT model) will. 20.

(36) provide more accurate results for detecting LID and unidimensionality than either CTT or other IRT methods.. Summary One of the best indicators of language proficiency is vocabulary size. Large amounts of vocabulary are necessary for competence in the four basic skills of reading writing and the basis of communicative competence. For these reasons, the last 30 years or so has seen a burgeoning of research and interest in measuring the vocabulary size of language learners, resulting in hundreds of research publications on the development and use of vocabulary size tests such as the Yes/No Checklist Test, Vocabulary Size Test and the Vocabulary Levels Test.. These tests have been used as placement tests, diagnostic tests and benchmarks for learning in pre- and post-test type of studies. Each of these three vocabulary size tests has its own focus and unique format, but the VLT has received the most attention in research publications in the last 35 years, despite widespread suspicion of its item cluster format. Since the 1980s when the Nation (1983) first published the VLT, people have been concerned that the cluster format of this test results in the response of one item influencing the possibility of answering another. In other words, since each item cluster is composed of three items (definitions) and six answer options (words), it is suspected that the answering of one item can unfairly influence—or depend on—the answering of another item in the cluster since the three cluster items draw from the same set of answer options. This type of Local Item Dependence (LID) is called item chaining and appears to be a flagrant violation of the basic assumption of Local Item Independence (LII) in Classical Test Theory as well Item Response Theory. And if item chaining is pervasive throughout the test, this also challenges another fundamental 21.

(37) assumption in test theory: unidimensionality, or the test’s capacity to measure only one trait like vocabulary knowledge. If both of these assumptions are substantially violated by Local Item Dependence (LID), the test’s reliability and validity are necessarily called into question.. Previous VLT validation studies (e.g., Beglar & Hunt, 1999; Schmitt et al., 2001; Webb et al., 2017) using Classical Testing Theory, Factor Analysis and Item Response Theory (Rasch modeling) were unable to uncover evidence for LID. The purpose of this dissertation, then, is to investigate the issue of LID in a shortened version of the VLT (three levels instead of five) using a wider variety of Rasch modelling approaches that were triangulated so as to identify the existence and extent of LID in the VLT.. 22.

(38) Chapter 2. Literature Review 2.1 Statistical analysis: Virtues of Rasch modelling over CTT and IRT 2.11 Classical Test Theory In contrast to more modern psychometric theories, collectively known as Item Response Theory (IRT), Classical test theory (CTT) refers to classical psychometric theory that aims to understand and improve the reliability of psychological tests, which have for decades been the mainstay in disciplines ranging from psychology to economics to education. Evolving since Binet created his intelligence test in the early 1900s, CTT is regarded as a simple, robust model (Coaley, 2009). Based on Novick’s (1966) foundational formulation, CTT predicts the outcomes of these tests, such as the difficulty of items or the ability of test-takers. Mathematically, the theory is grounded in the idea that a person’s observed or obtained score on a test is the sum of a true score (error-free score) and an error score. The relationship between these three elements is often formulated as: Observed Score = True Score + Error; or X = T + E. CTT can therefore be viewed as true score theory, where the true score (T) of a person is a hypothetical construct that could be realized if the person were to complete the same test an infinite number of times. The main concern is to quantify the random error (E) part, and in test creation, to minimize the error so that the Observed score (X) will approach the true score.. According to Traub’s (1997) historical analysis, CTT was the result of an evolution of three concepts: 1. The recognition that errors are intrinsic to measurements, 2. the realization that this error is a random variable, and 3. the conception of correlation and 23.

(39) how to measure it. Charles Spearman in 1904 started the evolution by figuring out how to correct a correlation coefficient for attenuation due to measurement error and how to obtain the index of reliability needed in making the correction (Traub, 1997).. In test development and validation, items require analysis. CTT item analysis is most commonly achieved using descriptive statistics and involves calculating the item mean and item variability. In this framework, more effective items have both higher variability and item means closer to the center of the distribution of the item scores. CTT can investigate an item by analyzing its distractors, difficulty, discrimination, and total correlations (Coaley, 2009, pp. 35-40).. Distractor analysis evaluates and compares the frequency of the selected answer options. Ideally, the distractor options are more or less equally chosen by the test-takers who incorrectly answered the item. As for difficulty analysis, it produces a difficulty indicator, or p value, which represents the percentage of test-takers who answered the item correctly; this is calculated by simply dividing the number of people who answered the item correctly by the total number of people who answered it. A high p value approaching 1 indicates that most people got the item correct, suggesting that the item is too easy; on the other hand, a p value close to 0 suggests the item is too difficult. A mean p value of 0.5 indicates moderate difficulty and is able to better discriminate test takers. A well-balanced test will have items representing a range of difficulties (0.20.8), but their mean p value should be close to 0.5. Difficulty analysis can also be extended to determine if the item exhibits bias towards any group of test-takers; this is done by comparing total group correct scores on an item.. 24.

(40) Discrimination analysis determines whether the response on one item is related to all of the others. This can help identify which items are effectively measuring the trait (or dimension) under investigation. People who score well are more likely to answer an item correctly, while lower scorers will be less likely. However, if compared to higher scorers, the lower scorers tend to either correctly answer an item more often (negative discrimination) or just as often (zero discrimination), this is a red flag and suggests that the item is measuring a different trait or dimension. Item discrimination is often calculated by comparing the top and bottom 27% of the distribution of scores. Specifically, discrimination, or d, is found by subtracting the percent of people getting the item right in the high group (Ph/Nh) from that in the low group (Pl/Nl):. d = Ph/Nh - Pl/Nl.. Items that discriminate well are easier for the higher group, and thus have large, positive values of d. Items with a negative value are easier for lower scorers and should be removed.. Total correlation analysis is another method to evaluate the discriminability of an item and involves determining the correlation between an item and a total score on that measure. Items with high positive item-total score correlations are more clearly related to the trait or dimension being measured. These items exhibit more variability than others with lower correlations, which indicates better ability to discriminate between high and low values. A negative value suggests that an item is negatively related to the other items on the measure, and that it is measuring a different trait. 25.

(41) Although CTT was the dominant mode of analysis in the social sciences for decades and still remains very widely used, Hambleton, Swaminathan, and Rogers (1991, pp. 45) pointed out almost 30 years ago that there are four important weaknesses of CTT. The first involves its definition of reliability as "the correlation between test scores on parallel forms of a test,” which is problematic because there is no consensus as to what parallel tests are. Another issue is related to its conception of standard error, which is assumed to be the same for all test-takers. Unfortunately, this assumption is difficult to accept given that scores on any test are unequally precise measures for examinees of different ability. The third shortcoming of CTT that Hambleton and colleagues identified is that examinee characteristics and test characteristics cannot be separated, which means that they can only be interpreted with reference to each other. Finally, CTT is test-oriented, i.e., based on the sum of all items. Since this approach is not oriented to individual items, CTT is unable to make predictions on how well a test-taker or even group of test-takers might do on a particular test item.. 2.12 Modern Latent Trait Theory: IRT In response to the shortcomings of CTT, Item Response Theory (IRT) was developed to investigate the relationship between a person’s response to an item and the trait or attribute being measured. In IRT, the attribute is usually described as an underlying “latent trait”, and for this reason the theory has been called “latent trait theory”. IRT can be applied for test construction, test validation and even test scoring.. CTT has been referred to as a “weak model” because its assumptions can be easily met by traditional procedures, whereas IRT is described as a “strong model” given its 26.

(42) stringent assumptions that the test data must meet (Kline, 2000). For traditional likelihood unidimensional IRT models, there are three assumptions that must be met (Embretson & Reise, 2000): 1. Items should have a monotonic relationship with their underlying trait, i.e., as ability increases, so does the probability of answering an item correctly; 2. Items measure or should load onto only one trait or dimension; and 3. Item responses should be independent of each other and depend only on ability.. IRT and Rasch methodologies emerged in the 1950s and 1960s by pioneers including psychometrician Frederic Lord, mathematician Georg Rasch, and sociologist Paul Lazarsfeld (Engelhard, 2013), but only gained currency by the late 1970s and early 1980s with the increasing availability of computers necessary to run the sophisticated and highly iterative mathematical functions. In fact, it was this mathematical sophistication that provided the impetus for much suspicion and criticism of IRT, especially in comparison with the mathematically simpler and more intuitive CTT (Coaley, 2009, pp. 40-43). Although IRT and Rasch methodologies entered the field of language testing in the 1980’s, skepticism about the highly technical nature of the Rasch model resulted in controversy in the 1990’s in what McNamara and Koch (2012) called the “Rasch wars”.. Nevertheless, IRT and Rasch approaches have since then been accepted as a methodological staple of research into language testing—especially high-stakes testing by companies like ETS (Education Testing Services, creators of the GMAT, TOEFL, TOEIC; Davies, 2003) who have used IRT for quality control and test validation—and viewed as a more accurate method of determining test scores compared to Classical Testing Theory (CTT). Whereas CTT treats every test item as equal, IRT is a scaling. 27.

(43) method of measurement that identifies patterns of responses to account for items of different difficulty, as well as persons of different levels of ability. IRT is a probabilistic “model”, i.e., a simplification of reality, that assumes that persons with more ability have a higher probability of getting more difficult items correct.. The probability of a correct item response is defined as a mathematical function of both person (i.e., ability) and item parameters, of which there are three possible parameters: 1. Difficulty (i.e., location on a difficulty range), 2. Discrimination (i.e., slope, or how steeply the rate of success varies with ability), and 3. Pseudo-guessing (i.e., probability of the least able persons guessing an item’s correct response). Corresponding to these three item parameters are three IRT models: the 1 parameter logistic (1-PL) model that only takes into account the difficulty parameter, the 2-PL model that takes into account difficulty and discrimination, and the 3-PL that includes all three item parameters.. 2.13 Rasch vs IRT: Only Rasch is an invariant measuring method The analyses in this study will use the Rasch model for measurement. Although the Rasch model is also known as a special case of the 1-PL IRT model, Rasch proponents, such as Bond and Fox (2007, p. 265), disavow its association with the 1-PL, 2-PL and 3-PL IRT models because the three latter IRT models are sample dependent, similar to CTT. From the Rasch point of view, the users of IRT models follow the traditional mindset of statisticians: to explain data in terms of models. However, the Rasch approach does the opposite by furnishing a measurement model variously described as “objective” (e.g., Zhang & Yang, 2015), “invariant” (e.g., Engelhard, 2013; Millsap, 2012), or “fundamental” (Bond & Fox, 2007), against which the data are evaluated in terms of fit or misfit. Rasch measurement thus assumes that the indices of 28.

(44) discrimination and guessing are negligible, and while guessing may be implicated in the test-taking behavior of the VLT, there is reason to believe its impact is negligible given the test instructions urge against guessing, and Schmitt et al. (2001) found little evidence of guessing from their post-test interviews. In contrast to the 1-PL, 2-PL and 3-PL models that use difficulty, discrimination and guessing parameters to describe—and therefore to better fit—the data, the Rasch model is “person-distribution-free” because it individually parameterizes persons; this differs from the 1-PL model that assumes the underlying latent trait distribution to be standard normal (Castaneda, 2017). Rasch devised a “sufficient statistic” (Fisher, 1922) that fully explains the sample insofar as the sample cannot provide any additional information to determine the test taker’s ability. As such, Rasch measurement is used to compare the data with an “objective” model that we would expect in the real world. In other words, Rasch measurement aims to provide a means to objectively, or invariantly, measure phenomena according to an interval scale of equal lengths, just as a ruler invariantly measures the length of different objects, such as a 2-centimeter length of thread, coin or piece of paper. In the Rasch model, the same ruler is used to simultaneously, or conjointly, measure person (subject) ability and item difficulty (see Figure 2.1).. Figure 2.1. The ruler analogy for Rasch objective measurement: plotting persons/subjects and items on the same metric 29.

(45) This claim to objective measurement cannot be made by either (1-PL, 2-PL, 3-PL) IRT or CTT models. Bond and Fox (2007, p. 272) argue that “the axioms of conjoint measurement provide the only satisfactory prescription for scientific measurement and that, in terms of widespread application to the human sciences, Rasch measurement is the only game in town”; and Boone et al. (2014) similarly point out Rasch measurement is the only approach currently available to social scientists that "aligns" with the scientific approach to measurement and data collection. Insofar as the data are compared to an invariant model, the Rasch method is a useful quality control check for the development of measuring instruments, such as vocabulary tests. The Rasch model mathematically defines item response requirements so that they become linear measures on an interval scale instead of just numbers or ordinal scores. That is, the separate parameters of item difficulty and person ability are simultaneously located and calibrated on a continuous scale of a latent trait, e.g., vocabulary knowledge. As explained by the founder of the Rasch method, Georg Rasch (1960):. A person having a greater ability than another should have the greater Probability of solving any item of the type in question, and similarly, one item being more difficult than another means that for any person the probability of solving the second item correctly is the greater one. (p. 117). The Rasch calibration scale is based on the Guttman scale (Figure 2.2), which consists of a set of graded patterns of ideal responses based on the conjoint measurement of person ability and item difficulty. This idealized structure of the Rasch model makes it an objective or “invariant” metric. However, the difference with the Rasch model is that. 30.