Variable difficulty values within testlets: micro-linguistic relativism and guessing as additional threats to construct validity

6 LID TESTLETS - ANSWER PATTERNS

Chapter 5. Discussion 5.1 Summary statistics

5.36 Variable difficulty values within testlets: micro-linguistic relativism and guessing as additional threats to construct validity

If vocabulary difficulty is seen as being a function of how common or uncommon the word is relative to other words, we may call this a situation of macro (linguistic) relativism. However, in a diagnostic vocabulary test like the VLT, another factor of difficulty may involve the local context at two micro-levels: the micro-level relativism

123

of items within each VLT (frequency) level, and also the micro-testlet relativism within the testlet for both answer options and question stems. At the VLT level, where the VLT level is a functionally independent subtest, the operationalization of word-frequency-as-difficulty makes the assumption that these items will be approximately the same level of difficulty, or at least within a range of difficulty that does not substantially overlap with the next level. These are the implicit assumptions made by Schmitt et al. (2001), Beglar (2010), McLean et al. (2015) and Webb et al. (2017) in their VLT validation studies aiming to show that there is an implicational scale in these frequency-leveled vocabulary tests, i.e., that the items at higher levels are also higher in difficulty.

The same micro considerations should also apply at the testlet level with all six answer options in theory coming from the same level and being within an acceptable range of difficulty. If the items/testlets in one level overlap in difficulty with those of another level, or if the items within a testlet overlap with items of a similar difficulty in another level, then these items/testlets cannot be considered as performing productive measurement. And if substantial overlap exists, the very purpose of the test as a diagnostic measuring tool that can profile the word knowledge of a given level must be called into question. Although it is unreasonable and unrealistic to expect a one-to-one correspondence between any NS English corpus with any given NNS L2 English lexicon (especially beyond the 2000 or 3000 level), at some point the greater the divergences at the micro-item and micro-testlet levels, the greater the threat to the content validity, substantive validity and structural validity of the whole test (see Section 1.44).

124

The major concern of this study, therefore, is whether the micro or contextual factors of large or inconsistent ranges of item difficulties that occur at the levels of the overall test, VLT-level-subtest and individual testlet can also influence the responses of the test-taker. In terms of item chaining, is it possible that a testlet’s item or items of a substantially greater or smaller degree of difficulty can influence the person’s ability to

“partially recognize” the other items or response choices? It must be remembered that the three items in the VLT testlets are actually definitions and the three correct answers actually derive from six answer options (three answer keys and three distractors) that are all taken to be representations of one frequency level. And just considering the six options, and for the moment ignoring definition appropriacy and the vocabulary difficulty of the question stems (which should be at least one level lower than the VLT-leveled item), it is not only the set of three answer options competing with the set of three distractor options in each testlet, it is each answer option competing with the five other answer options for (partial vocabulary) recognition. This is clearly the case for the learners deliberately employing guessing strategies of answer elimination, but this is likely to be the default test-taking behavior of all VLT test-takers since all answer options will be scrutinized in the answering process.

To minimize the local or micro-level relativism and the biases that guessing by elimination may introduce, all six of the answer options need to ideally fall within the narrow range of difficulty, i.e., from the same frequency level. If various answer options come from different frequency levels, then some items will probably be easier or more difficult than the others, and this will influence the overall appropriateness (or fit) of the overall testlet and its ability to measure productively. This applies to both distractor options and key options. For distractors, if the three distractor options are of a clearly

125

higher and more difficult level than the answer keys, this may make it easier for test-takers to identify the (lower) level keys as answers; conversely, if the distractors are of an obviously lower level, they will probably be more easily recognized, resulting in learners being better able to (knowingly or unknowingly) employ test-taking strategy of distractor elimination to discount them. The same logic applies to the set of three key options: if one or more items are too difficult, test-takers are disadvantaged, especially those whose ability is at or near that testlet’s alleged difficulty level; the converse is true if the answer options are too easy. For this reason, Culligan (2015) pointed out that for the VLT, vocabulary knowledge may not be the only factor influencing the probability of a correct response, but the difficulty of the words in the 6x3 cluster could also be an influence. Difficulty variability was also implicated in VLT cluster item dependence investigation by Lai et al. (under review) who proposed the VLT-sequence model that assumes that the answer order of test-takers follows the sequence of item difficulty; they found that cluster items ranked second and third most difficult (i.e., answered second and third) were assigned increasingly inflated difficulty estimates in the Rasch analysis.

Knowles and Condon (2000) observed a similar contextual effect of item sequence variability, i.e., when items were placed in different parts of a test or survey. Although their research involved considerably more semantically complex item responses that involved much more subjective interpretation from respondents, they observed a noticeable “parametric drift” as a result of variable item placement within a test. The VLT could also be subject to a similar kind of parametric drift if testlet distractors, or keys, are either too easy or too difficult as it may tempt test-takers to employ test-taking strategies to guess answers and interfere with both ability and difficulty measures. Local item-level variability in difficulty, if extensive, could pose a threat to the test’s construct

126

validity. This conjecture of micro-item difficulty relativism could be tested empirically by manipulating items of varying difficulty in testlets to see if substantial differences in item difficulties result in biased (inflated or deflated) item and person reliability scores.

在文檔中單字階層測驗之局部獨立性檢測 (頁 137-141)