Comparison of seven problematic testlets - Discussion 5.1 Summary statistics

6 LID TESTLETS - ANSWER PATTERNS

Chapter 5. Discussion 5.1 Summary statistics

5.31 Comparison of seven problematic testlets

Figure 5.1 is a bubble path plot of the seven testlets showing their relative difficulty, precision and fit measures. The CA testlet stands out: all three items span 4.3 logits of underfit and 3.5 logits of difficulty, while CA61 is clearly the most underfitting item on the map and at the farthest right (4.3 ZSTD) at the most difficult, occupying the highest vertical position at about 3 logits in difficulty. The AI testlet items are also remarkable

110

given their lowest (least difficult) position as well as their large sized bubbles, which indicate large measurement error. It can also be seen that four testlets have pairs of items that overlap with each other: CD70-71, BD41-42, AE13-14, AI26-27.

Figure 5.1. Bubble pathway plot for 7 testlets of 21 items

The items of the seven testlets (21 items in total) were once more entered into a Q3-LID pairwise analysis to see which of the pairs had the most similar response patterns (Figure 5.2b). This time, only the AD10-11, BD40-41, and AE13-14 item pairs, which originally had the highest original residual correlations (0.41, 0.34, and 0.38, respectively), were found to have weak correlations ranging from 0.18-0.36; the previously identified pairs did not appear (original Q3 LID analyses, BH53-54: 0.27;

CA61-62: 0.25; CD70-71: 0.2 [only for VLT5 level]).

111

A Wright map of the 7 testlet items was also created (Figure 5.2a) to see how these testlet items related to each other and to the other testlet items; the items were distributed into different vertical columns to better see the item and testlet relationships. Two of the three testlet item pairs from the Q3 LID analyses consisted of testlet pairs grouped fairly closely together (AE13-14, BD40-41) while the other testlet item pair of AD10-11 were farther apart on the difficulty scale with AD10 being closer to AD12. Finally, although the BH53-54, CA61-62 and CD70-71 testlet pairs did not emerge from the new Q3 LID analysis, these pairs are all substantially more difficult than the third testlet item, located more than 2 logits above it.

112 problematic testlet items and all other items; b. Q3 analysis of only the 21 items in the seven problematic testlets (only 3 LID pairs located, for both positive and negative correlations)

113

5.32 PCAR and Q3 LID + item analysis 5.32a PCAR analyses

The PCAR revealed two testlet items at the bottom of cluster 3 (items

70, 71) with

“substantive” contrastive loadings (-0.42 and -0.44, respectively). These items, in addition to BI57, CG80 and BD40 have substantial residual loadings (outside of +/- 0.4) and suggest possible non-Rasch dimensions. These potential red flags of item pairs with local item dependence were in fact overlooked by Webb et al. (2017) even though they reported two testlet item pairs (139-140 and 43-44) with residual loadings outside +/- 0.4 (Tables 6-7, pp. 49-50). Linacre (Dec 18, 2018) recommends that groups of testlet items that appear either at the top of cluster 1 or bottom of cluster 3 with loadings outside of +/- 0.4 are unusually dependent on each other and should be considered for deletion. Ten other items with residual loadings outside of +/- 0.3 are listed in Table 5.1 in an effort to look for similarities in the items/answers that may indicate another non-Rasch dimension.

114

Table 5.1

Items with residual loadings outside of +/- 0.3

Note: * items outside +/- 0.4.

The list of items in Table 5.1 with residual loadings outside of +/- 0.3 reveals two general differences between items with positive loadings (off dimension) and those with negative loadings (negatively correlating with the Rasch dimension). Firstly, five of the nine off-dimension items (left column) occur in VLT2 compared to the negatively correlated items (right column), which contains no VLT2 items and a majority of VLT5 items (four out of the six items). Secondly, the average length of the item definition is longer for the positive loading items (3.7 vs 2.3) and includes more complicated grammatical structures, such as gerund phrases (“happening once a year”, going to a far place) and a complicated reduced relative clause (“money [that is] paid regularly for doing a job”) with both an adverb and gerund. However, the small number of items in each group and the “general” differences make it impossible to confidently conclude what other dimensions, such as grammar knowledge, might be tapped into besides the main Rasch dimension.

BI57* Happening once a year Annual CD71* Large group of soldiers Legion CG80* Guess about the future Predict BD40* Army officer Lieutenant

CJ88 Empty Vacant CD70* Female horse Mare

AB6 Money paid regularly for doing a job

Salary BH53 Cut neatly Trim

AC9 Going to a far place Journey CA61 Bucket Pail

AG20 Walk without purpose Wander CF77 Plan or invent Devise AJ30 Having no fear Brave

CH83 Fall down suddenly Collapse AD12 Not having something Lack

115

5.32b Q3 LID and testlet analyses

Although the Q3 LID analysis for the three combined VLT levels³ did not find item-pairs in testlet CD70-72 as the PCAR analysis did with items CD70-71, it did show that 5 out of the total of 20 identified locally dependent item pairs occurred within testlets (10-11, 13-14,

40-41, 53-54, 61-62); these testlet pairs were weakly dependent with

residual correlations ranging from 0.25 to 0.41. This finding differs from Webb at al.

(2017) who did not find any LID pairs in their two new versions of the VLT. And while there is no overlap of testlet item pairs from the PCAR and Q3 LID analyses, the PCAR found that three individual items with contrast loadings <-0.3 also appeared in the Q3 LID pairs as the first item (items 40, 53, 61); these items appear at the VLT3 and 5 levels (Table 5.2).

The inter-testlet comparison yielded high model reliability scores (except for AG, AI, and BA testlets with values of 0.59, 0.74, and 0.0) and fit statistics and thus was unable to pinpoint intra-testlet problems. However, six testlets had outfit scores between 0.6 to 0.77: three from VLT2 (AC, AE, AI), two from VLT3 (BE, BG) and one from VLT5 (DJ). Of these six testlets, the only testlets in common with the other analyses was AE13-15 (0.77) with the above Q3 LID analysis and AI25-26 (0.6) with the PCAR analysis of VLT2.

3 However, the Q3 LID analysis for the intermediate ability group for the VLT5 level as a subtest did find CD70-71 to be locally dependent.

116

Table 5.2

Testlet item pairs implicated in Q3 LID, testlet, and PCAR analyses

Note: + Item identified in PCAR analysis; * testlet pairs identified by Q3 LID analysis.

The following sections will present qualitative analyses of response patterns, word frequencies of answer options and item stems, and other testlet features of the problematic testlets from Table 4.17. These testlets contain items from at least 2 of the 5 above analyses (Q3 LID item pairs, PCAR and testlet analyses; the number in the square brackets [ ] indicates the number out of 5 from Table 4.17): three testlets for VLT2 (AD, AE, AI), two for VLT3 (BD, BH) and two for VLT5 (CA, CD). It is assumed that these commonly identified testlet items are most likely to exhibit significant local dependence. These testlets items, their answers and distractors, are presented in full in Table 5.3.

Item No.

Item Answer Item

No.

Item Answer

AD10+ Gold and silver Treasure BD40+ Army officer Lieutenant

AD11+ Pleasing quality Charm BD41 A kind of stone Marble

AE13* Part of milk Cream BH53+ Cut neatly Trim AE14* A lot of money Wealth BH54 Spin around quickly Whirl

CA61+ Bucket Pail CA62+ Unusual interesting

thing

Novelty

117

Table 5.3

Seven problematic testlets with answers and distractors

Testlet

在文檔中單字階層測驗之局部獨立性檢測 (頁 124-132)