VSMB vs VSM and 2PL-IRT - 英文詞彙測驗允許留白的試題反應模型之建構與檢測

First we show how to combine the blank and incorrect responses into the new “incorrect”

responses. For example, the binary response vector (1, 1, 0) is composed of two trinary response vectors (1, 1, 9) and (1, 1, 0). Table 15 shows all the conversions. For convenience, we use RVSMB to represent this reduced VSMB model. By doing so, the VLT data at hand consist of binary responses only and therefore we can compare the fit of the RVSMB model to the 2PL-IRT and the VSM models using the AIC and BIC indices. As shown in Table 16, AIC and BIC are both smallest for the VSM model, which means that the VSM model is the most parsimonious model among the three. With a smaller AIC value, the RVSMB model seems to perform better than the 2PL-IRT model. However, with the additional 30 γ’s parameters, the RVSMB model, with the greatest BIC value, appears to provide the worse fit among the three.

Table 15: Converting trinary responses of 1, 0, and 9 into binary responses 1 and 0.

0 for incorrect 9 for blank & 0 for incorrect (1, 1, 1) = (1, 1, 1)

(1, 1, 0) = (1, 1, 0)+(1, 1, 9) (1, 0, 1) = (1, 0, 1)+(1, 9, 1)

(1, 0, 0) = (1, 0, 0)+(1, 0, 9)+(1, 9, 0)+(1, 9, 9) (0, 1, 1) = (0, 1, 1)+(9, 1, 1)

(0, 1, 0) = (0, 1, 0)+(0, 1, 9)+(9, 1, 0)+(9, 1, 9) (0, 0, 1) = (0, 0, 1)+(0, 9, 1)+(9, 0, 1)+(9, 9, 1)

(0, 0, 0) = (0, 0, 0)+(0, 0, 9)+(0, 9, 0)+(0, 9, 9)+(9, 0, 0)+(9, 0, 9)+(9, 9, 0)+(9, 9, 9)

Table 16: Loglikelihood, AIC and BIC for 2PL-IRT, VSM, and RVSMB.

Model 2PL-IRT VSM RVSMB

Loglikelihood -8824.548 -8704.047 -8749.390

AIC 17769.1 17528.09 17678.78

BIC 18044.27 17803.26 18091.54

While assessing the fit for the moments, Table 17 shows the observed and expected moments under the RVSMB, VSM, and 2PL-IRT models. Too better access the difference in the observed and expected moments, we also report their standardized differences.

Almost all the standardized differences are smaller than 2, with the exceptions of the two

moments (·, 1, 1), (1, 1, 1) in Cluster 8 under the RVSMB and the 2PL-IRT models. More specifically, these larger standardized differences for the two moments (·, 1, 1) and (1, 1, 1) of Cluster 8 are respectively 2.610 and 2.375 for the RVSMB model, 2.994 and 2.816 for the 2PL-IRT model, and 1.673 and 1.393 for the VSM model. In short, the VSM model can provide good fit for all the moments, the RVSMB model fits better than 2PL-IRT model for Cluster 8 and they both fit the moments of all the other clusters. In other words, although the VSMB model is rejected for the 3000-level VLT data based on the M₃ statistics, the RVSMB model provides reasonable fits for most of the moments.

Figure 5 depicts the scatterplot of the ability estimates, θ’s, from the VSM and the VSMB models. It seems that the ability estimates from the VSMB model are greater than those obtained from the VSM model. In fact, 513 out of 725 examinees are estimated to have higher ability under the VSMB model. It looks that higher ability is estimated for leaving blanks than giving incorrect answers. However, these ability differences are relatively small. Larger ability differences actually occur in some cases with higher abilities under the VSM model, as shown and labeled with the examinee number 183, 567, and so on in Table 18. Table 18 also shows the response vectors of those with a difference greater than 0.2 in their ability estimates. We find that those examinees gave correct answers to over 20 items, but left blank in at least one whole easy cluster. By easy clusters we mean those clusters of which all the items have the observed proportion of correct higher than 0.75. In the 3000-level VLT data, Clusters 1, 3, 6, and 9 are considered as such easy clusters. To conclude, for examinees who have a high proportions of correct for their answered items but leave blank in an entire easy cluster, the VSMB model is perhaps likely to underestimate their abilities.

5 Discussion

We first discuss the issue in choosing a proper number of quadrature points to achieve accurate approximation to numerical integration. The choice of quadrature points is im-portant in conducting simulations. We originally chose the quadrature points of 10. There did not seem to be any problem when the sample sizes were 500 and 1000, but when the number size went up to 2000, local extreme would occur in optimizing the likelihood func-tion. We later solved the problem by increasing the number of quadrature points to 20.

Thus, we would recommend to use at least 20 quadrature points for the Gauss-Hermite quadrature methods in estimating the VLT, especially with large sample sizes.

Through the analysis of the real VLT data, we find that our assumption for leaving a blank seems a bit too naive. Especially when we observe, in Table 18, that some examinees who got the proportion of correct over 0.7 actually left blanks on all the items within an

Table 17: Standardized differences in the moments of correct for the 300-level VLT data fitted with RVSMB, VSM, and 2PL-IRT.

moment Cluster 1 Cluster 2 Cluster 3

RVSMB VSM 2PL-IRT RVSMB VSM 2PL-IRT RVSMB VSM 2PL-IRT

(1, ·, ·) -0.336 -0.333 -0.449 -0.249 -0.247 -0.328 -0.275 -0.275 -0.359 (·, 1, ·) -0.540 -0.721 -0.448 -0.130 -0.163 -0.334 -0.450 -0.407 -0.409 (·, ·, 1) -0.307 -0.426 -0.500 -0.396 -0.749 -0.495 0.140 0.224 -0.241 (1, 1, ·) 0.257 -0.048 0.395 -0.375 -0.667 -0.383 -0.072 -0.119 0.056 (1, ·, 1) -0.155 -0.420 -0.278 -0.414 -0.954 -0.381 -0.738 -0.769 -0.856 (·, 1, 1) -0.387 -0.734 -0.273 1.605 0.814 1.872 -0.628 -0.652 -0.645 (1, 1, 1) 0.151 -0.379 0.363 1.381 0.334 1.843 -0.610 -0.746 -0.395

moment Cluster 4 Cluster 5 Cluster 6

RVSMB VSM 2PL-IRT RVSMB VSM 2PL-IRT RVSMB VSM 2PL-IRT

(1, ·, ·) -0.326 -0.331 -0.433 -0.281 -0.275 -0.360 -0.302 -0.302 -0.399 (·, 1, ·) -0.354 -0.364 -0.518 -0.243 -0.230 -0.382 -0.195 -0.223 -0.380 (·, ·, 1) -0.725 -0.778 -0.545 -0.351 -0.352 -0.537 -0.128 0.005 -0.443 (1, 1, ·) -0.073 -0.250 -0.169 -0.408 -0.494 -0.480 0.006 -0.213 -0.033 (1, ·, 1) -0.123 -0.351 0.119 -0.176 -0.238 -0.318 0.175 0.072 0.022 (·, 1, 1) 0.197 -0.002 0.472 -0.862 -0.975 -0.769 -0.206 -0.414 -0.287 (1, 1, 1) 0.569 0.151 0.943 -0.616 -0.797 -0.472 -0.115 -0.588 0.049

moment Cluster 7 Cluster 8 Cluster 9

RVSMB VSM 2PL-IRT RVSMB VSM 2PL-IRT RVSMB VSM 2PL-IRT

(1, ·, ·) -0.300 -0.314 -0.413 -0.268 -0.265 -0.346 -0.278 -0.277 -0.360 (·, 1, ·) -0.389 -0.431 -0.528 -0.347 -0.364 -0.548 -0.332 -0.350 -0.393 (·, ·, 1) -0.436 -0.613 -0.464 -0.589 -1.077 -0.554 -0.007 0.186 -0.419 (1, 1, ·) 0.116 -0.267 0.271 -0.818 -0.913 -0.939 0.040 -0.107 0.065 (1, ·, 1) 0.411 -0.040 0.603 -0.593 -1.123 -0.487 0.123 0.179 -0.169 (·, 1, 1) 0.361 -0.052 0.663 2.610 1.673 2.994 -0.541 -0.675 -0.537 (1, 1, 1) 1.191 0.406 1.791 2.375 1.393 2.816 -0.326 -0.614 -0.185

moment Cluster 10

Table 18: Response vectors for examinees with a difference in ability estimates greater than 0.2.

Cluster Item No. 183 No. 567 No. 657 No. 661 No. 675 No. 677 answer key

1 1 9 4 9 9 4 9 4

2 9 5 9 9 5 6 5

3 9 1 9 9 1 9 1

2 4 9 3 9 9 3 9 3

5 9 4 9 9 4 9 4

6 9 6 9 9 6 9 6

3 7 1 1 1 1 1 1 1

8 2 2 2 2 2 2 2

9 6 6 6 6 6 6 6

4 10 3 3 3 3 3 3 3

11 4 4 4 4 4 4 4

12 6 6 6 6 6 6 6

5 13 2 2 1 2 2 9 2

14 3 3 3 3 3 3 3

15 4 4 4 4 4 4 4

6 16 6 9 6 6 9 6 6

17 5 9 5 5 9 5 5

18 4 9 4 4 9 4 4

7 19 1 1 1 1 1 1 1

20 4 4 4 4 4 4 4

21 5 5 5 6 6 6 5

8 22 1 9 1 1 1 1 1

23 5 9 5 5 5 5 5

24 6 9 6 6 6 6 6

9 25 6 6 6 6 6 6 6

26 3 3 3 3 3 3 3

27 1 1 1 1 1 1 1

10 28 5 5 5 5 5 5 5

29 3 3 3 3 3 3 3

30 1 1 1 1 1 9 1

ability differences for number from low to high are 0.477, 0.210, 0.368, 0.383, 0.267, 0.236

Figure 5: Scatterplot of the ability estimates from the VSM versus the VSMB models.

easy cluster. In that case, a lack of motivation might be involved because the examinees can do reasonably well in other clusters. Therefore, the VSMB model does not account for this type of leaving blank at all and by using only one single blank parameter to capture the underlying mechanism of giving blanks are way too simple to be true. It is likely that the conditional probabilities for leaving a blank differ between people who have higher or lower abilities, in spite of the fact that they both do not know the correct answer to the item. Furthermore, it is also possible that they did not ”leave” blanks but had other reasons, such as fatigue or not having enough time to come back to the item while skipping it the first time the examinee read it. It is however difficult to tell whether an examinee leaves the blanks because he or she does not know the answer, or is unwilling to respond due to lack of motivation. One possible solution is to disregard data from those clusters with the all-leaving-blank responses and simply use the rest of the responses in parameter estimation to reduce the effect of other factors such as motivation. However, the current VSMB model does allow for the all-leaving-blank (9, 9, 9) responses and therefore doing so will in turn cause bias in parameter estimation if the VSMB model is true. A new model which provides a better account for the all-leaving-blank responses is very much in need.

6 Concluding Remarks

The VSM model offers a way to handle dependence in item clusters on the VLT data, but blanks are simply considered as the same as giving an incorrect response. In this thesis, we consider the VSMB model that allow for leaving blank responses. Moreover, unlike the VSM model where a blank is regarded as an incorrect response and thought to influence or reduce the remaining set of available items for later items within the cluster, the VSMB model sees that a blank should not change the remaining set of items. Instead, we try to capture that by adding a blank parameter and make the model closer to reality.

We formulated the VSMB model and show how the parameter estimates can be effectively obtained as well as establishing the validity of using of M₃statistic for the model goodness-of-fit. But through simulations we find that, with sample sizes smaller than 2000, the effect of ignoring the blanks seem very minor and the VSM model can still obtain reasonable parameter estimates. Through analyzing the 3000-level VLT data, the VSMB model does not pass the M₃ test for all clusters. Even when we reduce the trinary responses into binary responses to make model comparisons, the VSM model is found to be the most parsimonious partly because a large number of 30 additional blank parameters needs to be employed in the VSMB model. The VSMB, VSM, and 2PL-IRT models all provide reasonable fit to the moments of most of the clusters. In conclusion, the current account of leaving blanks in the VSMB model does not seem to capture well enough the underlying mechanism of yielding a blank response, especially for those who answered correctly most of the items but left blanks on a cluster with all easy items. In practice, the VSM model that simply regards the blanks as incorrect responses can be applied to analyze the VLT data.

Reference

Abramowitz, M. and Stegun, I.A.(1972). Handbook of mathematical functions, 10th print-ing. Washington, DC:U.S. Government Printing Office.

Broyden, C. G. (1970). The convergence of a class of double-rank minimization algorithms:

2. The new algorithm. IMA journal of applied mathematics, 6(3), 222-231.

Fletcher, R. (1970). A new approach to variable metric algorithms. The computer journal, 13(3), 317-322.

Goldfarb, D. (1970). A family of variable-metric methods derived by variational means.

Mathematics of computation, 24(109), 23-26.

Jenkins, P., Earle-Richardson, G., Burdick, P., & May, J. (2007). Handling nonresponse in surveys: analytic corrections compared with converting nonresponders. American journal

of epidemiology, 167(3), 369-374.

Kolmogorov, A.N.(1956). Foundations of the theory of probability, second english edition, Chelsea, NY: Chelsea publishing company.

Lai, G.D.(2016). Psychometric models for local dependency in vocabulary levels test (mas-ter’s thesis)(in Chinese). National Taiwan Normal University, Taipei, Taiwan.

Laufer, B. and Nation, P. (1999). A vocabulary-size test of controlled productive ability.

Language testing, 16(1), 33-51.

Lynn, P. (1996). Weighting for non-response. Survey and statistical computing, 205-214.

Maydeu-Olivares, A., & Joe, H. (2006). Limited information goodness-of-fit testing in multidimensional contingency tables. Psychometrika, 71(4), 713-732.

Maydeu-Olivares, A.(2013). Goodness-of-fit assessment of item response theory models.

Measurement, 11, 71-101.

McInnis, E. D. (2006). Nonresponse bias in student assessment surveys: A comparison of respondents and non-respondents of the national survey of student engagement at an independent comprehensive Catholic University (Doctoral dissertation). Marywood Uni-versity, Scranton, Pennsylvania.

Nation, I.S.P.(1983). Testing and teaching vocabulary. Guidelines, 5(1), 12-25.

Nation, I.S.P.(1990). Teaching and learning vocabulary. Boston, MA: Heinle and Heinle.

Schmitt, N., Schmitt, D., & Clapham, C.(2001). Developing and exploring the behaviour of two new versions of the Vocabulary Levels Test. Language Testing, 18(1), 55-88.

Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization.

Mathematics of computation, 24(111), 647-656.

Smirnov, N. (1948). Table for estimating the goodness of fit of empirical distributions.

The annals of mathematical statistics, 19(2), 279-281.

在文檔中英文詞彙測驗允許留白的試題反應模型之建構與檢測 (頁 35-41)