Experiment three: Model Fit in real data - 多階層高層試題反應理論之蒙地卡羅馬可夫鏈估計法

CHAPTER 5 Results

5.3 Experiment three: Model Fit in real data

This chapter presents an application of the MHO-IRT model to the TASA 2007 fourth-grade mathematics assessment data. The application is applied to identify the model that is most suitable for the large scale assessment. There are 8,205 examinees who participated in the TASA 2007 fourth-grade mathematics assessment. The HO-IRT and U-IRT based models are used to estimate the overall ability. The results are shown in Table 5-17. According to Table 5-17, the multilevel based model provides a higher overall ability estimate than that given by the HO-IRT and U-IRT models.

Table 5-17

The overall ability estimates based on the real data

MHO-IRT HO-IRT MU-IRT U-IRT

Overall ability 0.1713 0.0411 0.1490 0.0645 The HO-IRT and M-IRT based models are used to estimate the domain abilities.

The results are shown in Table 5-18. The results of the MHO-IRT model show that the examinees obtain ability estimate in increasing order in the domains of number, algebra, geometry, and statistics with the lowest and highest abilities being in the number and statistics domains, respectively. The HO-IRT, MM-IRT, and M-IRT provide the same order of the domain ability estimates. Moreover, the MHO-IRT and the MM-IRT models provide similar results in the domain ability estimates while HO-IRT and M-IRT provide similar results.

Table 5-18

The domain ability estimates based on the real data

Domain MHO-IRT HO-IRT MM-IRT M-IRT

Number -0.136 0.001 -0.0212 0.0153

Algebra 0.073 0.187 0.0854 0.1035

Geometry 0.298 0.366 0.2183 0.3218

Statistic 0.312 0.381 0.3850 0.4084

Usually there is an uncertainty in the appropriate error structure and predictor variables to be included in the models. Adding more parameters may improve the fit;

however, this may be at the expense of identifiability and generalizability. The model selection criteria assess whether improvements in fit measures such as likelihoods, deviances, or error sum of squares justify the inclusion of extra parameters in a model.

The model selection techniques using the AIC, BIC, and DIC indices are performed to compare these models so as to identify the model that best describes the real data. In Table 5-13, the values of the three model selection indices, i.e., AIC, BIC, and DIC, are presented. These values provide a good summarized information and insight into their performance in selecting the models.

Table 5-19

Model selection indices in each model

Model AIC BIC DIC

MHO-IRT 42055.22 (1) 52611.73 (1) 115689.44 (1) HO-IRT 42058.20 (2) 52657.25 (2) 115708.80 (2) MM-IRT 42080.12 (3) 52659.08 (3) 115745.44 (3) M-IRT 42142.93 (4) 52787.09 (6) 115785.58 (5) MU-IRT 42145.91 (5) 52670.48 (4) 115780.47 (4) U-IRT 42151.84 (6) 52752.59 (5) 115837.22 (6) In Table 5-13, the values of the AIC, BIC, and DIC all indicate that the MHO-IRT is the best model due to the smaller values. The rank of each model fit index is shown in Table 5-13 within parentheses. According to the AIC index, the best model for the real data is the MHO-IRT followed by the HO-IRT, MM-IRT, M-IRT, MU-IRT, and U-IRT models. In the BIC index, there is a minor difference from the AIC as the BIC identifies the M-IRT to be the worst model. The DIC and AIC provide similar results in selecting the appropriate models. The results show that all three indices identify the MHO-IRT to be the most suitable model for the data followed by the HO-IRT and MM-IRT models.

CHAPTER 6 ConClusion and disscussion

This dissertation is conceived to address the issue of ability estimation. It is common to find tests comprising different domains measuring specific content.

Although multidimensional IRT is extensively used, these tests are treated to be unidimensional. However, the overall ability estimate is inappropriate if the unidimensionality is violated. In this experiment, six different estimation models have been applied to investigate and subsequently identify the model that is best suited for the real test. In this dissertation, an HO-IRT model incorporating background variables is developed. The development of the MHO-IRT model provides a coherent framework for estimating the general and domain specific abilities and incorporating the background variables. The MHO-IRT model approach represents a general framework that subsumes the overall as well as domain abilities and incorporates the background variables. Using the MCMC methods, the estimation of general and the latent structural parameters can be obtained simultaneously. The feasibility and effectiveness of the MHO-IRT model approach is examined by a simulation experiment. The usefulness of the proposed model is also verified through its application to TASA 2007 fourth-grade mathematics assessment data. Moreover, there are two studies to distinguish the influence of the proposed model in estimating the individual ability and the population statistics.

Compared to the currently available methods such as unidimensional or multidimensional, the proposed MHO-IRT model has two special features: (1) clearly modeling general and multiple domain abilities in the same model and incorporating the correlation structure of the abilities and the background variables in the estimation process. The MHO-IRT estimates two types of abilities (i.e., the overall and domain abilities) in an integrative and efficient manner; (2) capitalizing on all information contained in the student test performance and borrowing strength from the background information.

In the first experiment, the background variables are continuous and the simulated data are set to investigate the individual ability estimates. The simulation study shows that the MHO-IRT and MU-IRT overall ability estimates are very similar in terms of RMSE. The first experiment shows that the models that include the background variable are relatively efficient. The results indicate that the MHO-IRT, MM-IRT, and MU-IRT model approaches provide more proficient estimates compared to the U-IRT, M-IRT, and HO-IRT model approaches. In addition, better estimates are obtained with longer tests, larger sample sizes, and higher correlations between the overall ability and background variables.

In the second experiment, the background variables are dichotomous and the simulated data are set to investigate the population estimates. The population means are estimated well by all the models used. Compared to the U-IRT and HO-IRT models, the results indicate that the models that include the background variables are relatively efficient. The RMSE of the group mean shows that the RMSE decreases when the test length and sample size increase. The results indicate that better estimates are obtained with longer tests than with larger sample sizes. The differences between the models that include the background variables and those that do not are higher when estimating the population standard deviations. The MHO-IRT and MU-IRT provide similar estimates and outperform the HO-IRT and U-IRT models.

Model selection using the AIC, BIC, and DIC indices, are performed to compare these models so as to identify the model that best describes the real data. The results show that AIC, BIC, and DIC all indicate that the MHO-IRT is the best model due to the smaller values.

In applying the methods proposed in this study, some practical concerns need to be addressed one of which is the choice between the MHO-IRT and HO-IRT models.

Although the MHO-IRT model provides better estimates, and in some cases, just slightly better estimates, it comes at the expense of additional parameters. Does the additional improvement warrant the additional cost? The answer depends on the information that is available and desired. The HO-IRT model should be used if no background variables that correlate highly with the abilities are available. However, if

the background variables can account for a large proportion of the variability in the abilities then the MHO-IRT model should be used. In summary, the MHO-IRT approach has important implications for parameter estimations. The MHO-IRT model fits the design of large-scale assessments. More importantly, it provides an efficient estimation of the parameters.

REFERENCES

Ackerman, T. A., & Davey, T. C. (1991). Concurrent adaptive measurement of multiple abilities. Paper presented at the annual meeting of the American Educational Research Association, Chicago.

Ackerman, Phillip L. (1996). Adult intelligence. Practical Assessment, Research

& Evaluation, 5(8). Retrieved January 21, 2014 from

http://PAREonline.net/getvn.asp?v=5&n=8 . This paper has been viewed 30,763 times since 11/13/1999.

Adams, R. J., & Wu, M. L. (2006). The mixed-coefficient multinomial logit model: A generalized form of the Rasch model. In M. v. Davier & C. H.

Carstensen (Eds.), Multivariate and mixture distribution Rasch models:

Extensions and applications (pp. 57 – 76): Springer Verlag.

Adams, R. J., Wilson, M., & Wang, W. C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological measurement, 21, 1-24.

Adams, R. J., Wilson, M., & Wu, M. L. (1997). Multilevel item response models:

An approach to errors in variables regression. Journal of Educational and Behavioral Statistics, 22, 47-76.

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov and F. Csaki (Eds.), Second

international symposium on information theory (pp. 267 281). Budapest:

Academiai Kiado.

Andrich, D. (1978). A rating formulation for ordered response categories.

Psychometrika, 43(4), 561-573.

Ansley, T. N., & Forsyth, R. A. (1985). An examination of the characteristics of unidimensional IRT parameter estimates derived from two-dimensional data. Applied Psychological Measurement, 9, 37-48.

Beatom A.E., & Gonzalez. E. (1995). NAEP primer. Chestnut Hill, MA: Boston College: Boston.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-479). Reading, MA:

Addison-Wesley.

Bolt, D.M., Cohen, A.S., & Wollack, J.A. (2001). A mixture model for multiple choice data. Journal of Educational and Behavioral Statistics, 26(4), 381-409.

Congdon, P. (2003). Applied Bayesian modelling. New York: John Wiley.

Davey, T., Oshima, T. C., & Lee, K. (1996). Linking multidimensional item calibrations. Applied Psychological Measurement, 20, 405–416.

De Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York: Springer.

de la Torre, J. (2003). Improving the Accuracy of Item Response Theory

Parameter Estimates through Simultaneous Estimation and Incorporation of Ancillary Variables. University of Illinois Department of Psychology.

de la Torre, J. (2009). Improving the Quality of Ability Estimates Through Multidimensional Scoring and Incorporation of Ancillary Variables.

Applied Psychological Measurement, 33, 465-485

de la Torre, J., & Hong, Y. (2010). Parameter estimation with small sample size:

A higher-order IRT model approach. Applied Psychological Measurement, 34, 267-285.

de la Torre, J., & Patz, R. J. (2005). Making the most of what we have: A

practical application of MCMC in test scoring. Journal of Educational and Behavioral Statistics, 30, 295-311.

de la Torre, J., & Song, H. (2009). Simultaneous estimation of overall and domain abilities: A higher-order IRT model approach. Applied Psychological Measurement, 33, 620-639.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 3, 359-374.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, R. B. (1995). Bayesian data analysis. London: Chapman and Hall.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, R. B. (2003). Bayesian data analysis (2nd ed.). London: Chapman and Hall.

Gonzalez, E.J., Galia, J., and Li, I. (2004), “Scaling Methods and Procedures for the TIMSS 2003 Mathematics and Science Scales” in M.O. Martin, I.V.S.

Mullis, and S.J. Chrostowski (eds.), TIMSS 2003 Technical Report, Chestnut Hill, MA: Boston College

Johnson R. A, Wichern D. W. (1997). Applied multivariate statistical analysis, 4th ed. Saddle River, NJ: Prentice-Hall.

Jones, D. H., & Nediak, M. (2000). Item parameter calibration of LSAT items using MCMC approximation of Bayes posterior distributions. (No. RRR 7-2000). Piscataway, NJ: RUTCOR.

Kahraman, N., & Kamata, A. (2004). Increasing the precisions of subscale scores by using out-ofscale information. Applied Psychological Measurement, 28, 407-426.

Kang, T., & Cohen, A. S. (2007). IRT model selection methods for dichotomous

items. Applied Psychological Measurement, 31, 331-358.

Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied Psychological measurement, 30(1), 3-21.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, N.J.: L. Erlbaum Associates

Luecht, R. M. (1996). Multidimensional computerized adaptive testing in a certification or licensure context. Applied Psychological Measurement, 20, 389–404.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149-174

McCullagh, P. and J. A. Nelder. 1989. Generalized Linear Models. Second ed.

London: Chapman and Hall.

McCulloch, C. E., & Searle, S. R. (2001). Generalized, Linear, and Mixed Models. New York: Wiley.

McKinley, R. L., & Reckase, M. D. (1982). The use of the general Rasch model with multidimensional item response data (RR ONR82-1). Iowa City:

American College Testing Program.

Mislevy, R. J. (1984). Estimating latent distributions. Psychometrika, 49, 359-381.

Mislevy, R. J. (1987). Exploiting auxiliary information about examinees in the estimation of item parameters. Applied Psychological Measurement, 11, 81-91.

Mislevy, R. J., & Sheehan, K. (1989). The role of collateral information about examinees in item parameter estimation. Psychometrika, 54, 661-679.

Mislevy, R. J., Johnson, E. G., & Muraki, E. (1992). Scaling procedure in NAEP.

Journal of Educational Statistics, 17, 131-154.

Mislevy, R.J. (1991). Randomization-based inference about latent variables from complex samples. Psychometrika, 56, 177-196.

Mislevy, R.J., Beaton, A., Kaplan, B.A., & Sheehan, K. (1992). Estimating population characteristics from sparse matrix samples of item responses.

Journal of Educational Measurement, 29, 133–161.

OECD (2005). PISA 2003 Technical Report. OCED. Paris.

Patz, R. J. and Junker, B. W. (1999a). A straightforward approach to Markov Chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24, 146–178.

Patz, R. J., & Junker, B. W. (1999b). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342-366.

Raftery, A. E., & Lewis, S. M. (1996). Implementing MCMC. In W. R. Gilks, S.

Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 115-130). London: Chapman & Hall.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Institute of Educational Research. (Expanded edition, 1980. Chicago: The University of Chicago Press.)

Reckase, M. D. & Mckinley, R. L. (1991), The discriminating power of items that measure more than one dimension, Applied Psychological Measurement, (15), 361-373.

Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, 401-412.

Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25–36.

Rijmen, F., Tuerlinckx, F., De Boeck, P., & Kuppens, P. (2003). A nonlinear mixed model framework for item response theory. Psychological Methods, 8, 185-205. doi:10.1037/1082-989X.8.2.185

Roussos, L., & Stout, W. (1996). A multidimensionality based DIF analysis paradigm. Applied Psychological Measurement, 20, 355–371.

Rubin, D.B. (1987), Multiple Imputation for Nonresponse in Surveys, New York:

John Wiley & Sons, Inc.

Schwarz, G., 1978. Estimating the dimension of a model. Ann Statist., 6: 461-464.

Song, H. (2007). A higher-order item response model: development and application. doctoral dissertation, The State University of New Jersey.

Spiegelhalter, D., Best, N., & Carlin, B. (1998). Bayesian deviance, the effective number of parameters, and the comparison of arbitrarily complex models.

Technical report, Division of Biostatistics, University of Minnesota.

Research Report 98-009.

Stout, W., Habing, B., Douglas, J., Kim, H. R., Roussos, L., & Zhang, J. (1996).

Conditional covariance-based nonparametric multidimensionality assessment. Applied Psychological Measurement, 20, 331–354.

van der Linden, W. J. (1996). Assembling tests for the measurement of multiple traits. Applied Psychological Measurement, 20, 373–388.

von Davier, M., & Sinharay, S. (2010). Stochastic approximation for latent regression item response models. Journal of Educational and Behavioral Statistics, 35(2), 174-193.

von Davier, M., Gonzalez, E. & Mislevy, R. (2009). What are plausible values and

why are they useful? In IERI Monograph Series Volume 2, 9-36.

Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., III, Rosa, K., Nelson, L., et al. (2001). Augmented scores: “Borrowing strength” to compute score based on small numbers of items. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 343-388). Mahwah, NJ: Lawrence Erlbaum.

Wang, W.-C., & Wilson, M. R. (2005a). Assessment of differential item functioning in testlet-based items using the Rasch testlet model.

Educational and Psychological Measurement, 65, 549-576.

Wang, W.-C., & Wilson, M. R. (2005b). The Rasch testlet model. Applied Psychological Measurement, 29, 126-149.

Wang, W.-C., Chen, P.-H., & Cheng, Y.-Y. (2004). Improving measurement precision of test batteries using multidimensional item response models.

Psychological Methods, 9, 116-136

Wu, M. L., Adams, R. J., & Wilson, M. R. (1998). What is ACER ConQuest? In ACER ConQuest: Generalized item response modelling software.

Melbourne, Australia: Australian Council for Educational Research.

Yao, L., & Boughton, K. A. (2007). A Multidimensional item response modeling approach for improving subscale proficiency estimation and classification.

Applied Psychological Measurement, 31, 1–23.

Yao, L., Patz, R. J., & Hanson, B. A. (2002). More efficient Markov Chain Monte Carlo estimation in IRT using marginal posteriors. from

http://www.ncme.org/repository/incoming/86.pdf#search='yao%20patz%2 0irt'.

在文檔中多階層高層試題反應理論之蒙地卡羅馬可夫鏈估計法 (頁 79-0)