研究結論與建議 - 階層式試題反應理論模式及其等化估計方法

本研究旨在針對 HIRT 提出無參數型的參數估計方法與等化同時估計法，並擴充 de la Torre 與 Hong (2010) 的參數型的參數估計方法於多因子 HIRT 模式的參數估計，同時藉由模擬不同情境(人數、題數、能力分布與試題架構)與臺灣學生學習成就評量資料庫的實徵分析，探討新的參數估計方法於估計量尺分數、迴歸參數與試題參數之估計精準度。

藉由模擬研究與實證資料之分析，所獲得的結論與建議如下 一、具備 HIRT 結構的資料時，模式誤用對參數估計精準度有何影響

透過與 UIRT 與 MIRT 模式的估計結果相比較，可發現當資料具備 HIRT 結構時，使用 MIRT 模式估計領域量尺分數與試題參數其結果近似於使用 HIRT 模式的估計結果；然誤用 UIRT 模式估計總體量尺分數與試題參數時，則會產生嚴重的偏誤。且該現象於題內多向度測驗結構時更為明顯。

該結果意味著分析具備 HIRT 結構的測驗時，應避免使用忽略試題間多項度特值的 UIRT 模式估計總體量尺分數與試題參數；然，MIRT 模式於領域量尺分數與試題參數的估計結果則無此現象。

二、HIRT 無參數型估計方法

基於分析量尺分數不來自常態分布之模擬資料的情境，比較 HIRT 參數型估計方法與 HIRT 無參數型估計方法於量尺分數、迴歸參數與試題參數之估計精準

度─RMSE 指標，比較結果顯示兩者在量尺分布服從標準常態分布時，其估計精

準度無顯著上的差異；然，當量尺分數非服從常態分布時，則以無參數型估計方法有較高的參數估計精準度。該結果除顯示所提出的 MH-within-Gibbs sampling 於 HIRT 模式無參數型估計法能正確估計量尺分數、迴歸參數與試題參數外，亦顯示量尺分數非來自於常態分布時，參數型的估計方法會產生估計的偏誤。

三、HIRT 等化同時估計法

藉比較 UIRT、MIRT 模式與所提出之 HIRT 模式等化同時估計方法於模擬資料的估計結果，於試題參數之比較結果顯示 UIRT 模式與 HIRT 模式估計精準度相似；然，總體量尺分數之比較結果顯示，HIRT 模式因考慮高層次與領域量尺分數間相關，會有較高的估計精準度。該現象，與單組測驗架構下 UIRT 模式與 HIRT 模式於各參數之估計精準度比較的結果雷同。此外 MIRT 模式與 HIRT 模

式的比較，亦可發現兩者對於量尺分數、試題參數有相似的估計精準度，且兩者的估計精準度並無顯著上差異。於此，除顯示所提出 HIRT 模式等化同時估計法可在等化過程中正確估計量尺分數、迴歸參數與試題參數外，更可顯示在等化過程中，模式誤用亦會對總體量尺分數會產生估計精準度下降的現象。

四、實徵資料分析

透過實徵資料所獲得的試題參數而進行的模式檢定指標之研究，可發現 AIC、 BIC 與 DIC 三種較為常用的模式檢定指標，可正確區辨出 UIRT、MIRT 和 HIRT 模式間差異。此外，透過模式指標檢定與估計標準誤之呈現，亦顯示 TASA 實證資料確實宜用 HIRT 模式進行量尺分數、迴歸參數與試題參數之估計。

簡言之，HIRT 模式目前漸受重視，本研究主要是基於 MH-within-Gibbs sampling 和核平滑化法等方法，提出 HIRT 模式無參數型的參數估計方法與等化同時估計法，並輔以 TASA2006 小四數學科測驗資料為例，提供理論與實務之驗證。然，後續一些相關議題仍需進一步探究，以求 HIRT 模式能夠在測驗資料分析中，獲得更為完整的應用與發展。

參考文獻

中文文獻

吳慧珉 (2001) 。選項特徵曲線之研究－以核函數之平滑化為估計取向。國立臺中師範學院教育測驗統計研究所碩士論文，臺中市。

陳煥文 (2004) 。垂直等化連結特性之研究-四種連結方法的比較。國科會專題研究計畫。

臺灣學生學習成就評量資料庫 (2009) 。檢索日期：2009 年 11 月 20 日。檢自：

http://tasa.naer.edu.tw/Release/index.aspx

劉湘川 (2001a) 。相關加權核平滑化無參數試題選項特徵曲線估計法及其 IORS 整合模式。第五屆華人社會心理與教育測驗學術研討會，1-10。臺北市：中國測驗學會、臺灣師範大學。

劉湘川 (2001b) 。核平滑化試題選項特徵曲線與選項關聯結構整合擴充模式。測 驗統計年刊，9 (1) ，1-18。

謝典佑、林佳樺、郭伯臣、施淑娟 (2009，9 月)。高層次 IRT 模式式適合度檢 定之研究─以 TASA 數學科為例。「大型教育資料庫建置及相關議題」學術研討會。台中：國立台中教育大學。

謝典佑、曾彥鈞、廖晨惠、郭伯臣 (2009，10 月)。同時估計法於高層次試題反應理論之研究。中國測驗學會年會暨心理與教育測驗學術研討會。台北：國立台灣師範大學。

謝典佑、楊智為、許天維、郭伯臣 (2009，10 月)。整合無參數與 MH-within-Gibbs 技術提升高層次試題反應理論參數估計精準度之研究。中國測驗學會年會暨心理與教育測驗學術研討會。台北：國立台灣師範大學。

英文文獻

Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29 (1), 67-91.

Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22 (3), 37-51.

Adams, R. J., Wilson, M., & Wang, W. C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21 (1), 1-23.

Akaike, H. (1974). A new look at the statistical model identication. IEEE Transactions on Automatic Control, 19 (6), 716-723.

Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics, 17 (3), 251-269.

American Psychiatric Association (1994) .The diagnostic and statisticalmanual of mental disorders (4th ed.). Washington, DC: Author.

Andersen, E. B., & Madsen, M. (1997). Estimating the parameters of a latent population distribution. Psychometrika, 42 (3), 357-374.

Baker, F. B. & Kim, S. H. (2004). Item Response Theory: Parameter Estimation Techniques. New Yook: Marcel Dekker, Inc. 2nd Edition.

Baker, F. B. (1998). An investigation of the item parameter recovery characteristics of a Gibbs sampling procedure. Applied Psychological Measurement, 22, 153-169.

Baker, F. B. (2004). Item Response Theory：Parameter estimation techniques. New York：Marcel Dekker.

Baker, F. B., & Subkoviak, M. J. (1981). Analysis of test results via log-linear models.

Applied Psychological Measurement, 5 (4), 503-515.

Birnbaum, A. (1968). Some latent trait models. In F. M. Lord & M. R. Novick (Eds), Statistical theories of mental test scores (pp. 395-479). Reading, MA: Addison &

Wesley.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters：Application of an EM algorithm. Psychometrika, 46 (4), 443-459.

Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously scored items. Psychometrika, 35 (2), 179-197.

Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6 (4), 431-444.

Boulet, J.R.（1996）. The effect of nonnormal ability distributions on IRT parameter estimation using full-information and limited-information methods (item response theory, nonlinear factor analysis). Dissertation abstracts online, University of Ottawa (Canada).

Carrol, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies.

Cambridge, UK: Cambridge University Press.

Congdon, P. (2003). Applied Bayesian Modelling., New York：John Wiley.

Cook, L.L., & Eignor, D.R. (1991). An NCME instructional module on IRT equating methods . Educational Measurement: Issues and Practice, 10 (3), 37-45.

Cressie, N., & Holland, P.W. (1983). Characterizing the manifest probabilities of latent trait models. Psychometrika, 48 (1), 129–141.

Crocker, L. & Algina, J. (1986). Introduction to Classical and Modem Test Theory.

New York: Holt, Rinehart and Winston.

Cronbach, L. J., & Snow, R. E. (1977). Aptitude and instructional methods. New York:

de la Torre, J., & Douglas, J. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69 (3), 333-353.

de la Torre, J., & Hong, Y. (2010). Parameter estimation with small sample size a higher-order IRT model approach. Applied Psychological Measurement, 34 (4), 267-285.

de la Torre, J., & Patz, R. (2005). Making the most of what we have: A practical application of multidimensional item response theory in test scoring. Journal of Educational and Behavioral Statistics, 30 (3), 295-311.

de la Torre, J., & Song, H. (2009). Simultaneous estimation of overall and domain abilities: A higher-order IRT model approach. Applied Psychological Measurement, 33 (8), 620-639.

Engelen, R. J. H. (1989). Parameter estimation in the logistic item response model.

Doctoral dissertation, Universiteit Twente.

Ferrando, P. J. (2003). The accuracy of the E, N and P trait estimates: An empirical study using the EPQ-R. Personality and Individual Differences, 34 (4), 665-679.

Gelman, A. B., Carlin, J. S., Stern, H. S., & Rubin, D. B. (1995). Bayesian Data Analysis. London; New York: Chapman and Hall.

Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transcationa on Pattern Analysis and Machine Intelligence, 6 (6), 721-741.

Gustafsson, J. E., & Snow, R. E. (1979). Ability profiles. In R. F. Dillon (Ed.), Handbook on testing (pp. 107-135). Westport, CT: Greenwood Press.

Haebara, T. (1980). Equating Logistic Ability Scales by a Weighted Least Squares Method. Japanese Psychological Research, 22 (3), 144-149.

Hanson, B. A., & Beguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the

common-item equating design. Applied Psychological Measurement, 26 (1), 3-24.

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57 (1), 97-109.

Hattie, J. (1981). Decision criteria for determining unidimensional and multidimensional normal ogive models of latent trait theory. Armidale, Australia:

The University of New England, Center for Behavioral Studies.

Hoskens, M., & De Boeck, P. (1997). A parametric model for local dependencies among test items. Psychological Methods, 2 (3), 261-277.

Hsieh, T. Y., Kuo, B. C., & Shih, S. C. (2009). A Multi-factor High-order Item Response Model Based on MH with Gibbs Method. Paper presented at the Pacific Rim Objective Measurement Symposium, Hong Kong.

Tien-Yu Hsieh, Bor-Chen Kuo, & Chia-Hua Lin. (2011). The concurrent calibration method of high-order item response theory. Paper presented at the annual meeting of national council on measurement in education, Orleans, Louisiana.

Jaeger, R., M. (1981). Some exploratory indices for selection of a test eqauting method. Journal of Educational Measurement, 18 (1), 23-38.

Kane, M., T., Mroch, A., A., Suh, Y., & Ripkey, D., R. (2009). Linear equating for the NEAT design: parameter substitution models and chained linear relationship models. Measurement, 7, 125–146.

Kang, T., & Cohen, A. S. (2007). IRT model selection methods for dichotomous items. Applied Psychological Measurement, 31 (4), 331-358.

Kim, S. H., & Cohen, A. S. (1998). A comparison of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22 (2), 131-143.

sampling under the two-parameter logistic model. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada.

Kolen, M. J., & Brennan, R. J. (1995). Test Equating: Methods and Practices. New York: Springer-Verlag.

Kolen, M. J., & Brennan, R. L. (2004). Test Equating:, Methods and Practices. (2nd ed.). New York: Springer-Verlag.

Kuo, B. C., Hsieh, T. Y., & Cheng, C. M. (2010). Comparing UIRT, MIRT, and HIRT based on model fitting and parameter recovery. Paper presented at the 7th Conference of the International Test Commission, Hong Kong.

Kuo, B. C., Hsieh, T. Y., & Wu, H. M. (2010). Hierarchical item response theory model with nonparametric prior distribution. Paper presented at the 7th Conference of the International Test Commission, Hong Kong.

Kuo, B. C., Hsieh, T. Y., Wu, H. M., & Lin, C. H. (2009). The comparison of one-factor high-order IRT model and multivariate IRT model. Paper presented at the Pacific Rim Objective Measurement Symposium, Hong Kong.

Li, F., Cohen, A. S., Kim, S. H., & Cho, S. J. (2009). Model selection methods for mixture dichotomous IRT models. Applied Psychological Measurement, 33 (5), 353-373.

Lin, T. H., & Dayton, C. M. (1997). Model selection information criteria for non-nested latent class models. Journal of Educational and Behavioral Statistics, 22 (3), 249-264.

Liu, C. H., & Rubin, D. B. (1998). Ellipsoidally symmetric extensions of the general location model for mixed categorical and continuous data. Biometrika, 85 (3), 673-688.

Lord, F. M. (1975). Relative efficiency of number-right and formula scores. British Journal of Mathematical and Statistical Psychology, 28, 46-50.

McKinley, R. L. & Reckase, M. D. (1983). MAXLOG: A computer program for the estimation of the parameters of a multidimensional logistic model. Behavior Research Methods and Instrumentation, 15, 389-390.

Mellenbergh, G.J., & Vijn, P. (1981). The Rasch model as a loglinear model. Applied Psychological Measurement, 5 (3), 369–376.

Mislevy, R.J. (1984). Estimating latent distributions. Psychometrika, 49, 359–381.

Mislevy, R.J., & Bock, R.D. (1990). BILOG-3: Item analysis and test scoring with binary logistic models [Computer software]. Mooresville, IN: Scientific Software International.

Muraki, E. & Bock, R. D. (1996). PARSCALE: IRT based test scoring and item analysis for graded open-ended exercises and performance tasks (Version 3) [Computer software]. Chicago: Scientific Software.

OECD (2005). PISA 2003 Technical Report. OCED. Paris.

Patz, R. J., & Junker, B. W. (1997). Applications and extensions of MCMC in IRT:

Multiple item types, missing data, and rated responses (Technical Report No.

670). Pittsburgh: Carnegie Mellon University, Department of Statistics.

Patz, R. J., & Junker, B. W. (1999). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24 (2), 146-178.

Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn （ed.）. Educational measurement (3rd ed., pp. 221-262).

Washington, DC: American Council on Education.

Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating.

In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 221-262). New York:

Macmillan.

Illustration from a Large Scale Testing Program. Applied measurement in education, 22 (1), 79-103.

Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika, 56 (4), 611–630.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.

Copenhagen: Danish Institute for Educational Research.

Reckase, M. D. (1985 ). The difficulty of test items that measure more than one ability.

Applied Psychological Measurement, 9 (4), 401-412.

Reckase, M. D. (1997). A linear logistic multidimensional model for dichotomous item response data. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 271-286). New York : Springer.

Reckase, M. D. (2009). Multidimensional item response theory. New York, NY:

Springer.

Sahu, S. K. (2002). Bayesian estimation and model choice in item response models.

Journal of Statistical Computation and Simulation, 72 (3), 217-232.

Samejima, F. (1998). Efficient nonparametric approaches for estimating the operating characteristics of discrete item responses. Psychometrika, 63 (1), 111-130.

Schmitt, J. E., Mehta, P. D., Aggen, S. H., Kubarych, T. S.,&Neale,M. C. (2006).

Semi-nonparametric methods for detecting latent nonnormality: A fusion of latent trait and ordered latent class modeling. Multivariate Behavioral Research, 41 (4), 427-443.

Schwarz, G. (1978), Estimating the dimension of a model, Annals of Statistics, 6 (2), 461-464.

Sheng, Y., & Wikle, C. K. (2008). Bayesian Multidimensional IRT Models with a Hierarchical Structure. Educational and Psychological Measuremen, 68 (3), 413-430.

Silverman, B. W. (1986). Density Estimation. London: Chapman and Hall.

Spearman, C. E. (1904). ‘‘General intelligence’’ objectively determined and measured.

American Journal of Psychology, 15 (2), 201-293.

Spiegelhalter, D., Best, N., & Carlin, B. (1998). Bayesian deviance, the effective number of parameters, and the comparison of arbitrarily complex models.

Technical report, Division of Biostatistics, University of Minnesota. Research Report 98-009.

Stocking, M., L., & Lord, F., M. (1983). Developing a Common Metric in Item Response Theory. Applied Psychological Measurement, 7 (2), 201-211.

Stone, C. A., & Lane, S. (1991). Use of restricted item response theory models for examining the stability of item parameters estimates over time. Applied Measurement in Education, 4 (2), 125-141.

Thissen, D. (1991). MULTILOG user’s guide: Multiple categorical item analysis and test scoring using item response theory. Chicago: Scientific Software.

Thissen, D., & Mooney, J. A. (1989). Log-linear item response models, with applications to data from social surveys. Sociological Methodology, 19, 299-330.

Thurstone, L. L. (1938). Primary mental abilities. Psychometric Monograph, No. 1.

Tierney, L. (1994). Markov chains for exploring posterior distributions. Annals of Statistics, 22 (4), 1701-1762.

Wainer, H., Vevea, J.L., Camacho, F., Reeve, B.B., Rosa, K., Nelson, L., Swygert, K.A., & Thissen, D. (2001). Augmented scores ： "borrowing strength" to compute scores based on small numbers of items. In D. Thissen & H. Wainer (Eds), Test Scoring (P. 343-387). Hillsdale, NJ: Lawrence Erlbaum Associates.

Wang, W., Wilson, M. & Cheng, Y. (2000). Local Dependence between Latent Traits when Common Stimuli are Used. Paper presented at the International Objective

Wilson, M., & Adams, R. J. (1995). Rasch models for item bundles. Psychometrika, 60 (2), 181-198.

Woods, C. M. (2006). Ramsay-curve item response theory to detect and correct for nonnormal latent variables. Psychological Methods, 11 (3), 253–270.

Woods, C. M. (2007). Ramsay-curve IRT for Likert-type data. Applied Psychological Measurement, 31 (3), 195–212.

Woods, C. M., & Lin, N. (2008). Item response theory with estimation of the latent density using Davidian curves. Applied Psychological Measurement, 33 (2), 102-117.

Woods, C. M., & Thissen, D. (2004). RCLOG v.1: Software for item response theory parameter estimation with the latent population distribution represented using spline-based densities (Technical Report). Chapel Hill, NC: L. L. Thurstone Psychometric Laboratory.

Woods, C. M., & Thissen, D. (2006). Item response theory with estimation of the

在文檔中階層式試題反應理論模式及其等化估計方法 (頁 70-82)