階層結構試題反應模式及其在電腦適性測驗之應用

全文

(1)國立台灣師範大學教育心理與輔導學系博士論文 Department of Educational Psychology and Counseling, National Taiwan Normal University Doctoral Dissertation. 階層結構試題反應模式及其在電腦適性測驗之應用 The Hierarchical Structure Item Response Model and its Application to Computerized Adaptive Testing. 指導教授：陳柏熹博士、王文中博士 Advisors: Dr. Po-Hsi Chen and Dr. Wen-Chung Wang. 研究生：黃宏宇撰 Graduate: Hung-Yu Huang 中. 華. 民. 國. 九. 十. July, 2009. 八. 年. 七. 月.

(2) 摘要本研究旨在發展具有階層結構潛在變項的試題反應模式，稱之為「階層結構試題反應模式」，且將其應用在電腦適性測驗中，並檢驗其有效性。本論文共有三個模擬研究，第一個研究是透過貝氏統計中的馬可夫鍊蒙地卡羅估計法，來進行模式參數的估計與模式適配度的檢驗，結果發現本研究發展的模式適配度指標與貝氏 DIC 指標適合用來診斷模式與資料的適配程度，且貝氏估計法能提供良好的模式參數回復性。第二個研究則是發展階層結構試題反應模式在電腦適性測驗上的算則，結果發現透過修正題組模式的電腦適性測驗算則而發展出的選題與能力估計程序，具有最佳的能力估計效能。第三個研究則是修正傳統的最大訊息量選題法，在測驗初期加上隨機成分來控制測驗初期能力估計的誤差，結果發現新近的選題方法能提高題庫使用率，降低試題的曝光率與測驗平均重疊率，支持新的選題法可以兼顧題庫安全與測量精確度。最後，作者則針對未來研究與實務應用提供若干建議。. 關鍵詞：階層結構試題反應模式、貝氏估計法、電腦適性測驗、題庫安全、新近選題法、試題反應理論. i.

(3) Abstract This study is aimed at constructing IRT models with a higher order latent trait structure within a multidimensional IRT framework and implementing these models in the context of a CAT with varying modern item selection rules, in order to assess the effectiveness of the model through simulation studies. Sheng and Wikle (2008) proposed a Bayesian multidimensional IRT models with a hierarchical structure (also referred to as a hierarchical structure item response model (HSIRM) in this study), and conducted several simulation studies to support their assumptions. However, certain questionable features of their simulation design made their findings unclear and left many questions in need of answering. Unlike the original models proposed by Sheng and Wikle (2008), the HSIRM constructed a latent trait structure based on factor analysis instead of principle components analysis. Because the original study is questionable, and because it is necessary to guarantee that novel IRT models are stable and reliable before implementing them in a CAT environment, it is important to revise the proposed models and assess their estimation efficiency. Consequently three separate studies on the HSIRM were conducted. The first study focused on the Bayesian estimation method and the Bayesian model checking techniques at first and then checked the model parameter recovery. The second study attempted to derive the CAT algorithm of the HSIRM and evaluated the accuracy of overall and domain ability estimations under a variety of conditions. Finally, modern item selection methods were incorporated into the HSIRM-based CAT to better control item exposure and overlap rates. In the first study, simulations were conducted to assess the effect of Bayesian model checking techniques, including the posterior predictive model checking (PPMC) method, the pseudo-Bayes factor (PsBF) approach, and the Bayesian DIC, and then to evaluate the model parameter recovery through comparison with the true model. The data sets were generated with a UIRT model, a MIRT model with identical latent trait (MIRT-I), and a HSIRM with both a high ability correlation (HSIRM-H) and a low ability correlation (HSIRM-L) in terms of 1PL- and 2PL-IRT models. The analytic models were 1PL- and 2PL-HSIRMs. Five indicators were incorporated into the PPMC procedure, including the SD of the biserial correlations (Bis), the Bayesian chi-square test (BChi), the reproduced correlation matrix test (Rcor), the observed score covariance between the subtests test (Cov), and the identical latent trait correlation test (Id). The results suggest that, when implementing PPMC in the HSIRM, it is advisable to fit data to the 1PL- HSIRM at first and move away the inappropriate data sets which were generated from the UIRT, MIRT-I, and MIRT-S models according to the well-working criteria, because the effect of the PPMC ii.

(4) method is improved in the fit of 1PL- HSIRM. As for the relative model fit criteria, only the DIC was able to consistently select the correct model to fit the data. The PsBF always preferred the simplest model to fit the data regardless of which was the true model. With respect to model parameter recovery, most estimators were unbiased, suggesting that the Bayesian statistics such as the MCMC procedure can provide precise measurement for model parameters in the HSIRM. Finally, as a one-stage approach, in comparison with two-stage approaches such as the two-stage CFA and averaging procedures, the HSIRM was the most efficient in measurement accuracy for overall ability estimates. In addition, the major advantage of the one-stage method over the two-stage methods is that the HSIRM can provide standard errors of measurement obtained immediately by the standard deviation of the posterior distribution for each examinee in estimating overall ability, whereas only the two-stage CFA approach had an approximate estimate for the standard error of measurement obtained through an indirect formula transition. The most important thing, however, was that both the two-stage CFA and the averaging approaches did not meet the structure of test design used as a standard in the study, namely, the way to design a test was the only appropriate way to analyze the corresponding data set. In the second study, three HSIRM-based CAT algorithms were proposed by the author. These included a multidimensional CAT, a unidimensional CAT, and a HSIRM-CAT approach. The two-stage methods, UCAT with CFA and UCAT with average approaches, served as baseline methods for comparison with the one-stage methods. The results showed that except for the unidimensional CAT approach, the one-stage methods always generated more accurate estimates both for overall and domain abilities than the two-stage methods, suggesting that the multidimensional CAT and the HSIRM-CAT approaches are reliable enough for administration in a CAT context. Neglecting the random effects of subtests made it difficult for the unidimensional CAT approach to estimate overall and domain abilities precisely, especially in the diverse factor loading setting and the 2PL-HSIRM condition. Of the two methods, the HSIRM-CAT approach was recommended because the CAT was based on the HSIRM that was used to generate the item responses. In addition, a significant advantage of the HSIRM-CAT approach was that it yielded standard errors of measurement for overall and domain ability estimates simultaneously after administering an adaptive item such that the fixed-precision stopping rule can be implemented if necessary, whereas the multidimensional CAT approach did not. In the last study, the progressive (PG; Barrada, Olea, Ponsoda, & Abad, 2008; Revuelta & Ponsoda, 1998) and the proportional (PP; Barrada et al., 2008; Segall, 2004a) methods were incorporated into the CAT procedures based on the HSIRM to improve item pool security and measurement precision simultaneously, as compared iii.

(5) to the point Fisher information (PFI) method. In addition, the Sympson and Hetter online freeze (SHOF; Chen, 2004, 2005) procedure and content balancing controls were implemented in the process. The result showed that the PG and the PP methods can reduce the item exposure rate as well as improve item pool usage. Further, the effect becomes larger as the acceleration parameter increases. However, it was not possible to guarantee that the item exposure rate for each item would be below a pre-specified level unless the SHOF was implemented. As the acceleration parameter increased, item overlap rate decreased both for the PG and the PP methods but the overall RMSE did not always increase. When the PG method improved measurement precision by reducing the acceleration parameter, the difference in overall RMSEs between the PFI and the PG method was much smaller. In sum, the HSIRM-CAT approach, with both the PG and SHOF procedures can improve item bank security with little or no loss in measurement precision and provide test information for the duration of the CAT, as evidenced by the equivalent overall RMSEs with lower test mean overlap rate of this approach as compared to the PFI method. Finally, study limitations are noted and suggestions for future investigations are proposed.. Keywords: hierarchical structure item response model, Bayesian estimation, computerized adaptive testing, test security, modern item selection rules, item response theory.. iv.

(6) TABLE OF CONTENTS CHAPTER ONE INTRODUCTION ......................................................1 1.1 Motivation.......................................................................................................1 1.2 Significance and Contribution ......................................................................3. CHAPTER TWO LITERATURE REVIEW .........................................4 2.1 Unidimensional IRT Models .........................................................................4 2.2 Multidimensional IRT Models ......................................................................5 2.3 Testlet Response Theory Model ....................................................................6 2.4 Multidimensional IRT Models with a Hierarchical Structure...................7 2.4.1 Factor Analysis Approach for the MIRT Model with a Hierarchical Structure.................................................................................8 2.4.2 Principle Components Analysis Approach for the MIRT Model with a Hierarchical Structure .....................................................................9 2.4.3 Comments for the Bayesian MIRT Models with a Hierarchical Structure .....................................................................................................11 2.5 Computerized Adaptive Testing..................................................................15 2.5.1 Item Selection Procedures ................................................................15 2.5.2 Ability Estimation .............................................................................17 2.5.3 Item Selection and Scoring for MIRT Models ...............................20 2.6 Item Bank Security and Modified Item Selection Procedures.................25 2.6.1 Content Balancing, Item Exposure Control, and Test Overlap Control ........................................................................................................25 2.6.2 Modern Item Selection Procedures .................................................30 2.6.3 Progressive Method...........................................................................31 2.6.4 Proportional Method ........................................................................33 2.7 Statement of the Problems...........................................................................35. CHAPTER THREE METHOD.............................................................37 3.1 Study 1: Model-data Fit and Model Parameter Recovery of the HSIRM ..............................................................................................................................37 3.1.1 Bayesian Estimation..........................................................................37 3.1.2 Bayesian Model Fit Checking Technique........................................39 3.1.3 Posterior Predictive Model Checking .............................................40 3.1.4 Bayesian Approach Model Comparison .........................................42 3.1.5 Simulation Design .............................................................................44 3.1.6 Data Generation ................................................................................46 3.1.7 Analysis ..............................................................................................46 3.2 Study 2: The CAT Procedure based on the HSIRM .................................47 v.

(7) 3.2.1 Derivation of the Hierarchical Structure CAT Algorism ..............47 3.2.2 Simulation Design .............................................................................52 3.3 Study 3: Modern Item Selection Procedures in the Hierarchical Structure CAT ....................................................................................................53 3.3.1 Implementation of the PG and the PP Methods into the Hierarchical Structure CAT......................................................................53 3.3.2 Simulation Design .............................................................................55. CHAPTER FOUR RESULTS................................................................57 4.1 Study 1: Model-data Fit and Model Parameter Recovery of the HSIRM ..............................................................................................................................57 4.1.1 Model Fit Checking with the PPMC Methods ...............................57 4.1.2 Model Comparison and Model Selection Methods ........................64 4.1.3 Model Parameter Recovery..............................................................67 4.1.4 Examination of the Effect of the HSIRM .......................................78 4.2 Study 2: The CAT Procedure based on the HSIRM .................................85 4.2.1 Measurement Precision in CAT under the HSIRM.......................85 4.2.2 Test Reliability in CAT based on the HSIRM.................................93 4.2.3 Relative Efficiency between CAT Algorisms under the HSIRM ..96 4.2.4 Utility of Each Subtest in CAT under the HSIRM.........................98 4.2.5 Conditional Standard Errors for Overall Ability in CAT under the HSIRM ......................................................................................................101 4.3 Study 3: Modern Item Selection procedures in CAT based on the HSIRM ..............................................................................................................104 4.3.1 Magnitudes of Bias, Maximum Item Exposure Rate, and Item Pool Usage Rate under CAT based on the HSIRM...............................104 4.3.2 Overall RMSEs and Test Mean Overlap Rate under CAT based on the HSIRM................................................................................................139. CHAPTER FIVE DISCUSSION AND CONCLUSION ...................179 5.1 Discussion and Conclusion ........................................................................179 5.2 Study Limitation and Suggestions for Further Investigations...............184. REFERENCES......................................................................................187. vi.

(8) LIST OF TABLES Table 3. 1 The Cases Examined in the Simulation Studies.....................45 Table 4. 1 Number of Misfit in Posterior Predictive Model Checking over 10 Replications ...................................................................................60 Table 4. 2 Average of Model Selection Indices in Each Condition.........66 Table 4. 3 Model Recovery in Each Condition ........................................67 Table 4. 4 Parameter Recovery for the 1PL-HSIRM with High Ability Correlation..........................................................................................69 Table 4. 5 Parameter Recovery for the 1PL-HSIRM with Low Ability Correlation..........................................................................................71 Table 4. 6 Parameter Recovery for the 2PL-HSIRM with High Ability Correlation..........................................................................................73 Table 4. 7 Parameter Recovery for the 2PL-HSIRM with Low Ability Correlation..........................................................................................76 Table 4. 8 Magnitudes of Bias for the Overall and Domain Abilities across Five Proposed Methods in CAT ............................................88 Table 4. 9 Magnitudes of RMSEs for the Overall and Domain Abilities across Five Proposed Method s in CAT ...........................................89 Table 4. 10 Factorial ANOVA in RMSEs according to Overall Ability Estimation...........................................................................................90 Table 4. 11 Factorial ANOVA in RMSEs according to Subtest Ability Estimation...........................................................................................92 Table 4. 12 Test Reliability for the Five Proposed Methods in CAT based on the HSIRM ....................................................................................95. vii.

(9) Table 4. 13 Ratios of Mean Square Errors comparing Two-stage methods with One-stage Methods.....................................................97 Table 4. 14 Mean Number of Items to be Administered in Each Subtest under One-stage Methods .................................................................99 Table 4. 15 Distribution of Discrimination Parameters in Item Pool for the 2PL-HSIRM ...............................................................................100 Table 4. 16 Maximum Item Exposure Rate and Item Pool Usage Rate under Multidimensional CAT Approach in 1PL-HSIRM with High Ability Correlation...........................................................................107 Table 4. 17 Maximum Item Exposure Rate and Item Pool Usage Rate under Multidimensional CAT Approach in 1PL-HSIRM with Low Ability Correlation...........................................................................109 Table 4. 18 Maximum Item Exposure Rate and Item Pool Usage Rate under Multidimensional CAT Approach in 2PL-HSIRM with High Ability Correlation........................................................................... 113 Table 4. 19 Maximum Item Exposure Rate and Item Pool Usage Rate under Multidimensional CAT Approach in 2PL-HSIRM with Low Ability Correlation........................................................................... 115 Table 4. 20 Maximum Item Exposure Rate and Item Pool Usage Rate under Unidimensional CAT Approach in 1PL-HSIRM with High Ability Correlation........................................................................... 118 Table 4. 21 Maximum Item Exposure Rate and Item Pool Usage Rate under Unidimensional CAT Approach in 1PL-HSIRM with Low Ability Correlation...........................................................................120 Table 4. 22 Maximum Item Exposure Rate and Item Pool Usage Rate under Unidimensional CAT Approach in 2PL-HSIRM with High Ability Correlation...........................................................................124 Table 4. 23 Maximum Item Exposure Rate and Item Pool Usage Rate viii.

(10) under Unidimensional CAT Approach in 2PL-HSIRM with Low Ability Correlation...........................................................................126 Table 4. 24 Maximum Item Exposure Rate and Item Pool Usage Rate under HSIRM-CAT Approach in 1PL-HSIRM with High Ability Correlation........................................................................................129 Table 4. 25 Maximum Item Exposure Rate and Item Pool Usage Rate under HSIRM-CAT Approach in 1PL-HSIRM with Low Ability Correlation........................................................................................131 Table 4. 26 Maximum Item Exposure Rate and Item Pool Usage Rate under HSIRM-CAT Approach in 2PL-HSIRM with High Ability Correlation........................................................................................135 Table 4. 27 Maximum Item Exposure Rate and Item Pool Usage Rate under HSIRM-CAT Approach in 2PL-HSIRM with Low Ability Correlation........................................................................................137. ix.

(11) LIST OF FIGURES Figure 2. 1 Factor Analysis Approach for the MIRT Model with a Hierarchical Structure........................................................................9 Figure 2. 2 Principle Components Analysis Approach for the MIRT Model with a Hierarchical Structure. .............................................. 11 Figure 4. 1. Estimates of Factor Loading when Fitting 1PL-HIRT Model to 1PL-UIRT Data over 10 Replications. .........................................61 Figure 4. 2. Estimates of Factor Loading when Fitting 2PL-HIRT Model to 2PL-UIRT Data over 10 Replications. .........................................63 Figure 4. 3. Magnitudes of Bias against Three Estimations for 1PL-HSIRM with High Ability Correlation across 10 Replications. ..............................................................................................................79 Figure 4. 4. Magnitudes of RMSE against Three Estimations for 1PL-HSIRM with High Ability Correlation across 10 Replications. ..............................................................................................................79 Figure 4. 5. Magnitudes of Bias against Three Estimations for 1PL-HSIRM with Low Ability Correlation across 10 Replications. ..............................................................................................................80 Figure 4. 6. Magnitudes of RMSE against Three Estimations for 1PL-HSIRM with Low Ability Correlation across 10 Replications. ..............................................................................................................80 Figure 4. 7. Magnitudes of Bias against Three Estimations for 2PL-HSIRM with High Ability Correlation across 10 Replications. ..............................................................................................................82 Figure 4. 8. Magnitudes of RMSE against Three Estimations for 2PL-HSIRM with High Ability Correlation across 10 Replications. ..............................................................................................................82. x.

(12) Figure 4. 9. Magnitudes of Bias against Three Estimations for 2PL-HSIRM with Low Ability Correlation across 10 Replications. ..............................................................................................................83 Figure 4. 10. Magnitudes of RMSE against Three Estimations for 2PL-HSIRM with Low Ability Correlation across 10 Replications. ..............................................................................................................83 Figure 4. 11. Magnitudes of the Mean Standard Errors in Four HSIRMs across 10 Replications........................................................................84 Figure 4. 12. Interaction Between Methods and Ability Correlations in RMSEs according to Overall Ability Estimation. ...........................90 Figure 4. 13. Interaction Between Methods and the Number of Parameters in RMSEs according to Subtest Ability Estimation. ..92 Figure 4. 14. Interaction Between Methods and Ability Correlations in RMSEs according to Subtest Ability Estimation. ...........................93 Figure 4. 15. Conditional Standard Errors over Overall Ability Levels in 1PL-HSIRM with High Ability Correlation. .................................102 Figure 4. 16. Conditional Standard Errors over Overall Ability Levels in 2PL-HSIRM with High Ability Correlation. .................................102 Figure 4. 17. Conditional Standard Errors over Overall Ability Levels in 1PL-HSIRM with Low Ability Correlation...................................103 Figure 4. 18. Conditional Standard Errors over Overall Ability Levels in 2PL-HSIRM with Low Ability Correlation...................................103 Figure 4. 19. Overall RMSE and Test Mean Overlap for the 600-item Pool with a Test Length of 30 Items in 1PL-HSIRM under High Ability Correlation...........................................................................141 Figure 4. 20. Overall RMSE and Test Mean Overlap for the 1200-item Pool with a Test Length of 30 Items in 1PL-HSIRM under High xi.

(13) Ability Correlation...........................................................................143 Figure 4. 21. Overall RMSE and Test Mean Overlap for the 600-item Pool with a Test Length of 60 Items in 1PL-HSIRM under High Ability Correlation...........................................................................145 Figure 4. 22. Overall RMSE and Test Mean Overlap for the 1200-item Pool with a Test Length of 60 Items in 1PL-HSIRM under High Ability Correlation...........................................................................147 Figure 4. 23. Overall RMSE and Test Overlap for the 600-item Pool with a Test Length of 30 Items in 1PL-HSIRM under Low Ability Correlation........................................................................................151 Figure 4. 24. Overall RMSE and Test Overlap for the 1200-item Pool with a Test Length of 30 Items in 1PL-HSIRM under Low Ability Correlation........................................................................................153 Figure 4. 25. Overall RMSE and Test Overlap for the 600-item Pool with a Test Length of 60 Items in 1PL-HSIRM under Low Ability Correlation........................................................................................155 Figure 4. 26. Overall RMSE and Test Overlap for the 1200-item Pool with a Test Length of 60 Items in 1PL-HSIRM under Low Ability Correlation........................................................................................157 Figure 4. 27. Overall RMSE and Test Overlap for the 600-item Pool with a Test Length of 30 Items in 2PL-HSIRM under High Ability Correlation........................................................................................161 Figure 4. 28. Overall RMSE and Test Overlap for the 1200-item Pool with a Test Length of 30 Items in 2PL-HSIRM under High Ability Correlation........................................................................................163 Figure 4. 29. Overall RMSE and Test Overlap for the 600-item Pool with a Test Length of 60 Items in 2PL-HSIRM under High Ability Correlation........................................................................................165. xii.

(14) Figure 4. 30. Overall RMSE and Test Overlap for the 1200-item Pool with a Test Length of 60 Items in 2PL-HSIRM under High Ability Correlation........................................................................................167 Figure 4. 31. Overall RMSE and Test Overlap for the 600-item Pool with a Test Length of 30 Items in 2PL-HSIRM under Low Ability Correlation........................................................................................171 Figure 4. 32. Overall RMSE and Test Overlap for the 1200-item Pool with a Test Length of 30 Items in 2PL-HSIRM under Low Ability Correlation........................................................................................173 Figure 4. 33. Overall RMSE and Test Overlap for the 600-item Pool with a Test Length of 60 Items in 2PL-HSIRM under Low Ability Correlation........................................................................................175 Figure 4. 34. Overall RMSE and Test Overlap for the 1200-item Pool with a Test Length of 60 Items in 2PL-HSIRM under Low Ability Correlation........................................................................................177. xiii.

(15)

(16) CHAPTER ONE INTRODUCTION 1.1 Motivation For a long time, the focus of educational testing has been mainly on paper-and-pencil (P&P) tests. In recent years, many tests have begun to be administered on computer as advances in computer technology have opened up new opportunities. Furthermore, the field has progressed from computer-based testing (CBT) to computerized adaptive testing (CAT). Compared to traditional paper-and-pencil testing, computerized adaptive testing can make more accurate estimations of examinees traits with tailored items and efficiently improve measures of ability with a smaller number of test items (Mcbride & Martin, 1983; Parshall, Spray, Kalohn, & Davey, 2002; Ponsoda & Olea, 2003; Segall, 2004b; van der Linden & Glas, 2000; Wainer, 1990). The CAT procedures and algorithms, such as item selection and ability estimation, have been developed and implemented primarily on the basis of item response theory (IRT). IRT models all share two fundamental assumptions, namely unidimensionality and local independence (Baker & Kim, 2004; Bond & Fox, 2001; Embretson & Reise, 2000; Hambleton & Swaminathan, 1985; Smith & Smith, 2004; van der Linden & Hambleton, 1997). When a test contains more than one dimension or testlet-based items in which a group of items share a common stimulus (Bradlow, Wainer, & Wang, 1999; Wainer, Braslow, & Du, 2000; Wainer & Kiely, 1987), however, it is not appropriate to implement conventional unidimensional IRT models. Recently, multidimensional IRT (MIRT) and testlet response theory (TRT) models have been proposed for such situations, and have been adapted for use in CAT contexts (Segall, 1996; Segall & Moreno, 1999; Shih, 2007; Su, 2007; Wang & Chen, 2004). Unidimensional and multidimensional IRT models are based on different assumptions about the latent dimensions. The unidimensional IRT models assume that all items measure a unified ability. When more than one ability is being measured, multidimensional IRT models provide more precise covariance estimates between latent traits than unidimensional IRT models, due to measurement errors from separate unidimensional analyses. The advantage of multidimensional IRT models over unidimensional models is that the multidimensional IRT models can correct the overestimated test reliability and biased item parameter estimates which result when the multidimensional structure of the latent trait is ignored (Wang & Wilson, 2005b, 2005c). In practice, achievement tests usually measure several abilities and are thus appropriately analyzed by multidimensional IRT models. For example, an English test 1.

(17) may comprise three domain abilities: listening, reading, and writing. A multidimensional IRT model is very useful in estimating the correlation between various abilities and simultaneously calibrating multiple dimensions and item parameters. Consequently three ability estimates for an examinee can be obtained to evaluate the performance of the examinee through maximum-likelihood and Bayesian estimations. However, it is much harder to make an assessment of the overall performance of examinees when so many trait levels exist. One could, for convenience, simply take the average value of the three domain abilities estimated by the multidimensional IRT model to be a criterion for an examinee, or fit subtests of that data to the unidimensional IRT model separately and average them in the same way. However, such approaches would neglect the measurement errors of the domain ability estimates the effect of using the average value as an overall performance would thus be suspicious. Clearly, it would be preferable to include overall ability as well as domain ability dimensions in the model, such that both composite and subtest scores can be obtained in one single implementation. Sheng and Wikle (2008) proposed so called Bayesian multidimensional IRT models with a hierarchical structure in which they attempted to build a hierarchical structure for the underlying ability dimensions. They ingeniously incorporated a second-order factor model (Kelloway, 1998; Schmid & Leiman, 1957) from the factor analytic framework into MIRT models. Using these proposed models, overall and domain abilities can be estimated simultaneously, and multidimensional IRT models with a hierarchical structure can be used to administer a CAT to evaluate the overall performance for an examinee. Due to certain flaws in the design of their study, however, questions exist about how the parameter estimates (factor loadings) were recovered in their simulation. Before implementing a novel IRT model in a CAT environment, the models must be reliable and stable. Consequently, in this study, the proposed models will be revised and the effects of estimation will be assessed. Use of the maximum-information item selection procedure to administer tailored set of items to examinees is very popular in CAT environments. There is a trade-off, however, between measurement accuracy and item pool security. If the test is compromised or overexposed, the advantages of a CAT may not be sustained. However, although item exposure control in a CAT may protect items from being overexposed, it also reduces the measurement precision to some extent. Further, the maximum-information item selection procedure is not appropriate for shorter tests and such a procedure tend to favor items with maximum information values at the wrong provisional ability, because the errors of the estimated ability at the beginning phase of the CAT are very large (Chang & Ying, 1996; van der Linden & Pashley, 2000; Veerkamp & Berger, 1997). As a result, several modified maximum-information item 2.

(18) selection procedures have been proposed which simultaneously consider measurement accuracy and item bank security in short length tests (Chen & Ankenmann, 2004; Chen, Ankenmann, & Chang, 2000; van der Linden, 1998; van der Linden & Pashley, 2000; Veerkamp & Berger, 1997). However, the highly complicated nature of the numerical integrations required resulted in longer computation times, particularly for multidimensional CATs (Shih, 2007). An alternative to the modified maximum-information item selection criteria has been proposed to decrease the computation time and take both measurement accuracy and item pool security into account (Barrada, Olea, Ponsoda, & Abad, 2008). The novel item selection rules will be illustrated in the following sections in this study and implemented in a CAT environment based on multidimensional IRT models with a hierarchical structure. 1.2 Significance and Contribution As discussed above, the advantageous psychometric properties of CATs can only be realized when they are based on reliable and well-established IRT models. Additionally, to sustain the quality of a CAT used in a high stakes setting, the item exposure and test overlap must be controlled at satisfactory levels compared to the normal maximum-information item selection procedures. This study aims to revise the multidimensional IRT models using the hierarchical structure proposed by Sheng and Wikle (2008) and to implement the models in CAT contexts. The resulting models will be able to obtain both domain and overall ability estimates. It is expected that practitioners will find it very useful to be able to get overall performance estimates for examinees and check the model fit in terms of the multidimensional IRT models with a hierarchical structure. Furthermore, test practitioners will be able to build CAT systems which simultaneously estimate overall and domain abilities while arriving at a better balance of measurement accuracy and pool security. Hence this study not only provides guidance on the implementation of multidimensional IRT models with a hierarchical structure, but also sheds new light on both theoretical and practical issues.. 3.

(19) CHAPER TWO LITERATURE REVIEW This chapter first discusses a variety of dichotomous IRT and MIRT models and then extends these models to the MIRT models with a hierarchical structure. Finally the principle of the CAT algorithm and issues related to it are presented and implemented in MIRT models with a hierarchical structure. 2.1 Unidimensional IRT Models As a special case of generalized linear mixed models (GLMMs), the IRT models introduced below are based on the logistic distribution rather than the cumulative normal distribution (De Boeck & Wilson, 2004). The one-parameter logistic (1PL) IRT model, also named the Rasch model (Rasch, 1960) can be represented as follows: P ( X ni = 1∣θ n , bi ) =. exp(θ n − bi ) , 1 + exp(θ n − bi ). (2.1). where P ( X ni = 1) is the probability of a correct response; bi is item difficulty parameter for item i; and θ n is the nth examinee’s ability parameter for the administered test. The significant advantage of the Rasch model is that it provides the property of measurement on an interval or even ratio scale, which is of great importance in the field of psychological and educational measurement, especially for large-scale and high stakes tests (Embretson & Reise, 2000). The two-parameter logistic (2PL) IRT model can be made when the item discrimination parameter ( ai ) for item i is added into Equation (2.1) in the following manner (Birnbaum, 1968): P ( X ni = 1∣θ n , bi , ai ) =. exp[a i (θ n − bi )] . 1 + exp[ai (θ n − bi )]. (2.2). The 2PL model is useful for determining whether items are equally related to the latent trait by examining their estimated values for the discrimination parameter, which function analagously to factor loadings in factor analysis (Reise, Widaman & Pugh, 1993; Smith & Reise, 1998). When a third parameter, the guessing parameter, is added to the 2PL model, it becomes the three-parameter logistic (3PL) IRT model, as follows:. 4.

(20) P ( X ni = 1∣θ n , bi , a i , ci ) = ci + (1 − ci ). exp[ai (θ n − bi )] , 1 + exp[a i (θ n − bi )]. (2.3). where ci represents the probability of endorsing item i for a examinee when his or her ability level reaches the low extreme. 2.2 Multidimensional IRT Models Multidimensional IRT (MIRT) models provide two or more parameters and their covariance structure to represent each person’s trait level. The most general MIRT model, which was called Multidimensional Three-parameter Logistic Model (M3PLM), was proposed by Hattie (1981) and Reckase (1985) in which the item responses can be expressed as:. P ( X ni. [. ]. exp a i' (θ n − bi 1) , = 1∣θ n , bi , a i , ci ) = ci + (1 − ci ) 1 + exp a i' (θ n − bi 1). [. ]. (2.4). where P ( X ni = 1) is the probability of a correct response; θ 'n ≡ {θ1 , θ 2 ,..., θ p } refers to the p-dimensional latent trait; a i' is a 1×p vector of discrimination parameters for item i; bi and ci are the difficulty and guessing parameters for item i, respectively; and 1 is a p×1 vector of 1’s. When there are only difficulty and discrimination parameters in the model, Equation (2.4) will be reduced to the Multidimensional Two-parameter Logistic Model (M2PLM) and can been seen in the study of Reckase (1997). When the discrimination parameter is removed from the M2PLM, it becomes the Multidimensional One-Parameter Logistic Model (M1PLM; Mckinley & Reckase, 1982). In addition to the M1PLM, Adams, Wilson and Wang (1997) proposed the multidimensional random coefficients multinominal logit model (MRCMLM) for Rasch family models. Being a member of the exponential family of distribution, the MRCMLM can be viewed as a generalized linear mixed model (De Boeck & Wilson, 2004; McCulloch & Searle, 2001; Rijmen, Tuerlinckx, De Boeck, & Kupens, 2003; Wang & Wilson, 2005a; Wang & Wilson, 2005b) of which the Rasch testlet model (Wang & Wilson, 2005b), the logistic latent trait model (LLTM; Fischer, 1973), the rating scale model (RSM; Andrich, 1978), and the partial credit model (PCM; Masters, 1982) are all the special cases of the MRCMLM. The model can be expressed as 5.

(21) follows:. P ( X nij = 1; ξ∣θ n ) =. (. exp b ij' θ n + a ij' ξ. ∑ exp(b Ki. u =1. ' iu. ). θn + a ξ). ,. (2.5). ' iu. where P( X nij = 1) is the probability of the response to item i in category j for person. n; K i is the number of category in item i; ξ is a vector of difficulty parameters of that item; b ij is a score vector given to category j of item i across the P latent trait; and a ij is a design vector given to category j of item i that integrates the element of ξ into a linear relationship. The commercial computer program ConQuest (Wu,. Adams, & Wilson, 1998) can be implemented to calibrate the parameters based on the MRCMLM. In addition to ConQuest, the SAS NLMIXED procedure (SAS Institute, 1999) and the STATA GLLAMM procedure (Skrondal & Rabe-Hesketh, 2004) are alternatives to fit the MRCMLM and the nonlinear and generalized linear mixed models described above. 2.3 Testlet Response Theory Model. For a group of items in a test with common stimulus, Bradlow et al. (1999) added an additional random effect into the 2PL IRT model to eliminate the dependence between items within the same testlet. Wainer, Bradlow, and Du (2000) further extended the 2PL TRT model into the 3PL TRT model by including guessing parameters. In addition to the guessing parameter being taken into account in the 3PL TRT model, the later model also allows variation in the random effects over testlets and is more flexible in assessing the influence of different testlets and in evaluating whether the testlet could be ignored when the variation for that testlet is very small. Under the 3PL TRT model, the probability of a correct response to item i within testlet d(i) for a person with latent trait θ n is. P ( X ni = 1∣θ n , bi , ai , ci ) = ci + (1 − ci ). [. ]. exp ai (θ n − bi + γ nd ( i ) ). [. ],. 1 + exp ai (θ n − bi + γ nd ( i ) ). (2.6). where P ( X ni = 1) is the probability of a correct response; ai , bi , and ci are the 6.

(22) discrimination, difficulty, and guessing parameters, respectively; and γ nd (i ) is the random effect for person n on testlet d(i), which describes the interaction between persons and items (local item dependence) within the testlet. If we set the guessing parameter at 0 and the discrimination parameter equal to 1 in the Equation (2.6), the model will reduce to the Rasch testlet model proposed by Wang and Wilson (2005b). It should be noted that the TRT model is a special case of the MIRT model because the multiple dimensions, including the latent trait and testlet effects, are constrained to be independent. 2.4 Multidimensional IRT Models with a Hierarchical Structure. Although MIRT models are very useful in explaining the correlation between various abilities and simultaneously calibrating multiple dimensions and item parameters, it is much harder to make some assessment about the performance of examinees according to so many trait levels if the purpose of a test is to diagnose overall character. Recent attempts have been made to incorporate an overall ability dimension underlying several trait dimensions designed for individual test items (de la Torre & Douglas, 2004), with the increasing requirement of cognitive diagnosis and evaluating applicants by entrance test scores. For instance, suppose there are three subtests, listening, reading, and writing, comprising an English test. We can fit a UIRT model and present the unified English ability for an examinee if necessary. It seems, however, to be inappropriate to ignore the specifically separate abilities when this test intends to assess individual proficiency. Furthermore, ignoring the local item dependence in a test in order to fit the unidimensional model will result in overestimating test reliability and causing the parameter estimates to be biased (Ip, 2000; Wainer, 1995; Wainer & Lukhele, 1997; Wainer & Thissen, 1996; Wainer & Wang, 2000; Wang & Wilson, 2005b; Wang & Wilson, 2005c; Wang, Cheng, & Wilson, 2005). The analytic approaches from separate and composite scores need to be integrated while simultaneously giving consideration to UIRT and MIRT models. Alternative methods have proposed to achieve this end. The most convenient approach is to fit empirical data to a MIRT model to obtain the subtest scores and then average them to get the composite score, or fit subtests of that data to the UIRT model separately and average them in the same way. Overall ability estimates can be obtained by computing factor scores for each examinee from the separated ability estimates. However, either computing mean scores or factor scores from point estimates for distinct proficiency of an examinee based on maxima likelihood or Bayesian estimations will result in biased composite 7.

(23) scores and overlooking the correlation between subtests since it neglects measurement error and sampling variance (Adams, Wilson, & Wu, 1997; Adams & Wu, 2000). Therefore it is preferable to simultaneously estimate overall ability as well as domain ability dimensions in the same model in order to get composite and subtest scores at the same time. Sheng and Wikle (2008) proposed the so called Bayesian MIRT models with a hierarchical structure in which they attempted to build a hierarchical structure for the underlying ability dimensions. They ingeniously incorporated a second-order factor model (Kelloway, 1998; Schmid & Leiman, 1957) from the factor analytic framework into the context of MIRT models. Through Markov chain Monte Carlo (MCMC) simulation algorithms (Chib & Greenberg, 1995), the proposed models can be used to estimate person and item parameters, and even model choice techniques such as model comparison and model checking. This study will demonstrate the two prominent models they suggested, discuss certain possible insufficiencies in their model which may require supplementation, and extend the application of their models into computerized adaptive testing. 2.4.1 Factor Analysis Approach for the MIRT Model with a Hierarchical Structure. The first proposed model, based on a Bayesian MIRT model with a hierarchical structure, was demonstrated within the framework of factor analysis to construct the relationship between the overall and domain abilities. One can assume that each domain ability is a linear function of the overall ability as shown in Figure (2.1), that is, the variance for each distinct dimension is assumed to be explained by the variance of the overall dimension in which the proposed model is congruent with the definition of factor analysis in the literature of multivariate statistics (Johnson, 1998). Under such an assumption, Sheng and Wikle (2008) assumed that the linear function between the overall and domain abilities is as follows:. θ vn = β vθ 0 n + ε vn ,. (2.7). where ε vn ~ N (0,1) , θ vn refers to the domain ability parameter corresponding to the vth subtest; θ 0 n is the overall ability representing the overall performance; and β v is a measure of association between the overall ability and the vth domain parameters, that is, the square of regression coefficient represents the amount of variance explained by the overall ability for each domain trait. Combining Equation (2.7) into the 2PL-MIRT model, the author revised the original model probability function 8.

(24) proposed by Sheng and Wikle (2008) by replacing the probit link function with a logit link function, so that the model can be written as follows: P ( y vni = 1∣θ 0 n , ε vn , a vi , bvi ) =. exp(a vi β vθ 0 n − bvi + a vi ε vn ) . 1 + exp(a vi β vθ 0 n − bvi + a vi ε vn ). (2.8). For model identification, some parameter estimates are needed to impose constraints on the model, in which θ 0 n are assumed to be followed N(0, 1) and ε n has a multivariate normal distribution NV(0, I) where I is an identical matrix with V dimensions (subtests). When the linear regression coefficients are all set at unity, Equation (2.8) collapses into the testlet model proposed by Bradlow et al. (1999). It is evident that the testlet model is a special case of that proposed model where all β v s are equal to unity. There seems to be a mathematical disagreement, however, if we put constraints on θ 0 n and ε n as described above. The regression coefficients β v s can not be regarded as the correlation between the overall ability and each domain ability because the variance of each latent trait no longer equals to one.. θ1. θ0. θ2. θ3. Figure 2. 1 Factor Analysis Approach for the MIRT Model with a Hierarchical Structure. 2.4.2 Principle Components Analysis Approach for the MIRT Model with a Hierarchical Structure. On the other hand, if the overall ability is a linear combination of the remaining 9.

(25) domain abilities, one can obtain the linear relationship between varying ability dimension parameters as follows (Sheng & Wikle, 2008):. θ 0n =. ∑λθ v. vn. + εn,. (2.9). v. where ε n is ordinarily assumed following N(0, 1) so that λv are the standardized regression coefficients for the domain ability parameters when both θ on and θ vn are standard normal variates. This linear transformation is very similar to the method of principle components analysis and with that technique one can reduce the dimensionality of the data set and identify one or more meaningful underlying variates (Johnson, 1998). As shown in Figure (2.2), we can see that the causal paths in the regression function are opposite to the first model. The two MIRT models with a hierarchical structure proposed by Sheng and Wikle (2008) represented different beliefs concerning the relationship between the overall and distinct abilities. When the linear function in Equation (2.9) is incorporated into the 2PL MIRT model, the revised model form based on the logit link function can be written as:. P ( y vni. ⎡ ⎛M ⎤ ⎞ exp ⎢a vi ⎜ ∑ λvθ vn + ε n ⎟ − bvi ⎥ ⎠ ⎣ ⎝ v =1 ⎦ , = 1∣θ vn , ε n , a vi , λv ) = ⎤ ⎡ ⎛M ⎞ 1 + exp ⎢a vi ⎜ ∑ λvθ vn + ε n ⎟ − bvi ⎥ ⎠ ⎦ ⎣ ⎝ v =1. (2.10). where the model identification ε n ~N(0, 1) and the domain abilities are assumed to have a multivariate normal distribution, θ n ~Nm(0, R), where when the covariance matrix is set at 1s on the diagonal it becomes a correlation matrix R. Sheng and Wikle (2008) figured the relationship between overall ability and domain abilities is equivalent to that of principle components analysis. However, there seems to be a contradiction between the approach of the proposed model and the definition of principle components, because in principle components analysis the linear combination should be without predictive error variance and the regression weights just have to be chosen to maximize the composite score among examinees. The following section will discuss these issues and provide vigorous evidences which argues for a revision of the original designs based on the above two models.. 10.

(26) θ1. θ0. θ2. θ3. Figure 2. 2 Principle Components Analysis Approach for the MIRT Model with a Hierarchical Structure. 2.4.3 Comments for the Bayesian MIRT Models with a Hierarchical Structure. Because the two MIRT models with a hierarchical structure were quite complicated for frequentist statistics to deal with, Sheng and Wikle (2008) implemented Bayesian statistics to estimate parameters in the proposed models and drew conclusion about which model was better to represent the simulation and empirical data sets through Bayesian model choice techniques. It is not entirely clear, however, how the regression coefficients of the latent variables in both proposed models were recovered from the simulation data sets. Through a series of pilot studies before conducting the dissertation experiments, the researcher parroted the simulation studies according to the two models proposed by Sheng and Wikle (2008) to evaluate parameter recovery for each model and found out the item parameters recovered acceptably well but that the person parameters, such as the regression coefficients, were far from the setting values in advance. Returning to the original article by Sheng and Wikle (2008), it was found that while their results showed that the item parameters could be recovered well by the fitting models, the person parameters (factor loadings) were not shown anywhere in their article. It is reasonable to believe that the two proposed models confront with the problem of scale indeterminacy (Wang, 2004) which usually made it difficult to estimate the person parameters especially when computing the complicated relationship between different dimensions. 11.

(27) In addition, if the first model was considered as one comparable to the factor analysis model, the constraints on parameter distribution should be redefined to obtain reasonable conditions where the regression coefficients of the overall ability in Equation (2.7) are exactly equal to factor loadings under a common factor. Suppose one observes a p-variate latent trait vector θ from a population that has been standardized such that θ n ~Np(0, R). The general FA model assumes there are m underlying factors denoted by θ 01 ,θ 02,L,θ 0 m , and the relationship between latent traits and higher-order factors can be written as follows:. θ j = β j1θ 01 + β j 2θ 02 + L + β jmθ 0 m + η j ,. for j=1,2,…,p, and m<p.. (2.11). We can obtain the matrix form as follows: θ = Βθ 0 + η,. (2.12). where the common factors are assumed to have a multivariate normal distribution such that θ 0 ~ N (0, I ) ; the equation weight matrix B measures the contribution of common factors to each latent trait; and the residual variance, represented as specific factors, is also ordinarily assumed to follow N (0, Ψ ) where Ψ = diag (ψ 1 ,ψ 2 ,L ,ψ p ) . The variance of specific ability θ j can be represented as. follows:. Var (θ j ) = ∑ β jk2 + ψ j . m. (2.13). k =1. If the domain ability is standardized, the residual variance for the jth domain ability can be written as followed: m. ψ j = 1 − ∑ β jk2 ,. (2.14). k =1. that is, the variance of the error term in Equation (2.7) and (2.8) does not need to follow N (0, I ) described in the study of Sheng and Wikle (2008) when simultaneously setting constraints on domain abilities to follow the standard normal distribution. However, the constraints they made on residual variances resulted in 12.

(28) scale transformation where the variances of domain abilities had to be more than unity and the values of the factor loading did not need to be recovered well in simulation design. Apparently, the data responses generated from the standard normal distribution for each domain ability were fitted to a false assumption model and consequently it was much harder to obtain the parameter estimates correctly (Sheng & Wikle, 2008, pp. 420). Furthermore, their original study produced an underidentified model due to the number of unknown quantities exceeding the number of structural equations (Johnson, 1998; Kelloway, 1998). For instance, there were two subtests and one common factor in their simulation studies (Sheng & Wikle, 2008, pp. 420-422), thus requiring at least 4 equations to estimate 4 unknown parameters, but only 3 equations were obtained in such conditions. As a consequence, no unique solution exists because the number of common factors was more than the number of domain abilities minus 1 and divided by 2, equivalently, m > ( p − 1) / 2 . The other criteria for judgment of model identification can be summarized that when m ≤ ( p − 1) / 2 and m=1 exactly one solution exists, however, when m ≤ ( p − 1) / 2 but m ≥ 2 there is no unique solution since any solution can be rotated to an infinite number of other solutions (Johnson, 1998). Thus, the reason that Sheng and Wikle (2008) did not report the parameter recovery for factor loadings may be because only non-identical solutions could be obtained through varying replicated simulations and just investigating the model comparison approaches; however, the consequence of model adequacy evaluation may be misleading with underidentified models. Nonetheless, the item parameter estimates were recovered pretty well in their study with the requirement of identifying domain dimensions for each item correctly regardless of any latent trait structures (Sheng & Wikle, 2008). It implied that the kind of hierarchical structure MIRT models implemented has little influence on item parameter estimates, where the item parameter estimates would be less biased if the generated and empirical data sets fit MIRT models. As for the second Bayesian MIRT model with a hierarchical structure proposed by Sheng and Wikle (2008), the structure of the linear relationships between the overall and domain abilities was definitely inconsistent with the principle components analysis model. The aim of principle components analysis is to find the first principle component which accounts for as much of the variability in the data as possible, and the second or the succeeding components which account for as much of the remaining variability as possible (Johnson, 1998). If there is p-variate latent trait vector θ from a population for each examinee, p components will be sequentially generated through the principle components analysis and the first component is usually viewed as the most important factor which accounts for most of the variability present. There are 13.

(29) several approaches to assess and evaluate the number of components to be retained, and the most popular method is to look for eigenvalues that are greater than 1 in the standardized data set, and consequently the remaining components are probably not so important and can be ignored to reduce the dimensionality of the data set. The largest eigenvalue represents the variance of new composite scores through the linear function of the p latent traits and its corresponding eigenvector indicates the weights of the linear combination of the p latent traits. According to the definition of principle components analysis, the linear function is just a linear transformation and no predictive residual variance can be found in Equation (2.9) and (2.10). Since the second Bayesian MIRT model with a hierarchical structure was not built based on principle components, the results that the authors have suggested, including whether the principle components analysis model was better than the factor analysis model or not with a particular data set, were misleading and likewise the parameter recovery was still neglected due to the false assumption for the relationship between overall ability and domain abilities under the principle components approach. Either the factor analysis model or the principle components analysis model is a proper approach to be implemented to construct MIRT models with a hierarchical structure, if the purpose of the test is to require an overall ability or total score for assisting cognitive diagnosis and evaluating applicant quality. In the context of the study, only one common factor and one principle components were required in the factor analysis approach and principle components analysis approach, respectively. Just as with the definition of the factor analysis model described above, it is guaranteed to be a unique solution when there is only an overall ability with more than three domain abilities. In addition, there is only one solution for the values of the eigenvalue and its corresponding eigenvector under the definition of the principle components analysis, and the most important eigenvalue that accounts for much of the variability and its corresponding eigenvector are what this approach wants to represent as the variance of the overall ability (composite score) and the weights for each domain ability in the linear transformation, respectively. However, the meanings of overall ability based on the factor analysis model and the principle components analysis model are quite different because the former is trying to account for as much of the correlation between the domain abilities as possible, whereas the latter is trying to account for as much of the variation between the domain abilities as possible. Which method should be selected to implement in data analysis is totally determined by the beliefs and philosophy of the researcher. However, the principle components analysis approach in the MIRT model with a hierarchical structure is confronted with some problems. For example, the estimation of the latent trait is always highlighted in IRT family models and the estimation of 14.

(30) overall ability estimated by the linear transformation of domain ability does not represent the latent trait structure. Furthermore, the largest eigenvalue and its corresponding eigenvector are reserved to conduct the new composite score but the remaining components are abandoned in the item response model. This will result in a relatively poorer model fit, since minor components are excluded from the model. Consequently this study aims to evaluate the efficiency of the factor analysis approach by comparing the estimation of overall ability with the other convenient estimations described above, and to investigate the appropriate situations of implementation with that approach through model fit checking procedures. Hence, the term of the hierarchical structure item response model (HSIRM) used in following sections refers to the factor analysis approach described above. Furthermore, it should be noted that the term of overall and domain abilities used in this study is different from terminology of the GS model (general factor, specific factor model), because the overall and domain abilities refer to second- and first-order latent traits respectively, namely, a higher order model is implemented instead of a nested factor model (Gustafsson & Balke, 1993). Finally the researcher will develop the procedure of CAT based on the hierarchical structure item response model and further control item exposure and test overlap through the use of novel item selection rules. 2.5 Computerized Adaptive Testing. The purpose of CAT is to accurately estimate each examinee’s ability with a small number of tailored items. As the field has developed, many issues have emerged relating to different variables affecting the efficiency of CAT which need to be further investigated. In the following section, the author will introduce important subjects from the field of CAT, including item selection procedures and rules, ability estimation, item exposure control, and test overlap control. 2.5.1 Item Selection Procedures. Before conducting CAT with examinees, it is necessary to build an item pool through a series of pretests based on the intended IRT model. The better the quality of an item pool, the better the job the adaptive algorithm can do (Flaugher, 2000). The quality of an item pool can be assessed according to pool size, item parameter characteristics, and content structure (Chang & Ansley, 2003). This objective can be finished by coordinating efforts from a group in which there includes psychometricians, educational experts, testing contain experts, and testing developers. Although this study aims to investigate the implementation of new models in a CAT 15.

(31) environment through simulation experiment, the importance of the quality of the item pool must be kept in mind. There are mainly three kinds of item selection procedures, maximum information selection (van der Linden & Glas, 2000; van der Linden & Pashley, 2000), Bayesian selection (Owen, 1969, 1975), and Kullback-Leibler information selection (Chang & Ying, 1996). Maximum information selection is the most common method and will be discussed in detail in the following. The idea of Bayesian selection is to select an item in an item pool which minimizes the expected variance of the posterior distribution for the provisional ability parameter given that an examinee has been administered some items. As for Kullback-Leibler information selection, candidate item can be administered regarding global information rather than local information in which the distance between likelihoods is implemented (Chang & Ying, 1996). Maximum Information Selection The purpose of item selection procedures is to administer to an examinee the item that provides the most information for the estimated trait level (van der Linden & Pashley, 2000). The information of an item for each ability level can be calculated through the Fisher information function that represents the precision of measurement (Embretson & Reise, 2000; van der Linden & Glas, 2000). Under UIRT models, the information function can be formulated in accordance with Equation (2.15):. I i (θ ) =. [P (θ )] '. i. 2. Pi (θ )(1 − Pi (θ )). (2.15). ,. where Pi (θ ) is the probability of correct response to item i given latent trait θ and Pi ' (θ ) is the first derivative of Pi (θ ) . As the CAT begins, the initial ability is generally set at zero for each examinee except under specific circumstances. After examinees complete the first item that maximized information function at trait level of zero when meeting the other constraints (e.g., item exposure control or test overlap control), the provisional ability estimate will be updated given the responses he or she has made. The CAT algorism will look for the next item with maximum information at the updated ability level and administered that item to the same examinee given all constraints have been met. The process is repeated again until the stopping criterion, fixed length or fixed precision, is met. For computational purposes, the general form of the item information function in 16.

(32) Equation (2.15) can be written in terms of the 3PL, 2PL, and 1PL UIRT models as follows: ⎧ exp[ai (θ n − bi )] ⎫ 1 − ci , I i (θ ) = a ⎨ ⎬ ⎩1 − exp[ai (θ n − bi )]⎭ ci + exp[ai (θ n − bi )]. (2.16). exp[a i (θ n − bi )] ⎧ exp[a i (θ n − bi )] ⎫ ⎬, ⎨1 − 1 + exp[ai (θ n − bi )] ⎩ 1 + exp[ai (θ n − bi )]⎭. (2.17). 2. 2. I i (θ ) = a 2. and. I i (θ ) =. exp(θ n − bi ) ⎡ exp(θ n − bi ) ⎤ ⎢1 − ⎥, 1 + exp(θ n − bi ) ⎣ 1 + exp(θ n − bi ) ⎦. (2.18). respectively (Embretson & Reise, 2000; Hambleton & Swaminathan, 1985). 2.5.2 Ability Estimation. Before selecting the next item with maximum information value to administer to an examinee, provisional ability estimate should be computed based on the responses to items that have been administered. The ability for an examinee is updated after administering each selected item. Finally, the estimate of ability is calculated based on all responses to all the administered items with meeting the stopping criterion. There are three significant ability estimators, namely, the maximum likelihood estimator (MLE), the maximum a posteriori (MAP) estimator, and the expected a posteriori (EAP) estimator. Maximum Likelihood Estimator Assuming an examinee response to I items, one can compute the likelihood function by multiplying the probability function for all I items under the assumption of local item independence. The likelihood function can be expressed as:. I. xi. L(θ∣X ) = P(X∣θ ) = ∏ Pi (θ ) [1 − Pi (θ )]. 1− xi. ,. i =1. 17. (2.19).

(33) where X indicates a response vector; xi = 1 if examinee get a correct answer to item I; and xi = 0 otherwise. Because the probability for an item is between zero and one, the product of probability for administered items will become too small to maintain precision. To overcome the difficulty of numerical computation, a natural log of likelihood functions rather than raw likelihoods is often used in further calculations. In addition, it is hard to find the maximum of the log-likelihood by setting first derivative of the log-likelihood at zero due to absence of a closed solution form for such a function. One popular alternative way to find maximum estimates is to use an iterative Newton-Raphson procedure. We have to compute the ratio of the first derivative to second derivative of log-likelihood, and then new ability estimate is obtained by taking the previous ability estimate minus the ratio. The iterative procedure is repeated until the ratio is less than a pre-specified value such as 0.001. The strength of the MLE is asymptotically efficient and unbiased, but it requires a large item pool and no trail level estimate will be available until the examinee has endorsed or not endorsed at least one item (Embretson & Reise, 2000). Maximum A Posteriori (MAP) To overcome the critical problem with the MLE that no trait level can be estimated for examinees with all-endorsed or all-not-endorsed response vectors, one may incorporate prior information for unknown parameters into the log-likelihood function to become a posterior distribution. Assume that the distribution for latent trait level has a normal distribution N (µ , σ 2 ), in which the hyper-parameter for the distribution of unknown parameter can be obtained from empirical information or experts’ subjective judgment, then we can combine the log-likelihood function with the prior information into the formulation as follows: g (θ∣X ) ∝ log L(θ∣X )g (θ ),. (2.20). where g (θ ) is the prior distribution for trait parameter; log L(θ∣X ) is the log-likelihood function for latent trait given response vectors; and g (θ∣X ) is the posterior distribution for ability parameter for that examinee. Because the log-likelihood function and normal prior function is not conjugated, the posterior distribution has no closed form solution such that iterative Newton-Raphson procedure has to be implemented to find the maximum value. Except for the addition part of prior information in Equation (2.20), the algorism to find the value to 18.

(34) maximize the posterior function is the same as the MLE. The value of estimated ability that maximizes the posterior will equal the mode, and this is why the procedure is sometimes called Bayes modal estimation. The advantage of the MAP estimator is that the ability estimates can be obtained with whatever responses the examinees have made. Moreover, the prior distribution can increase the precision of the trail level estimation. However, there are some drawbacks with the MAP estimator. The critical problem is that the MAP estimates are always biased when the number of items is small. Another issue is that the results of ability estimates will be seriously biased and misleading if we adopt the wrong prior distribution (Embretson & Reise, 2000). In sum, the more administered items to an examinee, the less the prior information can influence the ability estimate. Expected A Posteriori Similar to the MAP estimator, the EAP estimator incorporates prior information for unknown parameters into the log-likelihood function. The difference between the EAP and MAP estimators, however, is that the EAP estimator does not involve an iterative procedure and its Bayesian estimator is derived from finding the mean of the posterior distribution rather than mode in the MAP estimator or MLE (Bock & Mislevy, 1982). The formula can be expressed in the form of an expected value as follows:. E (θ∣X ) = ∫ θ. L(θ∣X )g (θ ). ∫ L(θ∣X )g (θ )dθ. dθ .. (2.21). The strength of the EAP estimator is that this procedure is non-iterative and therefore computationally faster than iterative methods but only in the conditions of UIRT models. Based on Bayesian estimation, the EAP estimator can calculate finite trait level estimates for all response patterns. Furthermore, the EAP estimator has minimum mean square error over the population of ability (Bock & Mislevy, 1982). On the other hand, the EAP estimator will be biased and misleading when there is a finite number of items and a wrong prior distribution to be implemented (Embretson & Reise, 2000). In addition, Wainer and Thissen (1987) found that the ability estimates are seriously regressed toward the mean of the prior distribution when the number of the administered items is small. As with the MAP estimator, the effect of the prior information reduces with the increase in the number of administered items.. 19.

(35) 2.5.3 Item Selection and Scoring for MIRT Models When a CAT involves two or more abilities, unidimensional CAT (UCAT) is not appropriate to implement and multidimensional CAT (MCAT) is required. A notable benefit of MCAT is that it can provide equal or higher precision with about one-third fewer items than required by UCAT (Segall & Moreno, 1999). The following will demonstrate the conceptual formulations and technical computation procedures for the item selection and scoring in the environment of MCAT as originally proposed by Segall (1996). MLE Scoring and Item Selection for MIRT Models Analogous to the procedure of MLE in the UCAT environment, we can obtain the likelihood function by multiplying probability functions according to an observed response for an examinee as follows: xi. L(θ∣X ) = P(X∣θ ) = ∏ Pi (θ ) [1 − Pi (θ )]. 1− xi. ,. (2.22). i∈v. where X and θ indicate a response vector and ability vector with p elements, respectively; v is a vector space that contains the identifiers of the administered item; and the probability function Pi (θ ) can be derived from any MIRT models in which the M3PLM is demonstrated (Hattie, 1981; Reckase,1985). The next step is to find p ability estimates that can maximize the log-likelihood function by setting the first derivatives in terms of p equations at zero. It can be expressed as follows: ⎤ ⎡ ∂ ⎢ ∂θ log L(θ∣X ) ⎥ ⎥ ⎢ 1 ∂ ⎢ ∂ log L(θ∣X )⎥ log L(θ∣X ) = ⎢ ∂θ 2 ⎥ = 0, ∂θ ⎥ ⎢ ... ⎥ ⎢ ∂ log L(θ∣X )⎥ ⎢ ⎥⎦ ⎢⎣ ∂θ p. (2.23). where. 20.

(36) a [P (θ) − ci ][xi − Pi (θ)] ∂ log L(θ∣X) = ∑ ki i , (1 − ci )Pi (θ) ∂θ k i∈v. (2.24). For k=1,2,…, p. Iterative Newton-Raphson procedure can still be implemented to find the maximum values due to no closed form solution in Equation (2.23). Let θ ( j ) be a jth approximation to the value of θ that maximizes log-likelihood function, and a better approximation can be written as:. θ ( j +1) = θ ( j ) − δ ( j ) ,. (2.25). where δ ( j ) is the p×1 vector. [ ( )]. δ( j) = H θ( j). −1. ×. ∂ log L θ ( j )∣X . ∂θ. (. ). (2.26). The matrix H (θ ( j ) ) is the Hessian matrix that comes from the second partial. derivatives of the log-likelihood function in terms of θ ( j ) . The Hessian matrix is a p× p symmetric matrix and can be expressed as follows:. ⎡ ∂2 ⎤ ∂2 ∂2 ( ) ( ) ∣ ∣ L θ X L θ X log log log L(θ∣X)⎥ L ⎢ 2 ∂θ1θ 2 ∂θ1θ p ⎢ ∂θ1 ⎥ 2 ⎢ ⎥ ∂ ∂2 log L(θ∣X) L log L(θ∣X)⎥ ⎢ 2 H(θ) = ∂θ 2θ p ∂θ 2 ⎢ ⎥ O L ⎢ ⎥ 2 ⎢ ⎥ ∂ ( ) ∣ L θ X log ⎢ ⎥ ∂θ p2 ⎣⎢ ⎦⎥. , (2.27). θ=θ( j ). where each element in Equation (2.27) can be computationally simplified as ∂2 ∂θ k θ l. log L(θ∣X ) = ∑ i∈v. [. ].. a ki ali (1 − Pi (θ ))[Pi (θ ) − ci ] ci xi − Pi 2 (θ ) Pi (θ )(1 − ci ) 2. 2. (2.28). The iterative procedure is repeated until the elements of θ ( j ) change very little from one iteration to the next. To ensure the iteration can get converge, the Hessian matrix 21.