• 沒有找到結果。

Overall RMSEs and Test Mean Overlap Rate under CAT based on

4.3 Study 3: Modern Item Selection procedures in CAT based on the

4.3.2 Overall RMSEs and Test Mean Overlap Rate under CAT based on

4.3.2.1 1PL-HSIRM under High Ability Correlation

Figures 4.19, 4.20, 4.21, and 4.22 show the results obtained with the PFI method (the point Fisher information method) and the modern item selection procedures (the PG and the PP methods) for both overall RMSEs from overall and domain ability estimates and the test mean overlap rate in 600-item pool with a test length of 30 items, 1200-item pool with a test length of 30 items, 600-item pool with a test length of 60 items, and 1200-item pool with a test length of 60 items respectively, under 1PL-HSIRM with a high ability correlation. The pattern of results was similar to each other among the different item pool sizes and test lengths. As compared to the PFI method, both the PG and the PP methods significantly reduced the test mean overlap rate. In addition, the higher the value of the acceleration parameter the lower the test mean overlap rate the PG and the PP methods provided. The tendency towards greater overall RMSEs as the value of the acceleration parameter increased was not

remarkable except for the PP method practically in most conditions. The PP method both with the multidimensional CAT and the HSIRM-CAT approaches had the smaller test mean overlap rate but the larger overall RMSEs.

With the PFI method, both the multidimensional CAT and the HSIRM-CAT approaches were superior to the remaining one according to overall RMSEs in all conditions. The PFI method, however, was not always the selection rule that provided the most accuracy estimates. When the PG method improved ability estimate

precision by reducing the acceleration parameter, the difference in overall RMSEs between the PFI and the PG methods was much small, indicating that the PG method, particularly with the multidimensional CAT and the HSIRM-CAT approaches, take item pool security and estimation accuracy into account simultaneously. When incorporating SHOF and content balancing controls into the CAT context, where SHOF had more effect on test overlap rate than content balancing control, the test mean overlap rate under the PFI method significantly reduced but reduced more less under the PG and the PP methods, however, this impact was rather smaller in overall RMSEs for all methods.

When item pool size got larger, it can be found that the test mean overlap rate decreased but the RMSEs held by comparing Figure 4.19 with 4.20 as well as Figure 4.21 with 4.22. When passing from 30 to 60 items to be administered, the test mean overlap rate increased but RMSEs decreased by comparing Figure 4.19 with 4.21 as well as Figure 4.20 with 4.22.

In general, the HSIRM-CAT approach with the PG method can improve the item bank security with little or no loss in measurement precision and provide test

information for the duration of CAT because this alternative had equivalent RMSEs but with lower test mean overlap rate than the PFI method and accommodated with standard errors of measurement for overall and domain ability estimates compared to the multidimensional CAT approach with the same controlling procedures under 1PL-HSIRM with a high ability correlation.

Figure 4. 19. Overall RMSE and Test Mean Overlap for the 600-item Pool with a Test Length of 30 Items in 1PL-HSIRM under High Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3

0 0.05 0.1 0.15 0.2 0.25 0.3

Figure 4. 20. Overall RMSE and Test Mean Overlap for the 1200-item Pool with a Test Length of 30 Items in 1PL-HSIRM under High Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3

0 0.05 0.1 0.15 0.2 0.25 0.3

Figure 4. 21. Overall RMSE and Test Mean Overlap for the 600-item Pool with a Test Length of 60 Items in 1PL-HSIRM under High Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3

0 0.05 0.1 0.15 0.2 0.25 0.3

Figure 4. 22. Overall RMSE and Test Mean Overlap for the 1200-item Pool with a Test Length of 60 Items in 1PL-HSIRM under High Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3

4.3.2.2 1PL-HSIRM under Low Ability Correlation

Figures 4.23, 4.24, 4.25, and 4.26 show the results obtained with the PFI, the PG, and the PP methods for both overall RMSEs from overall and domain ability

estimates and the test mean overlap rate in 600-item pool with a test length of 30 items, 1200-item pool with a test length of 30 items, 600-item pool with a test length of 60 items, and 1200-item pool with a test length of 60 items respectively, under 1PL-HSIRM with a low ability correlation. The pattern of results was similar to each other among the different item pool sizes and test lengths. In comparison with the PFI method, both the PG and the PP methods significantly reduced the test mean overlap rate. In addition, the higher the value of the acceleration parameter the lower the test mean overlap rate the PG and the PP methods provided. The tendency towards greater overall RMSEs as the value of the acceleration parameter increased was not

remarkable except for the PP method practically in most conditions. The PP method both with the multidimensional CAT and the HSIRM-CAT approaches had the smaller test mean overlap rate but the larger overall RMSEs.

With the PFI method, both the multidimensional CAT and the HSIRM-CAT approaches were superior to the remaining one according to overall RMSEs in all conditions. The PFI method, however, was not always the selection rule that provided the most accuracy estimates. When the PG method improved ability estimate

precision by reducing the acceleration parameter, the difference in overall RMSEs between the PFI and the PG methods was much small, indicating that the PG method, particularly with the multidimensional CAT and the HSIRM-CAT approaches, take item pool security and estimation accuracy into account simultaneously. When incorporating SHOF and content balancing controls into the CAT context, where SHOF had more effect on test overlap rate than content balancing control, the test mean overlap rate under the PFI method significantly reduced but reduced more less under the PG and the PP methods, however, this impact was rather smaller in overall RMSEs for all methods.

When item pool size got larger, it can be found that the test mean overlap rate decreased but the RMSEs held by comparing Figure 4.23 with 4.24 as well as Figure 4.25 with 4.26. When passing from 30 to 60 items to be administered, the test mean overlap rate increased but RMSEs decreased by comparing Figure 4.23 with 4.25 as well as Figure 4.24 with 4.26. As compared to the conditions with a high ability correlation described in previous section, the RMSEs increased but the test mean overlap rate did not shift much more.

In sum, the HSIRM-CAT approach with the PG method can improve the item bank security with little or no loss in measurement precision and provide test

information for the duration of CAT because this alternative had equivalent RMSEs but with lower test mean overlap rate than the PFI method and accommodated with standard errors of measurement for overall and domain ability estimates compared to the multidimensional CAT approach with the same controlling procedures under 1PL-HSIRM with a low ability correlation.

Figure 4. 23. Overall RMSE and Test Overlap for the 600-item Pool with a Test Length of 30 Items in 1PL-HSIRM under Low Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3

0 0.05 0.1 0.15 0.2 0.25 0.3

Figure 4. 24. Overall RMSE and Test Overlap for the 1200-item Pool with a Test Length of 30 Items in 1PL-HSIRM under Low Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3

0 0.05 0.1 0.15 0.2 0.25 0.3

Figure 4. 25. Overall RMSE and Test Overlap for the 600-item Pool with a Test Length of 60 Items in 1PL-HSIRM under Low Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3

0 0.05 0.1 0.15 0.2 0.25 0.3

Figure 4. 26. Overall RMSE and Test Overlap for the 1200-item Pool with a Test Length of 60 Items in 1PL-HSIRM under Low Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3

4.3.2.3 2PL-HSIRM under High Ability Correlation

Figures 4.27, 4.28, 4.29, and 4.30 show the results obtained with the PFI, the PG, and the PP methods for both overall RMSEs from overall and domain ability

estimates and the test mean overlap rate in 600-item pool with a test length of 30 items, 1200-item pool with a test length of 30 items, 600-item pool with a test length of 60 items, and 1200-item pool with a test length of 60 items respectively, under 2PL-HSIRM with a high ability correlation. The pattern of results was similar to each other among the different item pool sizes and test lengths. As compared to the PFI method, both the PG and the PP methods significantly reduced the test mean overlap rate. The higher the value of the acceleration parameter the lower the test mean overlap rate the PG and the PP methods provided. Dissimilar to the 1PL-HSIRM, as the value of the acceleration parameter increased, the RMSEs increased both for the PG and the PP methods in most conditions, suggesting that the pattern of the results is congruent with the previous study conducted by Barrada et al. (2008) only when multiple-parameter IRT models are implementing in terms of the modern item selection rules. In addition, The PP method both with the multidimensional CAT and the HSIRM-CAT approaches had the smaller test mean overlap rate but the larger overall RMSEs.

With the PFI method, both the multidimensional CAT and the HSIRM-CAT approaches were superior to the remaining one according to overall RMSEs in all conditions. The PFI method, however, was not always the selection rule that provided the most accuracy estimates. When the PG method improved ability estimate

precision by reducing the acceleration parameter, the difference in overall RMSEs between the PFI and the PG methods was much small, indicating that the PG method, particularly with the multidimensional CAT and the HSIRM-CAT approaches, take item pool security and estimation accuracy into account simultaneously. When

incorporating SHOF control into the CAT context, the test mean overlap rate under all methods significantly reduced and the RMSEs slightly increased, as well as the test mean overlap rate slightly decreased and the RMSEs almost kept the same when incorporating content balancing control.

When item pool size got larger, it can be found that the test mean overlap rate decreased as well as the RMSEs slightly decreased by comparing Figure 4.27 with 4.28 as well as Figure 4.29 with 4.30. When passing from 30 to 60 items to be administered, the test mean overlap rate increased but RMSEs decreased by comparing Figure 4.27 with 4.29 as well as Figure 4.28 with 4.30. Furthermore, in comparison with the 1PL-HSIRM, controlling item pool size, test length, and item selection rule, the test mean overlap rate increased but the RMSEs decreased,

indicating the item pool security under 2PL-MIRT model is more seriously threatened than 1PL-HSIRM in the context of CAT, even though the measurement precision is improved under 2PL-MIRT model.

In general, the HSIRM-CAT approach with the PG method can improve the item bank security with little or no loss in measurement precision and provide test

information for the duration of CAT because this alternative had equivalent RMSEs but with lower test mean overlap rate compared to the PFI method and accommodated with standard errors of measurement for overall and domain ability estimates

compared to the multidimensional CAT approach with the same controlling procedures under 2PL-HSIRM with a high ability correlation.

Figure 4. 27. Overall RMSE and Test Overlap for the 600-item Pool with a Test Length of 30 Items in 2PL-HSIRM under High Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Figure 4. 28. Overall RMSE and Test Overlap for the 1200-item Pool with a Test Length of 30 Items in 2PL-HSIRM under High Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

Figure 4. 29. Overall RMSE and Test Overlap for the 600-item Pool with a Test Length of 60 Items in 2PL-HSIRM under High Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Figure 4. 30. Overall RMSE and Test Overlap for the 1200-item Pool with a Test Length of 60 Items in 2PL-HSIRM under High Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

4.3.2.4 2PL-HSIRM under Low Ability Correlation

Figures 4.31, 4.32, 4.33, and 4.34 show the results obtained with the PFI, the PG, and the PP methods for both overall RMSEs from overall and domain ability

estimates and the test mean overlap rate in 600-item pool with a test length of 30 items, 1200-item pool with a test length of 30 items, 600-item pool with a test length of 60 items, and 1200-item pool with a test length of 60 items respectively, under 2PL-HSIRM with a low ability correlation. The pattern of results was similar to each other among the different item pool sizes and test lengths. In comparison with the PFI method, both the PG and the PP methods significantly reduced the test mean overlap rate. The higher the value of the acceleration parameter the lower the test mean overlap rate the PG and the PP methods provided. Dissimilar to the 1PL-HSIRM, as the value of the acceleration parameter increased, the RMSEs increased both for the PG and the PP methods in most conditions, suggesting that the pattern of the results is congruent with the previous study conducted by Barrada et al. (2008) only when multiple-parameter IRT models are implementing in terms of the modern item selection rules. In addition, The PP method both with the multidimensional CAT and the HSIRM-CAT approaches had the smaller test mean overlap rate but the larger overall RMSEs.

With the PFI method, both the multidimensional CAT and the HSIRM-CAT approaches were superior to the remaining one according to overall RMSEs in all conditions. The PFI method, however, was not always the selection rule that provided the most accuracy estimates. When the PG method improved ability estimate

precision by reducing the acceleration parameter, the difference in overall RMSEs between the PFI and the PG methods was much small, indicating that the PG method, particularly with the multidimensional CAT and the HSIRM-CAT approaches, take item pool security and estimation accuracy into account simultaneously. When

incorporating SHOF control into the CAT context, the test mean overlap rate under all methods significantly reduced and the RMSEs slightly increased, as well as the test mean overlap rate slightly decreased and the RMSEs almost kept the same when incorporating content balancing control.

When item pool size got larger, it can be found that the test mean overlap rate decreased as well as the RMSEs slightly decreased by comparing Figure 4.31 with 4.32 as well as Figure 4.33 with 4.34. When passing from 30 to 60 items to be administered, the test mean overlap rate increased but RMSEs decreased by

comparing Figure 4.31 with 4.33 as well as Figure 4.32 with 4.34. As compared to the conditions with a high ability correlation described in previous section, the RMSEs increased but the test mean overlap rate did not shift much more. Furthermore, in

comparison with the 1PL-HSIRM, controlling item pool size, test length, and item selection rule, the test mean overlap rate increased but the RMSEs decreased,

indicating the item pool security under 2PL-MIRT model is more seriously threatened than 1PL-HSIRM in the context of CAT, even though the measurement precision is improved under 2PL-MIRT model.

In sum, the HSIRM-CAT approach with the PG method can improve the item bank security with little or no loss in measurement precision and provide test information for the duration of CAT because this alternative had equivalent RMSEs but with lower test mean overlap rate compared to the PFI method and accommodated with standard errors of measurement for overall and domain ability estimates

compared to the multidimensional CAT approach with the same controlling procedures under 2PL-HSIRM with a low ability correlation.

Figure 4. 31. Overall RMSE and Test Overlap for the 600-item Pool with a Test Length of 30 Items in 2PL-HSIRM under Low Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Figure 4. 32. Overall RMSE and Test Overlap for the 1200-item Pool with a Test Length of 30 Items in 2PL-HSIRM under Low Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

Figure 4. 33. Overall RMSE and Test Overlap for the 600-item Pool with a Test Length of 60 Items in 2PL-HSIRM under Low Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Figure 4. 34. Overall RMSE and Test Overlap for the 1200-item Pool with a Test Length of 60 Items in 2PL-HSIRM under Low Ability Correlation.

Note. The numbers above the scatter dots denote the acceleration parameters.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

CHAPTER FIVE

DISCUSSION AND CONCLUSION 5.1 Discussion and Conclusion

Though the CAT has a number of advantages over the P&P test, it is vital that the IRT models implemented for a CAT be reliable and stable. Otherwise, researchers and practitioners will risk making incorrect conclusions. The aim of this study was to construct a higher order latent trait structure within a multidimensional IRT

framework, so that overall and domain abilities can be estimated simultaneously in the context of CAT. Therefore, the multidimensional IRT model with a hierarchical structure, also referred to as the hierarchical structure item response model (HSIRM), was investigated to assess its utility and efficiency under Bayesian inferences. Sheng and Wikle (2008) proposed MIRT models with a hierarchical structure and conducted several simulation studies to support their assumptions. However, inappropriate features in their simulation design rendered their findings unclear and left a number of questions to be answered. It was necessary to confirm the HSIRM with an appropriate simulation design and then implement the models in a CAT context. Additionally, modern item selection rules have gained increasing attention due to their ability to simultaneously enhance item bank security and measurement precision. Item bank security and measurement precision have been long-standing problems in the CAT literature. In this study, there were three parts to explore these essential issues, which are discussed briefly in the following.

Study 1: Model-data Fit and Model Parameter Recovery of the HSIRM

The simulations were conducted to assess the effect of Bayesian model checking techniques, including PPMC, PsBF, and DIC, and to evaluate the parameter recovery given the true model. Only 1PL- and 2PL-HSIRMs were highlighted, accompanying the UIRT model, the MIRT model with identical latent trait (MIRT-I), the MIRT model with singular latent trait (MIRT-S), the hierarchical structure item response model with a high ability correlation (HSIRM-H) and with a low ability correlation (HSIRM-L). Regarding the absolute model fit criteria, five indicators were

demonstrated and incorporated into the PPMC procedure, including SD of the biserial correlations (Bis), Bayesian chi-square test (BChi), reproduced correlation matrix test (Rcor), observed score covariance between the subtests test (Cov), and identical latent trait correlation test (Id). Below are six recommendations which investigators and

demonstrated and incorporated into the PPMC procedure, including SD of the biserial correlations (Bis), Bayesian chi-square test (BChi), reproduced correlation matrix test (Rcor), observed score covariance between the subtests test (Cov), and identical latent trait correlation test (Id). Below are six recommendations which investigators and