3.3 Study 3: Modern Item Selection Procedures in the Hierarchical
3.3.1 Implementation of the PG and the PP Methods into the
Because the error in the ability estimates at the beginning of CAT is not
neglected, the maximum-information item selection criterion has to be modified to take the measure error into account. Several approaches have been proposed to improve the accuracy of ability estimates described in the previous chapter (Chen &
Ankenmann, 2004; Chen, Ankenmann, & Chang, 2000; Chang & Ying, 1996; van der Linden, 1998; van der Linden & Glas, 2000; Veerkamp & Berger, 1997) and extended into a MCAT environment (Shih, 2007). These modified procedures have, however, been found to be efficient in short length tests and more difficult to compute
integration to find the maximum-information item, especially for the MCAT environment. Alternatives to the modified item selection maximum-information procedures, the PG and the PP methods (Barrada et al., 2008; Revuelta & Ponsoda, 1998; Segall, 2004a), were implemented in this study. The major advantage of the two novel methods is that they reduce the importance of the item information at the
beginning of the test due to the inaccurate estimates when adaptively selecting the administering item and offer more informative items to administer at the end of the test by incorporating random component into maximum-information function and adjusting the speed of moving from random item selection to maximum-information item selection. In addition, the item exposure and test overlap have also been proved to be controlled more satisfactorily than with the general maximum-information criterion. However, these methods have not yet been applied to the hierarchical structure item response model, and it is necessary to evaluate the effect of the controlling process.
Although the PG and the PP methods can improve item exposure control, these methods did not guarantee the item exposure rate for each item in the pool is below the pre-specified criterion (Barrada et al., 2008). The freeze procedure proposed by Revuelta and Ponsoda (1998) may be implemented in the CAT context together with the PG or the PP method so as to increase the efficiency of the novel item selection rules. The order of the freeze procedure to control the presence of an item may, however, be acquired by examinees, and thus an online version of the Sympson and Hetter procedure (Ju, 2005) accompanied with the freeze procedure was proposed to correct the shortcomings described above - the so-called “SH online procedure with freeze procedure” (SHOF; Chen, 2004; Chen, 2005; Su, 2007; Wu, 2006; Wu, & Chen, 2008). To make sure of controlling item exposure rate under a pre-specified criterion, the SHOF procedure was incorporated into the CAT in this study.
Of the one-stage estimations, except for the unidimensional CAT approach, the other estimations are special cases of the MCAT procedure. In MCAT, the item that provides the largest decrement in the size of the posterior credibility region will be selected. The information matrix can be presented as minus the expected Hessian of the log posterior as shown in Equation (2.29). Given the administered and candidate
items, provisional ability estimates, and prior distribution for examinees (Bayesian estimation), the item with th elargest determinant for the information matrix will be administered in the next round in MCAT (see Equation (2.31) and (2.37)).
Consequently the PG and the PP methods can be extended into the MCAT
environment in a straightforward way by replacing the value of information in the UCAT with the determinant of the information matrix in MCAT.
3.3.2 Simulation Design
Nine independent variables were manipulated: (a) the three proposed direct estimations in the hierarchical structure CAT (one-stage estimations); (b) test length—30 and 60 items; (c) item pool size—600 (each subtest had 200 items) and 1200 (each subtest had 400 items) items; (d) the number of parameters in the
HSIRM—1PL- and 2PL- HSIRMs; (e) the magnitude of factor loadings—high ( .9, .8, and .7) and diverse ( .9, .6, and .3) factor loadings; (f) the item selection criteria: the general maximum-information criterion, the PG and the PP methods; (g) four different values for the acceleration parameters: -1, 0, 1, and 2; (h) content balancing based on the multinomial model (Chen & Ankenmann, 2004; Chen et al., 2003); and (i) the SHOF procedure in which the maximum item exposure rate was set to .2, representing a moderate item exposure control. For all design conditions, the item difficulty
parameters were drawn from U(-3, 3) and the item discrimination parameters were drawn from N(1, 0.25).
The trait level of the simulees was randomly generated based on the hierarchical structure item response model as described in study 1. For each manipulation, 1000 simulees were sampled. The starting ability estimates were set at 0. Five dependent variables were used for the comparison between methods over varying conditions: (a) bias as shown in Equation (3.22); (b) RMSE and overall RMSE for the accuracy as shown in Equation (3.23) and (3.27) respectively; (c) item exposure rate as shown in Equation (3.28); (d) test mean overlap rate as shown in Equation (3.29); and (e) pool usage rate as shown in Equation (3.30).
,
( ) (
1)
,1 1
1 2
− −
=
∑
−=
N N
L r N T
M
i
— i
(3.29)
M , Usage m
Pool = (3.30)
where θ and θ^ were the generating value and the estimate for overall ability or domain abilities respectively; N was equal to 1000 representing the sample size in the CAT; D was equal to 4 showing that there were an overall ability and three domain abilities manipulated in this study.; hi was the number of times item i was presented to simulees; L was the test length and M is the item pool size in a CAT; m was the
number of items which have been administered in a CAT.
CHAPTER FOUR RESULTS
4.1 Study 1: Model-data Fit and Model Parameter Recovery of the HSIRM
Because it is primarily necessary to assess the fit of the model before drawing any conclusion from the application of a statistical model to a data set (Kang & Cohen, 2007; Li et al., 2006; Sinharay, 2005), the fist part in the following section focus on the effect of Bayesian model checking techniques. Of the model checking methods, the PPMC methods were preliminarily implemented to examine whether the data sets were fitted to the hierarchical structure item response model with consideration of absolute model fitting. It is true when the correct model is known, it is more appropriate to use model selection tools to compare between candidate models (Bayaari & Berger, 2000). Model comparison methods, PsBF and DIC, were sequentially carried out to make decision which model can describe the data well between the 1PL- and the 2PL- hierarchical structure item response models. It was expected that (a) the test statistics used in the PPMC methods can detect model misfit in this simulation study; and (b) the model comparison indices can identify the better model. Finally the parameter recovery was evaluated with several criteria described above when the fitted model and the generated model were congruent in the context of the hierarchical structure item response model.