CHAPTER 3 METHOD
3.1 Study 1: The Parameter Recovery of RHO-RDINA and RHO-RDINO
3.1.3 Recovery Analysis of the RHO-RDINA and RHO-RDINO Model48
Before proceeding to DIF detection for the simulation study, a recovery analysis was
conducted to determine the extent to which the generating parameter could be
recovered from the simulated datasets by the RHO-RDINA and RHO-RDINO model.
The recovery analysis considered three issues: recovery of the simulated item
parameters (i.e., the slip and guessing parameters), recovery of the simulated attribute
difficulty parameters, and recovery of the attribute mastery classifications. Recovery
of item parameters or attribute difficulty parameters was assessed using root mean
squared errors (RMSE) between the generating parameters and the parameter
estimates. The RMSE can be expressed as:
2 1
ˆ ) 1 (
i i n
i
b n b
RMSE =
∑
−=
(3.1)
Where bi the generating parameter for either an item or attribute parameters is, bˆ i
is the parameter estimate, and n is the number of iterations. Recovery of attribute
mastery classification was done by simply calculating the proportion of examinees
who were correctly classified as masters or non-masters on each attribute.
Therefore the following questions are addressed on the results.
1. Will the test length, ability distribution differences and DIF pattern affect the
parameter recovery of two model-based methods?
2. Does the compensatory RHO-RDINO model based method perform equally
well with the RHO-RDINA model under the same condition?
3.2 Study 2: Comparing the Effectiveness of Traditional DIF Methods with Purification Procedure and Model Based Method within CDM
Framework
3.2.1 The Simulation Design
The simulation study was designed to test if the purification procedure contributes to
the two commonly used DIF detection methods: MH and LR methods. The
RHO-RDINA model and RHO-RDINO models were used to generate and analyze the
data. The goal of study two is to investigate whether or not scale purification affects
the efficiency of different DIF detection methods. Thus, some factors such as test
length and Q-matrix were fixed and were simulated to approximate real tests in study
2.
Main Purpose of Study
Comparing the Effectiveness of Traditional DIF Methods with Iterative Procedure and Model based
Method within CDM Framework
Figure 3.2 Simulation Design of Study 2 Controlled Variables.
Test length. The test length was not be manipulated, and a 20-item test was used
to represent a short length test. The test length was chosen to approximate most
diagnostic assessments (e.g., a total of 20 items in the Fraction and subtraction test; a
total of 25 mathematics problems in the Trends in International Mathematics and
Science Study, TIMSS; a total of 25 mathematic problems in the Organization for
Economic Cooperation and Development Programme for International Student
Assessment, PISA), as some cognitive diagnostic models were implemented to
analyze these data sets (e.g., de la Torre & Douglas, 2008; Dogan & Tatsuoka, 2008;
Lee, Park & Taylan, 2011).
Structure of the Q-Matrix. The structure of the Q-Matrix in study 2 was
simulated as in study 1. Thus, the Q-matrix listed in the Table 3.1 was used to
generate item responses.
Item Parameter and Attribute Difficulty. Slip and guessing parameters for each
item were generated from a uniform distribution between .1 and .3; for the higher
order structure, the overall ability for the two groups will be generated from a
standard normal distribution, with mean 0 and variance 1 for the reference group and
with mean -1 and variance 1 for the focal group. The range of attribute difficulty will
be set to [-1.5, -1.0, -1.0, -0.5, -0.5] for both groups; the discrimination estimates, ak,
will set at 1.5 and be restricted to be equal across the items.
Manipulated Variables.
Ability Distributions. As with study 1, two conditions of ability distribution were
generated in study 2. First, abilities were generated from a standard normal
distribution (M =0.0, SD=1.0) for both groups. Second, the mean was set at 0.0 and
-1.0 for the reference and focal groups, respectively. This difference of ability
distribution was assumed to affect the efficiency of DIF detection.
Percentage of DIF Items. Obviously, a higher percentage of DIF items in a test
will result in less accurate ability estimates. That is, the contamination of the
matching variable increases as the percentage of DIF increases. As a result, power is
likely to decrease as the percentage of DIF items increases. In the framework of item
response models, the percentage of DIF items in a test is usually manipulated in a
range from 0% to 40% (e.g., three levels 10%, 15% and 30% in the study of Fidalgo,
Mellenbergh and Muniz, 2000; three levels 0%, 10% and 20% in the study of Finch &
French, 2007; three levels 0%, 10% and 20% in the study of French & Maller, 2007;
four levels 10%, 20%, 30% and 40% in the study of Shih & Wang, 2009). The
percentage of DIF items is expected to be low in testing. However, if the test is not
well developed, a higher percentage of DIF items may appear on the test. Therefore,
the percentage of DIF items was set at the levels 0 %, 10%, 20%, and 30%.
Sample Size. Power and type I error increases with DIF detection methods (e.g.,
LR, MH and SIBTEST) as sample size increases (Finch & French, 2007; French &
Maller, 2007; Rogers & Swaminathan, 1993). Additionally, de la Torre and Douglas
(2004) stated that the estimation has sufficient power when sample size achieves 1000
in DINA model. In the present study, the RHO-RDINA and RHO-RDINO models are
implemented. Therefore, small and large sample sizes were included to evaluate
conditions with less statistical power. Four sample size combinations were
manipulated as following: F500/R500; F500/R1000; F1000/R1000 and F1000/R2000.
Thus, the sample size ratio of the focal and reference groups can be compared.
DIF Magnitude. Bias were simulated in accord with the manipulation in
previous DIF studies (French & Maller, 2007; Narayanan & Swaminathan, 1996;
Rogers & Swaminathan, 1993). Three level of DIF for item slip and guessing
parameters 0.4, 0.6 and 0.8 were selected to represent small, moderate and large
differences respectively.
Purification. The purification procedure will be implemented in LR and
Mantel-Haenszel methods and compared with non-purification procedures. Thus, the
purification procedure for LR and MH methods are described as follows:
1. Conduct LR / MH analysis for all items (N) with total summed score as the
matching criterion.
2. Identify DIF items (n) based on set criteria.
3. Rerun the analysis for all items with N-n total score as the matching criterion and
identify DIF items.
4. Rerun the analysis for all items with N-n total scores as the matching criterion.
Continue steps 3 and 4 until the same set of DIF items are identified in two
consecutive analyses or no other items were indentified.