CHAPTER 3 METHOD
3.1 Study 1: The Parameter Recovery of RHO-RDINA and RHO-RDINO
3.1.1 The Simulation Design
Study 1 aimed to assess the parameter recovery of RHO-RDINA and RHO-RDINO
models. Some simulation conditions are considered to investigate the efficiency of
two proposed DIF detection methods. Simulated datasets were used to have direct
control of certain factors, such as the structure of the Q- Matrix, DIF percentage and
DIF amount that influence DIF detection. However, conditions were simulated to
approximate real conditions for the generalizability of the results. To ensure stable
results, 25 replications were completed for each condition. The replication number
was chosen (a) in accord with the number used in recent simulation work with CDMs
estimated by the MCMC algorithm (De la Torre & Douglas, 2004), (b) to compare
methods or criteria for correct DIF detection rates with CDMs (Li, 2008; Zhang,
2006), and (c) given the time-intensive nature of purification with DIF methods as
applied in this study.
Figure 3.1 Simulation Design of study 1
Controlled Variables.
Structure of the Q-matrix. In practice, the structure of the Q-matrix is defined by
a number of content experts during test development. It is said that with the generated
Q-matrix approximating a reasonable test structure, the possible confounding effect of
Q-matrix from causing DIF is removed (Zhang, 2006). In this study, a single 20 × 5
Q-matrix is constructed that is balanced between complexity and effectiveness. The
Q-matrix which was used in this study is given in Table 3.1. The design of the
Q-matrix is similar to that of previous studies (e.g., de la Torre & Douglas, 2004; de la
Torre, Hong & Deng, 2010; Li, 2008). For the first five items, each item was
simulated as estimating a single attribute; for Items 6 to 15, each item was simulated
as estimating two attributes; for Items 16 to 20, each item was simulated as estimating
three attributes.
It is worth noting that unlike most previous CDM studies which have used a
fixed test length, in this study the test length was manipulated so that the Q-matrix
therefore has to change across different test length conditions. In addition, the
changed Q-matrix may threaten the inferences of test length effect. Thus, the same
Q-matrix structure was copied two and three times with the test length increases.
Table 3.1 The Q-Matrix Structure for 20 items
Attributes Attributes
Items 1 2 3 4 5 Items 1 2 3 4 5
i1 1 0 0 0 0 i11 0 1 1 0 0
i2 0 1 0 0 0 i12 0 0 1 1 0
i3 0 0 1 0 0 i13 0 0 0 1 1
i4 0 0 0 1 0 i14 1 0 0 1 0
i5 0 0 0 0 1 i15 0 0 1 0 1
i6 1 1 0 0 0 i16 0 1 1 1 0
i7 1 0 1 0 0 i17 0 1 0 1 1
i8 1 0 0 1 0 i18 1 1 0 0 1
i9 1 0 0 0 1 i19 1 0 1 1 0
i10 0 1 0 0 1 i20 0 1 1 0 1
Item Parameter and Attribute Difficulty. In the simulation study and real data
example presented in de la Torre and Douglas (2004), the range of attribute difficulty
was from -1.5 to .5, and the range of most slip and guessing parameters was from .1
to .3. The further study conducted by de la Torre, Hong & Deng (2010) investigated
the impact of level of guessing and slip parameter on item parameter estimation. In
their study, the high level of guessing and slip parameter was defined as items with
guessing and slips parameters ranging from .20 to .30, whereas the low level ranged
from .05 to .15. Considering that DIF detection usually occurs when the preliminary
version of the test is initiated, to match practical applications this dissertation set the
guessing and slip parameter in the range from .10 to .30. Slip and guessing parameters
for each item will be generated from a uniform distribution between .1 and .3. The
range of attribute difficulty was set to [-1.5, -1.0, -1.0, -0.5, -0.5] for both groups; the
discrimination estimates, ak, was set at 1.5 and be restricted to be equal across the
items.
Percentage of DIF Items.Obviously, a higher percentage of DIF items in a test
will result in less accurate ability estimates. That is, the contamination of the
matching variable increases as the percentage of DIF increases. As a result, power is
likely decrease with the percentage of DIF items increases. The percentage of DIF
items depends on whether the test is well developed. In previous studies, both Zhang
(2006) and Li (2008) set 20% DIF items in a test which would be likely to occur in a
well developed test. In study 1, the percentage of DIF item was set to 20%; that is, a
short test with four DIF items; a middle test with eight DIF items and twelve DIF item
for a long test.
DIF Magnitude. In the simulation study presented in Zhang (2006), two levels
of DIF magnitude were manipulated for item slip and guessing parameters: .075
and .15. However, a DIF of .075 was not sufficiently large enough to be detected. In
the simulation study of Li (2008), the amount of DIF was set at .10, which yielded
sufficient power. The amount of DIF, however, was not the focus in study 1.
Consequently, in study 1, the level of DIF magnitude will not be manipulated and the
value was set at .10. However, because of the item level parameters in this study has
been reparameterized, this can be done by using a logit function, log[p/(1-p)]. Note
that if one prefers the original probability formulation of the parameters, then it is
simple to recover by using the function, si =exp(si)/[1+exp(si)]. Thus, in the
present study, the slip and guessing parameters in the focal group was formed by
adding or subtracting .27 logit from the values for the reference group, the DIF
magnitude is set as .54 logit that is very close to the .10 difference between subgroups
in probability.
Manipulated Variables. The factors described above were fixed in study 1. The
following factors were manipulated: 2 ability distributions, 3 test lengths and 3
scenarios with different combinations of DIF patterns; 2 underlying CDMs were used
to generate the data.
Test Length. Test length has been manipulated in several DIF simulation studies
with a range of 20 to 80 items (e.g., French & Maller, 2007; Finch, & French, 2007;
Shih & Wang, 2009; Narayanan & Swaminathan, 1996). Studies of DIF detection, in
an IRT framework, found that statistical power increases with longer tests (Narayanan
& Swaminathan, 1996; French & Maller, 2007). Since studies in the framework of
cognitive diagnostic measurement have usually adopted fixed test length in DIF
detection (Zhang, 2006; Li, 2008), the test length effect is still uncertain. Thus,
different test lengths were manipulated according to the past simulation studies. The
20-item test, 40-item test and 60-item test were used to represent short, moderate and
long length tests respectively.
Ability Distributions. Ability differences influence DIF detection (e.g., degraded
power of model based detection method; Li, 2008). Two conditions were generated.
First, abilities were generated from a standard normal distribution (M =0.0, SD=1.0)
for both groups. Second, the mean will be set at 0.0 and -1.0 for the reference and
focal groups, respectively. These differences are selected to approximate actual test
data and have been used in previous DIF detection studies (e.g., Finch, & French,
2007; Li, 2008; Roussos & Stout, 1996; Shih & Wang, 2009).
DIF Pattern. In this simulation, DIF is created in three ways: by changing the
slip parameter, by changing the guessing parameter, and by changing both guessing
and slip parameter in the focal group. In summary, the following three distinct types
of DIF were examined and listed in the Table 3.2:
1. The no-DIF pattern was serving as baseline information to compare Type Ι error
rates; both the focal group and reference group receive the same set of item
parameters. In this way, both groups have an equal probability of a correct
response for a specific attribute pattern, and hence no DIF should occur for the
focal and the reference groups.
2. The one-sided DIF pattern refers to the fact that all DIF items were set to favor the
same group (the reference group). That is the DIF items were generated by
decreasing the slip parameter and increasing the guessing parameter by an equal
amount (setting the DIF amount equal to 0.27 logit). In other words, the reference
group had a much higher probability of being correct on the DIF items than the
focal group after the latent trait levels were controlled.
3. The balanced DIF pattern refers to the fact that half of DIF items were set to favor
the reference group while the other DIF items were set to favor the focal group.
Specifically, each group is at the same magnitude so that neither group can be
considered favored.
Table 3.2 DIF Pattern Manipulation
DIF pattern NO DIF One-sided Balanced
Favor R Favor R Favor F
Parameter g s g s g s g s
Groups Focal equal equal -0.27 +0.27 -0.27 +0.27 +0.27 -0.27 Reference equal equal +0.27 -0.27 +0.27 -0.27 -0.27 +0.27 Note: the”﹢”denotes add dif amount; the”﹣”denotes subtract dif amount