The Simulation Design - Study 1: The Parameter Recovery of RHO-RDINA and RHO-RDINO

CHAPTER 3 METHOD

3.1 Study 1: The Parameter Recovery of RHO-RDINA and RHO-RDINO

3.1.1 The Simulation Design

Study 1 aimed to assess the parameter recovery of RHO-RDINA and RHO-RDINO

models. Some simulation conditions are considered to investigate the efficiency of

two proposed DIF detection methods. Simulated datasets were used to have direct

control of certain factors, such as the structure of the Q- Matrix, DIF percentage and

DIF amount that influence DIF detection. However, conditions were simulated to

approximate real conditions for the generalizability of the results. To ensure stable

results, 25 replications were completed for each condition. The replication number

was chosen (a) in accord with the number used in recent simulation work with CDMs

estimated by the MCMC algorithm (De la Torre & Douglas, 2004), (b) to compare

methods or criteria for correct DIF detection rates with CDMs (Li, 2008; Zhang,

2006), and (c) given the time-intensive nature of purification with DIF methods as

applied in this study.

Figure 3.1 Simulation Design of study 1

Controlled Variables.

Structure of the Q-matrix. In practice, the structure of the Q-matrix is defined by

a number of content experts during test development. It is said that with the generated

Q-matrix approximating a reasonable test structure, the possible confounding effect of

Q-matrix from causing DIF is removed (Zhang, 2006). In this study, a single 20 × 5

Q-matrix is constructed that is balanced between complexity and effectiveness. The

Q-matrix which was used in this study is given in Table 3.1. The design of the

Q-matrix is similar to that of previous studies (e.g., de la Torre & Douglas, 2004; de la

Torre, Hong & Deng, 2010; Li, 2008). For the first five items, each item was

simulated as estimating a single attribute; for Items 6 to 15, each item was simulated

as estimating two attributes; for Items 16 to 20, each item was simulated as estimating

three attributes.

It is worth noting that unlike most previous CDM studies which have used a

fixed test length, in this study the test length was manipulated so that the Q-matrix

therefore has to change across different test length conditions. In addition, the

changed Q-matrix may threaten the inferences of test length effect. Thus, the same

Q-matrix structure was copied two and three times with the test length increases.

Table 3.1 The Q-Matrix Structure for 20 items

Attributes Attributes

Items 1 2 3 4 5 Items 1 2 3 4 5

i1 1 0 0 0 0 i11 0 1 1 0 0

i2 0 1 0 0 0 i12 0 0 1 1 0

i3 0 0 1 0 0 i13 0 0 0 1 1

i4 0 0 0 1 0 i14 1 0 0 1 0

i5 0 0 0 0 1 i15 0 0 1 0 1

i6 1 1 0 0 0 i16 0 1 1 1 0

i7 1 0 1 0 0 i17 0 1 0 1 1

i8 1 0 0 1 0 i18 1 1 0 0 1

i9 1 0 0 0 1 i19 1 0 1 1 0

i10 0 1 0 0 1 i20 0 1 1 0 1

Item Parameter and Attribute Difficulty. In the simulation study and real data

example presented in de la Torre and Douglas (2004), the range of attribute difficulty

was from -1.5 to .5, and the range of most slip and guessing parameters was from .1

to .3. The further study conducted by de la Torre, Hong & Deng (2010) investigated

the impact of level of guessing and slip parameter on item parameter estimation. In

their study, the high level of guessing and slip parameter was defined as items with

guessing and slips parameters ranging from .20 to .30, whereas the low level ranged

from .05 to .15. Considering that DIF detection usually occurs when the preliminary

version of the test is initiated, to match practical applications this dissertation set the

guessing and slip parameter in the range from .10 to .30. Slip and guessing parameters

for each item will be generated from a uniform distribution between .1 and .3. The

range of attribute difficulty was set to [-1.5, -1.0, -1.0, -0.5, -0.5] for both groups; the

discrimination estimates, ak, was set at 1.5 and be restricted to be equal across the

items.

Percentage of DIF Items.Obviously, a higher percentage of DIF items in a test

will result in less accurate ability estimates. That is, the contamination of the

matching variable increases as the percentage of DIF increases. As a result, power is

likely decrease with the percentage of DIF items increases. The percentage of DIF

items depends on whether the test is well developed. In previous studies, both Zhang

(2006) and Li (2008) set 20% DIF items in a test which would be likely to occur in a

well developed test. In study 1, the percentage of DIF item was set to 20%; that is, a

short test with four DIF items; a middle test with eight DIF items and twelve DIF item

for a long test.

DIF Magnitude. In the simulation study presented in Zhang (2006), two levels

of DIF magnitude were manipulated for item slip and guessing parameters: .075

and .15. However, a DIF of .075 was not sufficiently large enough to be detected. In

the simulation study of Li (2008), the amount of DIF was set at .10, which yielded

sufficient power. The amount of DIF, however, was not the focus in study 1.

Consequently, in study 1, the level of DIF magnitude will not be manipulated and the

value was set at .10. However, because of the item level parameters in this study has

been reparameterized, this can be done by using a logit function, log[p/(1-p)]. Note

that if one prefers the original probability formulation of the parameters, then it is

simple to recover by using the function, s_i =exp(s_i)/[1+exp(s_i)]. Thus, in the

present study, the slip and guessing parameters in the focal group was formed by

adding or subtracting .27 logit from the values for the reference group, the DIF

magnitude is set as .54 logit that is very close to the .10 difference between subgroups

in probability.

Manipulated Variables. The factors described above were fixed in study 1. The

following factors were manipulated: 2 ability distributions, 3 test lengths and 3

scenarios with different combinations of DIF patterns; 2 underlying CDMs were used

to generate the data.

Test Length. Test length has been manipulated in several DIF simulation studies

with a range of 20 to 80 items (e.g., French & Maller, 2007; Finch, & French, 2007;

Shih & Wang, 2009; Narayanan & Swaminathan, 1996). Studies of DIF detection, in

an IRT framework, found that statistical power increases with longer tests (Narayanan

& Swaminathan, 1996; French & Maller, 2007). Since studies in the framework of

cognitive diagnostic measurement have usually adopted fixed test length in DIF

detection (Zhang, 2006; Li, 2008), the test length effect is still uncertain. Thus,

different test lengths were manipulated according to the past simulation studies. The

20-item test, 40-item test and 60-item test were used to represent short, moderate and

long length tests respectively.

Ability Distributions. Ability differences influence DIF detection (e.g., degraded

power of model based detection method; Li, 2008). Two conditions were generated.

First, abilities were generated from a standard normal distribution (M =0.0, SD=1.0)

for both groups. Second, the mean will be set at 0.0 and -1.0 for the reference and

focal groups, respectively. These differences are selected to approximate actual test

data and have been used in previous DIF detection studies (e.g., Finch, & French,

2007; Li, 2008; Roussos & Stout, 1996; Shih & Wang, 2009).

DIF Pattern. In this simulation, DIF is created in three ways: by changing the

slip parameter, by changing the guessing parameter, and by changing both guessing

and slip parameter in the focal group. In summary, the following three distinct types

of DIF were examined and listed in the Table 3.2:

1. The no-DIF pattern was serving as baseline information to compare Type Ι error

rates; both the focal group and reference group receive the same set of item

parameters. In this way, both groups have an equal probability of a correct

response for a specific attribute pattern, and hence no DIF should occur for the

focal and the reference groups.

2. The one-sided DIF pattern refers to the fact that all DIF items were set to favor the

same group (the reference group). That is the DIF items were generated by

decreasing the slip parameter and increasing the guessing parameter by an equal

amount (setting the DIF amount equal to 0.27 logit). In other words, the reference

group had a much higher probability of being correct on the DIF items than the

focal group after the latent trait levels were controlled.

3. The balanced DIF pattern refers to the fact that half of DIF items were set to favor

the reference group while the other DIF items were set to favor the focal group.

Specifically, each group is at the same magnitude so that neither group can be

considered favored.

Table 3.2 DIF Pattern Manipulation

DIF pattern NO DIF One-sided Balanced

Favor R Favor R Favor F

Parameter g s g s g s g s

Groups Focal equal equal -0.27 +0.27 -0.27 +0.27 +0.27 -0.27 Reference equal equal +0.27 -0.27 +0.27 -0.27 -0.27 +0.27 Note: the”﹢”denotes add dif amount; the”﹣”denotes subtract dif amount

在文檔中在認知診斷測量架構中的試題差異功能偵測效果探討 (頁 53-60)