Recovery Analysis of the RHO-RDINA and RHO-RDINO Model48

CHAPTER 3 METHOD

3.1 Study 1: The Parameter Recovery of RHO-RDINA and RHO-RDINO

3.1.3 Recovery Analysis of the RHO-RDINA and RHO-RDINO Model48

Before proceeding to DIF detection for the simulation study, a recovery analysis was

conducted to determine the extent to which the generating parameter could be

recovered from the simulated datasets by the RHO-RDINA and RHO-RDINO model.

The recovery analysis considered three issues: recovery of the simulated item

parameters (i.e., the slip and guessing parameters), recovery of the simulated attribute

difficulty parameters, and recovery of the attribute mastery classifications. Recovery

of item parameters or attribute difficulty parameters was assessed using root mean

squared errors (RMSE) between the generating parameters and the parameter

estimates. The RMSE can be expressed as:

2 1

ˆ ) 1 (

i i n

b n b

RMSE =

∑

−

(3.1)

Where b_i the generating parameter for either an item or attribute parameters is, bˆ _i

is the parameter estimate, and n is the number of iterations. Recovery of attribute

mastery classification was done by simply calculating the proportion of examinees

who were correctly classified as masters or non-masters on each attribute.

Therefore the following questions are addressed on the results.

1. Will the test length, ability distribution differences and DIF pattern affect the

parameter recovery of two model-based methods?

2. Does the compensatory RHO-RDINO model based method perform equally

well with the RHO-RDINA model under the same condition?

3.2 Study 2: Comparing the Effectiveness of Traditional DIF Methods with Purification Procedure and Model Based Method within CDM

Framework

3.2.1 The Simulation Design

The simulation study was designed to test if the purification procedure contributes to

the two commonly used DIF detection methods: MH and LR methods. The

RHO-RDINA model and RHO-RDINO models were used to generate and analyze the

data. The goal of study two is to investigate whether or not scale purification affects

the efficiency of different DIF detection methods. Thus, some factors such as test

length and Q-matrix were fixed and were simulated to approximate real tests in study

Main Purpose of Study

Comparing the Effectiveness of Traditional DIF Methods with Iterative Procedure and Model based

Method within CDM Framework

Figure 3.2 Simulation Design of Study 2 Controlled Variables.

Test length. The test length was not be manipulated, and a 20-item test was used

to represent a short length test. The test length was chosen to approximate most

diagnostic assessments (e.g., a total of 20 items in the Fraction and subtraction test; a

total of 25 mathematics problems in the Trends in International Mathematics and

Science Study, TIMSS; a total of 25 mathematic problems in the Organization for

Economic Cooperation and Development Programme for International Student

Assessment, PISA), as some cognitive diagnostic models were implemented to

analyze these data sets (e.g., de la Torre & Douglas, 2008; Dogan & Tatsuoka, 2008;

Lee, Park & Taylan, 2011).

Structure of the Q-Matrix. The structure of the Q-Matrix in study 2 was

simulated as in study 1. Thus, the Q-matrix listed in the Table 3.1 was used to

generate item responses.

Item Parameter and Attribute Difficulty. Slip and guessing parameters for each

item were generated from a uniform distribution between .1 and .3; for the higher

order structure, the overall ability for the two groups will be generated from a

standard normal distribution, with mean 0 and variance 1 for the reference group and

with mean -1 and variance 1 for the focal group. The range of attribute difficulty will

be set to [-1.5, -1.0, -1.0, -0.5, -0.5] for both groups; the discrimination estimates, ak,

will set at 1.5 and be restricted to be equal across the items.

Manipulated Variables.

Ability Distributions. As with study 1, two conditions of ability distribution were

generated in study 2. First, abilities were generated from a standard normal

distribution (M =0.0, SD=1.0) for both groups. Second, the mean was set at 0.0 and

-1.0 for the reference and focal groups, respectively. This difference of ability

distribution was assumed to affect the efficiency of DIF detection.

Percentage of DIF Items. Obviously, a higher percentage of DIF items in a test

will result in less accurate ability estimates. That is, the contamination of the

matching variable increases as the percentage of DIF increases. As a result, power is

likely to decrease as the percentage of DIF items increases. In the framework of item

response models, the percentage of DIF items in a test is usually manipulated in a

range from 0% to 40% (e.g., three levels 10%, 15% and 30% in the study of Fidalgo,

Mellenbergh and Muniz, 2000; three levels 0%, 10% and 20% in the study of Finch &

French, 2007; three levels 0%, 10% and 20% in the study of French & Maller, 2007;

four levels 10%, 20%, 30% and 40% in the study of Shih & Wang, 2009). The

percentage of DIF items is expected to be low in testing. However, if the test is not

well developed, a higher percentage of DIF items may appear on the test. Therefore,

the percentage of DIF items was set at the levels 0 %, 10%, 20%, and 30%.

Sample Size. Power and type I error increases with DIF detection methods (e.g.,

LR, MH and SIBTEST) as sample size increases (Finch & French, 2007; French &

Maller, 2007; Rogers & Swaminathan, 1993). Additionally, de la Torre and Douglas

(2004) stated that the estimation has sufficient power when sample size achieves 1000

in DINA model. In the present study, the RHO-RDINA and RHO-RDINO models are

implemented. Therefore, small and large sample sizes were included to evaluate

conditions with less statistical power. Four sample size combinations were

manipulated as following: F500/R500; F500/R1000; F1000/R1000 and F1000/R2000.

Thus, the sample size ratio of the focal and reference groups can be compared.

DIF Magnitude. Bias were simulated in accord with the manipulation in

previous DIF studies (French & Maller, 2007; Narayanan & Swaminathan, 1996;

Rogers & Swaminathan, 1993). Three level of DIF for item slip and guessing

parameters 0.4, 0.6 and 0.8 were selected to represent small, moderate and large

differences respectively.

Purification. The purification procedure will be implemented in LR and

Mantel-Haenszel methods and compared with non-purification procedures. Thus, the

purification procedure for LR and MH methods are described as follows:

1. Conduct LR / MH analysis for all items (N) with total summed score as the

matching criterion.

2. Identify DIF items (n) based on set criteria.

3. Rerun the analysis for all items with N-n total score as the matching criterion and

identify DIF items.

4. Rerun the analysis for all items with N-n total scores as the matching criterion.

Continue steps 3 and 4 until the same set of DIF items are identified in two

consecutive analyses or no other items were indentified.

在文檔中在認知診斷測量架構中的試題差異功能偵測效果探討 (頁 62-68)