• 沒有找到結果。

MCMC method for multidimensional 3PL model

CHAPTER 3 A Multilevel Higher-Order item response model

3.3 Markov Chain Monte Carlo Methods for Other IRT Models

3.3.3 MCMC method for multidimensional 3PL model

The M3PL IRT model is as equation 3.3.17. Multidimensional 3PL MCMC method is similar to multidimensional multilevel 3PL MCMC method. The only different is the background variables. The prior distributions of the ability are given below. In a Bayesian framework, the model can be expressed as:

)

Where 0 and  are the mean vector and common (i.e., undifferentiated by examinee) covariance matrix of the multivariate normal distribution, respectively; v0 are the degrees of freedom, and 01 is the DD symmetric positive-definite scale matrix of the inverse-Wishart distribution.

The joint posterior distribution of the parameters given the observed item response X can be expressed as

As this joint posterior distribution is of an unknown distribution, it is impossible to obtain draws from it directly. Instead, draws can be taken from the full conditional distributions of , and IP. The joint posterior distributions can be approximated by taking large numbers of draws from these full conditional distributions. The full conditional distribution of , and IP are derived as follows:

Parameters for all the conditions were estimate using MCMC, following is an outline of the MCMC algorithm. For the MM-IRT models. At iteration t,

1. For , draw the candidate values * from N(λ(d)(t-1),λ2(d)(t-1)), and accept λ(d)*

with probability

}

28

with probability

}

with probability

}

29

CHAPTER 4 Experiments

To evaluate the performance of the multilevel higher-order item response model in estimating the data with or without background variables, the author implemented the approaches into a variety of combinations of the generating model and fitting model. For the different purpose, the data responses were generated from two different model. In experiment one, the data was generated from the multilevel higher-order item response (MHO-IRT) model which means  0,0.35or 0.7 and the background variable were continuous. Using this model to see the influence of different correlation background variables in estimating the parameters. In the experiment two, the data was generated from the higher-order item response (HO-IRT) model and the background variable were dichotomous. The result can show that incorporating the background variables how to improve the group statistic estimation.

In the last experiment three, use the real data from TASA 2007 fourth-grade mathematics assessment to check the model fitting of the data and the proposed models.

4.1 Experiment one 4.1.1 Experiment Design

A simulation study was conducted to investigate the feasibility of the HO-IRT model incorporating student background variables in the estimation process and to show how the estimates obtained from this model are affected by different factors.

The data responses are generated from the multilevel higher-order item response (MHO-IRT) model which means  0,0.35or 0.7 and the background variables are continuous. The combination is shown in Table 4-1 with an asterisk presenting the cases examined.

30

Table 4-1

The cases examined in the simulation studies Parameter

Fitted Model

U-IRT MU-IRT M-IRT MM-IRT HO-IRT MHO-IRT

Overall ability * * * *

Domain ability * * * *

The generated data are analyzed using the multilevel higher-order item response model (MHO-IRT), higher-order item response model (HO-IRT), multilevel multidimensional item response model (MM-IRT), multidimensional item response model (M-IRT), multilevel unidimensional item response model (MU-IRT) and unidimensional item response model (U-IRT).

A simulation study was conducted to investigate the feasibility of the HO-IRT model incorporating student background variables in the estimation process, and show how the estimates obtained from this model are affected by different factors.

Consequently, five factors with varied conditions were considered in the simulation study (de la torry, 2009): (1) the generating models, the MHO-IRT and HO-IRT models; (2) the data fitting model, the six conditions described above; (3) the sample size = 1000 and 4000; (4) the correlation between the background variables and the overall ability,   0,0.35and 0.7; (5) the test lengths in each domain, 5, 10, and 20 items. The item discrimination parameters were drawn from log N(0.6,1.13). The item difficulties were drawn from N(0,1), and the guessing parameters were drawn from Beta (4,16).

31

Table 4-2

The Setting of Manipulated variables

Manipulated variables Setting

Fitted models MHO-IRT, HO-IRT, MM-IRT, M-IRT,

MU-IRT, U-IRT

Number of domains 2

Correlation between the background variables and the overall ability

0.7 and 0.35

0,

Test lengths in each domain 5, 10, and 20 items

Sample size 1,000 and 4,000

Each simulated data set contained 1,000 or 4,000 simulated students. In each simulation has used 30 replications. Fully crossing different levels of these five factors yielded 108 conditions. The manipulated variables are shown in Table 4-2.

4.1.2 Data generation

The simulation study is divided into two experiments, the first experiment aims to investigate the model parameter recovery with continuous background variables, and the second experiment aims to investigate the model parameter recovery with dichotomous background variables. The data generation process of experiment one were show as following: (1) The overall ability parameters and the background variable were randomly generated from the multivariate normal distribution, as describe as equation 4.1.1; (2) the domain ability parameters could be generated by multiplying its corresponding factor loading values and then adding the residual values from independent distributions according to equation (3.1.4); (3) the item difficulties are drawn from N(0, 1); item discriminate parameters are drawn from log N(0.6, 1.13);

the guessing parameters are drawn from Beta(4, 16); (4) given the item and person parameters, the probabilities of item responses were computed according to the MHO-IRT model; and (5) the cumulative probability was computed and compared to a number randomly generated from the uniform (0, 1) distribution. If the random number was less than or equal to the cumulative probability, the simulated item

32

response was recorded as endorsing that item. (6) number of examinees, N=1000 or 4000.

The formula for generating the overall ability parameters and the background variables is given below(de la Torre, 2003):





YY Y

Y



0 , MVN 0

Y ~ (4.1.1)

 is designed to be YY I . Using the properties of conditional distributions of a MVN distribution (Johnson, & Wichern, 1997; Mardia, Kent, & Bibby, 1979), the conditional distribution of  given Yy is



 |Y~MVN YY , YY (4.1.2) 4.2 Experiment two

4.2.1 Experiment Design

In the experiment two, the data is generated from the higher-order item response (HO-IRT) model which means and standard deviations are presented in Table 4-3 and the background variables are dichotomous. The generated data are analyzed using the multilevel higher-order item response model (MHO-IRT), higher-order item response model (HO-IRT), multilevel multidimensional item response model (MM-IRT), multidimensional item response model (M-IRT), multilevel unidimensional item response model (MU-IRT) and unidimensional item response model (U-IRT), respectively.

The simulation study is conducted to investigate the feasibility of the HO-IRT model incorporating student background variables in the estimation process, and show how the estimates obtained from this model are affected by different factors.

Consequently, five factors with varied conditions were considered in the simulation study (de la torry, 2009): (1) the generating models, the MHO-IRT and HO-IRT models; (2) the data fitting model, the six conditions described above; (3) the sample size = 1000 and 4000; (4) the correlation between the background variables and the overall ability,   0,0.35and 0.7; (5) the test lengths in each domain, 5, 10, and 20 items. The item discriminate parameters were drawn from log N(0.6,1.13). The item

33

difficulties were drawn from N(0,1), and the guessing parameters were drawn from )

16 , 4 (

Beta .

Each simulated data set contained 1,000 or 4,000 simulated students. In each simulations have used 30 replications. Fully crossing different levels of these five factors yielded 108 conditions.

4.2.2 Data Generation

In this simulation study is to investigate the model parameter recovery with dichotomous background variables. The data generating procedure of experiment two are slightly different from study one. The main different are the overall ability generated from the distribution that showed inTable 4-3. This study generated the data according the von Davier, Gonzalez and Mislevy (2009). Assuming the data crossing two known background characteristics: school type with level A and B; and parental socioeconomic status (SES), also with two levels, H and L. This approach resulted in four (22) distinct groups. The average difference between school type A and B was 0.000 in this simulation, while the average difference based on parental SES was magnitude 1.414. The average for the high (H) SES group was +0.707, and -0.707 for the low (L) SES group. Table 4-3 presents the means and standard deviations used to generate the response data. Setting the standard deviation within each of these group to 0.707, which yielded a variance within each of the four groups of about 0.5, and an overall variance and standard deviation of 1.000. The other following steps are similar to the experiment one. For easy to read, using dummy coding to convey all of the necessary information on group membership. Dummy coding uses only ones and zeros.

In this simulated dataset “00” represent the group from school A and SES is L, “01”

represent the group from school B and SES is L, “10” represent the group from school A and SES is H, “11” represent the group from school B and SES is H.

34

Table 4-3

Means and standard deviations used to generate the simulated dataset

School

SES A B Average

L -0.707(0.707) -0.707(0.707) -0.707(0.707)

H +0.707(0.707) +0.707(0.707) +0.707(0.707)

Total 0.000(1.000) 0.000(1.000) 0.000(1.000)

The data generation process of the second experiment are show as following: (1) The overall ability parameters and the background variable are randomly generated from the normal distribution, as describe as Table 4-3; (2) the domain ability parameters could be generated by multiplying its corresponding factor loading values and then adding the residual values from independent distributions according to equation (3.1.4); (3) the item difficulties are drawn from N(0, 1); the item discriminate parameters are drawn from log N(0.6, 1.13); the guessing parameters are drawn from Beta(4, 16); (4) given the item and person parameters, the probabilities of item responses are computed according to the MHO-IRT model; and (5) the cumulative probability is computed and compared to a number randomly generated from the uniform (0, 1) distribution. If the random number is less than or equal to the cumulative probability, the simulated item response is recorded as endorsing that item.

(6) number of examinees, N = 1000 or 4000.

4.2.3 Analysis

Two criteria were used to assess the quality of the multilevel HO-IRT estimates of the overall, domain and item parameters: (1) correlation between the parameter estimates and the true parameter (i.e., the generating values) ; (2) the RMSE of the parameter estimates.

The correlation between the parameter estimates and true parameter is computed as:

35

ˆ

) ˆ, ) cov(

ˆ, (

Corr (4.1.3)

For each estimator, the root mean square error (RMSE) is computed as:

R RMSE

R

r (ˆr ) /

ˆ) (

2

1

 

 (4.1.4)

where R is equal to 50 in evaluating model parameter recovery and 4000 in ability estimates and averaging approaches for each replication; ˆ represent the predicted values; is the generating value.

All the chains are simulated using the Matlab computer program is written by the author. The prior distributions of model parameters are specified as described before.

In addition, the run length of 10,000 iterations is set to make the posterior distribution stationary for the proposed models and omit the preliminary 5,000 iterations as burn-in.

Previous studies have recommended that burn-in lengths between 500 and 1000 are reasonable for the MCMC procedure, as well as the application of IRT models (Bolt, Cohen, & Wollack, 2001; Kang & Cohen, 2007; Li, Bolt & Fu, 2006; Raftery & Lewis, 1996). For each condition, 10,000 iterations are run with the first 5,000 iterations serving as the burn-in. The estimates of the parameters are based on the mean of the remaining observations. For example,  is estimated as i

10000

5001 )

ˆ ( t

t i

i

(4.1.2)

Figures 4-1–4-4 show the plots of the draws against the iterations, and the autocorrelation lag for some selected parameters using the most complex of the six models, the MHO-IRT model. Figures 4-1 and 4-2 show that the chains have reached their stationary distributions after about 500-800 draws, whereas Figures 4-3 and 4-4 show a low autocorrelation after a lag of 1000. These plots show good indications that the selected lengths of burn-ins and iterations are justified. In addition, it is verified that the chain length used in conjunction with the algorithm for estimating the parameters of the MHO-IRT model is sufficiently long.

Figgure 4-1 Simulated Data – Ma ability,

36

Beta

Overall ab

Domain a

Lambd

arkov chain and lambd a

bility

ability

da

ns for selec da paramet

cted beta, o ters

overall andd domain

Figuure 4-2 Simmulated Data – Mark gues

37

Discrimin

Difficu

Guessi

kov chains ssing item p

nation

ulty

ing

for selecte parameters

ed discrimin s

nation, difffficulty andd

Figure 4-3 Simulated overall

Data – Est l and doma

38

Beta

Overall ab

Domain a

Lambd

timated au ain ability, a

bility

ability

da

utocorrelati and lambd

on function da paramete

n for select ers

ted beta,

Figure 4-44 Simulat discrimin

ted Data – nation, diffi

39

Discrimin

Difficu

Guessi

Estimated ficulty and

nation

ulty

ing

autocorrel guessing it

lation funct tem param

tion for sel meters

lected

40

4.3 Experiment three

This experiment attempts to apply the MHO-IRT model to fit the 2007 fourth-grade mathematics assessment data sets from the Taiwan Assessment of Student Achievement (TASA).

TASA is administered by the National Academy for Educational Research, Taiwan. TASA measures trends in Chinese, English, Mathematics, Social Science and Science achievements in Taiwan. The participants of TASA are shown in Figure 4-5.

In TASA 2007, the fourth-grade mathematics items contained four content domains (i.e., number, algebra, geometry, and statistics). The proposed MHO-IRT model is applied to simultaneously estimate the overall (mathematics proficiency) and four domain-specific proficiencies.

Figure 4-5 Participants in TASA

To ensure broad content coverage without extending the test time, TASA uses the NEAT design to assemble the test. As an example, consider the TASA 2007 fourth-grade mathematics assessment, in which each fourth-grade participant took a test booklet comprising two blocks of mathematics items. There were 96 mathematics items broken into 10 blocks with 16 items in the anchored block M and 8 items in the other blocks (M1-M10). The test booklets designed for TASA 2007 fourth-grade mathematics assessment are shown in Table 4-4.

Taiwan Assessment of Student Achievement

Chinese English Mathematics Social science Science 4th Grade

6th Grade 8th Grade

11th Grade

(vocational school)

11th Grade (high

school)

4th Grade 6th Grade 8th Grade

11th Grade

(vocational school)

11th Grade (high

school)

6th Grade 8th Grade

11th Grade

(vocational school)

11th Grade (high

school)

4th Grade 6th Grade 8th Grade

11th Grade

(vocational school)

11th Grade (high

school)

6th Grade 8th Grade

11th Grade

(vocational school)

11th Grade (high

school)

41

Table 4-4

The test booklets design for TASA 2007 fourth-grade mathematics assessment Booklet ID Block I Block II Booklet ID Block I Block II

S1 M M1 S6 M M6 S2 M M2 S7 M M7 S3 M M3 S8 M M8 S4 M M4 S9 M M9 S5 M M5 S10 M M10 Therefore, an individual fourth-grade student took 24 items. Furthermore, the

small number of mathematics items taken by individual students was divided into four content domains. The number of items in each is as follows: 52 items in number domain, 21 items in algebra, 24 items in geometry, and 23 items in statistics.

The six models were implemented in the real data experiment to estimate the overall and domain abilities. In the TASA 2007, the calculation of the correlation between the overall ability and the background variables, and the choice of the highest and lowest correlation background variables are as listed in Table 4-5. This experiment use the AIC, BIC and DIC to identify the best fit model. The HO-IRT and UIRT based models are used to estimate the overall ability, whereas the HO-IRT and MIRT based models are used to estimate the domain abilities.

Table 4-5

The Questionnaire item from TASA 2007 fourth-grade

Item number Questionnaire Correlation

Student Questionnaire 9_4 Do you have dictionary in your home? 0.031 Student Questionnaire 10_1 In this semester, have you ever go to the cram

school or tutoring class? 0.000

42

CHAPTER 5 Results

The estimation procedures based on the hierarchical structure item response model are performed to evaluate the efficiency of the six proposed MCMC algorithms, in which the estimations are implemented based on the different models. Estimates of the regression coefficient,  , correlation between abilities,  , overall ability, domain ability, and item parameters, are compared with the generating parameters to determine the accuracy of the estimation methods in recovering the parameters.

Three studies are performed in this research. The first experiment aims to investigate the model parameter recovery with continuous background variables. The second experiment aims to investigate the model parameter recovery with dichotomous background variables. The third experiment aims to implement the proposed model using real data to identify the best fit model.

5.1 Experiment one: Model parameter recovery with continuous background variables

5.1.1 Overall ability estimates

Four models are used to estimate the overall abilities of the examinees. Various measures are computed to compare the quality of the ability estimates obtained using the different methods. These measures are correlation between  and ˆ , and the root mean squared error of ˆ .

Table 5-1 lists the RMSE according to the following factors: (a) number of items administered in each domain (three levels: n=5, 10, and 20), (b) fitting model (four models: multilevel higher-order item response model (MHO-IRT), higher-order item response model (HO-IRT), multilevel unidimensional item response model (MU-IRT) and unidimensional item response model (U-IRT), (c) the difference in correlation between background variables and overall ability (three levels: =0, 0.35, and 0.7) and (d) number of examinees (two levels: N=1000 or 4000).

43

The MU-IRT and MHO-IRT model approaches provide more proficient estimates of the overall ability compared to the U-IRT and HO-IRT model approaches.

Compared to the U-IRT and HO-IRT models, the results indicate that the models including the background variables are relatively efficient. For example, the RMSE decreases from 0.5020 for HO-IRT to 0.4440 for MHO-IRT in n = 20, N=1000 and 

= 0.7. The RMSE decreases from 0.5517 for U-IRT to 0.4567 for MU-IRT in n = 20, N = 1000 and  = 0.7. The results indicate that better estimates are obtained with the MHO-IRT model approach than with the HO-IRT model approach.

In addition, better estimates are obtained with longer tests, larger sample sizes, and higher correlations between the overall ability and background variable. For example, when N = 1000, n = 5,  = 0.7 the RMSE is 0.7092 for the MHO-IRT model; when the number of items administered in each domain increases to 10 the RMSE decreases to 0.5730; and an ever better estimate results in n = 20 for which the RMSE is 0.4440. A large sample size is more representative of the population.

Although higher correlations do not afford the same magnitude of efficiency as longer tests and larger sample sizes, improvement is achieved when the ability and background variables have moderately high correlation (i.e.,  = 0.7 for this study).

The correlations between the true and estimated abilities are given in Table 5-2.

Table 5-2 includes the correlation between true overall abilities and estimated overall abilities according to the following factors: (a) number of items administered in each domain (three levels: n=5, 10, and 20), (b) fitting model (four models: multilevel higher-order item response model (MHO-IRT), higher-order item response model (HO-IRT), multilevel unidimensional item response model (MU-IRT) and unidimensional item response model (U-IRT) and (c) the difference in correlation between background variables and overall ability (three levels: =0, 0.35, and 0.7).

(d) number of examinees (two levels: N=1000 or 4000).

44

Table 5-1

RMSE of four model overall ability estimates

Examinees Correlation Test length MHO-IRT HO-IRT MU-IRT U-IRT

1000

0.7

5 0.7092 0.7630 0.7484 0.8440 10 0.5730 0.6317 0.5894 0.6806 20 0.4440 0.5020 0.4567 0.5517 0.35

5 0.7306 0.7630 0.8008 0.8440 10 0.6095 0.6317 0.6213 0.6806 20 0.4699 0.5020 0.4742 0.5517 0

5 0.7646 0.7630 0.7946 0.8440 10 0.6322 0.6317 0.5784 0.6806 20 0.5030 0.5020 0.5179 0.5517

4000

0.7

5 0.6546 0.7080 0.6933 0.7878 10 0.5151 0.5776 0.6337 0.7257 20 0.3860 0.4483 0.6005 0.6954 0.35

5 0.6744 0.7080 0.7420 0.7878 10 0.5585 0.5776 0.6711 0.7257 20 0.3658 0.4483 0.6186 0.6954 0

5 0.7134 0.7080 0.7614 0.7878 10 0.5764 0.5776 0.7248 0.7257 20 0.4471 0.4483 0.6244 0.6954 Table 5-2 indicates that similar correlations are found between the true and the MHO-IRT and MU-IRT estimates of the overall ability, and the former is more precise than the latter in terms of having higher correlation. In addition, the precision of both the MHO and HO-IRT estimates improves with longer tests, larger sample sizes and higher correlations. The different factors have similar impacts on the precision of the overall estimates as on the correlation between the true and estimated overall ability shown in Table 5-2. The results show that the use of the background variables provide better ability estimates compared to the method that ignores this information. In addition, additional improvement can be achieved by simultaneously using these different sources of information.

45

Table 5-2

Correlation between true overall abilities and estimated overall abilities

Examinees Correlation Test length MHO-IRT HO-IRT MU-IRT U-IRT

1000

0.7

5 0.8014 0.7725 0.8080 0.7947 10 0.8489 0.8080 0.8566 0.8041 20 0.8932 0.8393 0.8760 0.8250 0.35

5 0.7845 0.7725 0.8046 0.7947 10 0.8432 0.8080 0.8533 0.8041 20 0.8927 0.8393 0.8673 0.8250 0

5 0.7775 0.7725 0.7938 0.7947 10 0.8357 0.8080 0.8400 0.8041 20 0.8795 0.8393 0.8514 0.8250

4000

0.7

5 0.8115 0.7894 0.8194 0.7917 10 0.8516 0.8198 0.8581 0.8046 20 0.8985 0.8462 0.8960 0.8415 0.35

5 0.8025 0.7894 0.8061 0.7917 10 0.8332 0.8198 0.8438 0.8046 20 0.8966 0.8462 0.8719 0.8415 0

5 0.7878 0.7894 0.8113 0.7917 10 0.8381 0.8198 0.8480 0.8046 20 0.8816 0.8462 0.8584 0.8415

46

MHO-IRT model HO-IRT model

MU-IRT model U-IRT model

Figure 5-1 Scatter Plots of True and Estimated overall abilities: N=1000, n=20, 

=0.7

-3 -2 -1 0 1 2 3 4

-3 -2 -1 0 1 2 3 4

True ability

Estimated ability

R=0.8932

Data Fit Y = T

-3 -2 -1 0 1 2 3

-3 -2 -1 0 1 2 3

True ability

Estimated ability

R=0.8393

Data Fit Y = T

-2 -1 0 1 2 3

-2 -1 0 1 2 3

True ability

Estimated ability

R=0.8760

Data Fit Y = T

-3 -2 -1 0 1 2 3

-3 -2 -1 0 1 2 3

True ability

Estimated ability

R=0.8250

Data Fit Y = T

47

MHO-IRT model HO-IRT model

MU-IRT model U-IRT model

Figure 5-2 Scatter Plots of True and Estimated overall abilities: N=4000, n=20, 

=0.7

Figure 5-1 shows that much better estimates are obtained when N = 1000, n = 20,

Figure 5-1 shows that much better estimates are obtained when N = 1000, n = 20,

相關文件