Significance and contribution - 多階層高層試題反應理論之蒙地卡羅馬可夫鏈估計法

CHAPTER 1 INTRODUCTION

1.3 Significance and contribution

As discussed above, this dissertation has two major contributions to educational measurement and research. First, a multilevel HO-IRT model is proposed to incorporate background variables into the HO-IRT model. Incorporating the background variables can lead to unbiased estimates of population parameters, more precise ability estimates, and consistent model parameter estimates.

Second, a MCMC algorithm is proposed to estimate the overall ability, domain abilities, item parameters, and latent regression coefficients, simultaneously. The proposed MHO-IRT model is an integrated model that can better capture the structure and provide efficient estimates. A MCMC procedure has also been developed to estimate the parameters of this complex model and it can be used in conjunction with various item response models. Moreover, this dissertation provides a comparison of five more models with the proposed model. The experimental results suggest an appropriate situation for using the proposed model.

CHAPTER 2 LITERRATURE REVIEW

This chapter first discusses a variety of dichotomous IRT, MIRT and HO-IRT models. The principle of the multilevel method is presented and implemented in unidimensional and multidimensional item response theory model. The Markov Chain Monte Carlo algorithm are implemented in unidimensional and multidimensional item response theory model. Finally, model fit were describe.

2.1 Item Response Models

2.1.1 Unidimensional Item Response Models

Item response theory now contains a large family of models. The simplest of these models is the Rasch (1960) model, which is also known as the one-parameter logistic model (1PL). For the Rasch model, the dependent variable is the dichotomous response for particular person to a specified item. The 1PL function provides the prediction as follows:

) exp(

) ) exp(

| 1 (

j i

j i j

ij b

b b X

P  

 

 

  _(2.1.1)

where P(X_ij 1) is the probability of the examinee i answered item j correctly;

bj is difficulty parameter for item j ; and



_i is the ith examinee’s ability parameter for the administered test.

In the two-parameter logistic model (2PL), item discrimination is included in the measurement model. The model includes two parameters to represent item properties.

Both item difficulty, b_j, and item discrimination, a_j, are included in the exponential form of the logistic model (Birnbaum, 1968), as follows:

)]

( exp[

)]

( ) exp[

, ,

| 1 (

j i j

j i j j

j i

ij a b

b b a

a X

P  

 

 

  (2.1.2)

Notice that the item discrimination is a multiplier of the different between trait level and item difficulty. Item discriminations are related to the biserial correlations between item responses and total scores.

When a third parameter, the guessing parameter, c_j, is added to the 2PL model, it becomes the three-parameter logistic (3PL) IRT model, as follows (Lord, 1980):

)] her ability level reaches the low extreme.

Patz and Junker (1999a) described a general Markov chain Monte Carlo strategy, based on Metropolis-Hastings sampling, for Bayesian inference in complex item response theory settings. They demonstrate the basic MCMC methodology using the two-parameter logistic (2PL) model. Patz and Junker (1999b) extended their basic MCMC methodology to address issues such as non-response, designed missingness, multiple raters, guessing behavior and partial credit (polytomous) test items. MCMC algorithm for unidimentional 3PL is described in the following.

The prior distributions of the ability, item parameters are given below. In a Bayesian framework, the estimation method can be expressed as (Patz & Junker, 1999a): formulation, the marginal distribution of domain ability can be shown to be the standard normal distribution.

The joint posterior distribution of the parameters, given the observed item response X, can be expressed as

The full conditional distribution of



_and IP are derived as follows:

)

At iteration t, the outline of the MCMC algorithm are as following.



has independent components, the sampling can be done one examinee at a time.

For examinee i , sample ^* from N(



^t^-¹,



²t-1) , and accept  ^* with with probability

}

2.1.2 Multidimensional Item Response Models

Many assessments are designed to report not only overall ability but also domain abilities on a few domains or subskills, with a certain number of items in each domain.

Multidimensional IRT (MIRT) models provide two or more parameters and their covariance structure to represent each person’s trait level. Multidimensional

One-Parameter Logistic Model (M1PLM; Mckinley & Reckase, 1982) can be expressed as:

) exp(

) ) exp(

| 1

( θ 1

1 θ

j i

j i j

ij b

b b X

P  

 

  _(2.1.15)

where P(X_ij 1) is the probability of a correct response; θ_i {₁,₂,...,_p} refers to the p-dimensional abilities; b_j is the difficulty parameter for item i, respectively;

and 1 is a p1 vector of 1’s.

In addition to the M1PLM, Adams, Wilson and Wang (1997) proposed the multidimensional random coefficients multinominal logit model (MRCMLM) for Rasch family models. Being a member of the exponential family of distribution, the MRCMLM can be viewed as a generalized linear mixed model (De Boeck & Wilson, 2004; McCulloch & Searle, 2001; Rijmen, Tuerlinckx, De Boeck, & Kupens, 2003;

Wang & Wilson, 2005a; Wang & Wilson, 2005b) of which the Rasch testlet model (Wang & Wilson, 2005b), the logistic latent trait model (LLTM; Fischer, 1973), the rating scale model (RSM; Andrich, 1978), and the partial credit model (PCM; Masters, 1982) are all the special cases of the MRCMLM. The model can be expressed as







 



jk i jk

jk i jk i

Xijk

) exp(

) ) exp(

| 1 (



 

a θ b

a θ

θ b _(2.1.16)

where P(X_ijk 1) is the probability of the response to item j in category k for examinee i; O_i is the number of category in item j ;  is a vector of difficulty parameters of that item; b_jk is a score vector given to category k of item j across the P ability; and a_jk is a design vector given to category k of item j that integrates the element of  into a linear relationship. The commercial computer program ConQuest (Wu, Adams, & Wilson, 1998) can be implemented to calibrate the parameters based on the MRCMLM.

When the discrimination is included in the model, Equation (2.1.17) will be the Multidimensional Two-parameter Logistic Model (M2PLM; Reckase, 1997). The function is defined by the following:

)

Reckase (1985) proposed a multidimensional IRT model as an extension of the 3PL. In his original formulation, a single item can measure two or more abilities.

Extending the 3PL model to a multidimensional context, Reckase (1997) formulated linear logistic multidimensional model as:

) 2.1.3 Higher-Order Item Response Model

HO-IRT model was developed for simultaneous estimation of the overall and domain abilities. In the proposed HO-IRT model, a test is viewed as consisting of several unidimensional sub domains. That is, a single domain-specific ability _i^(d⁾ accounts for examinee i’s performance on domain d, where d 1,2,...,D . When different domains measure the same ability, the entire test is deemed unidimensional.

The correlations between different domain abilities can be accounted for by posting a higher-order ability



_i that is viewed as the examinee’s overall ability. Specifically, the domain abilities are expressed as linear functions of the overall ability (de la Torre

& Song, 2009).

guaranteed to follow an identical distribution as _i, namely, the standard normal distribution N(0,1) It is also assumed that the domain level abilities are independent of each other given the overall ability. The correlation between the overall and domain abilities is given by ^{( d}⁾, whereas the correlation between the domain ability d and

d is ⁽^d⁾⁽^d^'⁾. Although ^{( d}⁾ can be negative, it is expected to be non-negative in most educational applications where domain-abilities are positively correlated with the overall ability.

The diagrammatic representation of the HO-IRT model is driven in Figure 2-1.

The first level of the figure shows the response of examinee i to the j^th item in domain d. On the second level, an examinee’s domain level responses are linked to the examinee’s domain-specific ability _i^(d⁾, and the specific item parameters IP_j^(d⁾ via IRT models. On the third level of the figure, the examinee’s domain ability is relate to his or her overall ability _i by the latent regression parameter ^{( d}⁾.

2.2 Multilevel IRT Models

Several methods are currently available for improving estimation of domain abilities. The core idea shared by these methods is to incorporate background variables into the estimation process for improving the estimation of item parameters and person

Figure 2-1 A HO-IRT model applied to a D-domain test

Adapted from “A higher-order item response model: development and application.”

by Song, H., 2007, doctoral dissertation, The State University of New Jersey.

abilities (e.g., Mislevy, 1987; Mislevy & Sheehan, 1989; Adams, Wilson & Wu, 1997;

de la Torre, 2009; von Davier & Sinharay, 2010). background variables includes examinees’ demographic and educational background variables, examinees’

performance on the overall test or on other subtests, and the correlation structure of the underlying abilities that are best estimated by the IRT scale scores (de la Torre, 2003;

Wainer et al., 2001). Two kinds of models are included in this section: multilevel unidimensional IRT model (Mislevy, Johnson, & Muraki, 1992), and multilevel multidimensional IRT model (de la Torre, 2009).

2.2.1 Multilevel Unidimensional IRT Model

Currently, several methods are available that intent to provide more precise and reliable estimates by incorporating the background variables. Research evidence has shown that incorporating student demographic and educational variables in the estimation process can lead to unbiased estimate of population parameters, more precise ability estimates, and consistent parameter estimates (Mislevy, 1984; Mislevy, 1987; Mislevy & Sheehan, 1989). The hierarchical structures framework using modeling approaches allow specification of different models at the different levels of the hierarchy. Examples of such an approach are IRT models. IRT models integrate two models specified at two levels. At the first level is the item response function that relates the examinee’s ability and the item characteristics to the probability of a particular response; at the second level is the distribution function that characterizes how the ability is distributed in the population. One can view the former as modeling the within-person variability and the latter as modeling the between-person variability (Adams, Wilson, & Wu, 1997).

This idea was actually implemented in the scaling process for the National Assessment of Educational Progress (NAEP) (Mislevy, Johnson, & Muraki, 1992;

Gonzalez, Galia, & Li, 2004). The NAEP scaling approach was originally devised for reporting population abilities on the overall test or test domains (Mislevy, Johnson, &

Muraki, 1992). Instead of estimating ability for individual examinees, NAEP generates consistent of population characteristics using marginal estimation techniques. The basic idea of the NAEP scaling procedure is to improve ability estimation by

incorporating the ancillary information from background surveys so called plausible values methodology.

Plausible values methodology was developed as a way to address this issue by using all available data to estimate directly the characteristics of student populations and subpopulations, and then generating multiple imputed scores, called plausible values, from these distributions that can be used in analyses with standard statistical software. A detailed review of plausible values methodology was given in Mislevy (1991).

Suppose a sample statistic t(,Y) is used for estimating a corresponding population parameter

T

, where



represents the latent ability values for all sampled examinees, and

Y

represents the vector of student’s background variables. By treating



as missing (Rubin, 1987), t(,Y) can be evaluated through multiple imputations and the resultant values are plausible values. Estimate of t(,Y) by its expectation conditional on the observed data (X,Y) is, Where X represents the responses of all sampled examinees to test items, and



is a vector of unknown abilities. In IRT measurement models, closed-form solutions for this equation are not available. Instead, the integration can be approximated using Monte Carlo procedure by randomly drawing from the conditional distributions

) ,

( _i _i

p  x y for each sampled examinee

i

. The procedure to obtain the posterior distribution p( |X,Y) is based on using Bayes’ theorem and the IRT procedure, observed background variables. The item parameters are assumed to be known values.

Assume p(|y_i) is normally distribution, and



is a linear function of background variable y and their interactions denoted by _i y^c:



 'y^c  (2.2.3)

Where  is assumed of normal distribution with mean 0 and variance .  and  are the parameters that can be estimated through maximum likelihood and Bayesian estimation procedures (see Mislevy, Johnson, & Muraki, 1992, p.140 for details). The normalized-likelihood are used for the estimation of  and  and for generation of plausible values.

2.2.2 Multilevel Multidimensional IRT Model

de la Torre and Patz (2005) devised a method to improve estimation of domain abilities by incorporating the correlation structure of the abilities. de la Torre (2009) proposed a method to provide a general framework for ability estimation where background variables found in the covariates and correlation structure of the abilities can be incorporated in the estimation process using an integrated framework.

The extension of the 3PL model to the multidimensional context (Reckase, 1997) is given by

The prior distributions of the ability parameters are given below. For examinee i with ability θ_i,

Parameters were estimated by using MCMC. Following is an outline of the MCMC algorithm.

Iteration 0:

1. Assign the following initial values to the parameters: 0, I, and , random draws from MVN (0, I).

Iteration t:

2. For the regression parameters, the full conditional distribution of  , )

4. Finally, since



has independent components, the sampling can be done one examinee at a time. For examinee i, sample _i^* from MVN(_i⁽^t^¹⁾,_c) where _c is the fixed scale of the candidate-generating distribution. Accept the draw with probability

),1}

Gelman, Carlin, Stern, and Rubin (1995, p. 409) provide an alternative method of sampling from this full conditional distribution that avoids the use of the Kronecker product. By recasting the matrices as follows,

where y_i is the background variables vector of examinee i (i.e., the transpose

They also suggested the use of matrix factorization to avoid inversion of large matrices in this algorithm.

In summary, the two kinds of approach all incorporate the background variable into their estimation process in order to obtain more precise and reliable domain abilities. However, those methods are based on the unidimensional or multidimensional IRT models. None of these methods estimates the overall ability together with the domain abilities. This study proposed a multilevel higher-order IRT estimation method.

2.3 Model fit

There is usually uncertainty about appropriate error structure and predictor variables to include in models. Adding more parameters may improve fit, but maybe at the expense of identifiability and generalizability. Model selection criteria assess whether improvements in fit measures such as likelihoods, deviances or error sum of squares justify the inclusion of extra parameters in a model. Classical and Bayesian model choice methods may both involve comparison either of measures of fit to the current data or cross validatory fit to out of sample data. For example, the deviance statistics of general linear models (with Poisson, normal, binomial or other exponential family outcomes) follow standard densities for comparisons of models nested within one another, at least approximately in large samples (McCullagh and Nelder, 1989).

Penalised measures of fit (Aikake, 1973) may be used, involving an adjustment to the model log-likelihood or deviance to reflect the number of parameters in the model (Congdon, 2003).

In this dissertation, three criteria were used to assess the model fit: (1) Akaike’s information coefficient, AIC (Congdon, 2003), (2) Bayesian information coefficient, BIC (Congdon, 2003), and (3) deviance information coefficient, DIC (Spiegelhalter, Best & Carlin, 1998).

Thus, the L denotes the likelihood and D the deviance of a model involving p parameters. The deviance may be simply defined as minus twice the log likelihood,

D2log . Then to allow for the number of parameters, one may use criteria such as the Akaike Information Criterion (or AIC), expressed as

p D

Model

AIC( )  ()2 （4.1.5）

So when the AIC is used to compare models, an increase in likelihood and reduction in deviance is offset by a greater penalty for more complex models. Another criterion used generally as a penalized fit measure, though also justified as an asymptotic approximation to the Bayesian posterior probability of a model, is the Schwarz Information Criterion (Schwarz, 1978). This is also often called the Bayes Information Criterion. Depending on the simplifying assumptions made, it may take different forms, but the most common version is, for sample of size N.

(N)) p(

D Model

BIC( ) () log （4.1.6）

Spiegelhalter, Best and Carlin (1998) have developed a Bayesian alternative to both AIC and BIC, based on the deviance and called DIC. This criterion is more satisfactory than the two former alternatives because it takes into account the prior information and gives a natural penalization factor to the log-likelihood.

D Model

DIC( ) () （4.1.7）

The sum of the differences between the posterior mean of the model-level deviance and the deviance at each draw i is the p . _D

CHAPTER 3 A Multilevel Higher-Order item response model

3.1 Model Specification

In this chapter, a multilevel higher-order item response model is to proposed to combine the higher-order item response model with background variables. In this model, a test is viewed as consisting of several unidimensional subtest domains. That is, a single domain-specific ability _i^(d⁾ accounts for examinee i ’s performance on domain d, where d 1,2,...,D . The overall ability is regarded as normal distribution.

It is assume that students have been sampled from a normal population with mean  and variance ². That is:

2 ] ) exp[ (

) 2 ( ) ,

;

( ₂

2 2

/ 1 2 2





 





   ^  

f _(3.1.1)

or equivalently

E

 

 (3.1.2)

where E ~ N(0,²)

Adams et al. (1997) discuss how a natural extension of (3.1.2) is to replace the mean,  with the regression model, Y_i where Y is a vector of background _i variables, fixed and known values for student i, and  is the corresponding vector of regression coefficients. For example, Y could be constituted of student variables _i such as gender or socio-economic status. Then the population model for student ⁱ, becomes,

i i

i Y E

 (3.1.3)

where it is assumed that E are independently and identically normally _i distributed with mean zero and variance ².

The correlations between different domain abilities can be accounted for by positing a higher-order ability _i that is viewed as the examinee’s overall ability.

Specifically, the domain abilities are expressed as linear functions of the overall ability.

id i d d

i   

⁽ ⁾  ⁽ ⁾  (3.1.4)

where ^{( d}⁾ is the latent regression coefficient, and _id is the error term that is assumed to be normally distributed with a mean of zero and variance of 1(⁽^d⁾)².

The diagrammatic representation of the MHO-IRT model is driven in Figure 3-1.

} c , b , { ^(d)_j _j _j

)

(^d  a

IPj via IRT models, where a , ^(d)_j b_j and c_j are the discrimination, difficulty, and guessing parameters of item j. On the third level of the figure, the examinee’s domain ability is relate to his or her overall ability _i by the latent

Observed variables are in boxes; the remaining variables are to be estimated.

Figure 3-1 Multilevel HO-IRT method

Adapted from “A higher-order item response model: development and application.”

By Song, H., 2007, doctoral dissertation, The State University of New Jersey.

regression parameter ^{( d}⁾. On the fourth level, the examinee’s overall ability is relate to his or her background variables Y by the latent regression parameter _ni _n.

3.2 Parameter Estimation

For this study, the model parameters were estimated using MCMC methods. The procedure uses simultaneous estimation and background variables was compared to procedures that estimate abilities one at a time or ignores the background variables. In addition, although this article focuses on the three-parameter logistic (3PL) model, the framework was formulated such that other item response models can be used in its place.

3.2.1 Prior Distributions

The prior distributions of the ability, item, and the latent regression parameters are given below. In a hierarchical Bayesian framework, the model can be expressed as:

) background variables; ^{( d}⁾ is the latent regression parameter between overall and domain abilities; and the item characteristics, where a , ^(d)_j b_j and c_j are the discrimination, difficulty, and guessing parameters of item j. Using this formulation, the marginal distribution of domain ability can be shown to be the standard normal distribution.

3.2.2 Joint and Conditional Posterior Distributions

Let X be the matrices of item responses;



is the overall ability parameters; Y be the matrix of background variables; θ⁽^d⁾ {⁽¹⁾,⁽²⁾,...⁽^D⁾,} represent the domain ability parameters; IP represent the item parameters; λ {⁽¹⁾,⁽²⁾,...⁽^D⁾,} be the matrics of the latent regression parameter between overall and domain abilities.

The joint posterior distribution of the parameters, given the observed item response X and Y, can be expressed as As this joint posterior distribution is of an unknown distribution, it is impossible to obtain draws from it directly. Instead, draws can be taken from the full conditional

在文檔中多階層高層試題反應理論之蒙地卡羅馬可夫鏈估計法 (頁 14-0)