CHAPTER 1 INTRODUCTION
1.3 Significance and contribution
As discussed above, this dissertation has two major contributions to educational measurement and research. First, a multilevel HO-IRT model is proposed to incorporate background variables into the HO-IRT model. Incorporating the background variables can lead to unbiased estimates of population parameters, more precise ability estimates, and consistent model parameter estimates.
Second, a MCMC algorithm is proposed to estimate the overall ability, domain abilities, item parameters, and latent regression coefficients, simultaneously. The proposed MHO-IRT model is an integrated model that can better capture the structure and provide efficient estimates. A MCMC procedure has also been developed to estimate the parameters of this complex model and it can be used in conjunction with various item response models. Moreover, this dissertation provides a comparison of five more models with the proposed model. The experimental results suggest an appropriate situation for using the proposed model.
5
CHAPTER 2
LITERRATURE REVIEW
This chapter first discusses a variety of dichotomous IRT, MIRT and HO-IRT models. The principle of the multilevel method is presented and implemented in unidimensional and multidimensional item response theory model. The Markov Chain Monte Carlo algorithm are implemented in unidimensional and multidimensional item response theory model. Finally, model fit were describe.
2.1 Item Response Models
2.1.1 Unidimensional Item Response Models
Item response theory now contains a large family of models. The simplest of these models is the Rasch (1960) model, which is also known as the one-parameter logistic model (1PL). For the Rasch model, the dependent variable is the dichotomous response for particular person to a specified item. The 1PL function provides the prediction as follows:
) exp(
1
) ) exp(
,
| 1 (
j i
j i j
i
ij b
b b X
P
(2.1.1)
where P(Xij 1) is the probability of the examinee i answered item j correctly;
bj is difficulty parameter for item j ; and
i is the ith examinee’s ability parameter for the administered test.In the two-parameter logistic model (2PL), item discrimination is included in the measurement model. The model includes two parameters to represent item properties.
Both item difficulty, bj, and item discrimination, aj, are included in the exponential form of the logistic model (Birnbaum, 1968), as follows:
)]
( exp[
1
)]
( ) exp[
, ,
| 1 (
j i j
j i j j
j i
ij a b
b b a
a X
P
(2.1.2)
Notice that the item discrimination is a multiplier of the different between trait level and item difficulty. Item discriminations are related to the biserial correlations between item responses and total scores.
6
When a third parameter, the guessing parameter, cj, is added to the 2PL model, it becomes the three-parameter logistic (3PL) IRT model, as follows (Lord, 1980):
)] her ability level reaches the low extreme.
Patz and Junker (1999a) described a general Markov chain Monte Carlo strategy, based on Metropolis-Hastings sampling, for Bayesian inference in complex item response theory settings. They demonstrate the basic MCMC methodology using the two-parameter logistic (2PL) model. Patz and Junker (1999b) extended their basic MCMC methodology to address issues such as non-response, designed missingness, multiple raters, guessing behavior and partial credit (polytomous) test items. MCMC algorithm for unidimentional 3PL is described in the following.
The prior distributions of the ability, item parameters are given below. In a Bayesian framework, the estimation method can be expressed as (Patz & Junker, 1999a): formulation, the marginal distribution of domain ability can be shown to be the standard normal distribution.
The joint posterior distribution of the parameters, given the observed item response X, can be expressed as
7
The full conditional distribution of
and IP are derived as follows:)
At iteration t, the outline of the MCMC algorithm are as following.
1.
has independent components, the sampling can be done one examinee at a time.For examinee i , sample * from N(
t-1,
2t-1) , and accept * with with probability}
2.1.2 Multidimensional Item Response Models
Many assessments are designed to report not only overall ability but also domain abilities on a few domains or subskills, with a certain number of items in each domain.
Multidimensional IRT (MIRT) models provide two or more parameters and their covariance structure to represent each person’s trait level. Multidimensional
8
One-Parameter Logistic Model (M1PLM; Mckinley & Reckase, 1982) can be expressed as:
) exp(
1
) ) exp(
,
| 1
( θ 1
1 θ
j i
j i j
i
ij b
b b X
P
(2.1.15)
where P(Xij 1) is the probability of a correct response; θi {1,2,...,p} refers to the p-dimensional abilities; bj is the difficulty parameter for item i, respectively;
and 1 is a p1 vector of 1’s.
In addition to the M1PLM, Adams, Wilson and Wang (1997) proposed the multidimensional random coefficients multinominal logit model (MRCMLM) for Rasch family models. Being a member of the exponential family of distribution, the MRCMLM can be viewed as a generalized linear mixed model (De Boeck & Wilson, 2004; McCulloch & Searle, 2001; Rijmen, Tuerlinckx, De Boeck, & Kupens, 2003;
Wang & Wilson, 2005a; Wang & Wilson, 2005b) of which the Rasch testlet model (Wang & Wilson, 2005b), the logistic latent trait model (LLTM; Fischer, 1973), the rating scale model (RSM; Andrich, 1978), and the partial credit model (PCM; Masters, 1982) are all the special cases of the MRCMLM. The model can be expressed as
Oi
u
jk i jk
jk i jk i
Xijk
P
1
) exp(
) ) exp(
,
| 1 (
a θ b
a θ
θ b (2.1.16)
where P(Xijk 1) is the probability of the response to item j in category k for examinee i; Oi is the number of category in item j ; is a vector of difficulty parameters of that item; bjk is a score vector given to category k of item j across the P ability; and ajk is a design vector given to category k of item j that integrates the element of into a linear relationship. The commercial computer program ConQuest (Wu, Adams, & Wilson, 1998) can be implemented to calibrate the parameters based on the MRCMLM.
9
When the discrimination is included in the model, Equation (2.1.17) will be the Multidimensional Two-parameter Logistic Model (M2PLM; Reckase, 1997). The function is defined by the following:
)
Reckase (1985) proposed a multidimensional IRT model as an extension of the 3PL. In his original formulation, a single item can measure two or more abilities.
Extending the 3PL model to a multidimensional context, Reckase (1997) formulated linear logistic multidimensional model as:
) 2.1.3 Higher-Order Item Response Model
HO-IRT model was developed for simultaneous estimation of the overall and domain abilities. In the proposed HO-IRT model, a test is viewed as consisting of several unidimensional sub domains. That is, a single domain-specific ability i(d) accounts for examinee i’s performance on domain d, where d 1,2,...,D . When different domains measure the same ability, the entire test is deemed unidimensional.
The correlations between different domain abilities can be accounted for by posting a higher-order ability
i that is viewed as the examinee’s overall ability. Specifically, the domain abilities are expressed as linear functions of the overall ability (de la Torre& Song, 2009).
10
guaranteed to follow an identical distribution as i, namely, the standard normal distribution N(0,1) It is also assumed that the domain level abilities are independent of each other given the overall ability. The correlation between the overall and domain abilities is given by ( d), whereas the correlation between the domain ability d and
'
d is (d)(d'). Although ( d) can be negative, it is expected to be non-negative in most educational applications where domain-abilities are positively correlated with the overall ability.
The diagrammatic representation of the HO-IRT model is driven in Figure 2-1.
The first level of the figure shows the response of examinee i to the jth item in domain d. On the second level, an examinee’s domain level responses are linked to the examinee’s domain-specific ability i(d), and the specific item parameters IPj(d) via IRT models. On the third level of the figure, the examinee’s domain ability is relate to his or her overall ability i by the latent regression parameter ( d).
2.2 Multilevel IRT Models
Several methods are currently available for improving estimation of domain abilities. The core idea shared by these methods is to incorporate background variables into the estimation process for improving the estimation of item parameters and person
Figure 2-1 A HO-IRT model applied to a D-domain test
Adapted from “A higher-order item response model: development and application.”
by Song, H., 2007, doctoral dissertation, The State University of New Jersey.
11
abilities (e.g., Mislevy, 1987; Mislevy & Sheehan, 1989; Adams, Wilson & Wu, 1997;
de la Torre, 2009; von Davier & Sinharay, 2010). background variables includes examinees’ demographic and educational background variables, examinees’
performance on the overall test or on other subtests, and the correlation structure of the underlying abilities that are best estimated by the IRT scale scores (de la Torre, 2003;
Wainer et al., 2001). Two kinds of models are included in this section: multilevel unidimensional IRT model (Mislevy, Johnson, & Muraki, 1992), and multilevel multidimensional IRT model (de la Torre, 2009).
2.2.1 Multilevel Unidimensional IRT Model
Currently, several methods are available that intent to provide more precise and reliable estimates by incorporating the background variables. Research evidence has shown that incorporating student demographic and educational variables in the estimation process can lead to unbiased estimate of population parameters, more precise ability estimates, and consistent parameter estimates (Mislevy, 1984; Mislevy, 1987; Mislevy & Sheehan, 1989). The hierarchical structures framework using modeling approaches allow specification of different models at the different levels of the hierarchy. Examples of such an approach are IRT models. IRT models integrate two models specified at two levels. At the first level is the item response function that relates the examinee’s ability and the item characteristics to the probability of a particular response; at the second level is the distribution function that characterizes how the ability is distributed in the population. One can view the former as modeling the within-person variability and the latter as modeling the between-person variability (Adams, Wilson, & Wu, 1997).
This idea was actually implemented in the scaling process for the National Assessment of Educational Progress (NAEP) (Mislevy, Johnson, & Muraki, 1992;
Gonzalez, Galia, & Li, 2004). The NAEP scaling approach was originally devised for reporting population abilities on the overall test or test domains (Mislevy, Johnson, &
Muraki, 1992). Instead of estimating ability for individual examinees, NAEP generates consistent of population characteristics using marginal estimation techniques. The basic idea of the NAEP scaling procedure is to improve ability estimation by
12
incorporating the ancillary information from background surveys so called plausible values methodology.
Plausible values methodology was developed as a way to address this issue by using all available data to estimate directly the characteristics of student populations and subpopulations, and then generating multiple imputed scores, called plausible values, from these distributions that can be used in analyses with standard statistical software. A detailed review of plausible values methodology was given in Mislevy (1991).
Suppose a sample statistic t(,Y) is used for estimating a corresponding population parameter
T
, where
represents the latent ability values for all sampled examinees, andY
represents the vector of student’s background variables. By treating
as missing (Rubin, 1987), t(,Y) can be evaluated through multiple imputations and the resultant values are plausible values. Estimate of t(,Y) by its expectation conditional on the observed data (X,Y) is, Where X represents the responses of all sampled examinees to test items, and
is a vector of unknown abilities. In IRT measurement models, closed-form solutions for this equation are not available. Instead, the integration can be approximated using Monte Carlo procedure by randomly drawing from the conditional distributions
) ,
|
( i i
p x y for each sampled examinee
i
. The procedure to obtain the posterior distribution p( |X,Y) is based on using Bayes’ theorem and the IRT procedure, observed background variables. The item parameters are assumed to be known values.Assume p(|yi) is normally distribution, and
is a linear function of background variable y and their interactions denoted by i yc:
'yc (2.2.3)
13
Where is assumed of normal distribution with mean 0 and variance . and are the parameters that can be estimated through maximum likelihood and Bayesian estimation procedures (see Mislevy, Johnson, & Muraki, 1992, p.140 for details). The normalized-likelihood are used for the estimation of and and for generation of plausible values.
2.2.2 Multilevel Multidimensional IRT Model
de la Torre and Patz (2005) devised a method to improve estimation of domain abilities by incorporating the correlation structure of the abilities. de la Torre (2009) proposed a method to provide a general framework for ability estimation where background variables found in the covariates and correlation structure of the abilities can be incorporated in the estimation process using an integrated framework.
The extension of the 3PL model to the multidimensional context (Reckase, 1997) is given by
The prior distributions of the ability parameters are given below. For examinee i with ability θi,
Parameters were estimated by using MCMC. Following is an outline of the MCMC algorithm.
Iteration 0:
14
1. Assign the following initial values to the parameters: 0, I, and , random draws from MVN (0, I).
Iteration t:
2. For the regression parameters, the full conditional distribution of , )
4. Finally, since
has independent components, the sampling can be done one examinee at a time. For examinee i, sample i* from MVN(i(t1),c) where c is the fixed scale of the candidate-generating distribution. Accept the draw with probability),1}
Gelman, Carlin, Stern, and Rubin (1995, p. 409) provide an alternative method of sampling from this full conditional distribution that avoids the use of the Kronecker product. By recasting the matrices as follows,
where yi is the background variables vector of examinee i (i.e., the transpose
15
They also suggested the use of matrix factorization to avoid inversion of large matrices in this algorithm.
In summary, the two kinds of approach all incorporate the background variable into their estimation process in order to obtain more precise and reliable domain abilities. However, those methods are based on the unidimensional or multidimensional IRT models. None of these methods estimates the overall ability together with the domain abilities. This study proposed a multilevel higher-order IRT estimation method.
2.3 Model fit
There is usually uncertainty about appropriate error structure and predictor variables to include in models. Adding more parameters may improve fit, but maybe at the expense of identifiability and generalizability. Model selection criteria assess whether improvements in fit measures such as likelihoods, deviances or error sum of squares justify the inclusion of extra parameters in a model. Classical and Bayesian model choice methods may both involve comparison either of measures of fit to the current data or cross validatory fit to out of sample data. For example, the deviance statistics of general linear models (with Poisson, normal, binomial or other exponential family outcomes) follow standard densities for comparisons of models nested within one another, at least approximately in large samples (McCullagh and Nelder, 1989).
Penalised measures of fit (Aikake, 1973) may be used, involving an adjustment to the model log-likelihood or deviance to reflect the number of parameters in the model (Congdon, 2003).
In this dissertation, three criteria were used to assess the model fit: (1) Akaike’s information coefficient, AIC (Congdon, 2003), (2) Bayesian information coefficient, BIC (Congdon, 2003), and (3) deviance information coefficient, DIC (Spiegelhalter, Best & Carlin, 1998).
Thus, the L denotes the likelihood and D the deviance of a model involving p parameters. The deviance may be simply defined as minus twice the log likelihood,
16
L
D2log . Then to allow for the number of parameters, one may use criteria such as the Akaike Information Criterion (or AIC), expressed as
p D
Model
AIC( ) ()2 (4.1.5)
So when the AIC is used to compare models, an increase in likelihood and reduction in deviance is offset by a greater penalty for more complex models. Another criterion used generally as a penalized fit measure, though also justified as an asymptotic approximation to the Bayesian posterior probability of a model, is the Schwarz Information Criterion (Schwarz, 1978). This is also often called the Bayes Information Criterion. Depending on the simplifying assumptions made, it may take different forms, but the most common version is, for sample of size N.
(N)) p(
D Model
BIC( ) () log (4.1.6)
Spiegelhalter, Best and Carlin (1998) have developed a Bayesian alternative to both AIC and BIC, based on the deviance and called DIC. This criterion is more satisfactory than the two former alternatives because it takes into account the prior information and gives a natural penalization factor to the log-likelihood.
pD
D Model
DIC( ) () (4.1.7)
The sum of the differences between the posterior mean of the model-level deviance and the deviance at each draw i is the p . D
17
CHAPTER 3
A Multilevel Higher-Order item response model
3.1 Model Specification
In this chapter, a multilevel higher-order item response model is to proposed to combine the higher-order item response model with background variables. In this model, a test is viewed as consisting of several unidimensional subtest domains. That is, a single domain-specific ability i(d) accounts for examinee i ’s performance on domain d, where d 1,2,...,D . The overall ability is regarded as normal distribution.
It is assume that students have been sampled from a normal population with mean and variance 2. That is:
2 ] ) exp[ (
) 2 ( ) ,
;
( 2
2 2
/ 1 2 2
f (3.1.1)
or equivalently
E
(3.1.2)
where E ~ N(0,2)
Adams et al. (1997) discuss how a natural extension of (3.1.2) is to replace the mean, with the regression model, Yi where Y is a vector of background i variables, fixed and known values for student i, and is the corresponding vector of regression coefficients. For example, Y could be constituted of student variables i such as gender or socio-economic status. Then the population model for student i, becomes,
i i
i Y E
(3.1.3)
where it is assumed that E are independently and identically normally i distributed with mean zero and variance 2.
The correlations between different domain abilities can be accounted for by positing a higher-order ability i that is viewed as the examinee’s overall ability.
18
Specifically, the domain abilities are expressed as linear functions of the overall ability.
id i d d
i
( ) ( ) (3.1.4)
where ( d) is the latent regression coefficient, and id is the error term that is assumed to be normally distributed with a mean of zero and variance of 1((d))2.
The diagrammatic representation of the MHO-IRT model is driven in Figure 3-1.
The first level of the figure shows the response of examinee i to the jth item in domain d. On the second level, an examinee’s domain level responses are linked to the examinee’s domain-specific ability i(d), and the specific item characteristics
} c , b , { (d)j j j
)
(d a
IPj via IRT models, where a , (d)j bj and cj are the discrimination, difficulty, and guessing parameters of item j. On the third level of the figure, the examinee’s domain ability is relate to his or her overall ability i by the latent
Observed variables are in boxes; the remaining variables are to be estimated.
Figure 3-1 Multilevel HO-IRT method
Adapted from “A higher-order item response model: development and application.”
By Song, H., 2007, doctoral dissertation, The State University of New Jersey.
19
regression parameter ( d). On the fourth level, the examinee’s overall ability is relate to his or her background variables Y by the latent regression parameter ni n.
3.2 Parameter Estimation
For this study, the model parameters were estimated using MCMC methods. The procedure uses simultaneous estimation and background variables was compared to procedures that estimate abilities one at a time or ignores the background variables. In addition, although this article focuses on the three-parameter logistic (3PL) model, the framework was formulated such that other item response models can be used in its place.
3.2.1 Prior Distributions
The prior distributions of the ability, item, and the latent regression parameters are given below. In a hierarchical Bayesian framework, the model can be expressed as:
) background variables; ( d) is the latent regression parameter between overall and domain abilities; and the item characteristics, where a , (d)j bj and cj are the discrimination, difficulty, and guessing parameters of item j. Using this formulation, the marginal distribution of domain ability can be shown to be the standard normal distribution.
20
3.2.2 Joint and Conditional Posterior Distributions
Let X be the matrices of item responses;
is the overall ability parameters; Y be the matrix of background variables; θ(d) {(1),(2),...(D),} represent the domain ability parameters; IP represent the item parameters; λ {(1),(2),...(D),} be the matrics of the latent regression parameter between overall and domain abilities.The joint posterior distribution of the parameters, given the observed item response X and Y, can be expressed as As this joint posterior distribution is of an unknown distribution, it is impossible to obtain draws from it directly. Instead, draws can be taken from the full conditional
The joint posterior distribution of the parameters, given the observed item response X and Y, can be expressed as As this joint posterior distribution is of an unknown distribution, it is impossible to obtain draws from it directly. Instead, draws can be taken from the full conditional