A generalized dimensionality discrepancy measure for dimensionality assessment in multidimensional item response theory

(1)

British Journal of Mathematical and Statistical Psychology (2011), 64, 208–232

C 2010 The British Psychological Society

The British Psychological Society www.wileyonlinelibrary.com

A generalized dimensionality discrepancy measure for dimensionality assessment in multidimensional item response theory

Roy Levy

^∗

and Dubravka Svetina

Arizona State University, Tempe, Arizona, USA

A generalized dimensionality discrepancy measure is introduced to facilitate a critique of dimensionality assumptions in multidimensional item response models. Connections between dimensionality and local independence motivate the development of the discrepancy measure from a conditional covariance theory perspective. A simulation study and a real-data analysis demonstrate the utility of the discrepancy measure’s application at multiple levels of analysis in a posterior predictive model checking framework.

1. Introduction

Zhang and Stout (1999a) characterized approaches to dimensionality assessment for item response data in terms of those that (a) are more exploratory in nature and attempt full dimensionality assessment by estimating the number of latent dimensions and determining which items reflect which dimension(s) or (b) are more confirmatory in that they assess (departures from) unidimensionality. In addressing the latter of these, Stout (1987) framed the DIMTEST procedure as responding to Lord’s (1980, p. 21) call for a statistical procedure to assess unidimensionality in item response theory (IRT).

However, assessments are often multidimensional in the sense that performance on tasks depends on distinct though possibly related proficiencies or aspects of proficiency.

Advances in estimation capabilities, including those afforded by adopting Bayesian approaches to statistical modelling, have supported the growth in the popularity of research on and the use of multidimensional IRT (MIRT) models for item response and similar types of data (B´eguin & Glas, 2001; Bolt & Lall, 2003; Clinton, Jackman, &

Rivers, 2004; Yao & Boughton, 2007). With the emergence of such models comes the need for appropriate model checking and model criticism procedures. Thus, Lord’s (1980) call may be extended to a call for procedures for assessing dimensionality assumptions in multidimensional models. The analysis of the assumed dimensionality has

∗Correspondence should be addressed to Dr Roy Levy, Arizona State University, PO Box 873701, Tempe, AZ 85287-3701, USA (e-mail: roy.levy@asu.edu).

DOI:10.1348/000711010X500483

(2)

long been recognized as an integral part of any application of unidimensional IRT models, and though evaluating dimensionality assumptions takes on no less importance when multidimensional models are employed, dimensionality and data–model fit procedures for multidimensional models are relatively underdeveloped (Swaminathan, Hambleton,

& Rogers, 2007). To this end, we introduce a new discrepancy measure for assessing the dimensionality assumptions applicable to multidimensional (as well as unidimensional) models in the context of MIRT and present the application of this discrepancy measure using a posterior predictive model checking (PPMC) framework. PPMC has been successfully applied to unidimensional IRT models (e.g., Hoijtink, 2001; Sinharay, 2005, 2006; Sinharay, Johnson, & Stern, 2006), including the assessment of dimensionality for unidimensional models (Levy, 2010; Levy, Mislevy, & Sinharay, 2009; Sinharay et al., 2006). To date, no study has examined the use of PPMC for examining dimensionality assumptions when fitting MIRT models. In the current work, we capitalize on the flexibility of PPMC as a model checking framework and introduce a discrepancy measure suitable for assessing a model’s assumed (multi)dimensional structure for a wide class of latent variable models.

The current work considers the multidimensional normal ogive IRT model for dichotomous observables (i.e., scored item responses) which specifies the probability of an observed value of 1 (i.e., a correct response) as

where Xij is the observable response (coded as 0 or 1) of examinee i to item j,␪i = (␪i1,␪i2, . . . ,␪iM) is a vector of M latent variables that characterize examinee i, aj = (aj1, aj2, . . . , ajM)is a vector of M coefficients for item j that capture the discriminating power of the associated examinee variables, cjis a lower asymptote parameter for item j, and dj is an intercept related to the marginal difficulty of the item (e.g., B´eguin &

Glas, 2001; McDonald, 1997). The discrepancy measures and procedures presented in the current work apply to MIRT models using logistic distributions, including models with conjunctive relationships (Embretson, 1997; Reckase, 1997).

In Section 2, the role of local independence and dimensionality in multidimensional models is discussed, which serves as the foundation for the subsequent development of the new discrepancy measure and its variants. Following an overview of the mechanics of PPMC, results from a simulation study in which the performance of the proposed discrepancy measure is evaluated and compared to existing procedures are presented, evidencing the utility of formulating the discrepancy measure at different levels of the analysis to support inference. A real-data example illustrates the procedures in which multiple graphical representations are employed in the analysis.

2. Dimensionality and local independence in multidimensional models The relation between the assumptions of (a) proper specification of dimensionality and (b) local independence is not without ambiguity both in terminology and in meaning.

Owing to the primacy of unidimensional models, local independence is commonly phrased as the statistical independence of items conditional on a single underlying latent variable (Ip, 2001; Stout, 1987). However, the principle of local independence

(3)

may hold for M= 0,1,2, . . . latent variables (Hattie, 1984). Accordingly, a more general treatment defines local independence as the statistical independence of items conditional on a possibly vector-valued latent variable (Hattie, Krakowski, Rogers, & Swaminathan, 1996; Stout, 1990). As typically formulated, this more general notion asserts (e.g., Stout, 1990) that for all values of a possibly vector-valued latent variable␪,

This definition views local independence with respect to a set of latent variables. Under this formulation, if the model specifies the correct number of variables (or more), local independence will hold. In this vein, Hattie et al. (1996, p. 1) argued that the principle of local independence means that ‘Once these trait values are fixed at a given value (i.e., conditioned on), the responses to items become statistically independent. Thus, in order to determine the dimensionality of a set of items, it is necessary and sufficient to identify the minimal set of traits such that at all fixed levels of these traits the item responses are independent’.

Accordingly, if the model underspecifies the number of variables, local independence will not hold. To illustrate, we extend Ip’s (2001) argument for the presence of local dependence when fitting a unidimensional model to multidimensional data to the case of multidimensional modelling. Once extended to multidimensional models, this argument is recast to advocate a more refined definition of local independence.

Suppose the true model for a set of J items is three-dimensional with latent variables

␪ = (␪1,␪2,␪3)∼ N(0, ⌺) with covariance matrix

where␴²mis the variance of␪mand␳mmis the correlation between␪mand␪m. Consider a linear model for continuous responses to the set of J items:

where ␮ is an overall mean parameter, aj1, aj2, and aj3 are item-specific parameters capturing the dependence of the item on the latent variables, and ej is a random error due to measurement where each value of ej is independent and identically distributed with mean zero and variance␴²e_j. Any two items Yj and Yj are conditionally (locally) independent given␪1,␪2, and␪3.

However, if one conditions upon␪1and␪2, E(Yj|␪1,␪2)= ␮ + aj1␪1+ aj2␪2and

where ␳²3|12 is the squared multiple correlation between ␪3 and ␪1 and ␪2. Non-zero values for Cov(Yj, Yj|␪1,␪2) indicate local dependence, and it is readily seen that local dependence will exist unless (a) at least one of the items does not depend on␪3(i.e., aj3

and/or aj3equals zero) or (b)␪3is perfectly correlated with some linear combination of␪1and␪2. The argument can be extended in straightforward ways to situations with more true and conditioned dimensions and to situations in which there are multiple

(4)

relevant dimensions (i.e., a dimension␪mwhere both ajmand ajmare non-zero) that are not conditioned on.

The above perspective of formulating local independence with respect to a number of latent variables implicitly assumes that the model correctly specifies the pattern of dependence of the items on the dimensions (or allows all items to depend on all the dimensions, up to the limits of identification). However, the assertion that local independence follows from correctly specifying the number of latent variables will not necessarily hold if the pattern of dependence of the items on the dimensions is incorrect.

To illustrate, suppose a set of J items follows a three-dimensional model, as above, and that the analyst (correctly) specifies that there are three dimensions. However, suppose the analyst incorrectly specifies that aj3 and aj3 equal zero. In this case, the model- implied covariance conditional on␪1,␪2, and␪3is equal to the model-implied covariance conditional on just␪1and␪2; that is, Cov(Yj, Yj|␪1,␪2, ␪3) = Cov(Yj, Yj|␪1, ␪2), in which case the items will be locally dependent (see (2)). More generally, we hypothesize that an incorrect specification of the dependence of the items on the latent variables (e.g., failing to model an item as dependent on a relevant latent variable, improperly constraining item parameters to be equal) will result in local dependence, even when the correct number of latent variables is specified. Synthesizing the above discussion, we make the assumption of correct model specification explicit and state that items are locally independent with respect to a model if

where␻ is the collection of item-specific parameters ␻1, . . . ,␻J; in the current case for the model in (1),␻j = (aj, dj, cj). The inclusion of ␻ indicates the conditioning is with respect to the assumed structure, as well as the number of latent variables.

A weaker form of local independence requires pairwise item independence (Ip, 2001; Stout, Habing, Douglas, Kim, Roussos & Zhang, 1996). For items j and j, weak local independence holds if

where E_␪ denotes that the expectation is taken with respect to the distribution of␪.

In practice, weak local independence is most often investigated in terms of pairs of observables as it is reasonable to assume that if variables are pairwise independent, higher-order dependencies, though possible, are highly implausible (McDonald, 1997;

McDonald & Mok, 1995).

The current work adopts the perspective that local dependence can be thought of as unmodelled dimensionality, regardless of whether the extraneous dimensionality is of substantive interest or merely a nuisance (Ip, 2001). This supports a conditional covariance theory perspective on dimensionality assessment, which has provided the foundation for a number of dimensionality assessment procedures for unidimensional models (Levy et al., 2009; Stout et al., 1996; Zhang, 2007; Zhang & Stout, 1999a, 1999b).

(5)

3. The generalized dimensionality discrepancy measure

To develop a discrepancy measure and associated testing procedure to assess the assumed dimensionality, consider the model-based covariance (MBC; Reckase, 1997) for item pairs,

where N is the number of examinees and in the current context E(Xij|␪i, ␻j) is the conditional expectation of the value of the observable (i.e., the scored item response) for examinee i to item j, which is given by the item response function of the model.

MBC is closely related to Q3(Yen, 1984); Q3_jj = re_ije_ij where r refers to the correlation and eij= Xij− E(Xij|␪i,␻j). For any pair of items, values of MBC and Q3that are greater (less) than zero indicate the presence of positive (negative) local dependence, whereas values of zero indicate local independence. Note that both MBC and Q3condition on the latent variables through their use of E(Xij|␪i,␻j), the (conditional) expectation of the value of the response for examinee i to item j. Research on discrepancy measures for dimensionality and local independence assumptions in unidimensional models has shown that measures that examine conditional associations, including MBC and Q3, outperform those that examine marginal associations (Levy, 2010; Levy et al., 2009).

Viewing MBC as an estimator of the conditional covariance in (3), we employ MBC for item pairs as a building block, constructing a generalized dimensionality discrepancy measure (GDDM) as the mean of the absolute values of MBC over unique item pairs

It is clear that GDDM≥ 0, with equality holding when weak local independence holds for all pairs of items. GDDM can be viewed as an estimator of

which captures the average absolute amount of local dependence amongst the item pairs. The absolute value is taken because, when estimating a model, the presence of positive local dependence implies the presence of negative local independence (Habing & Roussos, 2003). By taking the absolute value prior to aggregation, all bivariate information in the data is employed; in this way negative local dependence contributes to the total local dependence assessed in the model. Procedures that focus on positive local dependence (e.g., DIMTEST) might suffer in situations in which unmodelled multidimensionality most prominently manifests itself in terms of negative local dependence (Levy et al., 2009).

The quantity in (6) may be thought of as a multidimensional extension of the maximum possible value of the DETECT index (Stout et al., 1996). Whereas the DETECT index conditions on the (single) dimension of best measurement and seeks to identify clusters of items assuming simple structure (Stout et al., 1996; Zhang, 2007; Zhang &

Stout, 1999b), the proposed quantity conditions on the possibly multiple dimensions in the model and the possibly factorially complex pattern of dependence of items on

(6)

the dimensions. The quantity in (6) may similarly be viewed as a multidimensional extension of quantities associated with Stout’s notion of essential unidimensionality (see equation (14) of Stout, 1987). As such GDDM is, in spirit, an extension of the DIMTEST statistic to support confirmatory dimensionality assessment for multidimensional as well as unidimensional models.

GDDM may be formulated at the subtest level in a straightforward manner by taking J in (5) to be the number of items to be investigated in the subtest. At the level of a single item pair (i.e., a subtest with J= 2), MBC or Q3is preferred, as they reflect the direction (positive or negative) of the conditional association in addition to its magnitude. In Section 4, we describe the model checking framework that supports the use of GDDM at the test and subtest level, and MBC and Q3at the item-pair level.

4. Posterior predictive model checking

The current section advances a Bayesian approach to modelling and model checking to facilitate inference regarding the model’s assumed dimensionality based on GDDM.

Given the usual conditional independence assumptions, the posterior distribution of person parameters␪ and item parameters ␻ given data X is

PPMC (Gelman, Meng, & Stern, 1996; Meng, 1994; Rubin, 1984) analyses characteristics of the observed data and/or the discrepancy between the data and the model by referring to the posterior predictive distribution

where P(␪, ␻|X) is the posterior distribution in (7), and X^rep refers to replicated data of the same size (i.e., sample size and number of observables) as the original data.

PPMC is conducted by evaluating discrepancy measures that are constructed to capture relevant features of the data as well as the discrepancy between data and the model (i.e., GDDM). Data–model fit is evaluated by examining the realized values of the discrepancy measures based on the observed data, denoted D(X;␪, ␻), relative to the values obtained by employing the posterior predictive distribution, denoted D(X^rep;␪, ␻). In addition to graphical procedures, a common way to summarize the results of a PPMC analysis is via the posterior predictive p (PPP) value (Gelman et al., 1996; Meng, 1994), the tail area of the posterior predictive distribution of the discrepancy measure corresponding to the realized value(s):

The current work adopts a perspective that views the results of PPMC as diagnostic pieces of evidence for, rather than a hypothesis test of, data–model (mis)fit (Gelman, 2003, 2007; Gelman et al., 1996; Gill, 2007; Stern, 2000). Such an interpretation is particularly relevant in the context of assessing dimensionality and local dependence (Chen & Thissen, 1997) and is consistent with the approach to model criticism that views statistical diagnostics as a component in the larger enterprise of evaluating model adequacy (Sinharay, 2005). From this perspective, PPP values have direct interpretations as expressions of our expectation of the extremity in future replications, conditional

(7)

on the model (Gelman, 2003, 2007), and serve to summarize the results of the PPMC.

PPP values near .5 indicate that realized discrepancies fall in the middle of the posterior predictive distribution of the discrepancy, evidencing adequate data–model fit in terms of the features captured by the discrepancy measure. Extreme PPP values indicate a lack of data–model fit in terms of the features captured by the discrepancy measure.

For GDDM, a PPP value close to zero will result when the realized values consistently exceed those in the posterior predictive distribution, indicating that the observed data exhibit more local dependence than would be expected based on the model.

It is noted that the propriety of interpreting PPP values from a hypothesis-testing perspective has been critiqued, as PPP values will not necessarily be uniformly distributed under the null hypothesis of exact data–model fit (Robins, van der Vaart, & Ventura, 2000). As a result, employing PPP values to conduct hypothesis testing could lead to conservative tests. Alternatives to PPP values include conditional predictive and partial predictive p values (Bayarri & Berger, 2000a), which do have desirable properties (from a frequentist, hypothesis-testing perspective) under null conditions (Bayarri & Berger, 2000a; Robins et al., 2000). However, these approaches have several drawbacks (Stern, 2000), including that they do not support discrepancy measures that are a function of the data and model parameters (Bayarri & Berger, 2000a). Research has shown that when investigating dimensionality and local independence assumptions, discrepancy measures that condition on (some function of) the model parameters and examine conditional associations outperform discrepancy measures that are not functions of the model parameters and examine marginal associations (Levy, 2010; Levy et al., 2009;

McDonald & Mok, 1995). As a function of the model-implied expectations, GDDM is explicitly constructed to embody the principles of a conditional covariance theory perspective on (multi)dimensionality assessment and precludes the usage of conditional and partial predictive p values. In contrast, PPMC handles GDDM (and other) discrepancy measures naturally. Importantly, it will not necessarily be the case that PPP values will yield overly conservative tests; appropriately chosen discrepancy measures should yield PPP values that perform quite well under null conditions (Bayarri & Berger, 2000b). Levy et al. (2009) found that PPP values for MBC – from a hypothesis-testing perspective – yielded Type I error rates only slightly below the nominal values in the case of unidimensional IRT. Moreover, researchers wishing to adopt a hypothesis-testing perspective may employ methods to calibrate the PPP values to obtain p values with desirable frequentist properties (Hjort, Dahl, & Steinbakk, 2006).

PPMC is a flexible framework for model criticism and holds many potential advantages over traditional techniques (Levy et al., 2009). PPMC makes no appeal to regularity conditions or asymptotic arguments to arrive at reference distributions; the ubiquity of ill- defined reference distributions has hampered traditional approaches to model checking in unidimensional IRT (e.g., Chen & Thissen, 1997). Such problems could potentially be exacerbated in MIRT models studied here. As a consequence of the flexibility of PPMC, the analyst is free to choose from a broad class of functions, including those that pose difficulties for traditional model checking techniques (e.g., Levy et al., 2009). In addition, the results of PPMC may be compiled and communicated in a variety of ways, including numerically via PPP values or graphically (Gelman, 2003; Gelman et al., 1996;

Levy et al., 2009; Sinharay et al., 2006). Furthermore, by using the posterior distribution rather than point estimates for the model parameters, PPMC incorporates the uncertainty in the estimation of the model parameters in the construction of the distributions of the discrepancy measures (Meng, 1994).

(8)

5. Simulation study

This section describes a Monte Carlo study examining the utility of GDDM in evaluating the dimensionality assumptions of MIRT models. As described below, data are generated from two- and three-dimensional MIRT models and fitted with a series of models to evaluate the behaviour of GDDM under various conditions.

5.1. Models

All data sets consisted of simulated responses from 1,000 simulated examinees on a test of 36 items following one of several multidimensional model structures of the form in (1) where, for simplicity, all cj parameters are equal to 0. Table 1 lays out the models’

structures in terms of the values of the parameters used to generate the data. The second column gives the values of the location parameter for the items used for all the models.

The columns under the heading M0 give the values of aj1 and aj2, the discrimination parameters for␪1and␪2, respectively, for M0. As a two-dimensional model with simple structure, M0 serves as the baseline model; the remaining models may be thought of as extensions of this model. M1, M2, and M3 are three-dimensional models in which some of the items are dependent on a third dimension,␪3. For each of these models, the structure of the dependence of the items on␪1and␪2remains the same as in M0.

The columns under the headings M1, M2, and M3 indicate which items reflect␪3 in each these models, and give the value of aj3for these items. M1 is a three-dimensional model in which some of the items that depend on␪1 additionally depend on ␪3. M2 is a three-dimensional model in which␪3 influences items that depend on ␪2 as well as items that depend on␪1. M3 has a similar structure to M2, but has fewer items that depend on␪3. The structures in M1–M3 correspond to situations in which␪3may be a substantive dimension or a dimension akin to a method factor, say, for use in modelling testlet structures. Finally, M4 is a two-dimensional model; the columns under M4 give the values of aj1 and aj2. M4 differs from M0 in that several items reflect both␪1 and

␪2, corresponding to a scenario where multiple skills or proficiencies are involved in solving some of the items. For simplicity, we assume all bivariate correlations between the latent variables are constant, denoted by␳. For each model structure, we consider conditions in which␳ = 0. In addition, we consider conditions in which ␳ = .5 for each of M0, M1, M2, and M4.

5.2. Analyses

For each model, the following Monte Carlo procedures were replicated 50 times. For each replication, 1,000 values of ␪1, ␪2, and ␪3 were generated from a multivariate normal distribution with all variances equal to one and correlations dictated by the condition. These values and the item parameters (Table 1) were entered into the MIRT item response function in (1) to yield the model-implied probability for each of the 1,000 simulated examinees to each of the 36 items. Each of these probabilities was compared to a simulated random variable from a uniform distribution over the unit interval. If the probability exceeded the random uniform draw, the value of the item response was set to one; otherwise it was set to zero.

To explore the utility of using the proposed procedures, we conducted PPMC using GDDM at the test and subtest levels, and using MBC and Q3 at the item- pair level for a variety of combinations of data generation models and data analysis

(9)

Table 1. The structure of the models in the simulation study

models. A cross in Table 2 indicates which models were fitted to data from each of the models.

For each of combinations listed in Table 2, the analysis proceeded as follows. For each of the data sets from the generation model, the analysis model is fitted using Markov chain Monte Carlo procedures (B´eguin & Glas, 2001; Bolt & Lall, 2003; Clinton et al., 2004; Yao & Boughton, 2007) via the pscl package in R (Jackman, 2008). This estimation routine specifies independent standard normal priors for the latent examinee variables

(10)

Table 2. Analysis plan of the simulation study

and requires user input of the prior distributions for the item parameters. The location parameters were assigned diffuse normal prior distributions

The function requires the input of prior distributions for all discrimination parameters.

Those specified by the model were assigned diffuse normal prior distributions

Those not specified by the model (e.g., discrimination parameters for items 1–18 on␪2in M0) were assigned normal distributions with a mean of zero and a variance of 1× 10⁻¹², effectively constraining the parameters to be equal to zero.

For five replications in each analysis condition (i.e., each combination listed in Table 2), trace plots and convergence diagnostics (Brooks & Gelman, 1998) were analysed and it was determined that, in all cases, the chains converged by 5,000 iterations. It was concluded that 5,000 iterations would be sufficient to discard as burn-in for the remaining replications in each condition. Thus, for each analysis, three chains were run from overdispersed starting-points for 2,000 iterations after the burn- in phase of 5,000 iterations. These iterations were thinned by 20 and the remaining iterations were pooled to yield 300 draws from the posterior distribution for use in conducting PPMC. These draws are employed with the observed data X to compute the realized values of the discrepancy measures and then used to generate the X^rep, which are then used in computing the posterior predicted values of the discrepancy measures. The PPP value is estimated as the proportion of draws for which the posterior predicted value of the discrepancy measure exceeds the corresponding realized value. Programs to conduct the PPMC were written by the authors and are available upon request.

To facilitate a comparison between the performance of the proposed methods and existing procedures, the models were fitted in NOHARM (Fraser & McDonald, 1988) and programs were written to compute two statistics targeting the goodness of fit of the

(11)

model. Specifically, we consider a statistic introduced by Gessaroli and De Champlain (1996; Finch & Habing, 2005, 2007),

where j and jserve to index the items to define the unique pairings of items, and

is the Fisher z transformation of the residual correlation for pairing of items j and j, given by

where p⁽⁰⁾_j is the observed proportion of examinees getting item j correct and p^(r)_jj is the residual covariance between items j and j. To facilitate hypothesis testing, the value of

␹²GDis referred to a central␹²distribution with degrees of freedom given by 0.5J(J−1)−t, where t is the number of parameters estimated in fitting the model (Finch & Habing, 2007).

In addition, we consider a statistic introduced by Gessaroli, De Champlain, and Folske (1997; Finch & Habing, 2005, 2007),

where

where kjand kjare the scores (0 or 1) for items j and j, respectively, and pk_jk_j and pˆk_jk_j

are the observed and model-implied proportions of examinees with response patterns given by the combination of kjand kj, respectively (see Finch & Habing, 2005, for further details on the calculation using NOHARM output). To facilitate hypothesis testing, ALR is referred to the same␹²distribution as␹²GD(Finch & Habing, 2007).

5.3. Results

Following Gelman et al. (1996), we recommend the use of graphical representations of the results of PPMC in an applied analysis. Figure 1 contains scatterplots of the realized and posterior predicted values of GDDM from an analysis of one data set generated from M1 with␳ = 0. Figure 1a contains a scatterplot of the realized and posterior predicted values of GDDM obtained by fitting M1, which is the correct model. Figure 1b contains the scatterplot obtained when the data were fitted with M0, which underspecifies the dimensionality. In each plot, the unit line is added as a reference. It is seen that in Figure 1a the points appear to be randomly distributed around the line, indicating that the realized and posterior predicted values are comparable, evidencing adequate

(12)

(a)

Posterior predictive GDDM

0.0036 0.0040 0.0044

Posterior predictive GDDM 0.0036 0.0040 0.0044

Realized GDDM (b)

0.0036 0.0040 0.0044

Realized GDDM

Figure 1. Scatterplots of the realized and posterior predicted values of GDDM from the analysis of a data set from M1: (a) results from fitting M1; and (b) results from fitting M0.

data–model fit. In Figure 1b, the points fall below and to the right of the line, indicating that the realized values are frequently larger than the posterior predicted values, hence that the observed data exhibit more local dependence than the posterior predicted data, as captured by GDDM. The PPP values summarize these graphical representations as the proportions of points above and to the left of the unit line, which in this case are .46 and .01, respectively. For ease of exposition, the majority of the results of the simulation study are presented in terms of PPP values as summaries of each PPMC analysis.

Table 3 summarizes the results of the study for GDDM at the test level,␹²GD, and ALR in each of the conditions. For each condition defined by the first two columns, the proportions of replications with p values below .10 and .05 obtained from fitting the correct model are given first, followed by the proportions of replications with p values below .10 and .05 obtained from fitting M0. For GDDM, the p values are PPP values based on PPMC; for␹²GDand ALR the p values are obtained via the␹²reference distributions as described above.

Table 4 presents the proportion of PPP values below .10 and .05 for GDDM evaluated on subtests when M0 was fitted to data from the various conditions. The first and second subtests were defined as the first half of the test (items 1–18) and second half of the test (items 19–36), respectively. From the perspective of fitting M0, investigating GDDM for these subtests corresponds to investigating the dimensionality that is assumed for each

(13)

Table 3. Proportions of p values for GDDM, ␹²GD, and ALR in the analysis conditions^a

Table 4. Proportions of PPP values beyond .10 and .05 for subtests when M0 was fitted to data from the generation models: except where noted, the results of the two subtests are pooled

modelled dimension. For all models except M1, the results of the PPP values from the two subtests were pooled following an assumption that the two subtests are symmetric with respect to their dimensionality. For M1, the subtests are not symmetric; the first half of the test reflects␪1and␪3while the second half of the tests reflects␪2only. For M1, the proportion of PPP values below .10 and .05 are listed for the two subtests separately.

Table 5 presents the median and proportion of extreme PPP values of MBC for different types of item pairs from fitting M0 to data in each of the conditions where

␳ = 0. Extremely high PPP values are considered as well as extremely low PPP values

(14)

Table 5. Median PPP values and proportions of extreme PPP values for MBC by types of item pairs^awhen M0 was fit to data from the generation models with␳ = 0

because MBC is a directional measure of local dependence. For each model, item pairs are defined by types based on the dimension(s) they reflect. Item-pair types that have an exchangeable dimensional structure are pooled. For example, in M0 (1–1) refers to item pairs in which both items reflect␪1and (2–2) refers to item pairs in which both items reflect␪2. These item pairs have the same dimensional structure in the sense that they both reflect one of the correctly modelled dimensions. As such they are pooled together but kept separate from item pair (1–2), in which one item reflects␪1and the other item reflects␪2. Results for the analysis of Q3are not presented as they were quite close to that of MBC. Similarly, results for MBC and Q3when the latent variables were correlated

␳ = .5 are not presented as they exhibited patterns consistent with those in Table 5.

5.4. Discussion

Viewing a PPMC analysis as a diagnostic approach to evaluating data–model fit (Gelman, 2003; Gelman et al., 1996; Levy et al., 2009; Sinharay, 2005), the results in Table 3 indicate

(15)

that when M0 is fitted to data from more dimensionally complex models, PPMC using GDDM is likely to yield patterns indicative of data–model misfit akin to that in Figure 1b.

When the correct model is fitted to data, PPMC is unlikely to yield such patterns. Instead, PPMC will typically yield patterns indicative of adequate data–model fit as in Figure 1a.

The results for M1, M2, and M3 indicate that GDDM is able to detect data–model misfit of the two-dimensional model when fitted to data that follow a three-dimensional model, both when the influence of the unmodelled dimension is concentrated on items that depend on one of the other dimensions (M1) and when its influence is more widely distributed across the total set of items (M2 and M3).

The results for fitting M0 to data that follow M4 indicate that GDDM is able to detect data–model misfit of the simpler two-dimensional model when fitted to data that follow a two-dimensional model with more complex structure. Similarly, the results from fitting M3 to data that follow M2 indicate that GDDM is able to detect data–model misfit when the analysis model correctly specifies three dimensions but the pattern of the dependence of the items on those dimensions is incorrectly specified. These results support the argument advanced above that local independence should be viewed not merely with respect to the number of dimensions, but with respect to the model as constituted by dimensions and the patterns of dependence of the observables on those dimensions.

A hypothesis-testing perspective views the proportions in columns 3 through 8 in Table 3 as being akin to empirical Type I error rates and the proportions in the remaining columns as akin to power rates for detecting data–model misfit due to the improperly specified dimensionality. Although the current work adopts a diagnostic perspective on PPMC, it is interesting to note that, from a hypothesis-testing perspective, GDDM has considerable power to detect data–model misfit in dimensionally misspecified models while maintaining Type I error rates. With only a few exceptions (under M4 with␳ = 0, M2 with␳ = .5, and M1 with ␳ = .5 and ␣ = .05), the Type I error rates were slightly below the nominal values, which is consistent with theoretical work on the distributions of PPP values under null conditions (Robins et al., 2000) and previous work on the use of PPP values in IRT (e.g., Sinharay et al., 2006). In the context of dimensionality and local dependence assessment for unidimensional models, Levy et al. (2009) found that MBC, Q3, and related indices exhibited empirical Type I error rates at or slightly below nominal values. The present study finds that GDDM, which extends MBC, yields similar rates in the multidimensional models studied here.

In contrast, the empirical Type I error rates for ␹²_GD and ALR were generally well below nominal values. GDDM also considerably outperformed␹²_GDand ALR in terms of detecting data–model misfit in when the data followed M1 and M3. GDDM performed as well as or better than ␹²GD and ALR when the data followed M2. In M1–M3, ALR outperformed␹²GD, and performed nearly as well as GDDM when the data followed M2.

When the data followed M4,␹²GDand ALR performed similarly and outperformed GDDM.

Within any model structure, all of the discrepancy measures performed as well or better in detecting unmodelled dimensionality when the latent variables were uncorrelated compared to when they were correlated, as is consistent with theoretical and previous empirical work in unidimensional modelling (e.g., Levy et al., 2009; Nandakumar &

Stout, 1993; Nandakumar, Yu, Li, & Stout, 1998; Stout, 1987; Zhang & Stout, 1999a).

Overall, GDDM performed favourably as compared to␹²GDand ALR.

The results from fitting M2, M3, and M0 to the data from the M2 condition illustrate the way GDDM captures increasingly poor data–model fit. In these conditions, M2 is the correct model, M3 departs from M2 by failing to model the dependencies of some of

(16)

0.0035 0.0045 0.0055 Mean GDDM

Figure 2. Distributions of mean GDDM values fitting M0 (dashed line), M2 (solid), and M3 (dotted) to data from M2.

the items on␪3, and M0 further departs by failing to model the dependencies of any of the items on␪3. As such, M3 and M0 are increasingly misspecified models relative to the correct model structure of M2. Table 3 indicates that the use of PPP values from GDDM allows for the detection of data–model misfit associated with fitting M3 and M0 in every replication. To further illustrate how GDDM assesses the magnitude of the data–model misfit, Figure 2 plots the distributions of the means of the realized GDDM values from fitting each model to the data from the M2 condition with␳ = 0. The solid line represents the distribution of mean GDDM values obtained from fitting M2; the dotted and dashed lines represent the distributions of mean GDDM values obtained from fitting M3 and M0, respectively. As is evident, for all data sets the values of the mean GDDM from both M3 and M0 exceeded that from M2. Furthermore, the mean values from M0 (dashed) were in general larger than those from M3 (dotted). This finding supports the interpretation that larger values of GDDM are indicative of worse data–model fit.

The results in Table 4 indicate that GDDM at the subtest level exhibits patterns similar to those of GDDM at the test level. When the data are generated from M0 and fitted with M0, extreme PPP values occur rarely (i.e., in hypothesis-testing terms: slightly above the nominal level of .10 when␳ = 0, below the nominal level of .10 when ␳ = .5, and at the nominal level of .05 in both conditions). When M0 is fitted to data from other models, the proportions are all larger than when fitted to data from M0, indicating GDDM at the subtest level is sensitive to the unmodelled dimensionality. The results for fitting M0 to data from M1 exhibit a pattern consistent with expectation where the model detects the presence of the extraneous dimensionality in the first subtest, but not in the second subtest. When the unmodelled dimension influences items on both subtests (M2), GDDM indicates data–model misfit on each subtest in all replications. Reducing the number of items that reflect the unmodelled dimension in M3 reduces the capacity for GDDM to detect data–model misfit at the subtest level. This further supports the previous findings of GDDM as being sensitive to the magnitude of the data–model misfit. Consistent with the findings at the test level, the performance of GDDM at the subtest level suffered when the latent variables were correlated.

The results for the investigation of MBC at the item-pair level when fitting M0 (Table 5) reveal a number of patterns. In the M0 condition the analysis model is correctly specified.

(17)

Accordingly, the median PPP values for the two types of item pairs are near .5, and the proportion of extreme PPP values in the tails beyond .10 and .05 is just under those respective values. These results parallel analogous results from the investigation of MBC in unidimensional models (Levy et al., 2009).

Under M1, the influence of ␪3is localized to the first half of the test, and does not influence any of the items that reflect␪2; associations involving these items should be well modelled. Accordingly, the pairings of items where one or both of the items reflect␪2

(labelled 1–2, 2–13, and 2–2) yield medians of PPP values at .5 and proportions of extreme PPP values just below the nominal levels, as in the M0 condition. A conditional covariance theory perspective on estimation and dimensionality (Levy et al., 2009; Stout et al., 1996;

Zhang & Stout, 1999a) implies that the dimension estimated as␪1when M0 is fitted to data that follow M1 is a complex combination of the true␪1and␪3. As a result, pairings of items with the same dimensional structure (either 1–1 or 13–13) should exhibit positive local dependence, whereas item pairs with different dimensional structures with respect to the estimated␪1(1–13) should exhibit negative local dependence. This is exactly what was observed in the M1 condition, where the 1–1 pairings and the 13–13 pairings yielded increasingly high proportions of small PPP values, indicating positive local dependence, and the 1–13 pairings yielded relatively high proportions of large PPP values, indicating negative local dependence. These patterns of positive and negative local dependence mirror those found using MBC and related indices in the analysis of unidimensional IRT in light of unmodelled multidimensionality (Levy, 2010; Levy et al., 2009).

Under M2, the influence of␪3is distributed over both halves of the test. A conditional covariance theory perspective implies that the dimension estimated as␪1when M0 is fitted to data is a complex combination of the true␪1and␪3and the dimension estimated as␪2when M0 is fitted to data is a complex combination of the true␪2and␪3. Akin to the results in M1 pairs in which both items reflect␪3(13–13, 23–23, 13–23) yielded low PPP values, indicating positive local dependence, as did pairs where both items reflected just␪1or␪2(1–1, 2–2). Similarly, item pairs in which one item reflects␪3and the other item reflects either␪1or␪2(1–13, 2–23, 1–23, 2–13) yielded high PPP values, indicating negative local dependence. Finally, the PPP values for the 1–2 pairs were somewhat smaller than .5. Under M3, pairings in which one or both items did not reflect␪3yielded more moderate PPP values than their counterparts under M2. In contrast, item pairs in which both items reflect␪3(13–13, 23–23, 13–23) were smaller than their counterparts under M2.

Under M4, the PPP values the 12–12 item pairs are small, indicating positive local dependence. The PPP values for the remaining types of item pairs are closer to .5, indicating their associations are well modelled. Under M4, the unmodelled multidimensionality predominantly manifests itself in terms of item pairs in which both items reflect␪1 and␪2. We speculate that the results for the remaining item pairs are consistent with a conditional covariance theory perspective. However, it is unclear whether the results for the remaining types of item pairs indicate true patterns or are merely reflective of random variation around .5. Theoretical research on the extension of conditional covariance theory to multidimensional models and empirical research investigating patterns of local dependence in multidimensional contexts are necessary to further investigate this possibility.

Synthesizing across the results, it is clear that the higher level of aggregation (subtests above item pairs, test above subtests) yields better performance in terms of higher rates of indicating the presence of unmodelled multidimensionality. This is because GDDM aggregates the item-pair level local dependencies, akin to how measures of differential

(18)

bundle functioning aggregates item-level differential item functioning, capitalizing on the amplification of the effects in the aggregation (Nandakumar, 1993).

6. Analysis of National Assessment of Educational Progress data

The procedures introduced above are illustrated in an analysis of item response data from the 1996 National Assessment of Educational Progress (NAEP). The NAEP science assessment framework specifies three content areas: physical science, earth science, and life science. Each item is classified in terms of one of these areas; subscale scores are reported on each of these dimensions (Allen, Carlson, & Zelenak, 1999). Item responses to 16 items in block S20 from students from the national sample were analysed. The block contained four items in life science and six items in each of physical science and earth science. Eight of the items were multiple-choice items scored dichotomously and eight were constructed response items scored via integers. For the purposes of this analysis, the responses to the constructed response items were dichotomized in a manner such that the collapsing of categories results in the most balanced dichotomous response frequencies for each item. Missing responses prior to the last observed response were regarded as intentional omissions and were scored as incorrect. The analysis was performed on 1,020 examinees with complete data.

6.1. MIRT model structure

A three-dimensional MIRT model is analysed, where the latent variables correspond to proficiency in the areas of physical science, earth science, and life science. Each item is modelled as reflecting one of the latent variables in accordance with the NAEP classifications of items in terms of the content areas. Figure 3 contains a path diagram representation of the hypothesized model.

6.2. Bayesian analysis and Markov chain Monte Carlo estimation

For the eight multiple-choice items, the probability of a correct response from examinee i to item j was given via the MIRT model in (1). For the eight constructed response items, the probability of a correct response from examinee i to item j was given via the MIRT model in (1) where cj= 0. The model was identified by specifying the discrimination parameter for the first item on each latent variable to be unity; that is, the discrimination

Physical science

θ1

Life science

θ3

Earth science

θ2

X₃ X₆ X₇ X₁₁ X₁₅ X₁₆ X₂ X₄ X₅ X₈ X₉ X₁₂ X₁ X₁₀ X₁₃ X₁₄ Figure 3. The MIRT model for the NAEP analysis.

(19)

parameters for items 3, 2, and 1 for␪1,␪2, and␪3, respectively, were fixed at one. The remaining unknown discrimination parameters, lower asymptote parameters, and all of the location parameters were assigned prior distributions:

where I(0,∞) for the specification of the prior on each ajmconstrains the distribution to have support over the positive real line, modelling the hypothesis that the probability of a correct response monotonically increases with increases in any␪m.

For each examinee, the prior distribution for the latent variables was multivariate normal with a mean vector set at 0 to identify the model

A diffuse inverse-Wishart prior distribution was specified for the covariance matrix

whereI is the identity matrix of rank M = 3.

The model was estimated in WinBUGS 1.4 (Spiegelhalter, Thomas, Best, & Lunn, 2007).¹ Three chains from dispersed start values were run for 6,000 iterations using program-chosen sampling algorithms, including a Metropolis algorithm in which the variance of the normal proposal distribution is adapted for the first 4,000 iterations.

As measured by visual inspection of trace plots and Brooks–Gelman–Rubin diagnostics (Brooks & Gelman, 1998), the 4,000 iterations needed to adapt the proposal distribution in the Metropolis sampler were sufficient for the chains to converge. The remaining iterations were thinned by a factor of 20 and pooled to yield 300 draws used to conduct PPMC.

6.3. PPMC analysis

A PPMC analysis was conducted on the NAEP data, evaluating GDDM at the test level, GDDM at the subtest level where three subtests are defined in terms of the three content areas (Figure 3), and MBC and Q3at the item-pair level.

Figure 4 is a scatterplot of the 300 realized and posterior predicted values of GDDM evaluated on the entire test, where it is clearly seen that the realized values tend to be larger than their posterior predicted counterparts; the PPP value from this analysis was .08. This indicates that the solution to the model, in terms of the posterior distribution, suffers in terms of adequately accounting for the associations in the data.

Figure 5 shows scatterplots for the 300 realized and posterior predicted values of GDDM evaluated in each of three subtests. The PPP values for the physical science, earth science, and life science subtests were .38, .25, and .48, respectively. These results

1WinBUGS parameterizes distributions slightly differently from conventions adopted here. Specifically, it employs the precision (i.e., the inverse of the variance) in specifying normal distributions and a parameterization of the beta distribution such that the priors are specified in WinBUGS as ajm ∼ N(0,0.10)I(0, ∞), dj ∼ N(0,0.10), and cj∼ ␤(4,16).

(20)

0.0035 0.0040 0.0045 0.0050 0.0055 0.0035

0.0040 0.0045 0.0050 0.0055

Realized GDDM

Figure 4. Scatterplot of the realized and posterior predicted values of GDDM from the analysis of the NAEP data.

(a) (b)

0.002 0.004 0.006 0.008 0.002

0.005 0.008

0.002 0.005 0.008

Realized GDDM

0.002 0.004 0.006 0.008 Realized GDDM

0.002 0.004 0.006 0.008 Realized GDDM (c)

Figure 5. Scatterplot of the realized and posterior predicted values of the subtest GDDM from the analysis of the NAEP data: results from (a) the physical science subtest; (b) the earth science subtest; (c) the life science subtest.

(21)

0 0.02 0.05 0.5 0.95 0.981

1 2

3 4

5 6

7 8

9 10

11 12

13 14

15 16

Figure 6. Graphical representation of PPP values of MBC for item pairs from the analysis of NAEP data.

suggest that, within content areas, the associations among the items are well accounted for by the dimensional structure in the model.

At the item-pair level, the results for MBC and Q3were nearly identical; results for MBC will be presented and discussed. Figure 6 contains a matrix graphical plot of the PPP values for MBC at the item-pair level, where the numbers along the diagonal indicate the item in that row and column of the matrix. The shading of the square in each element in the matrix conveys the value of the PPP value as indicated by the key (e.g., a white square indicates that the PPP value lies between 0 and .02). Focusing on the more extreme PPP values, the results suggest that the strongest residual associations include: (a) the positive residual associations between items 10 and 12, 10 and 16, 1 and 12, 13 and 16, and to a lesser extent between 6 and 10, 7 and 14, and 8 and 13, and (b) the negative residual associations between items 7 and 10, 10 and 11, 5 and 6, and to a lesser extent between 5 and 8, and 3 and 16. With the exception of these last two pairings, these item pairs involve the pairings where the items come from different content areas.

Synthesizing the above results, the PPMC analysis using GDDM at the test level suggests that the three-dimensional MIRT model with dimensions defined by the content classifications of the items does not adequately account for the dependencies among the items. However, within subtests the associations are well accounted for, as evidenced by the results for GDDM at the subtest level. Rather, the additional dependencies appear to be in terms of several sets of item pairs drawn from different subtests, in particular several

(22)

item pairs involving item 10. At this point, this diagnostic information could be leveraged to investigate substantive reasons for the weaknesses of the model with subject matter experts and the assessment design team. Of the many possible explanations, Wei and Mislevy (2007), who interpreted results from exploratory factor analyses of data from these items, suggested that a factor structure based on a distinction between conceptual understanding and practical reasoning may be more suitable than a factor structure based on content domains.

7. Conclusion

This paper has described a new discrepancy measure supported by a PPMC framework for assessing the dimensionality assumption in MIRT models. Grounded in conditional covariance theory, the discrepancy measure assesses the aggregate magnitude of estimated pairwise conditional covariances in the data, relying on connections between the assumptions of dimensionality and local independence. It was argued that local independence should be viewed with respect to a model in terms of both the number of hypothesized dimensions and the specified patterns of dependence of items on the dimensions.

GDDM is designed to assess the specified dimensional structure for a collection of measured variables. Ideally, the choice to evaluate GDDM on a subset of the items should be based on whether meaningful collections of the items can be determined a priori, as is warranted if the test is constructed to measure subdomains (as in the NAEP example), or if subscale results along those domains will be employed, or if the administered items may be viewed as testlets. In the absence of a priori defined subtests, we recommend the following procedure for critiquing the assumed dimensionality of a set of items. If an analysis of GDDM at the test level indicates the assumed dimensional structure of the model is untenable, the researcher may then follow up with an analysis of subtests if the grouping of items into subtests can be theoretically justified or with an analysis at the finer-grained level of item pairs. Simultaneously, considering the results over the item pairs can be suggestive of the specific weaknesses of the model and the types of structures that might better account for the relationships in the data.

GDDM is parametric in the sense that it employs the model-based expectations of the observed values. However, it is not restricted to MIRT models discussed here. GDDM is constructed to be sufficiently general to be applicable to assess dimensionality assumptions across a broad class of latent variable modelling paradigms that make a variety of distributional assumptions regarding the latent and observable variables, including factor analytic and latent class models in addition to item response models. Similarly, this work has demonstrated the use of PPMC for model criticism for MIRT models. As a flexible framework for evaluating model diagnostics, PPMC may be used to support data–model fit across a variety of psychometric modelling paradigms.

Acknowledgements

We wish to acknowledge the two anonymous reviewers whose comments prompted improvements to this paper.

(23)

References

Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report. Washington, DC: National Center for Education Statistics.

Bayarri, M. J., & Berger, J. O. (2000a). P values for composite null models. Journal of the American Statistical Association, 95, 1127–1142. doi:10.2307/2669749

Bayarri, M. J., & Berger, J. O. (2000b). Rejoinder. Journal of the American Statistical Association, 95, 1168–1170. doi:10.2307/2669756

B´eguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66, 541–562. doi:10.1007/BF02296195 Bolt, D. M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory multidi-

mensional item response models using Markov chain Monte Carlo. Applied Psychological Measurement, 27, 395–414. doi:10.1177/0146621603258350

Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434–455. doi:10.2307/

1390675

Chen, W., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265–289.

Clinton, J., Jackman, S., & Rivers, D. (2004). The statistical analysis of role call data. American Political Science Review, 98, 355–370. doi:10.1017/S0003055404001194

Embretson, S. E. (1997). Multicomponent response models. In W. J. van der Linden & R. K.

Hambleton (Eds.), Handbook of modern item response theory (pp. 305–321). New York:

Springer.

Finch, H., & Habing, B. (2005). Comparison of NOHARM and DETECT in item cluster recovery:

Counting dimensions and allocating items. Journal of Educational Measurement, 42(2), 149–

169. doi:10.1111/j.1745-3984.2005.00008

Finch, H., & Habing, B. (2007). Performance of DIMTEST- and NOHARM-based statistics for testing unidimensionality. Applied Psychological Measurement, 31, 292–307. doi:10.1177/

0146621606294490

Fraser, C., & McDonald, R. P. (1988). NOHARM: Least squares item factor analysis. Multivariate Behavioral Research, 23, 267–269. doi:10.1207/s15327906mbr2302 9

Gelman, A. (2003). A Bayesian formulation of exploratory data analysis and goodness-of-fit testing.

International Statistical Review, 71, 369–382.

Gelman, A. (2007). Comment: Bayesian checking of the second levels of hierarchical models.

Statistical Science, 22, 349–352. doi:10.1214/07-STS235A

Gelman, A., Meng, X. L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, 733–807.

Gessaroli, M. E., & De Champlain, A. F. (1996). Using an approximate chi-square statistic to test the number of dimensions underlying the responses to a set of items. Journal of Educational Measurement, 33, 157–192. doi:10.1111/j.1745-3984.1996.tb00487.x

Gessaroli, M. E., De Champlain, A. F., & Folske, J. C. (1997). Assessing dimensionality using a likelihood-ratio chi-square test based on a non-linear factor analysis of item response data.

Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, March.

Gill, J. (2007). Bayesian methods: A social and behavioral sciences approach. (2nd ed.). Boca Raton, FL: Chapman & Hall/CRC.

Habing, B., & Roussos, L. A. (2003). On the need for local item dependence. Psychometrika, 68, 435–451. doi:10.1007/BF02294736

Hattie, J. (1984). An empirical study of various indices for determining unidimensionality.

Multivariate Behavioral Research, 19, 49–78. doi:10.1207/s15327906mbr1901 3

Hattie, J., Krakowski, K., Rogers, H. J., & Swaminathan, H. (1996). An assessment of Stout’s index of essential unidimensionality. Applied Psychological Measurement, 20, 1–14. doi:10.1177/

014662169602000101

(24)

Hjort, N. L., Dahl, F. A., & Steinbakk, G. H. (2006). Post-processing posterior predictive p values. Journal of the American Statistical Association, 101, 1157–1174. doi:10.1198/

016214505000001393

Hoijtink, H. (2001). Conditional independence and differential item functioning in the two- parameter logistic model. In A. Boomsma, M. A. J. van Duijn, & T. A. B. Snijders (Eds.), Essays in item response theory (pp. 109–129). New York: Springer.

Ip, E. H. (2001). Testing for local dependency in dichotomous and polytomous item response models. Psychometrika, 66, 109–132. doi:10.1007/BF02295736

Jackman, S. (2008). pscl: Classes and methods for R developed in the political science computa- tional laboratory, Stanford University. Department of Political Science, Stanford University, Stanford, CA. R package version 1.03. Retrieved from http://pscl.stanford.edu/

Levy, R. (2010). Posterior predictive model checking for conjunctive multidimensionality in item response theory. Journal of Educational and Behavioral Statistics. Advance online publication.

Levy, R., Mislevy, R. J., & Sinharay, S. (2009). Posterior predictive model checking for multi- dimensionality in item response theory. Applied Psychological Measurement, 33, 519–537.

doi:10.1177/0146621608329504

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.

McDonald, R. P. (1997). Normal-ogive multidimensional model. In W. J., van der Linden & R. K.

Hambleton (Eds.), Handbook of modern item response theory (pp. 257–269). New York:

Springer.

McDonald, R. P., & Mok, M. M. C. (1995). Goodness of fit in item response models. Multivariate Behavioral Research, 30, 23–40. doi:10.1207/s15327906mbr3001 2

Meng, X. L. (1994). Posterior predictive p-values. Annals of Statistics, 22, 1142–1160. doi:10.

1214/aos/1176325622

Nandakumar, R. (1993). Simultaneous DIF amplification and cancellation: Shealy-Stout’s test for DIF. Journal of Educational Measurement, 30, 293–311. doi:10.1111/j.1745-3984.1993.

tb00428.x

Nandakumar, R., & Stout, W. F. (1993). Refinement of Stout’s procedure for assessing latent trait dimensionality. Journal of Educational Statistics, 18, 41–68. doi:10.2307/1165182

Nandakumar, R., Yu, F., Li, H., & Stout, W. (1998). Assessing unidimensionality of polytomous data. Applied Psychological Measurement, 22, 99–115. doi:10.1177/01466216980222001 Reckase, M. D. (1997). A linear logistic multidimensional model. In W. J. van der Linden & R.

K. Hambleton (Eds.), Handbook of modern item response theory (pp. 271–286). New York:

Springer.

Robins, J. M., van der Vaart, A., & Ventura, V. (2000). The asymptotic distribution of P values in composite null models. Journal of the American Statistical Association, 95, 1143–1172.

doi:10.2307/2669750

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12, 1151–1172. doi:10.1214/aos/1176346785

Sinharay, S. (2005). Assessing fit of unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement, 42, 375–394. doi:10.1111/j.1745-3984.

2005.00021.x

Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory models.

British Journal of Mathematical and Statistical Psychology, 59, 429–449. doi:10.1348/

000711005X66888

Sinharay, S., Johnson, M., & Stern, H. S. (2006). Posterior predictive assessment of item response theory models. Applied Psychological Measurement, 30, 298–321. doi:10.1177/

0146621605285517

Spiegelhalter, D. J., Thomas, A., Best, N. G., & Lunn, D. (2007). WinBUGS user manual: Version 1.4.3. Cambridge: MRC Biostatistics Unit. Retrived from http://www.mrc-bsu.cam.ac.uk/bugs/

winbugs/contents.shtml