British Journal of Mathematical and Statistical Psychology (2011), 64, 208–232

C *2010 The British Psychological Society*

The British Psychological Society www.wileyonlinelibrary.com

**A generalized dimensionality discrepancy** **measure for dimensionality assessment in** **multidimensional item response theory**

### Roy Levy

^{∗}

### and Dubravka Svetina

Arizona State University, Tempe, Arizona, USA

A generalized dimensionality discrepancy measure is introduced to facilitate a critique of dimensionality assumptions in multidimensional item response models. Connections between dimensionality and local independence motivate the development of the discrepancy measure from a conditional covariance theory perspective. A simulation study and a real-data analysis demonstrate the utility of the discrepancy measure’s application at multiple levels of analysis in a posterior predictive model checking framework.

**1. Introduction**

Zhang and Stout (1999a) characterized approaches to dimensionality assessment for item response data in terms of those that (a) are more exploratory in nature and attempt full dimensionality assessment by estimating the number of latent dimensions and determining which items reflect which dimension(s) or (b) are more confirmatory in that they assess (departures from) unidimensionality. In addressing the latter of these, Stout (1987) framed the DIMTEST procedure as responding to Lord’s (1980, p. 21) call for a statistical procedure to assess unidimensionality in item response theory (IRT).

However, assessments are often multidimensional in the sense that performance on tasks depends on distinct though possibly related proficiencies or aspects of proficiency.

Advances in estimation capabilities, including those afforded by adopting Bayesian approaches to statistical modelling, have supported the growth in the popularity of research on and the use of multidimensional IRT (MIRT) models for item response and similar types of data (B´eguin & Glas, 2001; Bolt & Lall, 2003; Clinton, Jackman, &

Rivers, 2004; Yao & Boughton, 2007). With the emergence of such models comes the need for appropriate model checking and model criticism procedures. Thus, Lord’s (1980) call may be extended to a call for procedures for assessing dimensionality assumptions in multidimensional models. The analysis of the assumed dimensionality has

∗Correspondence should be addressed to Dr Roy Levy, Arizona State University, PO Box 873701, Tempe, AZ 85287-3701, USA (e-mail: roy.levy@asu.edu).

DOI:10.1348/000711010X500483

long been recognized as an integral part of any application of unidimensional IRT models, and though evaluating dimensionality assumptions takes on no less importance when multidimensional models are employed, dimensionality and data–model fit procedures for multidimensional models are relatively underdeveloped (Swaminathan, Hambleton,

& Rogers, 2007). To this end, we introduce a new discrepancy measure for assessing the
dimensionality assumptions applicable to multidimensional (as well as unidimensional)
models in the context of MIRT and present the application of this discrepancy measure
using a posterior predictive model checking (PPMC) framework. PPMC has been
successfully applied to unidimensional IRT models (e.g., Hoijtink, 2001; Sinharay, 2005,
2006; Sinharay, Johnson, & Stern, 2006), including the assessment of dimensionality for
*unidimensional models (Levy, 2010; Levy, Mislevy, & Sinharay, 2009; Sinharay et al.,*
2006). To date, no study has examined the use of PPMC for examining dimensionality
assumptions when fitting MIRT models. In the current work, we capitalize on the
flexibility of PPMC as a model checking framework and introduce a discrepancy measure
suitable for assessing a model’s assumed (multi)dimensional structure for a wide class of
latent variable models.

The current work considers the multidimensional normal ogive IRT model for dichotomous observables (i.e., scored item responses) which specifies the probability of an observed value of 1 (i.e., a correct response) as

*where X**ij* *is the observable response (coded as 0 or 1) of examinee i to item j,**i* =
(*i1*,*i2*, . . . ,*iM*)^{} **is a vector of M latent variables that characterize examinee i, a***j* =
*(a**j1**, a**j2**, . . . , a**jM*)^{}*is a vector of M coefficients for item j that capture the discriminating*
*power of the associated examinee variables, c**j*is a lower asymptote parameter for item
*j, and d**j* is an intercept related to the marginal difficulty of the item (e.g., B´eguin &

Glas, 2001; McDonald, 1997). The discrepancy measures and procedures presented in the current work apply to MIRT models using logistic distributions, including models with conjunctive relationships (Embretson, 1997; Reckase, 1997).

In Section 2, the role of local independence and dimensionality in multidimensional models is discussed, which serves as the foundation for the subsequent develop- ment of the new discrepancy measure and its variants. Following an overview of the mechanics of PPMC, results from a simulation study in which the performance of the proposed discrepancy measure is evaluated and compared to existing pro- cedures are presented, evidencing the utility of formulating the discrepancy mea- sure at different levels of the analysis to support inference. A real-data example illustrates the procedures in which multiple graphical representations are employed in the analysis.

**2. Dimensionality and local independence in multidimensional models**
The relation between the assumptions of (a) proper specification of dimensionality and
(b) local independence is not without ambiguity both in terminology and in meaning.

Owing to the primacy of unidimensional models, local independence is commonly
*phrased as the statistical independence of items conditional on a single underlying*
latent variable (Ip, 2001; Stout, 1987). However, the principle of local independence

*may hold for M*= 0,1,2, . . . latent variables (Hattie, 1984). Accordingly, a more general
treatment defines local independence as the statistical independence of items conditional
on a possibly vector-valued latent variable (Hattie, Krakowski, Rogers, & Swaminathan,
1996; Stout, 1990). As typically formulated, this more general notion asserts (e.g., Stout,
1990) that for all values of a possibly vector-valued latent variable,

*This definition views local independence with respect to a set of latent variables. Under*
this formulation, if the model specifies the correct number of variables (or more), local
*independence will hold. In this vein, Hattie et al. (1996, p. 1) argued that the principle*
of local independence means that ‘Once these trait values are fixed at a given value (i.e.,
conditioned on), the responses to items become statistically independent. Thus, in order
to determine the dimensionality of a set of items, it is necessary and sufficient to identify
the minimal set of traits such that at all fixed levels of these traits the item responses are
independent’.

Accordingly, if the model underspecifies the number of variables, local independence will not hold. To illustrate, we extend Ip’s (2001) argument for the presence of local dependence when fitting a unidimensional model to multidimensional data to the case of multidimensional modelling. Once extended to multidimensional models, this argument is recast to advocate a more refined definition of local independence.

*Suppose the true model for a set of J items is three-dimensional with latent variables*

= (1,2,3)**∼ N(0, ⌺) with covariance matrix**

where^{2}*m*is the variance of*m*and*mm*^{}is the correlation between*m*and*m*^{}. Consider
*a linear model for continuous responses to the set of J items:*

where * is an overall mean parameter, a**j1**, a**j2**, and a**j3* are item-specific parameters
*capturing the dependence of the item on the latent variables, and e**j* is a random error
*due to measurement where each value of e**j* is independent and identically distributed
with mean zero and variance^{2}*e*_{j}*. Any two items Y**j* *and Y**j*^{} are conditionally (locally)
independent given1,2, and3.

However, if one conditions upon1and2*, E(Y**j*|1,2)*= + a**j1*1*+ a**j2*2and

where ^{2}3|12 is the squared multiple correlation between 3 and 1 and 2. Non-zero
*values for Cov(Y**j**, Y**j*^{}|1,2) indicate local dependence, and it is readily seen that local
dependence will exist unless (a) at least one of the items does not depend on3*(i.e., a**j3*

*and/or a**j*^{}3equals zero) or (b)3is perfectly correlated with some linear combination
of1and2. The argument can be extended in straightforward ways to situations with
more true and conditioned dimensions and to situations in which there are multiple

relevant dimensions (i.e., a dimension*m**where both a**jm**and a**j*^{}*m*are non-zero) that are
not conditioned on.

The above perspective of formulating local independence with respect to a number of latent variables implicitly assumes that the model correctly specifies the pattern of dependence of the items on the dimensions (or allows all items to depend on all the dimensions, up to the limits of identification). However, the assertion that local independence follows from correctly specifying the number of latent variables will not necessarily hold if the pattern of dependence of the items on the dimensions is incorrect.

*To illustrate, suppose a set of J items follows a three-dimensional model, as above, and*
that the analyst (correctly) specifies that there are three dimensions. However, suppose
*the analyst incorrectly specifies that a**j3* *and a**j*^{}3 equal zero. In this case, the model-
implied covariance conditional on1,2, and3is equal to the model-implied covariance
conditional on just1and2*; that is, Cov(Y**j**, Y**j*^{}|1,2, 3) *= Cov(Y**j**, Y**j*^{}|1, 2), in
which case the items will be locally dependent (see (2)). More generally, we hypothesize
that an incorrect specification of the dependence of the items on the latent variables
(e.g., failing to model an item as dependent on a relevant latent variable, improperly
constraining item parameters to be equal) will result in local dependence, even when
the correct number of latent variables is specified. Synthesizing the above discussion,
we make the assumption of correct model specification explicit and state that items are
*locally independent with respect to a model if*

where is the collection of item-specific parameters 1, . . . ,*J*; in the current case
for the model in (1),j **= (a***j**, d**j**, c**j*). The inclusion of indicates the conditioning is
with respect to the assumed structure, as well as the number of latent variables.

A weaker form of local independence requires pairwise item independence (Ip,
*2001; Stout, Habing, Douglas, Kim, Roussos & Zhang, 1996). For items j and j*^{}, weak
local independence holds if

*where E*_{} denotes that the expectation is taken with respect to the distribution of.

In practice, weak local independence is most often investigated in terms of pairs of observables as it is reasonable to assume that if variables are pairwise independent, higher-order dependencies, though possible, are highly implausible (McDonald, 1997;

McDonald & Mok, 1995).

The current work adopts the perspective that local dependence can be thought
of as unmodelled dimensionality, regardless of whether the extraneous dimensionality
is of substantive interest or merely a nuisance (Ip, 2001). This supports a conditional
covariance theory perspective on dimensionality assessment, which has provided the
foundation for a number of dimensionality assessment procedures for unidimensional
*models (Levy et al., 2009; Stout et al., 1996; Zhang, 2007; Zhang & Stout, 1999a, 1999b).*

**3. The generalized dimensionality discrepancy measure**

To develop a discrepancy measure and associated testing procedure to assess the assumed dimensionality, consider the model-based covariance (MBC; Reckase, 1997) for item pairs,

*where N is the number of examinees and in the current context E(X**ij*|*i*, *j*) is the
conditional expectation of the value of the observable (i.e., the scored item response)
*for examinee i to item j, which is given by the item response function of the model.*

*MBC is closely related to Q*3*(Yen, 1984); Q*3_{jj}*= r**e*_{ij}*e*_{ij}*where r refers to the correlation*
*and e**ij**= X**ij**− E(X**ij*|*i*,*j**). For any pair of items, values of MBC and Q*3that are greater
(less) than zero indicate the presence of positive (negative) local dependence, whereas
*values of zero indicate local independence. Note that both MBC and Q*3condition on
*the latent variables through their use of E(X**ij*|*i*,*j*), the (conditional) expectation of
*the value of the response for examinee i to item j. Research on discrepancy measures*
for dimensionality and local independence assumptions in unidimensional models has
*shown that measures that examine conditional associations, including MBC and Q*3,
*outperform those that examine marginal associations (Levy, 2010; Levy et al., 2009).*

Viewing MBC as an estimator of the conditional covariance in (3), we employ MBC for
*item pairs as a building block, constructing a generalized dimensionality discrepancy*
*measure (GDDM) as the mean of the absolute values of MBC over unique item pairs*

It is clear that GDDM≥ 0, with equality holding when weak local independence holds for all pairs of items. GDDM can be viewed as an estimator of

which captures the average absolute amount of local dependence amongst the item
pairs. The absolute value is taken because, when estimating a model, the presence
of positive local dependence implies the presence of negative local independence
(Habing & Roussos, 2003). By taking the absolute value prior to aggregation, all
bivariate information in the data is employed; in this way negative local dependence
contributes to the total local dependence assessed in the model. Procedures that focus
on positive local dependence (e.g., DIMTEST) might suffer in situations in which
unmodelled multidimensionality most prominently manifests itself in terms of negative
*local dependence (Levy et al., 2009).*

The quantity in (6) may be thought of as a multidimensional extension of the
*maximum possible value of the DETECT index (Stout et al., 1996). Whereas the DETECT*
index conditions on the (single) dimension of best measurement and seeks to identify
*clusters of items assuming simple structure (Stout et al., 1996; Zhang, 2007; Zhang &*

Stout, 1999b), the proposed quantity conditions on the possibly multiple dimensions in the model and the possibly factorially complex pattern of dependence of items on

the dimensions. The quantity in (6) may similarly be viewed as a multidimensional extension of quantities associated with Stout’s notion of essential unidimensionality (see equation (14) of Stout, 1987). As such GDDM is, in spirit, an extension of the DIMTEST statistic to support confirmatory dimensionality assessment for multidimensional as well as unidimensional models.

*GDDM may be formulated at the subtest level in a straightforward manner by taking J*
in (5) to be the number of items to be investigated in the subtest. At the level of a single
*item pair (i.e., a subtest with J= 2), MBC or Q*3is preferred, as they reflect the direction
(positive or negative) of the conditional association in addition to its magnitude. In
Section 4, we describe the model checking framework that supports the use of GDDM
*at the test and subtest level, and MBC and Q*3at the item-pair level.

**4. Posterior predictive model checking**

The current section advances a Bayesian approach to modelling and model checking to facilitate inference regarding the model’s assumed dimensionality based on GDDM.

Given the usual conditional independence assumptions, the posterior distribution of
person parameters** and item parameters given data X is**

PPMC (Gelman, Meng, & Stern, 1996; Meng, 1994; Rubin, 1984) analyses characteristics of the observed data and/or the discrepancy between the data and the model by referring to the posterior predictive distribution

*where P(, |X) is the posterior distribution in (7), and X*^{rep} refers to replicated data
of the same size (i.e., sample size and number of observables) as the original data.

PPMC is conducted by evaluating discrepancy measures that are constructed to capture
relevant features of the data as well as the discrepancy between data and the model (i.e.,
GDDM). Data–model fit is evaluated by examining the realized values of the discrepancy
* measures based on the observed data, denoted D(X;*, ), relative to the values obtained

**by employing the posterior predictive distribution, denoted D(X**^{rep};, ). In addition to graphical procedures, a common way to summarize the results of a PPMC analysis is via

*the posterior predictive p (PPP) value (Gelman et al., 1996; Meng, 1994), the tail area of*the posterior predictive distribution of the discrepancy measure corresponding to the realized value(s):

The current work adopts a perspective that views the results of PPMC as diagnostic
pieces of evidence for, rather than a hypothesis test of, data–model (mis)fit (Gelman,
*2003, 2007; Gelman et al., 1996; Gill, 2007; Stern, 2000). Such an interpretation is*
particularly relevant in the context of assessing dimensionality and local dependence
(Chen & Thissen, 1997) and is consistent with the approach to model criticism that
views statistical diagnostics as a component in the larger enterprise of evaluating model
adequacy (Sinharay, 2005). From this perspective, PPP values have direct interpretations
as expressions of our expectation of the extremity in future replications, conditional

on the model (Gelman, 2003, 2007), and serve to summarize the results of the PPMC.

PPP values near .5 indicate that realized discrepancies fall in the middle of the posterior predictive distribution of the discrepancy, evidencing adequate data–model fit in terms of the features captured by the discrepancy measure. Extreme PPP values indicate a lack of data–model fit in terms of the features captured by the discrepancy measure.

For GDDM, a PPP value close to zero will result when the realized values consistently exceed those in the posterior predictive distribution, indicating that the observed data exhibit more local dependence than would be expected based on the model.

It is noted that the propriety of interpreting PPP values from a hypothesis-testing
perspective has been critiqued, as PPP values will not necessarily be uniformly distributed
under the null hypothesis of exact data–model fit (Robins, van der Vaart, & Ventura,
2000). As a result, employing PPP values to conduct hypothesis testing could lead to
conservative tests. Alternatives to PPP values include conditional predictive and partial
*predictive p values (Bayarri & Berger, 2000a), which do have desirable properties (from*
a frequentist, hypothesis-testing perspective) under null conditions (Bayarri & Berger,
*2000a; Robins et al., 2000). However, these approaches have several drawbacks (Stern,*
2000), including that they do not support discrepancy measures that are a function
of the data and model parameters (Bayarri & Berger, 2000a). Research has shown that
when investigating dimensionality and local independence assumptions, discrepancy
measures that condition on (some function of) the model parameters and examine
conditional associations outperform discrepancy measures that are not functions of the
*model parameters and examine marginal associations (Levy, 2010; Levy et al., 2009;*

McDonald & Mok, 1995). As a function of the model-implied expectations, GDDM
is explicitly constructed to embody the principles of a conditional covariance theory
perspective on (multi)dimensionality assessment and precludes the usage of conditional
*and partial predictive p values. In contrast, PPMC handles GDDM (and other) discrepancy*
measures naturally. Importantly, it will not necessarily be the case that PPP values
will yield overly conservative tests; appropriately chosen discrepancy measures should
yield PPP values that perform quite well under null conditions (Bayarri & Berger,
*2000b). Levy et al. (2009) found that PPP values for MBC – from a hypothesis-testing*
perspective – yielded Type I error rates only slightly below the nominal values in the
case of unidimensional IRT. Moreover, researchers wishing to adopt a hypothesis-testing
*perspective may employ methods to calibrate the PPP values to obtain p values with*
desirable frequentist properties (Hjort, Dahl, & Steinbakk, 2006).

PPMC is a flexible framework for model criticism and holds many potential advantages
*over traditional techniques (Levy et al., 2009). PPMC makes no appeal to regularity*
conditions or asymptotic arguments to arrive at reference distributions; the ubiquity of ill-
defined reference distributions has hampered traditional approaches to model checking
in unidimensional IRT (e.g., Chen & Thissen, 1997). Such problems could potentially
be exacerbated in MIRT models studied here. As a consequence of the flexibility of
PPMC, the analyst is free to choose from a broad class of functions, including those that
*pose difficulties for traditional model checking techniques (e.g., Levy et al., 2009). In*
addition, the results of PPMC may be compiled and communicated in a variety of ways,
*including numerically via PPP values or graphically (Gelman, 2003; Gelman et al., 1996;*

*Levy et al., 2009; Sinharay et al., 2006). Furthermore, by using the posterior distribution*
rather than point estimates for the model parameters, PPMC incorporates the uncertainty
in the estimation of the model parameters in the construction of the distributions of the
discrepancy measures (Meng, 1994).

**5. Simulation study**

This section describes a Monte Carlo study examining the utility of GDDM in evaluating the dimensionality assumptions of MIRT models. As described below, data are generated from two- and three-dimensional MIRT models and fitted with a series of models to evaluate the behaviour of GDDM under various conditions.

**5.1. Models**

All data sets consisted of simulated responses from 1,000 simulated examinees on a test
of 36 items following one of several multidimensional model structures of the form in
*(1) where, for simplicity, all c**j* parameters are equal to 0. Table 1 lays out the models’

structures in terms of the values of the parameters used to generate the data. The second column gives the values of the location parameter for the items used for all the models.

*The columns under the heading M0 give the values of a**j1* *and a**j2*, the discrimination
parameters for1and2, respectively, for M0. As a two-dimensional model with simple
structure, M0 serves as the baseline model; the remaining models may be thought of
as extensions of this model. M1, M2, and M3 are three-dimensional models in which
some of the items are dependent on a third dimension,3. For each of these models,
the structure of the dependence of the items on1and2remains the same as in M0.

The columns under the headings M1, M2, and M3 indicate which items reflect3 in
*each these models, and give the value of a**j3*for these items. M1 is a three-dimensional
model in which some of the items that depend on1 additionally depend on 3. M2
is a three-dimensional model in which3 influences items that depend on 2 as well
as items that depend on1. M3 has a similar structure to M2, but has fewer items that
depend on3. The structures in M1–M3 correspond to situations in which3may be a
substantive dimension or a dimension akin to a method factor, say, for use in modelling
testlet structures. Finally, M4 is a two-dimensional model; the columns under M4 give
*the values of a**j1* *and a**j2*. M4 differs from M0 in that several items reflect both1 and

2, corresponding to a scenario where multiple skills or proficiencies are involved in solving some of the items. For simplicity, we assume all bivariate correlations between the latent variables are constant, denoted by. For each model structure, we consider conditions in which = 0. In addition, we consider conditions in which = .5 for each of M0, M1, M2, and M4.

**5.2. Analyses**

For each model, the following Monte Carlo procedures were replicated 50 times. For each replication, 1,000 values of 1, 2, and 3 were generated from a multivariate normal distribution with all variances equal to one and correlations dictated by the condition. These values and the item parameters (Table 1) were entered into the MIRT item response function in (1) to yield the model-implied probability for each of the 1,000 simulated examinees to each of the 36 items. Each of these probabilities was compared to a simulated random variable from a uniform distribution over the unit interval. If the probability exceeded the random uniform draw, the value of the item response was set to one; otherwise it was set to zero.

To explore the utility of using the proposed procedures, we conducted PPMC
*using GDDM at the test and subtest levels, and using MBC and Q*3 at the item-
pair level for a variety of combinations of data generation models and data analysis

**Table 1. The structure of the models in the simulation study**

models. A cross in Table 2 indicates which models were fitted to data from each of the models.

For each of combinations listed in Table 2, the analysis proceeded as follows. For
each of the data sets from the generation model, the analysis model is fitted using Markov
chain Monte Carlo procedures (B´*eguin & Glas, 2001; Bolt & Lall, 2003; Clinton et al.,*
2004; Yao & Boughton, 2007) via the pscl package in R (Jackman, 2008). This estimation
routine specifies independent standard normal priors for the latent examinee variables

**Table 2. Analysis plan of the simulation study**

and requires user input of the prior distributions for the item parameters. The location parameters were assigned diffuse normal prior distributions

The function requires the input of prior distributions for all discrimination parameters.

Those specified by the model were assigned diffuse normal prior distributions

Those not specified by the model (e.g., discrimination parameters for items 1–18 on2in
M0) were assigned normal distributions with a mean of zero and a variance of 1× 10^{−12},
effectively constraining the parameters to be equal to zero.

For five replications in each analysis condition (i.e., each combination listed in
Table 2), trace plots and convergence diagnostics (Brooks & Gelman, 1998) were
analysed and it was determined that, in all cases, the chains converged by 5,000
iterations. It was concluded that 5,000 iterations would be sufficient to discard as
burn-in for the remaining replications in each condition. Thus, for each analysis, three
chains were run from overdispersed starting-points for 2,000 iterations after the burn-
in phase of 5,000 iterations. These iterations were thinned by 20 and the remaining
iterations were pooled to yield 300 draws from the posterior distribution for use in
conducting PPMC. These draws are employed with the observed data **X to compute**
the realized values of the discrepancy measures and then used to generate the **X**^{rep},
which are then used in computing the posterior predicted values of the discrepancy
measures. The PPP value is estimated as the proportion of draws for which the
posterior predicted value of the discrepancy measure exceeds the corresponding realized
value. Programs to conduct the PPMC were written by the authors and are available
upon request.

To facilitate a comparison between the performance of the proposed methods and existing procedures, the models were fitted in NOHARM (Fraser & McDonald, 1988) and programs were written to compute two statistics targeting the goodness of fit of the

model. Specifically, we consider a statistic introduced by Gessaroli and De Champlain (1996; Finch & Habing, 2005, 2007),

*where j and j*^{}serve to index the items to define the unique pairings of items, and

*is the Fisher z transformation of the residual correlation for pairing of items j and j*^{},
given by

*where p*^{(0)}_{j}*is the observed proportion of examinees getting item j correct and p*^{(r)}_{jj}_{} is the
*residual covariance between items j and j*^{}. To facilitate hypothesis testing, the value of

^{2}GDis referred to a central^{2}*distribution with degrees of freedom given by 0.5J(J−1)−t,*
*where t is the number of parameters estimated in fitting the model (Finch & Habing,*
2007).

In addition, we consider a statistic introduced by Gessaroli, De Champlain, and Folske (1997; Finch & Habing, 2005, 2007),

where

*where k**j**and k**j*^{}*are the scores (0 or 1) for items j and j*^{}*, respectively, and p**k*_{j}*k*_{j}*and pˆ**k*_{j}*k*_{j}

are the observed and model-implied proportions of examinees with response patterns
*given by the combination of k**j**and k**j*^{}, respectively (see Finch & Habing, 2005, for further
details on the calculation using NOHARM output). To facilitate hypothesis testing, ALR
is referred to the same^{2}distribution as^{2}GD(Finch & Habing, 2007).

**5.3. Results**

*Following Gelman et al. (1996), we recommend the use of graphical representations of*
the results of PPMC in an applied analysis. Figure 1 contains scatterplots of the realized
and posterior predicted values of GDDM from an analysis of one data set generated from
M1 with = 0. Figure 1a contains a scatterplot of the realized and posterior predicted
values of GDDM obtained by fitting M1, which is the correct model. Figure 1b contains
the scatterplot obtained when the data were fitted with M0, which underspecifies the
dimensionality. In each plot, the unit line is added as a reference. It is seen that in
Figure 1a the points appear to be randomly distributed around the line, indicating
that the realized and posterior predicted values are comparable, evidencing adequate

(a)

Posterior predictive GDDM

0.0036 0.0040 0.0044

0.0036 0.0040 0.0044

Posterior predictive GDDM 0.0036 0.0040 0.0044

Realized GDDM (b)

0.0036 0.0040 0.0044

Realized GDDM

**Figure 1. Scatterplots of the realized and posterior predicted values of GDDM from the analysis**
of a data set from M1: (a) results from fitting M1; and (b) results from fitting M0.

data–model fit. In Figure 1b, the points fall below and to the right of the line, indicating that the realized values are frequently larger than the posterior predicted values, hence that the observed data exhibit more local dependence than the posterior predicted data, as captured by GDDM. The PPP values summarize these graphical representations as the proportions of points above and to the left of the unit line, which in this case are .46 and .01, respectively. For ease of exposition, the majority of the results of the simulation study are presented in terms of PPP values as summaries of each PPMC analysis.

Table 3 summarizes the results of the study for GDDM at the test level,^{2}GD, and
ALR in each of the conditions. For each condition defined by the first two columns, the
*proportions of replications with p values below .10 and .05 obtained from fitting the*
*correct model are given first, followed by the proportions of replications with p values*
*below .10 and .05 obtained from fitting M0. For GDDM, the p values are PPP values based*
on PPMC; for^{2}GD*and ALR the p values are obtained via the*^{2}reference distributions
as described above.

Table 4 presents the proportion of PPP values below .10 and .05 for GDDM evaluated on subtests when M0 was fitted to data from the various conditions. The first and second subtests were defined as the first half of the test (items 1–18) and second half of the test (items 19–36), respectively. From the perspective of fitting M0, investigating GDDM for these subtests corresponds to investigating the dimensionality that is assumed for each

**Table 3. Proportions of p values for GDDM, **^{2}GD, and ALR in the analysis conditions^{a}

**Table 4. Proportions of PPP values beyond .10 and .05 for subtests when M0 was fitted to data**
from the generation models: except where noted, the results of the two subtests are pooled

modelled dimension. For all models except M1, the results of the PPP values from the two subtests were pooled following an assumption that the two subtests are symmetric with respect to their dimensionality. For M1, the subtests are not symmetric; the first half of the test reflects1and3while the second half of the tests reflects2only. For M1, the proportion of PPP values below .10 and .05 are listed for the two subtests separately.

Table 5 presents the median and proportion of extreme PPP values of MBC for different types of item pairs from fitting M0 to data in each of the conditions where

= 0. Extremely high PPP values are considered as well as extremely low PPP values

**Table 5. Median PPP values and proportions of extreme PPP values for MBC by types of item**
pairs^{a}when M0 was fit to data from the generation models with = 0

because MBC is a directional measure of local dependence. For each model, item pairs
are defined by types based on the dimension(s) they reflect. Item-pair types that have an
exchangeable dimensional structure are pooled. For example, in M0 (1–1) refers to item
pairs in which both items reflect1and (2–2) refers to item pairs in which both items
reflect2. These item pairs have the same dimensional structure in the sense that they
both reflect one of the correctly modelled dimensions. As such they are pooled together
but kept separate from item pair (1–2), in which one item reflects1and the other item
reflects2*. Results for the analysis of Q*3are not presented as they were quite close to
*that of MBC. Similarly, results for MBC and Q*3when the latent variables were correlated

= .5 are not presented as they exhibited patterns consistent with those in Table 5.

**5.4. Discussion**

Viewing a PPMC analysis as a diagnostic approach to evaluating data–model fit (Gelman,
*2003; Gelman et al., 1996; Levy et al., 2009; Sinharay, 2005), the results in Table 3 indicate*

that when M0 is fitted to data from more dimensionally complex models, PPMC using GDDM is likely to yield patterns indicative of data–model misfit akin to that in Figure 1b.

When the correct model is fitted to data, PPMC is unlikely to yield such patterns. Instead, PPMC will typically yield patterns indicative of adequate data–model fit as in Figure 1a.

The results for M1, M2, and M3 indicate that GDDM is able to detect data–model misfit of the two-dimensional model when fitted to data that follow a three-dimensional model, both when the influence of the unmodelled dimension is concentrated on items that depend on one of the other dimensions (M1) and when its influence is more widely distributed across the total set of items (M2 and M3).

The results for fitting M0 to data that follow M4 indicate that GDDM is able to detect data–model misfit of the simpler two-dimensional model when fitted to data that follow a two-dimensional model with more complex structure. Similarly, the results from fitting M3 to data that follow M2 indicate that GDDM is able to detect data–model misfit when the analysis model correctly specifies three dimensions but the pattern of the dependence of the items on those dimensions is incorrectly specified. These results support the argument advanced above that local independence should be viewed not merely with respect to the number of dimensions, but with respect to the model as constituted by dimensions and the patterns of dependence of the observables on those dimensions.

A hypothesis-testing perspective views the proportions in columns 3 through 8 in
Table 3 as being akin to empirical Type I error rates and the proportions in the remaining
columns as akin to power rates for detecting data–model misfit due to the improperly
specified dimensionality. Although the current work adopts a diagnostic perspective on
PPMC, it is interesting to note that, from a hypothesis-testing perspective, GDDM has
considerable power to detect data–model misfit in dimensionally misspecified models
while maintaining Type I error rates. With only a few exceptions (under M4 with =
0, M2 with = .5, and M1 with = .5 and ␣ = .05), the Type I error rates were slightly
below the nominal values, which is consistent with theoretical work on the distributions
*of PPP values under null conditions (Robins et al., 2000) and previous work on the use*
*of PPP values in IRT (e.g., Sinharay et al., 2006). In the context of dimensionality and*
*local dependence assessment for unidimensional models, Levy et al. (2009) found that*
*MBC, Q*3, and related indices exhibited empirical Type I error rates at or slightly below
nominal values. The present study finds that GDDM, which extends MBC, yields similar
rates in the multidimensional models studied here.

In contrast, the empirical Type I error rates for ^{2}_{GD} and ALR were generally well
below nominal values. GDDM also considerably outperformed^{2}_{GD}and ALR in terms of
detecting data–model misfit in when the data followed M1 and M3. GDDM performed
as well as or better than ^{2}GD and ALR when the data followed M2. In M1–M3, ALR
outperformed^{2}GD, and performed nearly as well as GDDM when the data followed M2.

When the data followed M4,^{2}GDand ALR performed similarly and outperformed GDDM.

Within any model structure, all of the discrepancy measures performed as well or better
in detecting unmodelled dimensionality when the latent variables were uncorrelated
compared to when they were correlated, as is consistent with theoretical and previous
*empirical work in unidimensional modelling (e.g., Levy et al., 2009; Nandakumar &*

Stout, 1993; Nandakumar, Yu, Li, & Stout, 1998; Stout, 1987; Zhang & Stout, 1999a).

Overall, GDDM performed favourably as compared to^{2}GDand ALR.

The results from fitting M2, M3, and M0 to the data from the M2 condition illustrate the way GDDM captures increasingly poor data–model fit. In these conditions, M2 is the correct model, M3 departs from M2 by failing to model the dependencies of some of

0.0035 0.0045 0.0055 Mean GDDM

**Figure 2. Distributions of mean GDDM values fitting M0 (dashed line), M2 (solid), and M3 (dotted)**
to data from M2.

the items on3, and M0 further departs by failing to model the dependencies of any of the items on3. As such, M3 and M0 are increasingly misspecified models relative to the correct model structure of M2. Table 3 indicates that the use of PPP values from GDDM allows for the detection of data–model misfit associated with fitting M3 and M0 in every replication. To further illustrate how GDDM assesses the magnitude of the data–model misfit, Figure 2 plots the distributions of the means of the realized GDDM values from fitting each model to the data from the M2 condition with = 0. The solid line represents the distribution of mean GDDM values obtained from fitting M2; the dotted and dashed lines represent the distributions of mean GDDM values obtained from fitting M3 and M0, respectively. As is evident, for all data sets the values of the mean GDDM from both M3 and M0 exceeded that from M2. Furthermore, the mean values from M0 (dashed) were in general larger than those from M3 (dotted). This finding supports the interpretation that larger values of GDDM are indicative of worse data–model fit.

The results in Table 4 indicate that GDDM at the subtest level exhibits patterns similar to those of GDDM at the test level. When the data are generated from M0 and fitted with M0, extreme PPP values occur rarely (i.e., in hypothesis-testing terms: slightly above the nominal level of .10 when = 0, below the nominal level of .10 when = .5, and at the nominal level of .05 in both conditions). When M0 is fitted to data from other models, the proportions are all larger than when fitted to data from M0, indicating GDDM at the subtest level is sensitive to the unmodelled dimensionality. The results for fitting M0 to data from M1 exhibit a pattern consistent with expectation where the model detects the presence of the extraneous dimensionality in the first subtest, but not in the second subtest. When the unmodelled dimension influences items on both subtests (M2), GDDM indicates data–model misfit on each subtest in all replications. Reducing the number of items that reflect the unmodelled dimension in M3 reduces the capacity for GDDM to detect data–model misfit at the subtest level. This further supports the previous findings of GDDM as being sensitive to the magnitude of the data–model misfit. Consistent with the findings at the test level, the performance of GDDM at the subtest level suffered when the latent variables were correlated.

The results for the investigation of MBC at the item-pair level when fitting M0 (Table 5) reveal a number of patterns. In the M0 condition the analysis model is correctly specified.

Accordingly, the median PPP values for the two types of item pairs are near .5, and the
proportion of extreme PPP values in the tails beyond .10 and .05 is just under those
respective values. These results parallel analogous results from the investigation of MBC
*in unidimensional models (Levy et al., 2009).*

Under M1, the influence of 3is localized to the first half of the test, and does not influence any of the items that reflect2; associations involving these items should be well modelled. Accordingly, the pairings of items where one or both of the items reflect2

(labelled 1–2, 2–13, and 2–2) yield medians of PPP values at .5 and proportions of extreme
PPP values just below the nominal levels, as in the M0 condition. A conditional covariance
*theory perspective on estimation and dimensionality (Levy et al., 2009; Stout et al., 1996;*

*Zhang & Stout, 1999a) implies that the dimension estimated as*1when M0 is fitted to
*data that follow M1 is a complex combination of the true*1and3. As a result, pairings of
items with the same dimensional structure (either 1–1 or 13–13) should exhibit positive
local dependence, whereas item pairs with different dimensional structures with respect
to the estimated1(1–13) should exhibit negative local dependence. This is exactly what
was observed in the M1 condition, where the 1–1 pairings and the 13–13 pairings yielded
*increasingly high proportions of small PPP values, indicating positive local dependence,*
*and the 1–13 pairings yielded relatively high proportions of large PPP values, indicating*
negative local dependence. These patterns of positive and negative local dependence
mirror those found using MBC and related indices in the analysis of unidimensional IRT
*in light of unmodelled multidimensionality (Levy, 2010; Levy et al., 2009).*

Under M2, the influence of3is distributed over both halves of the test. A conditional
*covariance theory perspective implies that the dimension estimated as*1when M0 is
*fitted to data is a complex combination of the true*1and3*and the dimension estimated*
as2*when M0 is fitted to data is a complex combination of the true*2and3. Akin to
the results in M1 pairs in which both items reflect3(13–13, 23–23, 13–23) yielded low
PPP values, indicating positive local dependence, as did pairs where both items reflected
just1or2(1–1, 2–2). Similarly, item pairs in which one item reflects3and the other
item reflects either1or2(1–13, 2–23, 1–23, 2–13) yielded high PPP values, indicating
negative local dependence. Finally, the PPP values for the 1–2 pairs were somewhat
smaller than .5. Under M3, pairings in which one or both items did not reflect3yielded
more moderate PPP values than their counterparts under M2. In contrast, item pairs in
which both items reflect3(13–13, 23–23, 13–23) were smaller than their counterparts
under M2.

Under M4, the PPP values the 12–12 item pairs are small, indicating positive local dependence. The PPP values for the remaining types of item pairs are closer to .5, indicating their associations are well modelled. Under M4, the unmodelled multidimensionality predominantly manifests itself in terms of item pairs in which both items reflect1 and2. We speculate that the results for the remaining item pairs are consistent with a conditional covariance theory perspective. However, it is unclear whether the results for the remaining types of item pairs indicate true patterns or are merely reflective of random variation around .5. Theoretical research on the extension of conditional covariance theory to multidimensional models and empirical research investigating patterns of local dependence in multidimensional contexts are necessary to further investigate this possibility.

Synthesizing across the results, it is clear that the higher level of aggregation (subtests above item pairs, test above subtests) yields better performance in terms of higher rates of indicating the presence of unmodelled multidimensionality. This is because GDDM aggregates the item-pair level local dependencies, akin to how measures of differential

bundle functioning aggregates item-level differential item functioning, capitalizing on the amplification of the effects in the aggregation (Nandakumar, 1993).

**6. Analysis of National Assessment of Educational Progress data**

The procedures introduced above are illustrated in an analysis of item response data from the 1996 National Assessment of Educational Progress (NAEP). The NAEP science assessment framework specifies three content areas: physical science, earth science, and life science. Each item is classified in terms of one of these areas; subscale scores are reported on each of these dimensions (Allen, Carlson, & Zelenak, 1999). Item responses to 16 items in block S20 from students from the national sample were analysed. The block contained four items in life science and six items in each of physical science and earth science. Eight of the items were multiple-choice items scored dichotomously and eight were constructed response items scored via integers. For the purposes of this analysis, the responses to the constructed response items were dichotomized in a manner such that the collapsing of categories results in the most balanced dichotomous response frequencies for each item. Missing responses prior to the last observed response were regarded as intentional omissions and were scored as incorrect. The analysis was performed on 1,020 examinees with complete data.

**6.1. MIRT model structure**

A three-dimensional MIRT model is analysed, where the latent variables correspond to proficiency in the areas of physical science, earth science, and life science. Each item is modelled as reflecting one of the latent variables in accordance with the NAEP classifications of items in terms of the content areas. Figure 3 contains a path diagram representation of the hypothesized model.

**6.2. Bayesian analysis and Markov chain Monte Carlo estimation**

For the eight multiple-choice items, the probability of a correct response from examinee
*i to item j was given via the MIRT model in (1). For the eight constructed response items,*
*the probability of a correct response from examinee i to item j was given via the MIRT*
*model in (1) where c**j*= 0. The model was identified by specifying the discrimination
parameter for the first item on each latent variable to be unity; that is, the discrimination

Physical science

θ1

Life science

θ3

Earth science

θ2

*X*_{3} *X*_{6} *X*_{7} *X*_{11} *X*_{15} *X*_{16} *X*_{2} *X*_{4} *X*_{5} *X*_{8} *X*_{9} *X*_{12} *X*_{1} *X*_{10} *X*_{13} *X*_{14}
**Figure 3. The MIRT model for the NAEP analysis.**

parameters for items 3, 2, and 1 for1,2, and3, respectively, were fixed at one. The remaining unknown discrimination parameters, lower asymptote parameters, and all of the location parameters were assigned prior distributions:

*where I(0,∞) for the specification of the prior on each a**jm*constrains the distribution
to have support over the positive real line, modelling the hypothesis that the probability
of a correct response monotonically increases with increases in any*m*.

For each examinee, the prior distribution for the latent variables was multivariate normal with a mean vector set at 0 to identify the model

A diffuse inverse-Wishart prior distribution was specified for the covariance matrix

where**I is the identity matrix of rank M = 3.**

The model was estimated in WinBUGS 1.4 (Spiegelhalter, Thomas, Best, & Lunn,
2007).^{1} Three chains from dispersed start values were run for 6,000 iterations using
program-chosen sampling algorithms, including a Metropolis algorithm in which the
variance of the normal proposal distribution is adapted for the first 4,000 iterations.

As measured by visual inspection of trace plots and Brooks–Gelman–Rubin diagnostics (Brooks & Gelman, 1998), the 4,000 iterations needed to adapt the proposal distribution in the Metropolis sampler were sufficient for the chains to converge. The remaining iterations were thinned by a factor of 20 and pooled to yield 300 draws used to conduct PPMC.

**6.3. PPMC analysis**

A PPMC analysis was conducted on the NAEP data, evaluating GDDM at the test level,
GDDM at the subtest level where three subtests are defined in terms of the three content
*areas (Figure 3), and MBC and Q*3at the item-pair level.

Figure 4 is a scatterplot of the 300 realized and posterior predicted values of GDDM evaluated on the entire test, where it is clearly seen that the realized values tend to be larger than their posterior predicted counterparts; the PPP value from this analysis was .08. This indicates that the solution to the model, in terms of the posterior distribution, suffers in terms of adequately accounting for the associations in the data.

Figure 5 shows scatterplots for the 300 realized and posterior predicted values of GDDM evaluated in each of three subtests. The PPP values for the physical science, earth science, and life science subtests were .38, .25, and .48, respectively. These results

1WinBUGS parameterizes distributions slightly differently from conventions adopted here. Specifically, it em-
ploys the precision (i.e., the inverse of the variance) in specifying normal distributions and a parameterization
*of the beta distribution such that the priors are specified in WinBUGS as a**jm* *∼ N(0,0.10)I(0, ∞), d**j* ∼
*N(0,0.10), and c**j*∼ (4,16).

Posterior predictive GDDM

0.0035 0.0040 0.0045 0.0050 0.0055 0.0035

0.0040 0.0045 0.0050 0.0055

Realized GDDM

**Figure 4. Scatterplot of the realized and posterior predicted values of GDDM from the analysis of**
the NAEP data.

(a) (b)

Posterior predictive GDDM

0.002 0.004 0.006 0.008 0.002

0.005 0.008

Posterior predictive GDDM

0.002 0.005 0.008

Posterior predictive GDDM

0.002 0.005 0.008

Realized GDDM

0.002 0.004 0.006 0.008 Realized GDDM

0.002 0.004 0.006 0.008 Realized GDDM (c)

**Figure 5. Scatterplot of the realized and posterior predicted values of the subtest GDDM from**
the analysis of the NAEP data: results from (a) the physical science subtest; (b) the earth science
subtest; (c) the life science subtest.

0 0.02 0.05 0.5 0.95 0.981

1 2

3 4

5 6

7 8

9 10

11 12

13 14

15 16

**Figure 6. Graphical representation of PPP values of MBC for item pairs from the analysis of NAEP**
data.

suggest that, within content areas, the associations among the items are well accounted for by the dimensional structure in the model.

*At the item-pair level, the results for MBC and Q*3were nearly identical; results for
MBC will be presented and discussed. Figure 6 contains a matrix graphical plot of the PPP
values for MBC at the item-pair level, where the numbers along the diagonal indicate the
item in that row and column of the matrix. The shading of the square in each element in
the matrix conveys the value of the PPP value as indicated by the key (e.g., a white square
indicates that the PPP value lies between 0 and .02). Focusing on the more extreme PPP
values, the results suggest that the strongest residual associations include: (a) the positive
residual associations between items 10 and 12, 10 and 16, 1 and 12, 13 and 16, and to
a lesser extent between 6 and 10, 7 and 14, and 8 and 13, and (b) the negative residual
associations between items 7 and 10, 10 and 11, 5 and 6, and to a lesser extent between
5 and 8, and 3 and 16. With the exception of these last two pairings, these item pairs
involve the pairings where the items come from different content areas.

Synthesizing the above results, the PPMC analysis using GDDM at the test level suggests that the three-dimensional MIRT model with dimensions defined by the content classifications of the items does not adequately account for the dependencies among the items. However, within subtests the associations are well accounted for, as evidenced by the results for GDDM at the subtest level. Rather, the additional dependencies appear to be in terms of several sets of item pairs drawn from different subtests, in particular several

item pairs involving item 10. At this point, this diagnostic information could be leveraged to investigate substantive reasons for the weaknesses of the model with subject matter experts and the assessment design team. Of the many possible explanations, Wei and Mislevy (2007), who interpreted results from exploratory factor analyses of data from these items, suggested that a factor structure based on a distinction between conceptual understanding and practical reasoning may be more suitable than a factor structure based on content domains.

**7. Conclusion**

This paper has described a new discrepancy measure supported by a PPMC framework for assessing the dimensionality assumption in MIRT models. Grounded in conditional covariance theory, the discrepancy measure assesses the aggregate magnitude of estimated pairwise conditional covariances in the data, relying on connections between the assumptions of dimensionality and local independence. It was argued that local independence should be viewed with respect to a model in terms of both the number of hypothesized dimensions and the specified patterns of dependence of items on the dimensions.

GDDM is designed to assess the specified dimensional structure for a collection of
measured variables. Ideally, the choice to evaluate GDDM on a subset of the items
should be based on whether meaningful collections of the items can be determined
*a priori, as is warranted if the test is constructed to measure subdomains (as in the*
NAEP example), or if subscale results along those domains will be employed, or if the
*administered items may be viewed as testlets. In the absence of a priori defined subtests,*
we recommend the following procedure for critiquing the assumed dimensionality
of a set of items. If an analysis of GDDM at the test level indicates the assumed
dimensional structure of the model is untenable, the researcher may then follow up
with an analysis of subtests if the grouping of items into subtests can be theoretically
justified or with an analysis at the finer-grained level of item pairs. Simultaneously,
considering the results over the item pairs can be suggestive of the specific weaknesses
of the model and the types of structures that might better account for the relationships
in the data.

GDDM is parametric in the sense that it employs the model-based expectations of the observed values. However, it is not restricted to MIRT models discussed here. GDDM is constructed to be sufficiently general to be applicable to assess dimensionality assumptions across a broad class of latent variable modelling paradigms that make a variety of distributional assumptions regarding the latent and observable variables, including factor analytic and latent class models in addition to item response models. Similarly, this work has demonstrated the use of PPMC for model criticism for MIRT models. As a flexible framework for evaluating model diagnostics, PPMC may be used to support data–model fit across a variety of psychometric modelling paradigms.

**Acknowledgements**

We wish to acknowledge the two anonymous reviewers whose comments prompted improvements to this paper.

**References**

*Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report. Washington,*
DC: National Center for Education Statistics.

*Bayarri, M. J., & Berger, J. O. (2000a). P values for composite null models. Journal of the American*
*Statistical Association, 95, 1127–1142. doi:10.2307/2669749*

*Bayarri, M. J., & Berger, J. O. (2000b). Rejoinder. Journal of the American Statistical Association,*
*95, 1168–1170. doi:10.2307/2669756*

B´eguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some model-fit analysis of
*multidimensional IRT models. Psychometrika, 66, 541–562. doi:10.1007/BF02296195*
Bolt, D. M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory multidi-

*mensional item response models using Markov chain Monte Carlo. Applied Psychological*
*Measurement, 27, 395–414. doi:10.1177/0146621603258350*

Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative
*simulations. Journal of Computational and Graphical Statistics, 7, 434–455. doi:10.2307/*

1390675

Chen, W., & Thissen, D. (1997). Local dependence indexes for item pairs using item response
*theory. Journal of Educational and Behavioral Statistics, 22, 265–289.*

*Clinton, J., Jackman, S., & Rivers, D. (2004). The statistical analysis of role call data. American*
*Political Science Review, 98, 355–370. doi:10.1017/S0003055404001194*

Embretson, S. E. (1997). Multicomponent response models. In W. J. van der Linden & R. K.

*Hambleton (Eds.), Handbook of modern item response theory (pp. 305–321). New York:*

Springer.

Finch, H., & Habing, B. (2005). Comparison of NOHARM and DETECT in item cluster recovery:

*Counting dimensions and allocating items. Journal of Educational Measurement, 42(2), 149–*

169. doi:10.1111/j.1745-3984.2005.00008

Finch, H., & Habing, B. (2007). Performance of DIMTEST- and NOHARM-based statistics for
*testing unidimensionality. Applied Psychological Measurement, 31, 292–307. doi:10.1177/*

0146621606294490

*Fraser, C., & McDonald, R. P. (1988). NOHARM: Least squares item factor analysis. Multivariate*
*Behavioral Research, 23, 267–269. doi:10.1207/s15327906mbr2302 9*

Gelman, A. (2003). A Bayesian formulation of exploratory data analysis and goodness-of-fit testing.

*International Statistical Review, 71, 369–382.*

Gelman, A. (2007). Comment: Bayesian checking of the second levels of hierarchical models.

*Statistical Science, 22, 349–352. doi:10.1214/07-STS235A*

Gelman, A., Meng, X. L., & Stern, H. (1996). Posterior predictive assessment of model fitness via
*realized discrepancies. Statistica Sinica, 6, 733–807.*

Gessaroli, M. E., & De Champlain, A. F. (1996). Using an approximate chi-square statistic to test
*the number of dimensions underlying the responses to a set of items. Journal of Educational*
*Measurement, 33, 157–192. doi:10.1111/j.1745-3984.1996.tb00487.x*

*Gessaroli, M. E., De Champlain, A. F., & Folske, J. C. (1997). Assessing dimensionality using a*
*likelihood-ratio chi-square test based on a non-linear factor analysis of item response data.*

Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, March.

*Gill, J. (2007). Bayesian methods: A social and behavioral sciences approach. (2nd ed.). Boca*
Raton, FL: Chapman & Hall/CRC.

*Habing, B., & Roussos, L. A. (2003). On the need for local item dependence. Psychometrika, 68,*
435–451. doi:10.1007/BF02294736

Hattie, J. (1984). An empirical study of various indices for determining unidimensionality.

*Multivariate Behavioral Research, 19, 49–78. doi:10.1207/s15327906mbr1901 3*

Hattie, J., Krakowski, K., Rogers, H. J., & Swaminathan, H. (1996). An assessment of Stout’s index
*of essential unidimensionality. Applied Psychological Measurement, 20, 1–14. doi:10.1177/*

014662169602000101

*Hjort, N. L., Dahl, F. A., & Steinbakk, G. H. (2006). Post-processing posterior predictive p*
*values. Journal of the American Statistical Association, 101, 1157–1174. doi:10.1198/*

016214505000001393

Hoijtink, H. (2001). Conditional independence and differential item functioning in the two-
parameter logistic model. In A. Boomsma, M. A. J. van Duijn, & T. A. B. Snijders (Eds.),
*Essays in item response theory (pp. 109–129). New York: Springer.*

Ip, E. H. (2001). Testing for local dependency in dichotomous and polytomous item response
*models. Psychometrika, 66, 109–132. doi:10.1007/BF02295736*

*Jackman, S. (2008). pscl: Classes and methods for R developed in the political science computa-*
*tional laboratory, Stanford University. Department of Political Science, Stanford University,*
Stanford, CA. R package version 1.03. Retrieved from http://pscl.stanford.edu/

Levy, R. (2010). Posterior predictive model checking for conjunctive multidimensionality in
*item response theory. Journal of Educational and Behavioral Statistics. Advance online*
publication.

Levy, R., Mislevy, R. J., & Sinharay, S. (2009). Posterior predictive model checking for multi-
*dimensionality in item response theory. Applied Psychological Measurement, 33, 519–537.*

doi:10.1177/0146621608329504

*Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale,*
NJ: Erlbaum.

McDonald, R. P. (1997). Normal-ogive multidimensional model. In W. J., van der Linden & R. K.

*Hambleton (Eds.), Handbook of modern item response theory (pp. 257–269). New York:*

Springer.

*McDonald, R. P., & Mok, M. M. C. (1995). Goodness of fit in item response models. Multivariate*
*Behavioral Research, 30, 23–40. doi:10.1207/s15327906mbr3001 2*

*Meng, X. L. (1994). Posterior predictive p-values. Annals of Statistics, 22, 1142–1160. doi:10.*

1214/aos/1176325622

Nandakumar, R. (1993). Simultaneous DIF amplification and cancellation: Shealy-Stout’s test for
*DIF. Journal of Educational Measurement, 30, 293–311. doi:10.1111/j.1745-3984.1993.*

tb00428.x

Nandakumar, R., & Stout, W. F. (1993). Refinement of Stout’s procedure for assessing latent trait
*dimensionality. Journal of Educational Statistics, 18, 41–68. doi:10.2307/1165182*

Nandakumar, R., Yu, F., Li, H., & Stout, W. (1998). Assessing unidimensionality of polytomous
*data. Applied Psychological Measurement, 22, 99–115. doi:10.1177/01466216980222001*
Reckase, M. D. (1997). A linear logistic multidimensional model. In W. J. van der Linden & R.

*K. Hambleton (Eds.), Handbook of modern item response theory (pp. 271–286). New York:*

Springer.

*Robins, J. M., van der Vaart, A., & Ventura, V. (2000). The asymptotic distribution of P values*
*in composite null models. Journal of the American Statistical Association, 95, 1143–1172.*

doi:10.2307/2669750

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied
*statistician. Annals of Statistics, 12, 1151–1172. doi:10.1214/aos/1176346785*

Sinharay, S. (2005). Assessing fit of unidimensional item response theory models using a Bayesian
*approach. Journal of Educational Measurement, 42, 375–394. doi:10.1111/j.1745-3984.*

2005.00021.x

Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory models.

*British Journal of Mathematical and Statistical Psychology, 59, 429–449. doi:10.1348/*

000711005X66888

Sinharay, S., Johnson, M., & Stern, H. S. (2006). Posterior predictive assessment of item
*response theory models. Applied Psychological Measurement, 30, 298–321. doi:10.1177/*

0146621605285517

*Spiegelhalter, D. J., Thomas, A., Best, N. G., & Lunn, D. (2007). WinBUGS user manual: Version*
*1.4.3. Cambridge: MRC Biostatistics Unit. Retrived from http://www.mrc-bsu.cam.ac.uk/bugs/*

winbugs/contents.shtml