Posterior Predictive Assessment of Item Response Theory Models

(1)

Response Theory Models

Sandip Sinharay, Educational Testing Service Matthew S. Johnson, Baruch College

Hal S. Stern, University of California, Irvine

Model checking in item response theory (IRT) is an underdeveloped area. There is no universally accepted tool for checking IRT models. The posterior predictive model-checking method is a popular Bayesian model-checking tool because it has intuitive appeal, is simple to apply, has a strong theoretical basis, and can provide graphical or numerical evidence about model misfit. An important issue with the application of the posterior predictive model-checking method is the choice of a discrepancy measure (which plays

a role like that of a test statistic in traditional hypothesis tests). This article examines the performance of a number of discrepancy measures for assessing different aspects of fit of the common IRT models and makes specific recommendations about what measures are most useful in assessing model fit. Graphical summaries of model-checking results are demonstrated to provide useful insights about model fit. Index terms: Bayesian methods, discrepancy measures, model checking, odds ratio, p values

Introduction

Model checking remains a major hurdle to overcome for effective implementation of item response theory (Hambleton & Han, 2004). The possible number of response patterns (2^Ifor a test with I binary items) is large for even moderately long assessments, leading to sparse contingency tables—so the standard chi-square tests do not apply directly. Therefore, investigators assess the fit of item response theory (IRT) models by examining whether the model can explain various summaries (or collapsed versions) of the original data. Researchers van der Linden and Hambleton (1997) commented,

The approach to the assessment of fit with two- and three-parameter models often involves the collection of a wide variety of evidence about model fit, including statistical tests, and then making an informed judgment about model fit and usefulness of a model with a particular set of data. (p. 16)

Even though a significant amount of research has concentrated on assessing different aspects of model fit in IRT, there is no universally accepted model fit measure (recent work by Maydeu-Olivares

& Joe, 2005 notwithstanding). Standard IRT software packages are lacking in reliable model-fit indices, and worse, a number of IRT software packages produce model-fit indices that have well-known shortcomings (Hambleton & Han, 2004).

Applied Psychological Measurement, Vol. 30 No. 4, July 2006, 298–321

298 DOI: 10.1177/0146621605285517

Ó 2006 Sage Publications

(2)

The situation is hardly better for IRT modeling under the Bayesian framework. There has been a recent surge in the use of Bayesian estimation in IRT. Patz and Junker (1999), Bradlow, Wainer, and Wang (1999), Beguin and Glas (2001), and Karabatsos and Sheu (2004) are only some of the recent examples applying Bayesian estimation and Markov chain Monte Carlo (MCMC) algo- rithms (e.g., Gelman, Carlin, Stern, & Rubin, 2003) to fit complicated psychometric models. How- ever, not much attention has been given to assessing the fit of IRT models from a Bayesian perspective.

The posterior predictive model-checking (PPMC) method (Rubin, 1984) is a popular Bayesian model diagnostic tool because it has intuitive appeal, is simple to apply, has a strong theoretical basis, and can provide graphical or numerical evidence about model misfit. The method compares the observed data with replicated data (data that are generated or predicted by the model) using a number of diagnostic measures that are sensitive to model misfit. Any systematic differences between aspects of the observed data set and those of the replicated data sets indicate a failure of the model to explain those aspects of the data. Graphical display is the most natural and easily comprehensible way to perform posterior predictive checks (PPC); in situations where graphical displays do not suffice or are cumbersome (e.g., for too many checks simultaneously), one can also use a tail-area probability, also known as a posterior predictive p-value (PPP-value). Examples of application of the PPMC method in educational measurement include Hoijtink (2001), Beguin and Glas (2001), and Karabatsos and Sheu (2004). However, most of these applications are specialized, and few of these focus in general on unidimensional IRT models. Before applying any technique to complicated models in a field, it is important to study its performance with simpler models (especially in IRT, where even the basic models are not so simple as far as model checking is concerned). However, attempts to thoroughly study model checking have been rare in the context of Bayesian IRT.

Motivated by the aforementioned, this article uses the PPMC method in a number of simulations and in a real data example to assess several aspects of fit of unidimensional IRT models for dichotomous items. Use of simple and natural discrepancy measures reveals useful information about model misfit when model assumptions are violated. This article emphasizes graphical displays of PPCs. For some aspects of the model, PPCs have to be carried out for a large number of measures (e.g., for all pairs of items). This article develops graphical approaches for summarizing the output of such high-dimensional model checks in a precise and easily comprehensible manner.

The PPMC method has previously been applied to examine person fit, item fit (Sinharay, in press) and differential item functioning (Hoijtink, 2001) for the simple IRT models—so these aspects are not covered in this article.

It is important to emphasize that model checking is related to but different from model selection.

The goal in model checking is to determine if the model is adequate or not. A model that is found to fit the data (or, more precisely, is not found to not fit the data) is not necessarily the best model or the correct model. In fact, there may be several models that might fit the data in the sense of being found acceptable by model checking. Although it is important to be aware of the possibility that other models may fit the data (see, e.g., Bickel, Buyske, Chang, & Ying, 2001), it remains important to assess whether one’s chosen model is appropriate for analyzing the data at hand.

The organization for the remainder of this article is as follows. The first section introduces the PPMC method and discusses interpretation of model checks. The second section discusses the discrepancy measures examined in this study. The third section describes an outline of the simulation studies. The fourth section describes the results of the simulations, and the next section uses those results to examine the fit of the Rasch model to data from the U.S. Department of Agriculture’s (USDA’s) Measurement of Food Security program. The final section summarizes the work and provides the reader with recommendations about how to use posterior predictive checks in IRT.

(3)

Posterior Predictive Model-Checking (PPMC) Techniques Description of the Method

Let pðyjωÞ denote the likelihood for a statistical model applied to data y (examinee responses in this context), and let pðωÞ be the prior distribution on the parameters, where ω denotes all the parameters in the model. Then the posterior distribution of ω is pðωjyÞ / pðyjωÞpðωÞ. The PPMC method (Rubin, 1984) suggests checking a model by examining whether the observed data y appear extreme with respect to the posterior predictive distribution of replicated data y^rep,

pðy^repjyÞ=Z

pðy^repjωÞpðωjyÞdω: ð1Þ

In practice, test quantities or discrepancy measures Dðy; ωÞ are defined (Gelman, Meng, & Stern, 1996), and the posterior distribution of Dðy; ωÞ is compared to the posterior predictive distribution (PPD) of Dðy^rep; ωÞ, with substantial differences between them indicating model misfit. A researcher may use Dðy; ωÞ= DðyÞ, a discrepancy measure depending on the data only (which can also be called a test statistic); in that case, the PPC consists of comparing DðyÞ to the PPD of Dðy^repÞ. The PPMC method operates in the spirit of a diagnostic plot rather than as a hypothesis test (Stern, 2000), and thus the comparison of observed and replicated discrepancy measures is mostly performed using graphical plots. In cases where graphical checks are not conclusive or when one is interested in several checks simultaneously, it can be useful to examine a quantitative measure of lack of fit, a tail-area probability also known as the PPP-value:

P Dðyð ^rep; ωÞ≥ Dðy; ωÞjyÞ=Z

Dðy^rep;ωÞ≥Dðy;ωÞpðy^repjωÞpðωjyÞdy^repdω; ð2Þ which can be a useful supplement to a graphical plot.

Because of the difficulty in dealing with equation (1) or (2) analytically for all but simple problems, Rubin (1984) suggested simulating replicated (or posterior predictive) data sets from the PPD in applications of the PPMC method. One draws N simulations ω¹, ω², . . . , ω^N from the posterior distribution pðωjyÞ of ω (most likely using an MCMC algorithm) and then draws y^rep;n from the distribution pðyjωⁿÞ for n= 1; 2; . . . N. The process results in N draws from the joint posterior distribution pðy^rep; ωjyÞ. One then computes the predictive discrepancies Dðy^rep;n; ωⁿÞ and realized discrepancies Dðy; ωⁿÞ; n= 1; 2; . . . N. It is possible then to create a graphical plot of Dðy^rep;n; ωⁿÞ versus Dðy; ωⁿÞ; n= 1; 2; . . . N; points lying consistently above or below the 45-degree line indicate model misfit. The proportion of the N replications for which Dðy^rep;n; ωⁿÞ exceeds Dðy; ωⁿÞ provides an estimate of the PPP-value. Extreme PPP-values (close to 0, 1, or both, depending on the nature of the discrepancy measure) indicate model misfit. Figure 1 graphically describes the PPMC method.

Interpreting Posterior Predictive Model Diagnostics

Posterior predictive (PP) model checks are offered in the spirit of model diagnostics such as residual plots in regression. Plots suggest when the observed data do not look like replicated data under the model. Investigators can decide if the difference they see seems important enough to warrant fitting a new model (which is often suggested by the plots; Gelman et al., 2003, p. 176) or taking other remedial action.

Graphical displays are ambiguous on occasion; some other times, individuals prefer to make a judgment based on a quantitative measure. The PPP-values can be useful in this regard—they provide a quantitative measure of the degree to which observed data such as the ones at hand would be

(4)

expected under the model. Extremely small probabilities (less than .01 or greater than .99) are very suggestive that the model does not capture the corresponding feature of the data. Less extreme values (say less than 0.05) are also suggestive of model failure, although as described in the next paragraph, such a threshold should not be viewed as a significance level in the usual sense.

Gelman et al. (1996) commented that they ‘‘use the term assessment instead of testing to highlight the fundamental difference between assessing the discrepancies between a model and data and testing the correctness of a model’’ (p. 734). Thus, the PPP-values do not refer to an explicit hypothesis test. The PPP-values share some features of traditional hypothesis-testing p values; in particular, both are defined as tail-area probabilities, but there are important differences as well. The PP approach does not carry out a hypothesis test, and the PPP-values are not necessarily uniformly dis- tributed when the fitted model is in fact correct, and there is some evidence that PPP-values under the correct model tend to be closer to 0.5 more often than would be expected under a uniform distribution (Bayarri & Berger, 2000; Sinharay & Stern, 2003). One purpose of the simulation studies is to investigate the properties of the PP model diagnostics (including PPP-values) when they are applied to IRT models.

Application to Item Response Theory Models

The approach as it applies in the context of IRT models is described next. Let yijdenote the binary score of the ith individual for the jth item in an educational assessment. Suppose that the IRT model of interest is the three-parameter logistic (3PL) model,

pij= Pðyij= 1Þ = cj+ ð1 − cjÞ 1 + exp −að jðθi− bjÞÞ−1

;

with symbols having their usual meanings. In this context, if the examinee proficiencies θiare treated as nuisance parameters, then ω is the collection of all item parameters. The posterior distribution pðωjyÞ is given by

pðωjyÞ / likelihood× prior = Y

i

Z

θ_i

Y

j

p^y_ij^ijð1− pijÞ^{1− y}^ij

( )

pðθiÞdθi

" #

pðωÞ; ð3Þ

where pðθiÞ is the population distribution on the θi. To assess the fit of the 3PL model to a data set, one has to repeat the following steps a large number of times:

Figure 1

Graph Describing the Posterior Predictive Model-Checking (PPMC) Method

(5)

1. Generate a draw of item parameters from the posterior distribution given by equation (3) (using, as in Patz & Junker, 1999, an MCMC algorithm to fit the 3PL model). Draw proficiency parameters θis from pðθiÞ.

2. Draw a data set from the 3PL model, using the item parameters and proficiency parameters drawn in the above step.

3. Compute the values of the predictive and realized discrepancy measures from the above draws of parameters and data set.

One then creates plots and/or computes PPP-values for the discrepancy measures.

Discrepancy Measures

The choice of discrepancy measures is crucial in the application of the PPMC method. Although the flexibility of the method allows one to use any function of data and parameters as a discrepancy measure, all measures may not be useful. Ideally, discrepancy measures should be chosen to reflect aspects of the model that are relevant to the scientific purposes to which the inference will be applied and to measure features of the data not directly addressed by the probability model (Gelman et al., 2003, p. 172). It is easy to demonstrate that the observed proportions correct for the items are not helpful diagnostics because these quantities will be fit well even by an inappropriate IRT model as long as the model includes difficulty parameters that relate closely to the proportions correct. It was also found that the chi-square-type measureP

i

½yij− EðyijÞ²

VarðyijÞ , the sum of squares of the outfit measures (e.g., van der Linden & Hambleton, 1997, p. 113), although intuitive and useful for other types of statistical models, is not useful for IRT model checking as it fails to detect problems with inadequate models. This last result suggests caution; before applying the PPMC method with a particular discrepancy measure, one should examine how useful that measure is for detecting failures of IRT models.

For unidimensional IRT models, a number of discrepancy measures may be of interest, depending on the context of the problem. These models assume unidimensionality, local independence (LI), a specific shape of the response function, and, often, normality of the ability distribution, and each of these assumptions should be checked using suitable discrepancy measures. There does not seem to exist a single, omnibus discrepancy measure that can detect violations of all of these assumptions together. In the sections that follow, this article shows how different discrepancy measures can be used to detect various model violations.

Discrepancy Measures Examined Observed Score Distribution

Let NCj denote the number of examinees getting exactly j items correct, j= 0; 1; 2; . . . J.

Furthermore, let NC= ðNC0; NC₁; . . . NCJÞ⁰. Ferrando and Lorenzo-seva (2001) and Hamble- ton and Han (2004) suggested comparing the observed and predicted score distributions to measure overall model fit. To summarize the model fit to the observed score distribution in a single number, Beguin and Glas (2001) suggested the discrepancy measure χ²_NC=P

j

½NC_j− EðNC_jÞ² EðNC_jÞ ; where EðNCjÞ EðNCjÞ is the expectation of NCj. Although the statistic χ²_NCdoes not follow a chi- square distribution, the PPMC method automatically provides the relevant reference distribution.

(6)

Biserial Correlation Coefficient

The correlation of examinee scores with the binary outcomes on a particular item, the biserial correlation coefficient, may be appropriate for detecting the inadequacy of IRT models that do not include a slope parameter, as shown in a small example in Albert and Ghosh (2000), where the Rasch model was found inadequate.

Measures of Association Among the Item Pairs

Unidimensional IRT models imply specific types of associations between the responses of an individual to pairs of items. So, discrepancy measures that examine the association between item pairs can potentially detect lack of fit of such models. This article examines two such statistics.

Odds ratio (OR). Consider an item pair. Let nkk⁰ denote the number of individuals scoring k on the first item and k⁰on the second item, k; k⁰= 0; 1. This study uses OR = ⁿ_n¹¹ⁿ⁰⁰

10n₀₁; the sample odds ratio (e.g., Agresti, 2002, p. 45), referred to as the odds ratio (OR) hereafter, as a test statistic with the PPMC method. Chen and Thissen (1997) used the OR to detect violations of the LI assumption in unidimensional IRT models; they argued that if LI is satisfied, a unidimensional model can fit the observed OR. However, LI may not hold for a particular application for reasons such as speededness or the test not being unidimensional in a psychological sense (e.g., Chen & Thissen, 1997);

in that case, the observed OR will be larger (smaller) than what is expected under a unidimensional IRT model for within-cluster items¹(between-cluster items²). Chen and Thissen found the stan- dardized log-OR not to have aN ð0; 1Þ null distribution and hence did not find it a useful diagnostic. The PPMC method overcomes this problem because it does not require an explicit distributional assumption for the discrepancy measure.

Mantel-Haenszel (MH) statistic. Let us define the sample odds ratio for an item pair conditional on the rest score (i.e., the raw score on the test obtained by excluding the two items) r as ORr=ⁿ_n^11rⁿ^00r

10rn_01r, where nkk⁰ris the number of individuals with rest score r obtaining a score k on the first item and k⁰on the second, k; k⁰= 0; 1. Then the ORrs can be combined into an MH statistic (e.g., Agresti, 2002, p. 234), as MH=P

rn_11rn_00r=nr

P

rn_10rn_01r=nr; where nris the number of examinees obtaining rest score r. If LI holds, using arguments as in Stout et al. (1996), the conditional covariance between the scores on the two items is close to zero, and the MH statistic should be near 1; if LI is violated, the conditional covariance is positive (which means the MH statistic is more than 1) for within-cluster items and negative (MH statistic less than 1) for between-cluster items. Hence, as with the OR, in situations where LI is violated, the observed MH statistics are likely to be higher (lower) than expected for within-cluster (between-cluster) item pairs.

Outline of the Simulation Studies

To explore the performance of the PPMC method in detecting lack of fit for IRT models, we apply the method to a variety of combinations of the data-generating model and data analysis model. The combinations in Table 1 with an asterisk (‘‘?’’) in the corresponding cells are the cases examined. More details about the two-dimensional model and the ‘‘3PL, nonnormal pðθÞ’’ model can be found in the sections where these are discussed. The simulation approach in this study is as follows.

Consider a data-generating model Mgand a data analysis model Ma. One hundred data sets are generated from Mg, each data set containing responses of 2,500 examinees to 30 items. The generating parameter values (for items and examinees) are the same for all the 100 data sets generated.

(7)

The proficiency parameters are generated from a Nð0; 1Þ distribution. The item parameters used in Mgare similar to those encountered in real test data. For example, when Mgis the 3PL model, this work uses real item parameter estimates, shown in Table 2, from the National Assessment in Educational Progress (NAEP). For generating data from the Rasch or two- parameter logistic (2PL) model, relevant parameters from the same table are used. For each of the 100 data sets generated, the model Mais fit using an MCMC algorithm (five chains of length 6,000 are run, which ensures convergence of the MCMC algorithm; the first 2,000 draws in each chain are discarded; and the remaining output is thinned by retaining every 20th draw to get a combined posterior sample of 1,000). The prior distributions used for fitting the models are

logðajÞ∼îidN ð0; 10Þ; bj ∼îidN ð0; 10Þ; logitðcjÞ∼îid N ðlogitð0:2Þ= −1:4; 1Þ: ð4Þ The population distribution of proficiencies is taken to be θi∼ Nð0; 1Þ. The results of the PPCs in the simulation study are robust to reasonable changes to the prior distributions (as long as prior variances are not close to zero).

Table 1

The Cases Examined in the Simulation Studies Data-Generating Model

Analysis Model 1PL 2PL 3PL Two-Dimensional 3PL, Nonnormal pðθÞ

1PL * * *

2PL * * *

3PL * * *

Note. 1PL= one-parameter logistic; 2PL = two-parameter logistic; 3PL = three-parameter logistic.

Table 2

The Generating Parameter Values for the Simulation Studies

Item ID 1 2 3 4 5 6 7 8 9 10

aj 0.70 1.89 1.32 1.36 1.17 0.56 1.10 1.68 1.01 0.88

bj 1.81 –0.52 0.26 –1.48 –0.52 0.44 –2.15 0.96 –0.87 1.70

cj 0.08 0.06 0.14 0.17 0.17 0.14 0.30 0.25 0.10 0.22

Item ID 11 12 13 14 15 16 17 18 19 20

a_j 1.19 0.60 1.49 2.01 2.40 2.00 1.48 1.10 1.52 1.45

bj –2.41 –0.56 –0.27 0.04 –0.79 –0.38 0.14 –1.53 0.11 –0.21

c_j 0.23 0.21 0.12 0.20 0.18 0.17 0.13 0.14 0.20 0.12

Item ID 21 22 23 24 25 26 27 28 29 30

a_j 0.80 0.67 0.83 1.17 1.43 2.4 1.53 1.2 1.7 2.05

bj –0.52 –0.43 –1.25 –0.52 –0.27 –0.44 1.75 –0.80 0.40 –0.93

cj 0.11 0.12 0.18 0.13 0.15 0.40 0.25 0.24 0.16 0.26

(8)

The posterior samples are used to perform the PPCs. As a numerical summary, the PPP-value for each discrepancy measure is computed on each data set, resulting in 100 PPP-values for each discrepancy measure. Note that the PPP-values are computed here to allow efficient summariza- tion of the results of the many model-checking simulations; in a real application, a graphical plot is recommended first and PPP-values, if required, later.

Results From the Simulation Studies

Analyses when the fitted model is the same as the generating model are useful for learning about the frequency with which misfit is wrongly detected by an approach. In this study’s case, such simulations suggest that PPCs are, if anything, conservative in that they tend not to show misfit of a correct model too often. For example, only 3% of the ORs had PPP-values less than .05 when the 3PL model was used to both generate and analyze the data.

The remainder of the discussion is concentrated on cases where the generating model and analysis model are different. Findings that indicate how different discrepancy measures help detect different types of lack of fit are emphasized. It is true that the correct model is known, that Bayarri and Berger (2000) recommended that it is more appropriate to use model selection tools (Bayes factor, Bayesian information criterion [BIC], etc.) than model checks if an investigator has a clearly formulated alternative to the hypothesized model H₀, and that Gelman et al. (1996) recommended the PPMC method for assessing the fit of a single model to the available data in the absence of an alternative model. However, this study’s approach best reflects what happens in operational settings, where one often has a single preferred model at his or her disposal. Most of the operational tests using IRT employ a predetermined model, and there is no alternative model (e.g., the NAEP employs the 3PL model for dichotomous item calibration); if the model does not fit the data, analysts often employ remedial actions such as discarding items or collapsing item responses until the model fits the data. The advantage of generating from known models in these simulations, other than convenience, is that it allows one to know the type of misfit to expect so that it is possible to cross-check how the PPMC method has fared.

In a typical real application, an investigator will analyze one data set; so for most of the simulation scenarios, results for a single simulated data set appear first. However, it is also a goal to learn how the PPMC method performs in repeated analyses (Rubin, 1984); so, results for 100 data sets under each simulation scenario are also given. To compare results to those when the model fits, the plots under the true model (i.e., when the analysis model is the same as the data- generating model) are also shown in most of the figures.

Data-Generating Model: 2PL/3PL; Data Analysis Model: 1PL

The one-parameter logistic (1PL)/Rasch model is the simplest of the IRT models and makes restrictive assumptions (equality of item discriminations and no guessing); discrepancy measures that can exploit the limitations, if any, of these assumptions are most likely to identify misfit.

Because the biserial correlation coefficient is closely related to the item slope parameter, data generated from the 2PL/3PL model are likely to have more extreme biserial correlations (both high and low) than can be explained by the Rasch model.

The top left panel of Figure 2 shows, for one data set generated from the 3PL model, the observed biserial correlations, corresponding 90% PP intervals, and PP medians. Only a few of the PP intervals contain the observed biserial coefficients, indicating the success of the PPMC method in detecting the lack of fit of the Rasch model to the biserial correlations. The replicated medians

(9)

are all very similar, around 0.57. This is not surprising as the Rasch model assumes that all items have the same slope parameter. The bottom left panel shows that these patterns are not seen when the correct model is used to analyze the simulated data. It is straightforward to construct a single summary measure involving the biserial correlations that demonstrates the lack of fit of the Rasch model. The top right panel of Figure 2 shows, as in Albert and Ghosh (2000), the observed and 1,000 replicated values of the standard deviation (SD) of the biserial correlations. Here, the huge difference of the observed value (0.13) from the replicated values (replicated median is only about 0.025), as against no difference for the bottom right panel of the figure, points to the inadequacy of the Rasch model for this data set. The same phenomenon occurs in the other 99 data sets as well (results not shown).

The different slope parameters in the 2PL/3PL model induce a degree of dependence among the items (Chen & Thissen, 1997, p. 269); the Rasch model, with the same slope for all items, fails

1PL fit

Item

Biserial

1 5 10 15 20 25 30

0.30.50.7

1PL fit

SD of biserials

Frequency

0.00 0.04 0.08 0.12

050100150

3PL fit

Item

Biserial

1 5 10 15 20 25 30

0.30.50.7

3PL fit

SD of biserials

Frequency

0.11 0.12 0.13 0.14 0.15

0100200300

Figure 2

Fit of the 1PL Model and the 3PL Model to Biserial Correlations

Note. The top row shows plots for biserial correlations for a single data set generated from the three-parameter logistic (3PL) model and analyzed by the one-parameter logistic (1PL) model. The left-hand plot shows the observed biserial for each item as a solid ellipse, corresponding 90% posterior predictive (PP) intervals as a dotted line, and the PP median as a hollow ellipse; the right-hand plot shows a histogram as a summary of the PPD of the standard deviation (SD) of the biserial correlations. A dotted vertical line shows the observed SD. For comparison purposes, the bottom row shows similar plots when the same data are analyzed by the correct (3PL) model.

(10)

to explain the dependence. To illustrate, a case is considered in which the generating model is the 2PL model. Figure 3 plots the observed ORs involving Items 1 and 2, corresponding 90% PP intervals, and PP medians from the analysis of a single data set generated from the 2PL model. For example, for the item pair (1, 2), the observed OR is 1.8, whereas a corresponding 90% PP interval is (1.75, 2.46).

The figure reveals a number of interesting facts. First, the replicated medians of the ORs are near 2 for each item pair; this phenomenon has not been reported in the literature yet. Intuitively, the Rasch model assumes the same discrimination parameter for all the items and hence has all item response functions parallel to each other, causing the predicted ORs to be close to one another.

Haberman, Holland, and Sinharay (in press), motivated by these findings, recently suggested some theoretical results regarding this phenomenon. Next, by comparing the observed ORs to the replicated ORs, it is obvious that the PPMC method succeeds in detecting the inadequacy of the Rasch model to capture the associations among the items in data sets that exhibit more complex structure.

1PL fit: Item 1

Item

OR

1 5 10 15 20 25 30

1.52.02.5

1PL fit: Item 2

Item

OR

1 5 10 15 20 25 30

1.52.53.54.5

2PL fit: Item 1

Item

OR

1 5 10 15 20 25 30

1.52.02.5

2PL fit: Item 2

Item

OR

1 5 10 15 20 25 30

2345

Figure 3

Fit of the 1PL Model and the 2PL Model to ORs

Note. The top row shows plots for observed and replicated odds ratios (ORs) for a data set generated from the two-parameter logistic (2PL) model and analyzed by the one-parameter logistic (1PL) model. For each OR, a solid ellipse shows the observed OR, and a dotted line shows the corresponding 90% posterior predictive (PP) interval for the OR; the hollow ellipses show the PP medians; the left-hand (right-hand) plot shows results for the item pairs involving Item 1 (Item 2). The bottom row shows similar plots when the same data are analyzed by the correct (2PL) model.

(11)

The observed ORs for pairs involving Item 1 (an item with low discrimination) appear quite small compared to the reference distribution obtained under the Rasch model. The opposite is true for the ORs for pairs involving Item 2 (one with high discrimination). The plots in the bottom row, in sharp contrast, do not show any such patterns.

Figure 4 summarizes, from all 100 simulated data sets, PPP-values for the OR for each pair of items when data are generated from the 2PL model but analyzed with the Rasch model. The items are sorted in increasing order of their true discrimination parameters.

The plot is a variation of one used in Friendly (2002); it simultaneously displays two aspects of the simulation results. The circles below the main diagonal (the ones to the right of an imaginary 45-degree line through the middle of the plot) show the median of the 100 PPP-values for each item pair; the proportion of a circle that is shaded is equal to the corresponding median PPP-value.

Thus, item pairð1; 2Þ has a median PPP-value near 1, whereas item pair ð29; 30Þ has a median PPP-value near zero. The circles above the main diagonal (those to the left of the imaginary 45-degree line) provide some information about how often the PPMC technique detects lack of fit for each item pair. The shaded proportion of a circle for an item pair is the same as the proportion

Item

5 10 15 20 25 30

Figure 4

Display of PPP-Values for Odds Ratios (ORs) for 2PL data Analyzed by the One-Parameter Logistic (1PL) Model

Note. The circles below the main diagonal show the median of 100 posterior predictive p values (PPP- values) for each item pair; the proportion of the circle that is shaded is equal to the median. In the same manner, the circles above the main diagonal show the proportion of simulated data sets with extreme PPP-values.

(12)

of times the PPP-values are extreme (below 0.05 or above 0.95) for that item pair. The two halves of Figure 4 tell essentially the same story—whenever there is an extreme median for an item pair (either extremely high or low), the proportion of extreme p values is high for the same item pair.

For comparison purposes, Figure 5 shows what happens when the correct model (i.e., the 2PL model) is fitted to the data. In Figure 5, the median PPP-values (below the diagonal) are all near 0.5, and the proportions of extreme PPP-values (above the diagonal) are small.

Data-Generating Model: 3PL; Data Analysis Model: 2PL

It proved relatively easy above to identify misfit of the Rasch model for data in which it was not appropriate. This next case is more difficult. The fit of the 2PL model has often been found adequate when data are generated from the 3PL model (see, e.g., Sinharay, in press).

Figure 6 compares, for a data set generated from the 3PL model, the observed number of examinees attaining a particular score to its PPD under the 2PL (top row) and 3PL (bottom row) models.

The top left panel of Figure 6 (similar to a plot used by Beguin & Glas, 2001) shows that the observed score frequencies lie outside the 90% interval in a number of places. There are fewer people at the low end of the observed score distribution than expected under the 2PL model

Item

5 10 15 20 25 30

Figure 5

Display of PPP-Values for Odds Ratios (ORs) for the Two-Parameter Logistic (2PL) Data Analyzed by the 2PL Model

Note.The circles below the main diagonal show the median of 100 posterior predictive p values (PPP-values) for each item pair; the proportion of the circle that is shaded is equal to the median. In the same manner, the circles above the main diagonal show the proportion of simulated data sets with extreme PPP-values.

(13)

because the 2PL model includes no guessing parameter. More (less) people fall in the 10 to 12 (22-25) range than expected under the 2PL model. The summary discrepancy measure χ²_NCis very informative in this situation. The top right panel of Figure 6 (in contrast to the bottom right panel) shows that the realized values of χ²_NCare mostly larger than the predictive values for this data set.

Figure 7 shows that the patterns observed above in the one data set persist across repetitions.

Under the 2PL model, PPP-values are consistently high for low raw scores and those in the range of 22 to 25 correct, indicating that these raw scores occur less often than expected. The PPP-values are consistently low in the middle range and for the top scores, indicating that these raw scores occur more often than expected. The PPP-value for the aggregate measure χ²_NCis less than 0.04 for each of the 100 generated data sets. Therefore, the PPMC method demonstrates that the 2PL model cannot adequately explain the observed score distribution of data from the 3PL model.

Biserial correlation is another aspect of data generated from the 3PL model that the 2PL model fails to explain adequately. For one data set generated from the 3PL model, for a number of items (mostly those with high biserials), the observed biserial correlations are lower than expected under

2PL fit

Observed score

Number of examinees

0 5 10 15 20 25 30

050150

0 20 40 60 80

020406080 2PL fit

Realized discrepancy

Predictive discrepancy

3PL fit

Observed score

Number of examinees

0 5 10 15 20 25 30

050150

0 20 40 60 80

020406080 3PL fit

Figure 6

Fit of the 2PL Model and the 3PL Model to the Observed Score Distribution

Note. The top row shows plots of diagnostics based on the observed score distribution for a data set generated from the three-parameter logistic (3PL) model and analyzed by the two-parameter logistic (2PL) model. The left panel shows the observed score distribution (solid ellipses, connected by a solid line) and the corresponding posterior predictive distribution (PPD; hollow ellipses joined by dotted lines at posterior predictive [PP]

medians and dotted lines indicating lower and upper bounds of PP 90% intervals); the right panel shows, along with a diagonal line, the realized and predictive discrepancies for χ²_NC. For comparison purposes, the bottom row shows similar plots when the same data are analyzed by the correct (3PL) model.

(14)

the 2PL model (although the discrepancy is not as extreme as observed for the Rasch model). As an outcome, the observed mean and SD of the biserial correlations are considerably lower than expected under the 2PL model (see Figures 18 and 19 in Sinharay & Johnson, 2003). The same pattern persists across repetitions. Figure 8 shows histograms of the PPP-values for the mean and SD of the biserial correlations across the 100 simulated data sets for the 2PL and 3PL models. The PPP-values for both mean and SD for the 2PL model are concentrated on high values.

The finding that observed scores and biserial correlations can identify weaknesses in the fit of the 2PL model to 3PL data is important in that there has previously been little success distinguish- ing the 2PL and 3PL models.

Data-Generating Model: Multidimensional; Data Analysis Model: 3PL

To study the ability of the PPMC method to detect lack of fit when the local independence assumption of the unidimensional IRT models is violated, the case is considered in which the 3PL model is fit to data generated from the linear logistic 2-dimensional model (Reckase, 1997, and the references therein) given by

Pðyij= 1Þ = c^j+ ð1 − c^jÞð1+ e^{− a}ð^1j^θ¹ⁱ^{+ a}^2j^θ²ⁱ^{− b}^jÞÞ⁻¹;

ðθ1i; θ_2iÞ∼ N2ð0; 0; 1; 1; ρÞ; jρj≤ 1: ð5Þ

1 3 5 7 9 12 15 18 21 24 27 30

0.00.20.40.60.81.0

2PL fit

Observed score

Pvalues

1 3 5 7 9 12 15 18 21 24 27 30

0.00.20.40.60.81.0

3PL fit

Observed score

Pvalues

Figure 7

Fit of the 2PL Model and the 3PL Model to the Observed Score Distribution for 100 Data Sets

Note. The left panel shows boxplots showing the 100 posterior predictive p values (PPP-values) for each observed raw score across the 100 simulated data sets generated from the three-parameter logistic (3PL) model and analyzed by the two-parameter logistic (2PL) model. The width of the boxplot for any observed score is proportional to the mean number of people for that score. There are horizontal dotted lines at 0.05 and 0.95. For comparison purposes, the right panel shows similar plots when the same data sets are analyzed under the correct (3PL) model.

(15)

For generating data from the above model, the same difficulty and guessing parameters are used as in Table 2. The closer the value of ρ to 1, the more unidimensional are the data. The slope parameters for the two dimensions, a_1js and a_2js, are functions of the slopes ajs used in Table 2, a_1j= ajI_½j≤15; a_2j= ajI_½j>15. Under this data-generating model, the first and second halves in the test measure different correlated skills.

The following discussion describes the results for the case with ρ= 0.7. Association among the items is an aspect the unidimensional 3PL model should fail to explain here. The MH statistic detects model misfit more often than the OR and is featured here. Figure 9, using the same symbols as in Figure 4, shows the PPP-values for a single data set above the main diagonal and the median PPP-values for 100 data sets below the main diagonal.

There an abundance of extreme PPP-values. It is important to note here (and in other examples) that some extreme PPP-values are to be expected by chance even under a correct model given the large number of item pairs considered (435 in this case). One should interpret the plot as showing misfit only if the number of extreme p values is large and/or the extreme p values indicate an unusual pattern (e.g., all the extreme p values involving the same few items). Figure 9 clearly indicates that the 3PL model does not fit because there are many extreme PPP-values, and the extreme p values indicate that the items form two clusters. Observed associations (as measured by the MH statistic) for within-cluster³(between-cluster) item pairs are mostly higher (lower) than expected under the 3PL model. This phenomenon is mentioned by, for instance, Stout et al. (1996), for multidimensional assessments, but Figure 9 provides an attractive graphical demonstration of the fact.

2PL fit: Average biserial

PPP-values

Frequency

0.80 0.85 0.90 0.95 1.00

0102030

2PL fit: SD of biserials

PPP-values

Frequency

0.90 0.95 1.00

0510

3PL fit: Average biserial

PPP-values

Frequency

0.25 0.30 0.35 0.40

02468

3PL fit: SD of biserials

PPP-values

Frequency

0.4 0.5 0.6 0.7

02468

Figure 8

Histograms of the 100 Posterior Predictive p Values (PPP-Values) for the Average and Standard Deviation (SD) of Biserial correlations for 100 Data Sets Generated From the Three-Parameter Logistic (3PL) Model and Analyzed by the Two-Parameter Logistic (2PL)

Model (Top Panel) and the 3PL Model (Bottom Panel, for Comparison Purposes)

Note. The vertical dashed lines show the corresponding medians.

(16)

The patterns of Figure 9 are not present when the appropriate two-dimensional model is fit to the data and a similar plot (not shown) created. It is also possible to detect misfit of the 3PL model using the OR statistic (see Sinharay & Johnson, 2003, for results) and more often using the MH statistic (results not shown) when (a) data are generated from the testlet model (Bradlow et al., 1999), a special case of a multidimensional model (Wang & Wilson, 2005, p. 129), and (b) when data are generated from a speeded test. For both these situations, the local independence assumption of a 3PL model is known to be violated.

Data-Generating Model: 3PL With a Nonnormal Ability Distribution;

Data Analysis Model: 3PL WithN (0, 1) Ability Distribution

Here, the generating item characteristic curves (ICCs) are those of a 3PL model, but the generating ability distribution, pðθÞ, is a mixture of two normal distributions, N ð−0:253; 0:609²Þand N ð2:192; 1:045²Þ, with mixing proportions 0.897 and 0.103, respectively, as in Woods and Thissen (2004). As a result, pðθÞ has mean 0 and SD 1, as in the N ð0; 1Þ distribution, but has skewness 1.9 and kurtosis 5. A simple 3PL model using the N ð0; 1Þ population distribution assumption is fit to the data. The observed score distribution, which is closely related to the under- lying ability distribution, is an aspect of the data that the fitted model fails to describe adequately.

The leftmost panel of Figure 10 shows the observed score distribution for one data set along with the PPD of the score distribution. Figure 10 clearly indicates model misfit (and hence represents

Item

5 10 15 20 25 30

Figure 9

Posterior Predictive p Values (PPP-Values) for One Data Set (Above Main Diagonal) and Median PPP-Values for 100 Data Sets (Below Main Diagonal) using the Mantel-Haenszel (MH)

Statistic as a Discrepancy Measure for Data Generated From a Two-Dimensional Model and Analyzed With the Three-Parameter Logistic (3PL) Model

(17)

a success for the PPMC method), especially for the higher raw scores. There are more observed scores than expected near 20 and near 30 under the model and fewer observed scores than expected around 25. The middle panel of Figure 10 shows the standard Nð0; 1Þ ability distribution on the same axes as the normal mixture. The unusual aspects of the observed score distribution relative to the 3PL model, assuming an Nð0; 1Þ ability distribution, correspond directly to the areas in the ability distribution where the mixture differs from the typical distribution. The rightmost panel of Figure 10 shows the corresponding model check for the aggregate measure χ²_NCthat pools information across all score levels. The realized discrepancies are much larger than that predicted under the model. The pattern persists across data sets. The PPP-value for the aggregate measure χ²_NC is less than 0.02 for all the 100 generated data sets.

Real Data Example: Food Security Data

The USDA’s U.S. Food Security Measurement Project administers a battery of survey questions as a supplement to the Current Population Survey (CPS) in an attempt to estimate the proportion of the population that is (in)secure in their ability to obtain ‘‘enough food for an active, healthy life.’’

Respondents are classified into one of three food security levels (food secure, food insecure without hunger, food insecure with hunger) based on their responses to the 18 items listed in Table 3.

The USDA uses the Rasch model to measure the unidimensional latent construct ‘‘food insecu- rity’’ by dichotomizing all of the items listed above. Responses of ‘‘often’’ and ‘‘sometimes’’ to Items 1 through 6 are treated as successes (yij= 1), and ‘‘never’’ is treated as a failure (yij= 0).

0 5 10 15 20 25 30

050100150200

Observed score

Number of examinees

2 0 2 4

0.00.10.20.30.40.50.6

Ability

Probability density

N(0,1) Mixture

0 50 100 150

050100150

Figure 10

Model-Checking Plots for a Data Set Generated Using a Nonnormal Ability Distribution and Analyzed Assuming anN ð0; 1Þ Ability Distribution

Note. Leftmost plot shows the observed score distribution (solid ellipses connected by a solid line) and the posterior predictive distribution (PPD) of the observed scores (posterior medians, joined by dotted lines, as hollow ellipses and 90% posterior intervals as dotted lines). The middle plot shows the density functions of theN ð0; 1Þ and normal mixture distribution. Rightmost plot shows realized and predictive discrepancy χ²_NC.

(18)

Follow-up Questions 8, 13, and 17 are dichotomized by treating responses of ‘‘only in 1 or 2 months’’ as failures. The Food Security Project treats all missing responses as failures (yij= 0), and this study will do the same here.

A number of studies have suggested that the unidimensional Rasch model may not be adequate for the 18 questions listed above. Froelich (2002) investigated the unidimensionality assumption for the Food Security data from the 1999 CPS with the DETECT and DIMTEST (see, e.g., Stout et al., 1996, and the references therein) procedures and found evidence of at least two dimensions, with the items pertaining to children loading on one dimension and the items pertaining to adults/

households loading on the second dimension. Nord (1999, as cited in Opsomer, Jensen, Nusser, Dringei, & Amemiya, 2002) noted that only individuals who pass a set of screening questions are administered the food security items. If it is believed that the entire population is normally distrib- uted, then the mixing distribution employed in analysis should account for the nonrepresentative sample. Johnson (2004) employed a truncated normal distribution and fitted the 2PL model to Food Security data from households without children from the 2002 CPS. That paper found that differences in the discrimination parameters were statistically significant. Specifically, the paper found that the discrimination parameter for the item that asks if an ‘‘Adult was hungry but couldn’t afford to eat’’ was significantly larger than the discrimination parameters for all other nonchild food security items.

This section demonstrates how the biserial correlation, observed score distribution, and Mantel- Haenszel posterior predictive diagnostics are used to examine the appropriateness of the Rasch model for the Food Security data. The responses of 12,722 households (with children) from the 2003 CPS (available from http://www.census.gov/) are examined.

Figure 11, using the same symbols as in Figure 2, compares the observed biserial correlation with their posterior predictive distributions. It was noted in the discussion of simulation studies that the

Table 3

The 18 Food Security Items

1. We worried whether our food would run out. 10. Adult hungry but couldn’t afford to eat.

2. Food bought didn’t last. 11. Adult lost weight because couldn’t afford 3. Adult unable to eat balanced meals. to eat enough.

4. Relied on only a few kinds of low-cost foods to feed children.

12. Adult didn’t eat for an entire day because couldn’t afford to eat.

5. Couldn’t feed child(ren) balanced meals. 13. . . . this happened in three or more months.

6. Child(ren) not eating enough. 14. Cut the size of child(ren)’s meals.

7. Adult cut size or skipped meals. 15. Child(ren) was(were) hungry.

8. . . . this happened in three or more months 16. Child(ren) skipped meals.

over the past year. 17. . . . this happened in three or more months.

9. Adult ate less than they felt they should. 18. Child(ren) did not eat for entire day.

Note. Respondents were asked if Statements 1 through 6 were ‘‘often,’’ ‘‘sometimes,’’ or ‘‘never’’ true over the past 12 months. For the remaining items, except 8, 13, and 17, respondents responded by answering yes or no. Follow-up Items 8, 13, and 17 asked respondents whether the preceding statement was true in ‘‘only 1 or 2 months,’’ ‘‘some months but not every month,’’ or ‘‘ almost every month.’’ Items 4 through 6 and 14 through 18 were presented only to those households in which there were children; the other 10 items were administered to all respondents.

(19)

biserial correlations for item responses generated from 2PL and 3PL items tend to be more variable than would be expected under the Rasch model. In the right panel of Figure 11, the observed standard deviation of the biserial correlations (designated with a vertical dashed line), although above the mean, is not terribly extreme when compared to the posterior predictive distribution of the same measure. However, the observed biserial correlations for all items except Item 4 lie above the 90%

posterior predictive interval. In fact, the mean of the observed biserial correlations is larger than the mean of the generated biserial correlations in 100% of the posterior predictive samples. This may be a consequence of a nonnormal latent variable distribution as described below.

Figure 12, using the same symbols as in the leftmost plot in Figure 10, compares the observed raw score distribution with the posterior predictive score distributions. It is clear from the figure that the observed data set has significantly more respondents with a raw score of zero and significantly fewer respondents with scores 1 through 4 than expected under the analysis model. This finding is somewhat surprising because, under the Rasch model, the observed score is a sufficient statistic for the latent variable θ, and under joint maximum likelihood estimation, the observed score distribution should be estimated perfectly. However, this analysis assumes a normal distribution for the latent variables, which, in this case, is not flexible enough to fit the observed raw score distribution.

Finally, the unidimensionality assumption is evaluated with the Mantel-Haenszel posterior predictive diagnostic plot in Figure 13. Both halves of the plot (above and below the diagonal) show the same PPP-values using circles as the half above the diagonal in Figure 9. The plot seems to indicate that there are some groups of items that are more closely related (empty circles) than expected under the unidimensional Rasch model that was fitted to the data. Specifically, Items 4 to 6 and Items 14 to 17 appear to form a single cluster of items that are more closely related than the Rasch model permits. These are all items that ask about the food availability for children in the home; Item 18 is the only child item that is not obviously part of the cluster. Items 7 through 13,

Item

Biserial

5 10 15

0.00.20.40.60.81.0

SD of biserial

Frequency

0.16 0.18 0.20 0.22

050100150200

Figure 11

Summaries of the PPDs of the Biserial Correlations for the Food Security Data

Note. The left-hand plot shows the observed biserial for each item as a solid ellipse, corresponding 90% posterior predictive (PP) intervals as a dotted line, and the PP median as a hollow ellipse. The right-hand plot shows a histogram as a summary of the posterior predictive distribution (PPD) of the standard deviation (SD) of the biserial correlations; a dotted vertical line shows the observed SD.

(20)

which ask specifically about the food intake of an adult in the household, appear to make up another item cluster; Item 3 is the only adult item not included in the cluster. The first two items may be viewed as a third cluster. These results are very similar to those found in Froelich (2002).

The posterior predictive diagnostics examined here present strong evidence against the use of a Rasch model with a normal latent variable distribution for the analysis of the Food Security data.

However, it is not clear what extension of the model, if any, will correct the several problems uncovered by the posterior predictive diagnostics. One strategy might be to first fit a multidimensional Rasch model to address the findings of Figure 13, but it is not clear that a multidimensional model will correct the problems with the biserial correlations and the observed score distribution.

Another strategy might use a more flexible distribution for the latent variable, one that can capture the shape of the observed score distribution. However, any such distribution will most likely fail to handle the problems identified in Figure 13. A combination of the two remedies might be needed.

No matter what extension to the Rasch model is implemented, it remains important for the researcher to examine the appropriateness of the model assumptions.

Summary and Recommendations

This article, through several simulations and a real data example, demonstrates that PPMC provides a straightforward way to perform a collection of Bayesian model checks. The method is flexible enough to correctly detect lack of fit for a range of model violations that are of usual concern to psychometricians. One strong aspect of this article is the use of easily comprehensible and attractive graphical displays to present the model-checking results. Although the article deals with dichotomous items only, the techniques easily generalize to the case with polytomous items.

Results in a companion article (Sinharay, 2005) applying the PP method to several real data examples provide further support for the findings from these simulations.

0 5 10 15

010003000

Observed score

Number of examinees

Figure 12

Posterior Predictive Check Using the Observed Score Distribution for the Food Security

Note. The plot shows the observed score distribution (solid ellipses connected by a solid line) and the posterior predictive distribution (PPD) of the observed score distribution (posterior medians as hollow ellipses, joined by dotted lines, and 90% posterior intervals as dotted lines).

(21)

The choice of discrepancy measures is a vital issue in the application of the PPMC method. A number of discrepancy measures examined in this work are found useful in assessing IRT model fit. The OR and MH statistics, examining the association between item pairs, are found to success- fully detect lack of fit whenever there is a violation of the LI assumption (e.g., for a multidimensional or a speeded test). The observed score distribution can be used to detect lack of fit when the assumed ability distribution is not correct. The discrepancy measures used in this work are natural quantities of interest in any educational assessment. Thus, another positive aspect of the model- checking approach presented here is that it provides a better understanding of the trade-offs among different types of IRT models. For example, the simulations suggest that when the items are of varying discrimination, then the Rasch model is not adequate to capture the dependencies in the data that are evident in the odds ratios. The simulations also show that the observed score distribution in the presence of guessing is not adequately described by the 2PL model, even though the model captures several other features of 3PL data.

The simulation results provide some advice concerning what has always been a difficult ques- tion about IRT model checking in real data settings—what aspects of model fit should be checked, and how should the checking be done? Naturally, the answer depends on an investigator’s circum- stance. If there is a specific concern, then one ought to use discrepancy measures that address the concern. For example, researchers have questioned whether the unidimensional Rasch model is adequate for the Food Security data discussed in the preceding section, and the Mantel-Haenszel statistic appears to confirm that the data are not unidimensional. Another example is provided by a a real data example in Sinharay (2005), where there were some questions about speededness of a 45-item basic skills test, and PPC using the OR statistic confirmed the problem.

Item

5 10 15

Figure 13

Display of the Posterior Predictive p Values (PPP-Values) From the Mantel-Haenszel (MH) Statistic for the Food Security Data Analyzed With the Rasch Model

(22)

If there are no specific a priori concerns for the model-checking exercise (as would happen with a model whose output is used for a number of different purposes), it is recommended to use a number of discrepancy measures (as suggested by van der Linden & Hambleton, 1997) for assessing different aspects of fit of the model; the simulations here suggest that the MH statistic and observed score distribution are especially useful. In the examination of the Food Security data, this study began by looking for evidence arguing against the form of the Rasch item response function but found instead, by examining the observed score distribution, that the assumption of a normal latent variable distribution is a bigger concern.

Only when the model fits all relevant aspects of the data adequately (as happened with a real example in Sinharay, 2005) can the investigator be confident that inferences made are reasonable.

It is not entirely clear what to do when the model does not fit one or more aspects of the data. Possi- ble actions include employing the model to make inferences but being aware of the limitations and reporting its deficiencies (Gelman et al., 1996, p. 799), discarding items or pooling testlet items with the hope that the chosen model will fit the revised data, or using a more general model such as the testlet model or multidimensional model. Another key issue when misfit is found is judging when discrepancies between the test data and the model fit are important enough to address. Such decisions require additional statistical analysis and subject matter expertise in addition to the model-checking plots and p values. Sinharay (2005) elaborates on this issue.

The results of this study should be of great interest to a range of researchers. Clearly, the results are of interest to researchers using a Bayesian approach. For any IRT model analyzed in a Bayesian way, the PPMC method is convenient because it builds on the posterior distribution obtained naturally during data analysis. With ever-increasing demand from test consumers, the need for more complicated models and hence the use of Bayesian methods is likely to accelerate; therefore, it will be to the advantage of the psychometric research community to develop model-checking strategies that take advantage of this. The findings of this study, especially about which diagnostics appear most helpful, should also be of interest to those using a frequentist approach (e.g., using maximum likelihood estimation). A variation of the PPMC method can be applied by using a point estimate for the model parameters (rather than using a sample from the posterior distribution of the parameters) to generate a number of simulated data sets. The simulated data sets then provide a reference distribution for model diagnostics. However, this frequentist approach, unlike the PPMC method, does not take into account uncertainty in the estimation of the parameters (e.g., Bayarri & Berger, 2000).

Notes

1. Those that are influenced by a single trait that is different from the trait that the test intends to measure; the latter trait is common to each item in the test.

2. Those that are influenced by different traits.

3. Both Clusters 1 through 15 and 16 through 30.

References Agresti, A. (2002). Categorical data analysis. Hoboken,

NJ: John Wiley.

Albert, J., & Ghosh, M. (2000). Item response modeling. In D. K. Dey, S. Ghosh, & B. Mallick (Eds.), Generalized linear models: A Bayesian perspective (pp. 173-193). New York: Marcel-Dekker.

Bayarri, S., & Berger, J. (2000). P-values for com- posite null models. Journal of the American Sta- tistical Association, 95, 1127-1142.

Beguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66(4), 541-562.