Assessing Fit of Unidimensional Item Response Theory Models Using a Bayesian Approach

(1)

Journal of Educational Measurement Winter 2005, Vol. 42, No. 4, pp. 375–394

Assessing Fit of Unidimensional Item Response Theory Models Using a Bayesian Approach

Sandip Sinharay ETS, Princeton, NJ

Even though Bayesian estimation has recently become quite popular in item re- sponse theory (IRT), there is a lack of works on model checking from a Bayesian perspective. This paper applies the posterior predictive model checking (PPMC) method (Guttman, 1967; Rubin, 1984), a popular Bayesian model checking tool, to a number of real applications of unidimensional IRT models. The applications demonstrate how to exploit the flexibility of the posterior predictive checks to meet the need of the researcher. This paper also examines practical consequences of mis- fit, an area often ignored in educational measurement literature while assessing model fit.

Before drawing any conclusions from the application of a statistical model to a data set, an investigator should assess the fit of the model (i.e., examine if the model can adequately explain the aspects of the data set that are of practical interest). Oth- erwise, the investigator is under the risk of drawing incorrect conclusions regarding the scientific problem of interest. Substantial lack of fit should result in the replace- ment or extension of the model if possible. Even if a model has been appointed as the final model for an application, it is important to assess its fit to be aware of its limitations.

Assessing fit of item response theory (IRT) models is not a straightforward task.

The main difficulty is that the possible number of responses (2^Ifor a test with I binary items) is sufficiently large for even a moderately large number of items so that the standardχ² test of goodness of fit does not directly apply. Therefore, as Glas and Suarez-Falcon (2003) noted, investigators assess the fit of IRT models by examining if the model can explain various summaries (or collapsed versions) of the original data. Researchers van der Linden and Hambleton (1997) commented:

The approach to the assessment of fit with two- and three-parameter models often involves the collection of a wide variety of evidence about model fit, including statistical tests, and then making an informed judgment about model fit and usefulness of a model with a particular set of data. (p. 16)

Researchers have proposed a significant number of techniques for assessing different aspects of fit of IRT models. For testing the assumption of unidimensional- ity/local independence of IRT models, there exists Q3(e.g., Yen, 1993), DIMTEST (Stout, 1987), and factor analysis (e.g., McDonald, 1999) among others. For assess- ing fit at the item level (i.e., testing item fit), different measures have been sug- gested by Stone and Zhang (2003) and Orlando and Thissen (2000) and so on. One can check for differential item functioning using the normal/χ²test in Lord (1980) or the Mantel-Haenszel test of Holland (1985) for example. For assessing person

(2)

Sinharay

fit, there exists a number of tests—see, e.g., Glas and Meijer (2003) for a review of them. Analysis of residuals have received limited attention (e.g., Reiser, 1996).

Still, there is no unanimously agreed-upon measure in any of the above areas. Stan- dard IRT software packages lack reliable model fit indices, and worse, a number of IRT software packages produce model fit indices that have well-known shortcom- ings (Hambleton & Han, 2004). In summary, as Hambleton and Han (2004) noted, model checking remains a major hurdle to overcome for effective implementation of IRT.

The situation is even worse with application of IRT models under the Bayesian framework. There has been a recent surge in the use of Bayesian estimation in IRT. Patz and Junker (1999a, 1999b), Bradlow, Wainer, and Wang (1999), Janssen, Tuerlinckx, Meulders, and De Boeck (2000), Beguin and Glas (2001), Fox and Glas (2001, 2003), Bolt, Cohen, and Wollack (2002), Sinharay, Johnson, and Williamson (2003), and Wollack, Cohen, and Wells (2003) are only some of the recent examples of application of Bayesian estimation and Markov chain Monte Carlo (MCMC) algorithms (e.g., Gelman, Carlin, Stern, & Rubin, 2003) to fit complicated psychometric models. However, little attention has been given to assessing fit of IRT models from a Bayesian perspective. Besides, there is no standard software package that can perform Bayesian model checking.

The posterior predictive model checking (PPMC) method (Guttman, 1967; Rubin, 1984) is a popular Bayesian model checking tool because of its simplicity, strong theoretical basis, and obvious intuitive appeal. The method primarily consists of comparing the observed data with replicated data (those predicted by the model) using a number of test statistics. For a list of applications of the PPMC method in educational measurement, see Sinharay and Johnson (2003, p. 2). However, most of these applications are specialized and few of them focus in general on unidimensional IRT models for dichotomous items. Before applying any technique to complicated models in a field, it is important to study its performance with the simpler models (especially because these models are complex enough so far as model checking is concerned). However, such studies have been rare for model checking with Bayesian IRT, with Glas and Meijer (2003) and Hoijtink (2001) being the excep- tions; there is need for more studies in this area. Finally, none of the above-mentioned works discussed checking a number of aspects as recommended by van der Linden &

Hambleton (1997).

In light of the above facts, this paper uses the PPMC method in a number of real data applications to assess several aspects of fit of unidimensional IRT models for dichotomous items. Use of a variety of test statistics in the examples show how the flexibility of the PPMC method can be utilized to suit the need of the researcher.

The test statistics used here are natural and observed quantities in the context of test data—so researchers will find these intuitive and appealing.

The preferable way to perform the posterior predictive checks is to use graphical plots (Gelman, Meng, & Stern, 1996; Stern, 2000). This paper employs the same viewpoint and uses appropriately created graphical displays to demonstrate the ability of the PPMC method to provide graphical evidence of misfit, thus addressing the comment in Hambleton and Han (2004) that there is the need for more graphical approaches to evaluate IRT model misfit.

(3)

Assessing Fit of Unidimensional Item Response Theory Models The issue of evaluating practical consequences of model misfit has been given little attention in the model checking literature in IRT (Hambleton & Han, 2004), whether frequentist or Bayesian. It is possible that discrepancies between the test data and predictions from a model are of no practical consequences, but a statistical test indicates misfit (van der Linden & Hambleton, 1997, p. 16). Therefore, finding an extreme p-value and thus rejecting a model should not be the end of an analysis (Gelman et al., 2003, p. 176)—the investigator should determine whether the misfit of the model has substantial practical consequences for the particular problem at hand. This work tries to address that issue, whenever applicable, for the real applications. As it turns out, determining the practical consequences of model misfit mostly involves additional statistical analysis—only a p-value or a diagnostic plot rarely provides any insight about practical consequences of misfit.

Overall, this paper provides a comprehensive treatment of Bayesian model checking in the context of unidimensional IRT models. Bayesians should find this work useful and frequentists should find this work, especially the discussions on practical consequences of misfit, of some interest.

There is a never-ending debate about the use of frequentist testing versus Bayesian testing (e.g., Casella & Berger, 1987), but this article does not delve into the issue.

Henceforth, it will be assumed that both frequentist and Bayesian testing (and p- values) are useful and the focus will be on application of them to assess model fit. The PPMC method has been applied to examine person fit (Glas & Meijer, 2003), item fit (Sinharay, 2003a), and differential item functioning (Hoijtink, 2001) for the simple IRT models—so these aspects are not covered in this paper. This paper also does not cover test statistics based on data and parameters (which are called discrepancy measures; see, e.g., Gelman et al., 1996).

This paper is organized as follows. The next section discusses the PPMC method and the role of test statistics in application of the method; the section also provides a brief description of the test statistics examined in this study. The following four sec- tions describe real data examples for which different aspects of the respective models are checked, depending upon the goal of the studies, without attempting to provide a complete model fit analysis for any one example (except for the last example). The examples involve an admissions test, a basic skills test, a mixed number subtraction test, and a test from the National Assessment of Educational Progress (NAEP).

Conclusions and future directions of work are provided at the end of this paper.

Posterior Predictive Model Checking Method

Let p(y| ω) denote the likelihood distribution for a statistical model applied to data (examinee responses in this context) y, whereω denotes the parameters in the model. Let p(ω) be the prior distribution on the parameters; the posterior distribution ofω is p(ω | y) ∝ p(y | ω)p(ω).

The PPMC method (Guttman, 1967; Rubin, 1984) suggests checking a model by comparing the observed data to the posterior predictive distribution of replicated data y^rep,

p(y^rep| y) =

p(y^rep| ω)p(ω | y) dω, (1)

(4)

Sinharay

as a reference distribution for the observed data y. In practice, a test statistic T(y) is defined to address the aspect of interest, and the observed value of T(y) compared to the posterior predictive distribution of T (y^rep), with any significant difference between them indicating a model failure.

A popular summary of the comparison of observed data y to Equation 1 is the tail- area probability or posterior predictive p-value (PPP-value), the Bayesian counterpart of the classical p-value:

p = P(T (y^rep)≥ T (y) | y) =

T (y^rep)≥T (y) p(y^rep| y) dy^rep. (2)

Because of the difficulty in dealing with Equations 1 or 2 analytically for all but simple problems, Rubin (1984) suggested simulating replicate data sets from the posterior predictive distribution in practical applications of the PPMC method. One generates N draws (mostly using an MCMC algorithm) ω¹, ω²,. . . , ω^N from the posterior distribution p(ω | y) of ω, and draws y^rep^,nfrom the likelihood distribution p(y| ωⁿ), n= 1, 2, . . . , N. The process results in N replicated data sets. A graphical plot of T(y) versus T(y^rep^,n), n= 1, 2, . . . , N, shows if the model explains the aspect measured by T(y) adequately. Stern (2000) commented that the PPMC method is in the spirit of a diagnostic plot rather than a test, and this paper adopts the same viewpoint. Equation 2 implies that the proportion of the N replications for which T (y^rep^,n) exceeds T(y) provides an estimate of the PPP-value. Extreme PPP-values (close to 0, or 1, or both, depending on the nature of the test statistic) indicate model misfit.

Because of the simplicity of implementing the PPMC method, the real data examples later skip any details about the issue; in each of the examples, an MCMC algorithm is run and a graphical plot is made or a PPP-value computed as described above. Unless otherwise stated, to fit a model, a normal prior distribution with mean 0 and variance 10 is assumed on the difficulty, logarithm of slope, and logit of guessing parameters (the prior distributions are weak to let the data completely determine the posterior distributions); a standard normal prior is assumed on the ability parameters.

The above description shows that the PPMC method is very similar to model checking based on Monte Carlo simulation (e.g., Stone & Zhang, 2003); both of these methods find the null distribution of the test statistics by simulating predicted/replicated data sets. The difference is that as generating parameter values, the former procedure uses draws from the posterior distribution while the latter uses the maximum likelihood estimate (MLE). However, a Monte Carlo-based approach does not take into account the uncertainty in the estimation of the parameters prop- erly (e.g., Bayarri & Berger, 2000), unlike the PPMC method.

Empirical and theoretical studies so far have suggested that PPP-values gener- ally have reasonable frequentist properties (e.g., Gelman et al., 1996). The PPMC method combines well with the MCMC algorithms. Researchers such as Bayarri and Berger (2000) showed that the PPP-values were conservative (i.e., often failed to detect model misfit), even asymptotically, for some choices of test statistics, such as when the test statistic is not centered. However, simulation studies in Sinharay

(5)

Assessing Fit of Unidimensional Item Response Theory Models and Johnson (2003) and Sinharay (2003a) demonstrated that the PPMC method with suitable test statistics could be useful to assess several aspects of fit of unidimensional IRT models.

Role of Test Statistics in the Posterior Predictive Model Checking Method Choice of test statistics is very important in an application of the PPMC method, just like in a frequentist testing problem. Ideally, test statistics should be chosen to reflect aspects of the model that are relevant to the scientific purposes to which the inference will be applied (Gelman et al., 2003, p. 172).

Sinharay and Johnson (2003) suggested caution while applying the PPMC method to IRT models. While the flexibility of the method allows one to use any function of data and parameters as a test statistic, one should make sure that the measure has ade- quate power. For example, the outfit measure

i

j

(yij−E(yij))²

var(yij) (e.g., van der Linden

& Hambleton, 1997, p. 113), while intuitive and powerful for other types of models, was found not useful for IRT models by Sinharay and Johnson; inadequate models predict the quantity adequately. Test statistics that relate to features of the data not directly addressed by the probability model are expected to be more useful (Gelman et al., 2003, p. 172). For example, Sinharay and Johnson reported that biserial correlation coefficients (e.g., Lord, 1980, p. 33) were not useful test statistics for the 2PL/3PL model (e.g., Lord), but they were useful for the Rasch model (e.g., Lord), as the latter model does not have any parameters to address the biserials.

For unidimensional IRT models, interest may be in assessing overall model fit, item fit, local independence (LI), differential item functioning, person fit, and so on, and a researcher has to employ the appropriate test statistics depending on the aspect(s) of fit that is (are) of primary concern; no single test statistic can test for all of the above-mentioned aspects together. The real data examples later in this paper demonstrate how one can use test statistics that are useful for assessing various aspects of model fit that are of immediate interest.

Test Statistics Examined

The following list describes most of the test statistics considered in the real data examples later.

1. Direct data display: Suitable display showing the observed data and a few replicated data sets may provide a rough idea about model fit by revealing interesting patterns of differences between the observed and replicated data (e.g., Gelman et al., 2003). For a data set from a large-scale assessment, plotting all data points may be prohibitive—but plots for few subsamples (for example, a few examinees from each group—low scoring, high scoring, and middle) is possible.

2. Observed score distribution: Consider a test with J dichotomous items. Denote NCj to be the number of examinees getting exactly j items correct. Suppose NC= (NC0, NC1, . . . , NCJ). A model that does not predict NC (the observed score distribution) may not provide a fair ranking of the individuals. Researchers such as Lord (1953), Ferrando and Lorenzo-seva (2001), and Hambleton and Han (2004) suggested using a comparison of observed and predicted scores as a descriptive measure of overall model fit. However, Ferrando and Lorenzo-seva

(6)

Sinharay

commented that no distributional assumption could be made for the test statistic with a frequentist approach. The PPMC method does not face this problem—

the application of the method does not need any distributional assumption of test statistics. Sinharay and Johnson (2003) showed, using simulation studies, that the observed score distribution was a useful test statistic with the PPMC method in detecting misfit of an IRT model that assumed a normal ability distribution when the true ability distribution was not normal.

3. Biserial correlation coefficient: This statistic may be powerful, for example, to detect the misfit of the Rasch model. Sinharay and Johnson (2003) found the standard deviation of the biserial correlation coefficient to be useful in detecting misfit of the Rasch model when data are generated from the 2PL or 3PL model.

4. Measures of association among the item pairs: There are no parameters in unidimensional IRT models to address association/interaction among items. There- fore, test statistics that capture the interaction effects among the items have the potential to detect possible misfit of such models. Two such statistics are examined:

• Odds ratio: Consider an item pair in a test. Denote nkk to be the number of individuals scoring k on the first item and kon the second item, k, k= 0, 1.

We use

OR= n00n11

n01n10, (3)

the sample odds ratio (see, for example, Chen & Thissen, 1997), referred to as the odds ratio (OR) hereafter, as a test statistic with the PPMC method.

Chen and Thissen (1997) applied these to detect violation of the LI assumption of IRT models; they argued that if LI was satisfied, a unidimensional model could predict the OR. However, LI may not hold for an application for reasons such as speededness, passage dependence, or the test not being unidimensional in psychological sense (Yen, 1993; Chen & Thissen, 1997); in that case, the OR should be more than what is predicted by a unidimensional IRT model for within-cluster items (those that are influenced by a single trait that is different from the one trait that the test intends to measure; the latter trait is common to each item in the test) and less than what is predicted by a uni- dimensional IRT model for between-cluster items (those that are influenced by different traits). Chen and Thissen found the standardized log-odds ratio not to have a N(0, 1) null distribution and hence did not find it useful. They also foundχ²-type test statistics based on the counts nkk’s not to follow the hypothesizedχ²distribution.

• Mantel-Haenszel statistic: For each item pair, it is possible to define an odds ratio conditional on the raw score of examinees and then combine them. Let us define the odds ratio for an item pair conditional on the rest score (i.e., the raw score on the test obtained by excluding the two items) r as

(7)

Assessing Fit of Unidimensional Item Response Theory Models

ORr = n_11rn_00r n10rn01r

, (4)

where nkkr is the number of individuals with rest score r obtaining a score k on one item and kon the other, k, k= 0,1. Then it is possible to combine them into a pooled conditional odds ratio, also called the Mantel-Haenszel statistic (e.g. Holland, 1985), as:

MH=

r n11rn00r/nr

r n10rn01r/nr

, (5)

where nr is the number of examinees obtaining rest score r. This statistic, like the odds ratio defined in Equation 3, should be useful in detecting the lack of LI of an unidimensional IRT model. If LI holds, using arguments as in e.g., Yen (1993) and Stout (1987), the conditional covariance between the scores on the two items is close to zero and the Mantel-Haenszel statistic should be near 1; if LI is violated, the conditional covariance is posi- tive (which means the Mantel-Haenszel statistic is more than 1) for within- cluster items and negative (Mantel-Haenszel statistic less than 1) for pairs of between-cluster items. Therefore, as with the OR, in situations where LI is violated, we will expect a unidimensional IRT model to underpredict the Mantel-Haenszel statistic for within-cluster items and overpredict the statistic for between-cluster items.

As Chen and Thissen (1997) commented on indices for detecting violation of LI, these statistics should be used not for hypothesis testing, but rather, for diagnosis purposes. Meaningful interpretation of the indices requires experience in IRT analysis and close examination of the item content. Sinharay and Johnson (2003), using detailed simulations, demonstrated these statistics to be useful in detecting violation of the LI assumption of unidimensional IRT models.

Determining the Scoring Technique in an Admissions Test Data and the Goal of the Analysis

Any large-scale testing program requires a rich collection of items. In an attempt to reduce expenses and improve item quality, there has been an increasing interest in using item models (Irvine & Kyllonen, 2002), classes from which it is possible to generate/produce items that are equivalent/isomorphic to other items from the same model (e.g., Bejar, 2002). A recent large-scale admissions test pretested items automatically generated from item models. The primary goal of the initiative was to calibrate each item model once and then to use the items generated from it in future computer-adaptive administrations of the test without the need to calibrate the items individually.

Sixteen item models were involved in this study; there were four main content areas covered and one item model for each of four difficulty levels (very easy, easy,

(8)

Sinharay

hard, and very hard) for each content area. The automatic item generation tool generated 10 five-option multiple-choice items from each item model. All of the 160 items were pretested operationally—they were embedded within an operational computer adaptive test, or CAT (taken by 32,921 examinees), to make sure that the examinees were motivated while responding to these items. Each examinee received only four model-based items, one from each content area, and one from each difficulty level. To avoid potential speededness and context effects, the within-section item positions were controlled. The number of examinees receiving any single one of these items varied from 663 to 1,016. Let us consider 100 items, 10 each from 10 item models, in this example.

For calibration (i.e., estimating item parameters) in such a situation, a simple way is to assume the same item response function (the 3PL model for this example) for all items belonging to the same item model. However, this approach ignores any variation between items within a model. If items generated from item models are administered in a future CAT, an examinee receiving a comparatively easier (harder) item from an item model may be getting an unfair advantage (disadvantage). Glas and van der Linden (2003) suggested an alternative calibration technique that used the 3PL model in the first stage, but added a hierarchical component by assuming that the item parameters of each item were normally distributed with a mean vector and a variance matrix that depended on the item model from which the item was generated. The statistical model (called the hierarchical model henceforth) is more complicated and more difficult to fit, but rightly accounts for the variation between the items in the same item model. The goal in this application will be to assess if the simple 3PL model approach suggested above is good enough or if one needs the more complicated hierarchical model.

Model Fitting, Choice of Test Statistic, and Results

The 3PL model (assuming the same response function for all items from an item model) and the hierarchical model are separately fitted (details about fitting the hierarchical model using an MCMC algorithm can be found in, e.g., Sinharay et al., 2003) to the data containing the 100 pretested items considered here. As each exam- inee faced only four pretested items, the usual N(0,1) prior distribution would lead to unstable estimates; so the posterior means and standard deviations (SD) from the operational CAT are used as the means and SDs respectively of the normal prior distributions on the proficiency parameters.

To assess if the model fits the aspect of the data that is of main interest here, remember that the 3PL model approach makes a restrictive assumption about the variation between items within an item model and hence a test statistic related to this aspect of the data should be used. Therefore, we first compute the proportion corrects for all the 100 items. Then we compute the SD of the proportion corrects of the 10 items for each item model, resulting in 10 within-model SDs. These SDs will be the proper test statistics here. If the predicted SDs are close to the observed SDs, then we can be confident that the apparently restrictive assumption of the 3PL model is supported by the data, and hence a future test with items from these item models will be fair to the students.

(9)

FIGURE 1. The observed and predicted within-model SD of proportion corrects for the 3PL model (left panel) and the hierarchical model (right panel).

The left panel of Figure 1 shows a plot of the observed and predicted SDs for the 3PL model. The dots denote the observed values while the boxplots denote the distribution of the predicted SDs. The 3PL model severely underpredicts the within- model SDs—the assumption of the same item response function for all the items within a model seems to be too restrictive to reproduce the SDs for the observed data. The right panel of Figure 1 shows a plot of the observed and predicted SDs for the more complicated hierarchical model; the model seems to explain the within- model SDs satisfactorily.

Evaluating Practical Consequences of Misfit

The simple 3PL model approach does not fit an aspect of the data that is of practical interest while the hierarchical model does, but the test administrators will be more interested in learning what the practical consequences are if the misfitting but simpler 3PL model is used instead of the complicated hierarchical model that fits the data better, and whether there would be a significant loss. Because Figure 1 does not address those issues, there is a need for further analysis.

As mentioned earlier, the primary goal in the pretesting process was to calibrate the item models once and then to use the items generated from them in future CATs.

Therefore, if using the simple 3PL model rather than the hierarchical model leads to substantial differences in the scores of individuals in a future CAT, then the misfit of the former model becomes of substantial practical concern.

Suppose that the (future) CAT starts with four items and chooses Item 5 on- ward based on the posterior mean of ability, under a N(0,1) population distribution

(10)

FIGURE 2. The comparison of posterior means and SDs of examinee abilities under the 3PL model and hierarchical model.

assumption obtained from the responses to the first 4 items. A data generator sim- ulates responses of 250 examinees to four items each under the hierarchical model assumption, using the estimates from the calibration process as the generating parameters. A scoring program then computes posterior mean and SD for each examinee ability under the 3PL model and the hierarchical model (for technical details about how to obtain the posterior distribution of ability under the two models, see, e.g., Glas & van der Linden, 2003). Figure 2 compares posterior means and posterior SDs of examinee abilities under the two models. There are clear patterns in the plots, the most noticeable one being the slight underestimation of the posterior SD by the 3PL model (an effect similar to that in Tsutakawa & Johnson, 1990 and Lewis, 2001). The posterior mean differs slightly as well, the 3PL model mostly overestimating the values for extreme scores and underestimating in the middle. An- other relevant factor is computing time. The hierarchical model takes about twice as much time as the 3PL model for calibration (using C++ programs) and takes about 8 to 10 times as long as the 3PL model for scoring (using an R program; Ripley, 2001).

Looking at Figure 2 and remembering the time taken by the two approaches, the test administrators can easily judge the practical consequences of misfit. If they think that these magnitudes of differences between the two models are significant and the additional computational time is affordable, they can use the more complicated hierarchical model in future. Otherwise, they can proceed with the 3PL model in future, thus saving computational time. It is possible to perform a more rigorous analysis of practical consequences of misfit here by simulating a full CAT using a whole item pool and examining the difference of the estimates provided by the two approaches

(11)

Assessing Fit of Unidimensional Item Response Theory Models in theθ-scale or in the reported score scale, but this paper does not delve into that, mainly because of confidentiality issues.

Determining Speededness in a Basic Skills Test Data and the Goal of the Analysis

This example involves the responses of 8,686 examinees to a separately timed section with 45 five-option multiple-choice items in the writing assessment part of a recently administered basic skills test. The first 25 items are of one type (on finding errors) while the last 20 are of another type (on correcting errors). Experts believe that the test is speeded. The objective in this example is to examine if the PPMC method can find the 3PL model failing in a way that can be attributed to speededness.

Choice of Test Statistic and Results

As described earlier, measures of association (such as the OR or Mantel-Haenszel statistic) should be useful here because speededness leads to violation of the LI assumption of a unidimensional IRT model (Yen,1993) and the failure of the model to explain the measures of association.

Figure 3 summarizes the PPP-values for the ORs (the Mantel-Haenszel statistic performs very similarly) for the 3PL model. For any item pair, an inverted triangle denotes that the 3PL model underpredicts the odds ratio while a triangle denotes an overprediction. No symbol for an item pair denotes a non-significant p-value. The plot shows that there exists substantial interaction among the items that the 3PL model cannot predict adequately. The model underpredicts the ORs for a significant number of pairs involving the last nine items (37–45) of the test. Also clearly visible are the large number of overpredictions for pairs involving one item from the group 38–45 and another from the group 1–25. Thus, Items 37–45 seem to load on a different dimension (in the words of, e.g., Stout, 1987) not measured by Items 1–36.

One explanation could be that Items 1–25 are of different type (finding errors) from Items 26–45 (correcting errors), causing violation of LI for the 3PL model. However, a plot like Figure 3 of the PPP-values for Items 26–45 only (obtained by fitting the model to responses for these items only) shows that Items 37–45 seem to load on a different dimension not measured by Items 26–36. It is highly likely that the failure of the model is caused by speededness in the test, especially because the last few items are discrete/stand-alone items, that is, they do not belong to a testlet (Bradlow et al., 1999), and a study of their contents does not suggest any additional connection among them.

Evaluating Practical Consequences of Misfit

Though Figure 3 shows statistically significant model violation that can be explained by speededness, it does not tell if the amount of speededness is practically significant; hence, there is the need for further analysis. Boughton, Larkin, and Yamamoto (2004) thoroughly examined the existence of speededness in a number of tests including this particular test (the test in this paper is referred to as PPW1 there).

(12)

FIGURE 3. PPP-values for the ORs for 3PL model fit for the basic skills test example.

They reported the percentage of examinees reaching all items in this test to be 85.2 (a value greater than 80 indicates a non-speeded test) and the ratio of the variance of the number of items not reached to the variance of number right to be .14 (a value less than .15 indicates a non-speeded test); so, the test is non-speeded according to the conventional measures. However, Boughton et al. argued that these were weaker measures of speededness as many examinees may switch to a random response strategy as time runs out, leaving no item unanswered. They applied the HYBRID model (Yamamoto, 1989), which assumes that subsets of examinee response patterns are described by a discrete latent class model while the remaining responses are modeled by an IRT model. Boughton et al. suggested that a test is speeded from a practical viewpoint if the HYBRID model indicates that at least 20% examinees switch to a random response strategy before the end of the test. They find that for the test considered here, more than 20% of the examinees switch to a random response pattern by Item 38. The size of the test was ultimately reduced by seven items operationally for future administrations. Therefore, the model misfit suggested by the PPMC method was of substantial practical consequence in this example.

(13)

TABLE 1

The Items in the Mixed Number Subtraction Data

Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Minuend 3¹₂ ⁶₇ 3 ³₄ 3⁷₈ 4₁₂⁴ 4¹₃ ¹¹₈ 3⁴₅ 2 4⁵₇ 7³₅ 4₁₀¹ 7 4¹₃ Subtrahend 2³₂ ⁴₇ 2¹₅ ³₄ 2 2₁₂⁷ 2⁴₃ ¹₈ 3²₅ ¹₃ 1⁴₇ ⁴₅ 2₁₀⁸ 1⁴₃ 1⁵₃ Proportion correct .37 .79 .33 .70 .69 .31 .37 .71 .75 .38 .74 .34 .41 .26 .31

Demonstrating the Usefulness of Direct Data Display With a Mixed Number Subtraction Example

Data and the Goal of the Analysis

Let us consider a part of the often-analyzed mixed number data set introduced in Tatsuoka (1984). This data set played a central role in the development of the rule space methodology (Tatsuoka, 1983), a method capable of diagnosing cognitive er- rors and analyzing different methods for solving problems. The original data set has responses of 545 middle-school students to 40 mixed number subtraction problems.

The test was designed to diagnose erroneous rules of operations, 39 of them found with the help of the PLATO system (Morimoto & Tatsuoka, 1983).

This paper considers a part of the data set with 15 items (for which the fractions have a common denominator) on the 325 students who were found, by the rule space method, to have used Method B (Tatsuoka, Linn, Tatsuoka, & Yamamoto, 1988;

Tatsuoka, 1990). These 15 items were supposed to diagnose student mastery of up to five skills. Table 1 lists the items.

We use this data set to demonstrate the usefulness of a direct data display (discussed earlier) for a quick check of overall fit of an IRT model.

Results

Consider Figure 4, which shows the mixed number subtraction data and eight replicated data sets generated after fitting the Rasch model. A similar plot was used in Sinharay (in press) for assessing fit of a Bayesian network model. There are cer- tain patterns in the observed data that are not present in the replicated data sets. For example, the top 7% (24 out of 325) examinees get all items correct in the observed data (clear from the top of the leftmost column in the plot being completely black), but the corresponding percentage in considerably less in the replicated data sets. The examinees with the lowest raw scores could answer only two items correctly. Inter- estingly, these are Items 4 and 5 (in Table 1), both of which can be solved without any knowledge of mixed-number subtraction, using the idea that a minus 0 is a (Item 4) and knowledge of whole number subtraction. The replicated data sets, however, often have correct responses for other items for these examinees. Further, the lower half of the examinees rarely get any hard items correct in the observed data, which is not true in the replicated data sets. These differences between the observed and replicated data sets suggest that the Rasch model is probably not satisfactory for the data set.

A plot like Figure 4, while not providing a definitive conclusion about model fit, can provide a rough and quick feedback about overall model fit and can trigger

(14)

Note. The left column in Figure 4 (marked “Obs”) shows all observed responses, with a little black box indicating a correct response. The items, sorted according to decreasing proportion correct, are shown along the x-axis; the examinees, sorted according to increasing raw scores, are shown along the y-axis.

The right columns (marked “Rep 1,” “Rep 2,” etc.) show eight replicated data sets. There are some clear differences between the observed and replicated data sets.

FIGURE 4. Mixed-number subtraction data and eight replicated data sets.

further analysis regarding model fit. For example, the plot suggests that the model probably does not predict the observed score distribution and the responses to a number of items. Sinharay (2003a, 2003b) showed, using further statistical analysis, that the Rasch model did not predict adequately the observed score distribution, and more than half of the items (including Items 4 and 5) were found to be misfitting. The model also fails to explain the biserial correlations and the ORs, doing a poor job overall.

The fact that the PPMC method can detect model misfit for such a small data set is encouraging, if one remembers the occasional conservativeness of the approach (noted by, e.g., Bayarri & Berger, 2000). This paper does not try to evaluate practical consequences of misfit in this example. The test was constructed to diagnose different skills of examinees and hence the Rasch model was not the appropriate model for these data, which the findings corroborated.

Demonstrating Adequate Fit of the 3PL Model to a Data Set from National Assessment of Educational Progress

Data and the Goal of the Analysis

It is often argued that an IRT model rarely fits real test data. This example is meant to demonstrate that the 3PL model can be a good fit to a real data set that is not too

(15)

Assessing Fit of Unidimensional Item Response Theory Models small. Consider a part of a data set from the NAEP Math OnLine (MOL) special study (Sandene et al., 2005). The MOL study was the first to compare computer testing and paper testing with a nationally representative sample of school-age students.

The data considered here consisted of the responses of 1,014 8th grade examinees to 16 multiple-choice items in one (paper-based) test form. As is typical with NAEP, the MOL study used the 3PL model operationally to provide a number of summaries and to compare computer testing and paper testing—therefore, it is important to make sure that the 3PL model is adequate for the data set.

Choice of Test Statistics and Results

Because results from the fitted model were used to learn about different aspects of the data, it is important to perform a collection of model checks here, as suggested by Hambleton (1989), and then make an informed judgment.

Figure 5 shows the observed data and eight replicated data sets after fitting the 3PL model to the data. Unlike Figure 4, Figure 5 does not suggest any obvious discrepancies between the observed data and the replicated data sets. The points for the most difficult item show an interesting pattern; the proportion of black (i.e., correct answer) in any vertical segment does not seem to be much more than that in a lower vertical segment. This is an outcome of the item being of extremely low slope parameter (.5). The pattern is also observed in the replicated data sets.

Figure 6 compares the observed and predicted score distributions. For any raw score (ranging between 0 and 16), a boxplot (with whiskers of the boxes extend- ing to the 2.5th and 97.5th percentiles) depicts the (empirical) posterior predictive distribution of the number of examinees obtaining that particular raw score. A notch

FIGURE 5. NAEP MOL data and eight replicated data sets.

(16)

FIGURE 6. NAEP MOL data: Observed and predicted raw score distributions.

toward the middle of each box denotes the median. The points indicate the observed number of examinees obtaining a particular raw score; a solid line joins the points.

Beguin and Glas (2001) used a similar plot for detecting violation of a multidimensional IRT model. The figure shows that the 3PL model performs respectably in predicting the observed score distribution, the observed score lying within a 95%

posterior predictive interval for each raw score.

The 3PL model adequately predicts the biserial correlation coefficients and the measures of association among the items (OR and Mantel-Haenszel statistic) as well.

For example, in a plot like Figure 3, there are only three significant PPP-values (out of a total of 120) at the 5% level. Therefore, the LI assumption is not violated for the data set.

Sinharay (2003a) performed item fit analysis on this data set and found that only one item is found misfitting at 5% level. Sandene et al. (2005) found no differential item functioning for the items of this test. Overall, this data set provides an example where the 3PL model performs extremely well and will make its user quite confident about the applicability of the model to make several types of inferences.

Conclusions

The assessment of fit of IRT models usually involves collecting of a wide variety of diagnostics for model fit and then making an informed judgment about model fit and usefulness of a model with a particular set of data (van der Linden & Hambleton, 1997; Hambleton, 1989). Even though Bayesian estimation has recently become very popular in educational measurement, not too many studies on assessing model fit

(17)

Assessing Fit of Unidimensional Item Response Theory Models under a Bayesian framework address the above comment of van der Linden and Hambleton.

As this paper shows with four real applications, the PPMC method provides a straightforward way to perform a collection of model checks aimed at different aspects of the model. The method can be used to obtain graphical evidence about model misfit, as demonstrated by the easily comprehensible and attractive graphical displays in this paper. Whenever applicable, this paper also evaluates practical consequences of misfit, an area often ignored in psychometrics literature. Overall, this paper attempts to provide a comprehensive treatment of Bayesian model checking for unidimensional IRT models.

The choice of test statistics is a crucial issue in the application of PPMC methods.

A number of test statistics, most of which are observed quantities of natural interest in the context of educational testing, appear promising in assessing IRT model fit.

The direct data display can give a quick and rough idea about model fit. The OR statistic, examining the association between item pairs, promises to be a useful diagnostic. The examples here, together with findings in Sinharay and Johnson (2003), indicate that the ORs may be useful in PPMC for detecting lack of LI (caused by, for example, speededness or lack of unidimensionality in a psychological sense) in educational test data.

This paper handles the simple IRT models (1-, 2-, and 3PL) except for performing one check on a hierarchical IRT model. However, the same statistics can be applied in a similar manner to more complicated Bayesian IRT models (e.g., those discussed in the works mentioned in the introduction section of this paper) fitted with MCMC.

Beguin and Glas (2001), who dealt with multidimensional IRT models, had one such example. This area needs further investigation.

No standard software performs posterior predictive checks. However, one can use WinBUGS (Spiegelhalter, Thomas, & Best, 1999) to run an MCMC algorithm, store the posterior sample in a file and perform the checks using a separate computer program (this paper used the R software of Ripley, 2001). Computation time is another potential problem with the PPMC method. However, a standard practice by the prac- titioners of the MCMC algorithm is to store the posterior sample obtained while fitting a model and to use the same to learn about different aspects on the problem in the future; if a posterior sample already exists, the computations require several minutes for small to moderate size assessments and at most a couple of hours for large-scale assessments.

One might wonder if these checks can be of any help in an application where a model is fitted using a frequentist approach (e.g., using maximum likelihood estimation). One may generate a number of simulated data sets in a frequentist framework using the MLE of the parameters and then implement some of the methods suggested here such as the direct data display (e.g., Figure 4). This is the spirit behind the item fit analysis in Stone and Zhang (2003). The frequentists also should find the way this paper evaluates practical consequences of misfit useful.

Acknowledgments

The author thanks Matthew Johnson, Shelby Haberman, Hal Stern, Neil Dorans, Dan Eignor, John Donoghue, Paul Holland, Andreas Oranje, Matthias von Davier,

(18)

Sinharay

Robert Mislevy, Michael Kolen, Charlie Lewis, Ronald Hambleton, and the anony- mous reviewers for their invaluable advice. The author wishes to thank Kikumi Tat- suoka for kindly giving the permission to use her mixed number data set and for her advice with the mixed number subtraction example. The author also wishes to thank Keith Boughton and Kevin Larkin for providing the data set for the basic skills test and their help with the data set. The author gratefully acknowledges the editorial assistance of Elizabeth Brophy, Rochelle Stern, and Kim Fryer.

Note

Any opinions expressed in this publication are those of the author and not necessarily of Educational Testing Service.

References

Bayarri, S., & Berger, J. (2000). P-values for composite null models. Journal of the American Statistical Association, 95, 1127–1142.

Beguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66(4), 541–562.

Bejar, I. I. (2002). Generative testing: From conception to implementation. In S. H. Irvine &

P. C. Kyllonen (Eds.), Item generation for test development. Mahwah, NJ: Lawrence Erl- baum Associates.

Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2002). Item parameter estimation under condi- tions of test speededness: Application of a mixture Rasch model with ordinal constraints.

Journal of Educational Measurement, 39(4), 331–348.

Boughton, K., Larkin, K., & Yamamoto, K. (2004). Modeling differential speededness using a hybrid psychometric approach. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.

Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets.

Psychometrika, 64, 153–168.

Casella, G., & Berger, R. L. (1987). Reconciling Bayesian and frequentist evidence in the one-sided testing problem. Journal of the American Statistical Association, 82, 106–

111.

Chen, W., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289.

Ferrando, P. J., & Lorenzo-seva, U. (2001). Checking the appropriateness of item response theory models by predicting the distribution of observed scores: The program EP-fit. Edu- cational and Psychological Measurement, 61(5), 895–902.

Fox, J. P., & Glas, C. A. W. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66, 269–286.

Fox, J. P., & Glas, C. A. W. (2003). Bayesian modeling of measurement error in predictor variables using item response theory. Psychometrika, 68, 169–191.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis. New York: Chapman & Hall.

Gelman, A., Meng, X. L., & Stern, H. S. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, 733–807.

Glas, C. A. W., & Meijer, R. R. (2003). A Bayesian approach to person fit analysis in item response theory models. Applied Psychological Measurement, 27(3), 217–233.

Glas, C. A. W., & Suarez-Falcon, J. C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106.

(19)

Assessing Fit of Unidimensional Item Response Theory Models

Glas, C. A. W., & van der Linden, W. J. (2003). Computerized adaptive testing with item cloning. Applied Psychological Measurement, 27(4), 247–261.

Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems.

Journal of the Royal Statistical Society B, 29, 83–100.

Hambleton, R. K. (1989). Principles and selected applications of item response theory.

In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 143–200). New York:

Macmillan.

Hambleton, R. K., & Han, N. (2004). Assessing the fit of IRT models: Some approaches and graphical displays. Paper presented at the annual meeting of the National Council on Mea- surement in Education, San Diego, CA.

Hoijtink, H. (2001). Conditional independence and differential item functioning in the two- parameter logistic model. In A. Boomsma, M. A. J. van Duijn, & T. A. B. Snijders (Eds.), Essays in item response theory. New York: Springer-Verlag.

Holland, P. W. (1985). On the study of differential item performance without IRT. Proceedings of the 27th annual conference of the Military Testing Association, Vol. I (pp. 282–287). San Diego, CA: Navy Personnel Research and Development Center.

Irvine, S. H., & Kyllonen, P. C. (Eds.) (2002). Item generation for test development. Mahwah, NJ: Lawrence Erlbaum Associates.

Janssen, R., Tuerlinckx, F., Meulders, M., & De Boeck, P. (2000). A hierarchical IRT model for criterion-referenced measurement. Journal of Educational and Behavioral Statistics, 25(3), 285–306.

Lewis, C. (2001). Expected response functions. In A. Boomsma, M. A. J. van Duijn, & T. A.

B. Snijders (Eds.), Essays on item response theory. New York: Springer-Verlag.

Lord, F. M. (1953). The relation of test score to the latent trait underlying the test. Educational and Psychological Measurement, 13, 517–548.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hills- dale, NJ: Erlbaum.

McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum.

Morimoto, Y., & Tatsuoka, K. K. (1983). Analysis of misconceptions in fraction problems:

Interactive diagnostic system on the PLATO system. Proceedings of ICMI-JSME Regional Conference on Mathematics in Education, Tokyo, Japan.

Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64.

Patz, R., & Junker, B. (1999a). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24, 146–178.

Patz, R., & Junker, B. (1999b). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342–366.

Reiser, M. (1996). Analysis of residuals for the multinomial item response models. Psychome- trika, 61(3), 509–528.

Ripley, B. D. (2001). The{R} project in statistical computing. MSOR Connections. The newsletter of the LTSN Maths, Stats & OR Network, 1(1), 23–25.

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12, 1151–1172.

Sandene, B., Horkay, N., Bennett, R., Allen, N., Braswell, J., Kaplan, B., and Oranje, A. (2005). Online Assessment in Mathematics and Writing: Reports from the NAEP Technology-based Assessment Project, Research, and Development Series (NCES 2005- 457). U.S. Department of Education, National Center for Education Statistics. Washington, DC: U.S. Government Printing Office.

(20)

Sinharay

Sinharay, S. (2003a). Bayesian item fit analysis for dichotomous item response theory mod- els. Princeton, NJ, ETS. Retrieved November 1, 2004, from http://www.ets.org/research/

newpubs.html, ETS RR-03-34.

Sinharay, S. (2003b). Practical applications of posterior predictive model checking for assess- ing fit of common item response theory models. Princeton, NJ, ETS. Retrieved November 1, 2004, from http://www.ets.org/research/newpubs.html, ETS RR-03-33.

Sinharay, S. (in press) Assessing fit of Bayesian networks using the posterior predictive model checking method. To appear in Journal of Educational and Behavioral Statistics.

Sinharay, S., & Johnson, M. S. (2003). Simulation studies applying posterior predic- tive model checking for assessing fit of the common item response theory models.

Manuscript in preparation. A preliminary version Retrieved November 1, 2004, from http://www.ets.org/research/newpubs.html.

Sinharay, S., Johnson, M. S., & Williamson, D. M. (2003). Calibrating item families and summarizing the results using family expected response functions. Journal of Educational and Behavioral Statistics, 28(4), 295–313.

Spiegelhalter, D. J., Thomas, A., & Best N. G. (1999). WinBUGS version 1.2 user manual [Computer Software manual]. Cambridge, UK: MRC Biostatistics Unit.

Stern, H. (2000). Comments on “P-values for composite null models.” Journal of the American Statistical Association, 95, 1157–1160.

Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331–352.

Stout, W. F. (1987). A nonparametric approach for assessing latent trait dimensionality. Psy- chometrika, 52, 589–617.

Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20(4), 345–354.

Tatsuoka, K. K. (1984). Analysis of errors in fraction addition and subtraction problems.

Urbana, IL: University of Illinois, Computer-Based Education Research (NIE Final Rep.

for Grant No. NIE-G-81-002).

Tatsuoka, K. K. (1990). Toward an integration of item response theory and cognitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M. G. Shafto (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. 453–488). Hillsdale, NJ: Erlbaum.

Tatsuoka, K. K., Linn, R. L., Tatsuoka, M. M., & Yamamoto, K. (1988). Differential item functioning resulting from the use of different solution strategies. Journal of Educational Measurement, 25(4), 301–319.

Tsutakawa, R. K., & Johnson, J. C. (1990). The effect of uncertainty of item parameter esti- mation on ability estimates. Psychometrika, 55, 371–390.

van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory.

New York: Springer.

Wollack, J. A., Cohen, A. S., & Wells, C. S. (2003). A method for maintaining scale stability in the presence of test speededness. Journal of Educational Measurement, 40(4), 307–330.

Yamamoto, K. (1989). HYBRID model of IRT and latent class models. Princeton, NJ: ETS (RR-89-41).

Yen, W. (1993). Scaling performance assessments: Strategies for managing local item depen- dence. Journal of Educational Measurement, 30, 187–213.

Author

SANDIP SINHARAY is a Research Scientist, Educational Testing Service, MS 12-T, Rosedale Road, Princeton, NJ 08541; ssinharay@ets.org. His primary research interests include Bayesian statistics, model checking and model selection, and educational statistics.