Assessing Fit of Unidimensional Graded Response Models Using Bayesian Methods

(1)

Spring 2011, Vol. 48, No. 1, pp. 81–97

Assessing Fit of Unidimensional Graded Response Models Using Bayesian Methods

Xiaowen Zhu Data Recognition Corporation

Clement A. Stone University of Pittsburgh

The posterior predictive model checking method is a flexible Bayesian model- checking tool and has recently been used to assess fit of dichotomous IRT models. This paper extended previous research to polytomous IRT models. A simula- tion study was conducted to explore the performance of posterior predictive model checking in evaluating different aspects of fit for unidimensional graded response models. A variety of discrepancy measures (test-level, item-level, and pair-wise mea- sures) that reflected different threats to applications of graded IRT models to perfor- mance assessments were considered. Results showed that posterior predictive model checking exhibited adequate power in detecting different aspects of misfit for graded IRT models when appropriate discrepancy measures were used. Pair-wise measures were found more powerful in detecting violations of the unidimensionality and local independence assumptions.

It has become increasingly common to use Bayesian methods for estimating different IRT models (c.f., Li, Bolt, & Fu, 2006; Patz, Junker, Johnson, & Mariano 2002; Wang, Bradlow, & Wainer, 2002). As for traditional IRT applications, infer- ences from Bayesian IRT applications are valid only when the model fits the data or there is reasonable correspondence between the model and the item responses or data. One model checking tool that has recently proved useful with Bayesian IRT applications is posterior predictive model checking.

Posterior predictive model checking involves simulating data using the posterior distributions for model parameters and comparing features of simulated data against observed data (Rubin, 1984). Any systematic differences indicate potential misfit of the model. The advantages of using this method with IRT model fit are threefold: (1) uncertainty in parameter estimation is accounted for by using posterior distributions for model parameters rather than point estimates; (2) null sampling distributions are constructed empirically rather than relying on analytically derived distributions; and (3) fit of complex IRT models which can only be estimated using Bayesian methods can be assessed.

The focus of previous research with posterior predictive model checking has been on unidimensional IRT models for dichotomously scored items. For example, Sinharay (2005, 2006) explored the performance of posterior predictive model checking in evaluating different aspects of misfit (e.g., violation of local

(2)

Zhu and Stone

independence and item misfit) for unidimensional dichotomous IRT models. Levy, Mislevy, and Sinharay (2009) examined the utility of posterior predictive model checking in detecting misfit of unidimensional 2PL models to multidimensional dichotomous response data. Glas and Meijer (2003) used this method to assess person fit of 3PL models. These studies demonstrated that posterior predictive model checking is a flexible and promising technique for detecting a range of possible model violations from multidimensionality and local dependence to item or person misfit for dichotomous IRT models.

The purpose of this study was to extend this body of research to polytomous IRT model applications. A number of polytomous models have been developed and used for performance assessment applications (Lane & Stone, 2006). This study focused on one of the commonly used models—Samejima’s (1969) unidimensional graded response model, and investigated the performance of posterior predictive model checking in assessing different sources of potential misfit for unidimensional graded IRT model applications. Discrepancy measures that reflected specific threats to applications of the model to performance assessments were considered (e.g., multidimensionality and local item dependence), and the usefulness of various discrepancy measures in detecting different types of misfit was compared.

Introduction to Posterior Predictive Model Checking

Let y be the observed data and y^repbe the replicated data based on the model and the posterior distribution of model parametersδ (item and ability parameters) that were estimated from the observed data. Posterior predictive model checking (Rubin, 1984) assesses the fit of a model by examining whether the observed data y appears extreme with respect to replicated data y^repfrom the posterior predictive distribution:

p (y^rep|y) =

p (y^rep, δ|y) dδ =

p (y^rep|δ)p (δ|y) dδ, (1)

where p(δ|y) is the posterior distribution of δ. The rationale underlying posterior predictive model checking is that if a chosen model fits the data, then observed data should be similar to replicated data generated from the posterior distributions for model parameters (Gelman, Carlin, Stern, & Rubin, 2003).

A discrepancy measure is employed to measure the difference between the observed and predicted data. Any systematic difference between the observed (or re- alized) discrepancy measure D(y, δ) based on the observed data, and the predic- tive discrepancy measure D(y^rep,δ) based on the posterior predictive distribution is indicative of model-data misfit. The comparison of D(y,δ) and D(y^rep,δ) can be performed using (1) graphical displays which provide a diagnostic tool for evalu- ating model fit (Gelman et al., 2003); and (2) posterior predictive p-values which provide numerical summaries of model fit and reflect the relative occurrence of the value for a discrepancy measure D(y,δ) for the observed data y in the distribution of

(3)

discrepancy values from replicated data D(y^rep,δ):

p-value= P (D (y^rep, δ) ≥ D (y, δ) |y)

=

D(y^rep,δ)≥D(y,δ)p (y^rep|δ)p (δ|y) dy^repdδ. (2)

Posterior predictive p-values near .5 indicate that there is no systematic difference between realized and predictive discrepancy measures, and thus indicate adequate fit of the model. Posterior predictive p-values near 0 or 1 suggest that realized discrepan- cies are inconsistent with posterior predictive discrepancy values and hence suggest model-data misfit. It should be noted that posterior predictive p-values should not be interpreted as traditional hypothesis-testing p-values (Sinharay, Johnson, & Stern, 2006).

Discrepancy Measures

Choosing discrepancy measures is a key issue in applications of posterior predictive model checking. They should be chosen to reflect potential sources of misfit most relevant to a testing application. For example, when a unidimensional IRT model is used in a testing application where item responses may reflect a multidimensional structure, discrepancy measures should be chosen that are sensitive to this threat to model fit. It is recommended that a number of different discrepancy measures be used for evaluating different sources of possible model misfit for a particular testing application (Sinharay et al., 2006). Thus, a variety of discrepancy measures were employed in this study.

Test-Level Measure

A common measure uses the test score distribution (i.e., number of examinees with each total test score) to compare observed and posterior predictive test score distributions. If the observed score distribution approximates the posterior predictive test score distribution, there is no evidence of model misfit at the test level. Dif- ferences between observed frequencies in the test score distribution and expected frequencies from the posterior predictive test score distribution can be evaluated using a chi-squared statistic (B´eguin & Glas, 2001). Previous research has found that this measure can be used with posterior predictive model checking to detect misfit for dichotomous IRT models (B´eguin & Glas, 2001; Sinharay et al., 2006).

Item-Level Measures

An item score distribution represents the number of examinees responding to each response category. Similar to the test score distribution, the difference between observed and posterior predictive item score distributions can also be summarized using a chi-squared statistic for each item. Although this measure was not found useful with dichotomous IRT models (Levy et. al. 2009) and a polytomous fusion model (Fu, Bolt, & Li, 2005), this study investigated its use with graded IRT models.

The item-test score correlation is the correlation between examinees’ total test scores and their scores on a particular item. Sinharay et al. (2006) found this

(4)

Zhu and Stone

correlation was effective in detecting misfit of 1PL models to data that were generated from a 2PL or 3PL model, but effectiveness of this correlation for detecting other aspects of misfit has not been investigated. The item-total score correlation may also be effective for detecting “local dependence” among items. Local dependence has been found to affect estimation of item discrimination (Yen, 1993), and item discrimination is related to item-test score correlations. Note that in this study, Pearson correlations were used to estimate associations between items and total test scores since there were five discrete response categories for each item.

A number of statistics have been proposed to assess fit of IRT models at the item level (e.g., Orlando & Thissen, 2000; Stone, 2000; Yen, 1981). All of these statistics compare observed response score distributions and expected response score distributions under the estimated model for each item. The performance of the Orlando and Thissen (2000) item-fit statistic with posterior predictive model checking has been examined by Sinharay (2006) for dichotomous items. In this study, one tradi- tional item-fit statistic (Yen’s Q1) and one alternative item-fit statistic (Stone’s item-fit statistic) developed specifically for use with performance assessments were used as discrepancy measures.

Pair-Wise Measures

Traditionally, dimensionality and local independence assumptions underlying IRT models have been evaluated using pair-wise test statistics reflecting associations between responses to item pairs. Pair-wise discrepancy measures used with posterior predictive model checking have been found to be more powerful than test- and item- level measures in detecting these sources of misfit for dichotomous IRT models (e.g., Fu et al., 2005; Levy et al., 2009). In the current study, two local dependence indices (Yen’s Q₃and an odds ratio) and a multidimensionality index based on item covariance residuals were employed to evaluate their use with posterior predictive model checking and the graded IRT model.

Yen’s Q₃ (1993) statistic measures the correlations between pairs of items after accounting for the latent ability level in examinees. It is computed from correlations based on deviations between observed and expected responses for item pairs. Levy et al. (2009) found that the Q3discrepancy measure with posterior predictive model checking was powerful in detecting local dependence among responses to dichotomous items.

Chen and Thissen (1997) used odds ratios to evaluate local dependence for dichotomous items. The odds ratios are computed from 2×2 contingency tables for item pairs (j and j^∗) by n₀₀n₁₁/.n01n₁₀, where npq is the observed number of ex- aminees having response p on Item j (0 or 1) and response q on Item j^∗(0 or 1).

Discrepancy measures based on odds ratios with posterior predictive model checking have also been found to be effective in detecting local dependence in the context of dichotomous IRT models (c.f., Levy et al., 2009; Sinharay et al., 2006).

In contrast with dichotomously scored items, multiple odds ratios can be com- puted with polytomous items since the contingency table is R × C (R > 2 and C > 2). For this study, a global odds ratio (Agresti, 2002) was used as a possi- ble discrepancy measure. For any two items, the R× C contingency table may be

(5)

reduced to a 2×2 contingency table by dichotomizing the response categories of each item. The global odds ratio was then defined as the cross-ratio of this pooled 2×2 table. In this study, the dichotomization was based on score rubrics typically used with performance assessments. For items with 5 response categories (0–4), Categories 3 and 4 were treated as “correct” responses, and Categories 0, 1, and 2 were treated as

“incorrect” responses.

For any pair of items, the absolute item covariance residual was defined as the absolute value of discrepancy between the observed and expected item covariances after estimating the model. This covariance residual measure has been shown to be relatively effective for detecting the departure of response data from unidimensionality (McDonald & Mok, 1995). Recently, Fu et al. (2005) and Levy et al. (2009) found that this measure was also effective in detecting certain types of misfit with posterior predictive model checking.

Simulation Study

In order to explore the performance of posterior predictive model checking in evaluating different sources of misfit for unidimensional graded IRT models, a number of response data sets were generated based on different graded models and all generated data were then estimated using unidimensional graded models. Table 1 presents the design and specific conditions used in this simulation, and Table 2 gives item parameters for each graded model. For each condition, 20 response data sets were generated using the data-generating model with each data set containing responses for 2,000 simulated examinees to 15 polytomous items with 5 response categories.

The test length and the number of response categories were fixed at typical values in performance assessment applications. A large sample size of 2,000 was used in

Table 1

Design and Conditions in this Simulation Study

Data-Generating Condition Violated

Analysis Model Model Number Assumption

Unidimensional Unidimensional Graded Model 1 None Graded Model Two-dimensional simple-structure Graded

Model

2 Unidimensionality

Case1: inter-dimension correlation= .3 Case2: inter-dimension correlation= .6 Two-dimensional complex-structure

Graded Model (one dominant dimension)

3 Local Independence

Case1: mild dependence a2/a1= .5 Case2: large dependence a2/a1= 1.0

Testlet Graded Model 4 Local Independence

Case1: mild dependenceσd(i )² = 0.5 Case2: large dependenceσ²d(i )= 1.0

(6)

Table 2

Item Parameters for Different Graded Models under Conditions 1–4

Conditions 1 and 4 Condition 2 Condition 3

Item a b1 b2 b3 b4 a1 a2 a1 a2(case1/case2)

1 1.0 –2.0 –1.0 .0 1.0 1.0 .0 1.0 .5(1.0)

2 1.0 –1.5 –.5 .5 1.5 1.7 .0 1.0 .5(1.0)

3 1.0 –1.0 .0 1.0 2.0 2.4 .0 1.0 .5(1.0)

4 1.0 –3.0 –1.5 –.5 1.0 1.0 .0 1.0 .5(1.0)

5 1.0 –1.0 .5 1.5 3.0 1.7 .0 1.0 .5(1.0)

6 1.7 –2.0 –1.0 .0 1.0 2.4 .0 1.7 .0

7 1.7 –1.5 –.5 .5 1.5 1.0 .0 1.7 .0

8 1.7 –1.0 .0 1.0 2.0 1.7 .0 1.7 .0

9 1.7 –3.0 –1.5 –.5 1.0 .0 2.4 1.7 .0

10 1.7 –1.0 .5 1.5 3.0 .0 1.0 1.7 .0

11 2.4 –2.0 –1.0 .0 1.0 .0 1.7 2.4 .0

12 2.4 –1.5 –.5 .5 1.5 .0 2.4 2.4 .0

13 2.4 –1.0 .0 1.0 2.0 .0 1.0 2.4 .0

14 2.4 –3.0 –1.5 –.5 1.0 .0 1.7 2.4 .0

15 2.4 –1.0 .5 1.5 3.0 .0 2.4 2.4 .0

Note. Conditions 2 and 3 had the same threshold values as for Conditions 1 and 4.

order to ensure that the graded IRT model was estimated precisely and the posterior predictive model checking results would not be affected by any inaccuracy in model parameters estimation.

Condition 1 represents the null condition in which the generating model and analysis model were equivalent or a condition under which the model fits the data. The unidimensional data were generated under the graded IRT model:

P_{i x}^∗ (θ) = exp [ai(θ − bi x)]

1+ exp [ai(θ − bi x)], (3)

where P_{i x}^∗(θ) is the probability of responding in category x or higher to item i, θ is the ability level for an examinee, ai is the discrimination (slope) parameter for item i, and bixis the threshold parameter for category x of item i. The configuration of item parameters for the graded model (see Table 2) involved a combination of three levels for the slope parameter: 1.0 for Items 1–5, 1.7 for Items 6–10, and 2.4 for Items 11–15 (reflecting low, average and high discrimination when no scaling constant, D= 1.7, is included in the model) and five levels for threshold parameters:

(1) –2.0, –1.0, .0, 1.0; (2) –1.5, –.5, .5, 1.5; (3) –1.0, .0, 1.0, 2.0; (4) –3.0, –1.5, –.5, 1.0; (5) –1.0, .5, 1.5, 3.0. Examinees were randomly simulated from a N(0,1) distribution.

(7)

For Conditions 2 and 3, the multidimensional graded model discussed by De Ayala (1994) was used to generate two-dimensional simple- and complex-structure data:

P_{i x}^∗() = exp

h

a_{i h}(θh− bi x)

1+ exp

h

a_{i h}(θh− bi x)

, (4)

whereθhis examinee’s ability level on dimension h, aihis the discrimination (slope) parameter of item i on dimension h, and bixis the threshold parameter for category x of item i.

In Condition 2, the test measured two dimensions but each item measured only one dimension (i.e., simple-structure). Specifically, the first eight items were designed to measure one dimension, and the remaining seven items measured the second dimen- sion. As shown in Table 2, the slope parameters on Dimension 1 (a1) were 1.0, 1.7.

2.4, 1.0, 1.7, 2.4, 1.0, 1.7 for Items 1–8, and 0 for Items 9–15. In contrast, the slope parameters on Dimension 2 (a2) were 0 for Items 1–8, and 2.4, 1.0, 1.7, 2.4, 1.0, 1.7, 2.4 for Items 9–15. The threshold parameters for these 15 items were the same as Condition 1. Two levels of inter-dimensional correlation were considered to reflect a low (.3) and moderate degree (.6) in the relationship between dimensions. The .6 case represents a correlation found in performance assessments such as math assessments where examinees are asked for an answer and an explanation (Lane, Stone, Ankenmann, & Liu, 1995). In these assessments one dimension reflects math ability and the other dimension reflects communication ability. Ability parameters for the two dimensions were randomly selected from a bivariate normal distribution (0,1) with the specified correlation.

Condition 3 simulated responses to a test designed to measure a dominant ability (e.g., math), but a subset of items also measure a nuisance or construct- irrelevant ability (e.g., reading). Although this condition reflects multidimensionality (complex-structure), this nuisance factor also reflects local dependence among the subset of items. Specifically, all 15 items measured a dominant dimension, but the first five items were designed to also measure a nuisance dimension. The correlation between two dimensions was fixed at a low level .3 because the test was developed to measure one dominant dimension. The degree to which item performance depended on the nuisance ability was captured by the ratio of the slope (a1) for the dominant ability to the slope (a2) for the nuisance ability. The ratio of a2to a1was set to two levels (.5 or 1.0) for the first five items. The thresholds and slope parameters a₁were the same as Condition 1 (see Table 2). The values of a₂ for Items 1–5 were .5 for Case 1 (a₂/a₁= .5) and 1.0 for Case 2 (a2/a₁= 1.0). Other items had values of 0 for a₂. Ability parameters for two dimensions were randomly selected from a bivariate normal distribution (0,1) with the correlation .3.

Condition 4 was designed to simulate responses to a test with a testlet (e.g., items with a shared stimulus). The items in the testlet would be locally dependent. The locally dependent data were generated under a modified graded IRT model for testlets

(8)

Zhu and Stone

proposed by Wang, Bradlow, and Wainer (2002):

P_{j i x}^∗ (θ) = exp ai

θj− bi x− γj d(i )

1+ exp

ai

θj− bi x− γj d(i )

 . (5)

In this model, a random testlet effect (γj d(i )) is introduced to reflect an interaction between person j and the testlet d(i), andγj d(i )is assumed to be distributed as N(0, σ²_{d(i )}). Whenσ²_{d(i )}= 0, the items within the testlet are conditionally independent. As the variance increases, the amount of local dependence increases. In this condition, item parameters were the same as that for the graded model in Condition 1 except one testlet was simulated (Items 6, 7, and 8). The variance of the testlet effectσ_{d(i )}² was specified at two levels: .5 and 1.0, reflecting mild and large degrees of local dependence, respectively. Ability parameters were randomly selected from N(0,1), and the testlet effectγj d(i )was randomly selected from a N(0,σ²_{d(i )}) for the items in the testlet.

For each of the 20 data sets generated in each condition, a unidimensional graded model was estimated using Markov chain Monte Carlo estimation in WinBUGS 1.4 (Spiegelhalter, Thomas, Best, & Lunn, 2003). The following pri- ors were specified:θi ∼ Normal(0, 1)for all persons i, and aj ∼ Lognormal(0, 1), b_{j 1}∼ Normal(0, 0.25) and bj (k+1)∼ Normal(0, 0.25)I (bj k) for all items j, where the notation I (bj k) indicates that the threshold b_{j (k+1)} was sampled to be larger than bj k as required under the graded model. Consistent with WinBUGS notation, precision parameters rather than variance parameters were specified in these prior distributions.

One chain of 4,000 iterations was run, and the first 3,500 iterations in each chain were discarded. Estimation of model parameters and posterior predictive model checking were based on the 500 remaining iterations (posterior distributions). Re- covery of the model parameters was evaluated using the root mean square error between the true (i.e., generating) parameters and posterior estimates based on mean values from the posterior distributions summarized across the 20 replications. The average root mean square error across all items was .08 for slope parameters and .07 for threshold parameters, indicating that a posterior sample of 500 was adequate for accurate recovery of item parameters for graded models. This posterior sample size was consistent with previous studies (e.g., Fu et al., 2005; Levy et al., 2009; Li et al., 2006).

For each condition, each of the 20 generated data sets served as an “observed”

data set, and for each “observed” data, the 500 posterior predictive (i.e., replicated) data sets under the unidimensional graded model were simulated within WinBUGS (one posterior predictive data set based on each of the 500 iterations). Values of the proposed discrepancy measures were calculated for the observed data as well as for each of the 500 predicted sets of data (y^rep) and then compared using graphical plots and posterior predictive p-values. In this study, posterior predictive p-values below .05 or above .95 were classified as extreme values and the null hypothesis of model fit was rejected. Use of discrepancy measures to reject model fit and computing re- jection rates across 20 replications allowed for the examination of Type I error rates

(9)

when the generating and analysis models were the same. Empirical power rates were derived when the generating and analysis models were not the same.

Results

Condition 1 (Analysis Model = Generating Model = Unidimensional Graded IRT Model)

For the null condition, median posterior predictive p-values and average Type I er- ror rates (i.e., proportions of replications with extreme posterior predictive p-values) were pooled across all 15 items (for item-level measures), all 105 item pairs (for pair-wise measures) and also pooled across the 20 replications. Results showed that median posterior predictive p-values for all measures were around .50 indicating no systematic differences between realized and predictive discrepancy values. This suggested adequate model-data fit, as was expected under this condition. The pro- portions of replications with extreme posterior predictive p-values were below the level of .10. The empirical Type I error rates for the global odds ratio and Yen’s Q₃ measures were about .05 and .06, respectively. Other measures had even lower Type I error rates. The results confirmed previous findings of conservativeness when using posterior predictive model checking (c.f., Fu et al., 2005; Levy et al., 2009; Sinharay, 2006).

For each pair-wise measure, there were 105 posterior predictive p-values for each replication. In order to summarize the results across the 20 replications more effi- ciently, pie plots similar to those used by Sinharay et al. (2006) were also constructed.

For example, Figure 1(a) displays median posterior predictive p-values for Yen’s Q3

measure for each item pair across the 20 replications. There is one pie plot for each item pair, and the proportion of a circle that is filled corresponds to the magnitude of the median posterior predictive p-value. As expected under the null condition, the median posterior predictive p-values for all item pairs were around .5. Pie plots for the other two pair-wise measures displayed a similar pattern.

Condition 2 (Generating Model= Two-Dimensional Simple-Structure Graded Model)

Table 3 presents the results for the misfitting conditions (Conditions 2–4). In Con- dition 2, the generated data reflected two dimensions (the first 8 items associated with Dimension 1 and the last seven items associated with Dimension 2), but the estimated model was a unidimensional graded model. Two interdimensional correlation cases were considered (ρ = .3 or .6). The pooled median posterior predictive p-value and the average proportion of extreme posterior predictive p-values across the 20 replications (i.e., empirical power) for each discrepancy measure under each correlation condition are presented in Table 3 (see Condition 2 column). For each item-level measure, the median posterior predictive p-values and proportions were pooled across items in each of the two dimensions (Dimension 1 or Dimension 2) and across replications. For the pair-wise measures, there were three types of item pairs: item pairs from the first dimension (Dimension 1, Dimension 1), item pairs from the second dimension (Dimension 2, Dimension 2), and item pairs from

(10)

(a) Condition 1 (b) Condition 2/Case 2

(c) Condition 3/Case 2 (d) Condition 4/Case 2

Figure 1. Display of median posterior predictive p-values for item pairs for Yen’s Q3for Conditions 1–4.

different dimensions (Dimension 1, Dimension 2). The results for the pair- wise measures were pooled from the same type of item pairs across the 20 replications.

As can be seen from the results, the three pair-wise measures were sufficiently powerful in detecting misfit of the unidimensional graded model to the two- dimensional data modeled for both cases. Median posterior predictive p-values were extreme and empirical power rates were high. Yen’s Q₃ index appeared to perform best in terms of empirical power. The global odds ratio and Yen’s Q₃measures also reflect the direction in the discrepancy between realized and posterior predictive val- ues. Median p-values close to 0 for item pairs from the same dimension indicated that the observed association between these item pairs was systematically higher than predicted under the unidimensional graded model (i.e., the unidimensional model underestimated item relationships). Conversely, median p-values close to 1 for item pairs from the different dimensions indicated that the observed associations were consistently lower than expected under the graded model (i.e., the model overesti- mated item relationships).

Results in Table 3 also illustrate that the test-level and item-level measures were not as effective as the pair-wise measures in detecting multidimensionality among

(11)

Table3 MedianPosteriorPredictivep-ValuesandEmpiricalPowerRatesforallMeasures—Conditions2–4 Condition2(TwoDimensions)Condition3(LocalDependence)Condition4(TestletEffect) Case1(ρ=.3)Case2(ρ=.6)Case1(mild)Case2(large)Case1(mild)Case2(lar TypeType MedianMedianMedianMedianMedianMedian Typep-ValuePowerp-ValuePowerp-ValuePowerp-ValuePowerp-ValuePowerp-valuePo TestScoreDistribution –.06.25.41.10–.29.10.25.10–.64.05.49 ItemScoreDistribution Dim1.50.00.50.002dim.50.00.50.00testlet.50.00.50 Dim2.50.00.50.001dim.50.00.50.00indep.50.00.50 Item-totalScoreCorrelation Dim11.00.91.28.092dim.33.00.17.00testlet.14.00.05 Dim2.00.99.66.031dim.57.00.65.00indep.54.00.63 Yen’sQ1Index Dim1.50.00.52.002dim.53.00.55.00testlet.45.00.47 Dim2.50.00.53.001dim.51.00.52.00indep.50.00.50 Stone’sItemFitStatistic Dim1.50.00.51.012dim.56.01.51.03testlet.50.02.40 Dim2.50.01.53.001dim.52.02.49.01indep.51.02.49 GlobalOddsRatio (Dim1,Dim1).07.45.01.74(2dim,2dim).18.20.00.94(testlet,testlet).001.00.00 (Dim1,Dim2).99.79.98.68(2dim,1dim).60.06.76.13(testlet,indep).70.10.80 (Dim2,Dim2).00.87.00.80(1dim,1dim).44.05.42.06(indep,indep).40.05.43 Yen’sQ3 (Dim1,Dim1).01.74.00.96(2dim,2dim).02.66.001.00(testlet,testlet).001.00.00 (Dim1,Dim2)1.00.981.00.97(2dim,1dim).65.09.94.45(testlet,indep).91.40.98 (Dim2,Dim2).00.95.00.97(1dim,1dim).44.06.28.10(indep,indep).39.08.36 AbsoluteItemCovarianceResidual (Dim1,Dim1).08.44.00.87(2dim,2dim).15.18.001.00(testlet,testlet).001.00.00 (Dim1,Dim2).00.93.01.83(2dim,1dim).53.00.38.01(testlet,indep).45.01.28 (Dim2,Dim2).00.86.00.92(1dim,1dim).52.00.50.00(indep,indep).52.00.50 Note:InCondition2,“Dim1”denotestheitemsinDimension1,and“Dim2”denotestheitemsinDimension2;InCondition3,“1dim”denotestheitemsonlymeasu thedominantdimension,and“2dim”denotestheitemsmeasuringboththedominantandnuisancedimensions;InCondition4,“testlet”denotesthetestletitems, “indep”denotestheindependentitems.

(12)

Zhu and Stone

the data whenρ = .6. Median posterior predictive p-values were not extreme and proportions of extreme p-values (i.e., empirical power) were small. However, when ρ = .3, the test-level measure and the item-total score correlation measure were more sensitive to the modeled misfit. The test-level measure exhibited some power (.25), and the item-total correlation exhibited sufficient power (.91 for Dim1 items, and .99 for Dim2 items) to detect the modeled multidimensionality. Thus, performance of these two discrepancy measures in posterior predictive model checking increased as the relationship between the dimensions decreased (i.e., uniqueness of dimensions increased).

As for Condition 1, graphical evidence for model fit was also examined. Figure 1(b) displays the pie plots for Yen’s Q₃measure whenρ = .6. The large number of extreme median posterior predictive p-values in this plot clearly indicated that the unidimensional graded model did not fit the data. Moreover, the pattern in this plot differed clearly from the pattern under Condition 1 (see Figure 1(a)). The 15 items fell into two clusters which matched the modeled factor structure—Items 1–8 formed one cluster, and Items 9–15 formed another cluster.

Condition 3 (Generating Model = Two-Dimensional Complex-Structure Graded Model)

In Condition 3, the generated data were two-dimensional with complex-structure (Items 1–5 measured a dominant dimension as well as a nuisance dimension, and Items 6–15 only measured the dominant dimension). Two levels of dependence among the first five items were modeled: mild (a₂/a₁= .5), and large (a2/a₁= 1.0).

Results for this condition are presented in Table 3 (see Condition 3 column). Based on the modeled structure, the items are labeled as follows in the table: items measuring both dimensions (Items 1–5 labeled 2dim), and items measuring only the dominant dimension (Items 6–15 labeled 1dim). For each item-level measure, the median posterior predictive p-values and empirical power rate were pooled across items of the same type. In addition, there were three types of item pairs: item pairs measuring both dimensions (2dim, 2dim), item pairs measuring the dominant dimension (1dim, 1dim), and pairs reflecting items measuring both dimensions and items measuring the dominant dimension (2dim, 1dim). The results for the pair-wise measures were pooled for the same type of item pairs across the 20 replications.

Results showed that the test-level and item-level measures were not effective in detecting the modeled misfit among the first 5 items. As for the pair-wise measures, the global odds ratio and item covariance residual measures exhibited low power (.20 and .18, respectively), while Yen’s Q₃ showed moderate power (.66) in detecting mild local dependence (Case 1) among the first 5 items. The median posterior pre- dictive p-value for Yen’s Q₃index for these items (2dim, 2dim) was .02. This value (∼0) indicated that the realized Q3 values were consistently larger than predictive values under the unidimensional graded model, or the graded model underestimated the association among the first 5 items. In other words, the first 5 items had more dependence than expected under the unidimensional model.

As the level of dependence on the nuisance dimension increased (Case 2), performance of the pair-wise measures improved as would be expected. For this case, both

(13)

Yen’s Q3index and the item covariance residual measure exhibited full power (1.00) in detecting the local dependence among the first five items. The median posterior predictive p-values were .00, implying that all the realized values were larger than the predictive values. In addition, the global odds ratio measure exhibited sufficient power (.94) for this case, and the median posterior predictive p-value was also close to 0. Overall, all three pair-wise measures were effective in detecting the large degree of simulated dependence among the first five items. For the mild dependence case, only Yen’s Q₃appeared to display adequate power.

It is worthy of note that as the degree of dependence increased, Yen’s Q₃measure also detected dependence between the modeled dependent and independent items (2dim, 1dim). For the large degree of dependence case, Yen’s Q₃ showed moderate power (.45) for the (2dim, 1dim) pairs, and the corresponding median posterior predictive p-value was .94. This high value indicated that the realized Q3values for the (2dim, 1dim) pairs were consistently smaller than the predictive values under the unidimensional graded model.

As for the previous conditions, pie plots were used to examine the pattern in the posterior predictive p-values. Figure 1(c) display the median posterior predictive p- values for Yen’s Q3 for the large dependence case. As can be seen from this plot, the median posterior predictive p-values were around .50 for the (1dim, 1dim) pairs, close to 0 for the (2dim, 2dim) pairs, and close to 1 for the (2dim, 1dim) pairs. The pattern is different from that under the null condition (Condition 1), indicating misfit between the model and the data.

Condition 4 (Generating Model= Testlet Graded Model)

In this condition, the effectiveness of different discrepancy measures in detecting local dependence among responses to testlet items was investigated. Items 6, 7 and 8 were simulated to reflect a testlet, and two levels of dependence among items were considered: mild (σ²_{d(i )}= 0.5), and large (σ²_{d(i )}= 1.0). The column “Condition 4” in Table 3 presents the median posterior predictive p-values and proportion of extreme p-values (i.e., power) pooled from the same type of items (or item pairs) across the 20 replications. In this table, “testlet” represents the testlet items, that is, Items 6–

8; “indep” represents the modeled locally independent items. As can be seen, the three pair-wise measures demonstrated full power (1.00) in detecting misfit of the unidimensional graded model to the modeled dependence among the testlet items.

In addition, as the degree of modeled dependence among the testlet items increased, the power of the two directional measures (the global odds ratio and Yen’s Q3) in detecting the misfit of the graded model to combinations of testlet and independent items (testlet, indep) increased. Yen’s Q₃ measure performed better than the global odds ratio measure in terms of empirical power. Specifically, for the mild and large dependence cases, the power rates for Yen’s Q₃ were .40 and .58 for the (testlet, indep) pairs, respectively. The global odds ratio measure had a power rate of .20 for the (testlet, indep) pairs for the large dependence case, and a power rate of .10 for the mild dependence condition.

The two directional measures also provided more information about the modeled misfit. For example, posterior predictive p-values for Yen’s Q3for the (testlet, testlet)

(14)

Zhu and Stone

pairs were 0, indicating that realized associations among the testlet items were con- sistently larger than predicted under the graded model. Posterior predictive p-values for the (testlet, indep) pairs were close to 1, implying that realized associations between the testlet and independent items were lower than predicted under the model.

In contrast, median posterior predictive p-values for all possible pairs of independent items were close to .50, indicating realized associations between the independent items were consistent with expected values. The absolute item covariance measure was also effective in detecting misfit of the model for testlet items. Median posterior predictive p-values of 0 for the (testlet, testlet) pairs suggested that observed abso- lute residuals were systematically larger than predicted residuals under the model.

However, as previously discussed, the direction of misfit cannot be determined (i.e., whether the model overestimates or underestimates associations between item pairs).

Figure 1(d) displays the pie plots for the large dependence case based on Yen’s Q3measure. The items appeared to fall into two clusters: Items 6–8 in one, and the remaining items in another. Any pairs including testlet items had extreme posterior predictive p-values. This pattern was clearly different from the corresponding plots under the null condition (Figure 1(a)), providing strong evidence about the misfit of the graded model to the modeled testlet effect.

For this condition, the item-total score correlation measure was found to be in- effective in detecting a mild degree of item dependence. However, power increased as the degree of dependence increased. The pooled median posterior predictive p- values were .14 and .05 for the mild and large dependence cases, respectively. The corresponding power increased from no power (.00) to moderate power (.52). For the large dependence case, the median posterior predictive p-values tended to be 0 for testlet items, indicating that the observed correlations for these items were higher than expected. In contrast, the median posterior predictive p-values for the indepen- dent items were not extreme (.63), indicating adequate fit of the graded model to these items.

Discussion

The purpose of this study was to evaluate the use of posterior predictive model checking with Bayesian estimation of polytomous IRT models for performance assessment applications. A simulation study was conducted to explore the performance of posterior predictive model checking in evaluating the fit of the unidimensional graded IRT model under different threats to model fit: dimensionality and local dependence. Different discrepancy measures at the test- and item-levels as well as several item-pair measures were used to determine which measures were more effective in evaluating the different threats.

Among the different measures, the pair-wise measures were found to be more powerful in detecting violations of the unidimensionality and local independence assumptions than the test- and item-level measures. This may be expected since the unidimensional graded model has no parameters to model associations between responses to pairs of items. Among the three pair-wise measures, the directional measures (global odds ratio and Yen’s Q3) may be preferred over a non-directional

(15)

measure (absolute item covariance residual). Posterior predictive p-values for the di- rectional measures indicate the direction of any misfit, or whether the model over- or underestimates item-pair associations. The pattern in posterior predictive p-values can also be used to indicate how items may be grouped into clusters or dimensions and to explore dimensionality in item responses. Though the three pair-wise measures were almost equally effective in several conditions, Yen’s Q3index performed best in terms of empirical power. The relatively low performance of the global odds ratio measure might be due to the dichotomization of the polytomous item responses.

However, Levy et al. (2009) also found that Yen’s Q₃index was more powerful than an odds ratio measure for evaluating dichotomous IRT models.

Regarding the performance of the test- and item-level measures in detecting violations in the unidimensionality or local independence assumptions, the test score distribution measure, item score distribution measure, and the two item-fit statistics were found to be least effective. However, when the inter-dimensional correlation was lower (i.e., two dimensions became more distinct), or the dependence among item responses increased, the item-total score correlation measure was effective in detecting the modeled misfit.

Consistent with previous research, the two classical item-fit statistics (Yen’s Q1

index and Stone’s fit statistic) were not effective in detecting item misfit due to multidimensionality or local dependence (e.g., Zhang & Stone, 2008; Kang & Chen, 2011). However, these item fit statistics have been found useful with posterior predictive model checking for detecting violations in the form of the IRT model. For example, Zhu (2009) simulated responses to models with boundary category curves that did not follow the logistic functions under the graded model. Under these conditions, Stone’s item fit statistic exhibited high power in detecting the modeled misfitting items and performed better than Yen’s Q₁index with posterior predictive model checking.

This study also confirmed previous findings for posterior predictive model checking in the context of the graded IRT model and performance assessments. For example, conservativeness of the posterior predictive model checking procedure under the null condition was confirmed. The effectiveness of pair-wise measures in assessing the violation of unidimensionality, and the uselessness of the item-score distribution measure were also confirmed. In addition, this study further provides support for using graphical plots to present the posterior predictive model checking results.

The plots may be more appealing than tables of posterior predictive p-values. Re- searchers may also be able to discern patterns from the plots which may indicate the applicability of an alternative model. For example, as shown in Condition 2, when a unidimensional graded model was estimated with two-dimensional data, the pie plots displayed two clear item clusters, implying that a two-dimensional model would be more appropriate.

An important implication from the above findings is that the choice of the discrepancy measures is important in posterior predictive model checking applications. In practice, researchers should consider the potential threats to model misfit, and choose measures most sensitive to these sources of misfit. In addition, as suggested by Sin- haray et al. (2006), a number of measures should be used in the assessment of model fit.

(16)

Zhu and Stone

Though the conditions in this study were carefully designed, some factors were fixed at values realistic to typical performance assessments. Therefore, results may not generalize to other conditions not considered in the current study. For example, this study was limited in terms of the length of tests (15 items), number of response categories (5-category), specific polytomous model estimated (graded model), and number of modeled dimensions (2 dimensions). The effect of factors such as sample size, number of items, number of dimensions, and other multidimensional structures could be further explored in the context of Bayesian evaluation of IRT models with polytomous items. Another possible limitation to the design was that 20 replications at each combination of experimental conditions were used to compute Type I error and empirical power rates. Although this was a small number, it was consistent with previous research focusing on Bayesian methods and posterior predictive model checking. While more replications could be used to obtain more reliable results, the general patterns in the results that were observed would not likely change with addi- tional replications.

References

Agresti, A. (2002). Categorical data analysis. Hoboken, NJ: John Wiley.

B´eguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66(4), 541–562.

Chen, W., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289.

De Ayala, R. J. (1994). The influence of dimensionality on the graded response model. Applied Psychological Measurement, 18, 155–170.

Fu, J., Bolt, D. M., & Li, Y. (2005). Evaluating item fit for a polytomous Fusion model us- ing posterior predictive checks. Paper presented at the meeting of the National Council on Measurement in Education, Montreal, Canada.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis. New York, NY: Chapman & Hall.

Glas, C. A. W., & Meijer, R. R. (2003). A Bayesian approach to person fit analysis in item response theory models. Applied Psychological Measurement, 27(3), 217–233.

Kang, T., & Chen, T. T. (2011). Performance of the generalized S-X² item-fit index for the graded response model. Asia Pacific Education Review. In press. Available at http://www.springerlink.com/content/n2570055685w849l/.

Lane, S., Stone, C. A., Ankenmann, R. D., & Liu, M. (1995). Examination of the assumptions and properties of the graded item response model: An example using a mathematics performance assessment. Applied Measurement in Education, 8, 313–340.

Lane, S., & Stone, C. A. (2006). Performance assessment. In R. L. Brennan (Ed.), Educational measurement (4th ed.). Westport, CT: American Council on Education/Praeger.

Levy, R., Mislevy, R. J., & Sinharay, S. (2009). Posterior predictive model checking for mul- tidimensionality in item response theory. Applied Psychological Measurement, 33(7), 519–

537.

Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30(1), 3–21.

McDonald, R. P., & Mok, M. M.-C. (1995). Goodness of fit in item response models. Multi- variate Behavioral Research, 30, 23–40.

Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24(1), 50–64.

(17)

Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data.

Journal of Educational and Behavioral Statistics, 27, 341–384.

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12, 1151–1172.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.

Psychometrika Monograph Supplement No. 17.

Sinharay, S. (2005). Assessing fit of unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement, 42, 375–394.

Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory mod- els. British Journal of Mathematical & Statistical Psychology, 59, 429–449.

Sinharay, S., Johnson, M. S., & Stern, H. S. (2006). Posterior predictive assessment of item response theory models. Applied Psychological Measurement, 30, 298–321.

Spiegelhalter, D. J., Thomas, A., Best, N., & Lunn, D. (2003). WINBUGS Version 1.4 User’s manual [Computer software manual]. Cambridge, UK: MRC Biostatistics Unit.

Stone, C. A. (2000). Monte Carlo based null distribution for an alternative goodness-of-fit test statistic in IRT models. Journal of Educational Measurement, 37(1), 58–75.

Wang, X., Bradlow, E. T., & Wainer, H. (2002). A general Bayesian model for testlets: Theory and applications. Applied Psychological Measurement, 26, 109–128.

Yen, W. M. (1981). Using simulation results to choose a latent trait model. Applied Psycho- logical Measurement, 5, 245–262.

Yen, W. M. (1993). Scaling performance assessments: strategies for managing local Item de- pendence. Journal of Educational Measurement, 30, 187–213.

Zhang, B., & Stone, C. A. (2008). Evaluating item fit for multidimensional item response models. Educational and Psychological Measurement, 68, 181–196.

Zhu, X. (2009). Checking fit of item response models for performance assessments using Bayesian analysis. Unpublished dissertation. University of Pittsburgh.

Authors

XIAOWEN ZHU is a Psychometrician, Data Recognition Corporation, 13490 Bass Lake Road, Maple Grove, MN 55311; [email protected]. Her primary research interests include item response theory and Bayesian methods.

CLEMENT A. STONE is a Professor at the Department of Psychology in Education, School of Education, University of Pittsburgh, 5920 Wesley W. Posvar Hall, 230 South Bouquet Street, Pittsburgh, PA 15260; [email protected]. His primary research interests include psycho- metric methods.