**Model Diagnostics for Bayesian Networks**

**Sandip Sinharay**
*Educational Testing Service*

*Bayesian networks are frequently used in educational assessments primarily for*
*learning about students’ knowledge and skills. There is a lack of works on assess-*
*ing fit of Bayesian networks. This article employs the posterior predictive model*
*checking method, a popular Bayesian model checking tool, to assess fit of simple*
*Bayesian networks. A number of aspects of model fit, those of usual interest to*
*practitioners, are assessed using various diagnostic tools. This article suggests a*
*direct data display for assessing overall fit, suggests several diagnostics for*
*assessing item fit, suggests a graphical approach to examine if the model can*
*explain the association among the items, and suggests a version of the Mantel–*

*Haenszel statistic for assessing differential item functioning. Limited simulation*
*studies and a real data application demonstrate the effectiveness of the suggested*
*model diagnostics.*

*Keywords: discrepancy measure, Mantel–Haenszel statistic, p-values, posterior predictive*
*model checking*

**1. Introduction**

*Bayesian inference networks or Bayesian networks (BIN or BN; Pearl, 1988)*
are frequently used in educational assessments (Mislevy, 1995; Mislevy, Almond,
Yan, & Steinberg, 2001; Mislevy, Steinberg, & Almond, 2003; Weaver & Junker,
2003), especially in the context of evidence-centered design (ECD; Almond &

Mislevy, 1999), for learning about students’ knowledge and skills given information about their performance in an assessment.

A BN is an appropriate tool when an examination of the items in an assessment
suggests that solving each item requires at least a certain level of one or more of a
number of prespeciﬁed skills. A BN ﬁrst assumes that for each skill, each exami-
nee has a corresponding discrete proﬁciency variable that states the skill level of
the examinee. The next step in a BN is to describe a probability model for the
*Journal of Educational and Behavioral Statistics*
*Spring 2006, Vol. 31, No. 1, pp. 1–33*

This research was supported by Educational Testing Service. The author thanks Shelby Haberman, Russell Almond, Robert Mislevy, Paul Holland, Andrew Gelman, Hal Stern, Matthias von Davier, the two anonymous referees, and Howard Wainer for their invaluable advice. The author also thanks Andrew Gelman and Michael Friendly for their help in creating some of the graphical plots. The author is grateful to Kikumi Tatsuoka for providing the mixed number subtraction data set. The author grate- fully acknowledges the help of Rochelle Stern and Kim Fryer with proofreading. Any opinions expressed in this article are those of the author and not of Educational Testing Service.

response of an examinee to an item based on his or her attainment of the skills
required to solve the item. For example, a simple BN might assume that the skill
variables are binary (i.e., one either has the skill or does not), that the examinees
*who have all the required skills for item j have a certain probability *π* ^{1j}*of solving
the item, while those who do not have the required skills have probability π

*(where π*

^{0j}*< π*

^{0j}*) of solving the item. Finally, a BN speciﬁes prior distributions over the skill variables, usually with the help of a graphic model (Pearl, 1988) and the item parameters.*

^{1j}Model checking for BNs is not straightforward. For even moderately large num-
ber of items, the possible number of response patterns is huge, and the standard χ^{2}
test of goodness of ﬁt does not apply directly. Standard item response theory (IRT)
model diagnostics do not apply to BNs because of the presence of discrete and
multivariate proﬁciency variables in the latter models. As an outcome, the use of
statistical diagnostic tools in this context is notably lacking (Williamson, Almond,

& Mislevy, 2000).

Recently, Yan, Mislevy, and Almond (2003), and Sinharay, Almond, and Yan (2004) suggested a number of model checking tools for BNs, including item ﬁt plots and an item-ﬁt test statistic. However, a number of questions remain unanswered in this area. Hambleton (1989) comments that the assessment of ﬁt of IRT models usually involves the collection of a wide variety of evidence about model ﬁt, and then making an informed judgment about model ﬁt and usefulness of a model with a particular set of data. Hambleton’s comment applies equally well to BNs, but is yet to be addressed completely for these models. For example, there is no work yet assessing differential item functioning (DIF) for BNs. In summary, there is a signif- icant scope of further research in this area.

The posterior predictive model checking (PPMC) method (Rubin, 1984) is a
popular Bayesian model checking tool because of its simplicity, strong theoretical
basis, and obvious intuitive appeal. The method primarily consists of comparing the
*observed data with replicated data—those predicted by the model—using a num-*
*ber of discrepancy measures; a discrepancy measure is like a test statistic in a fre-*
quentist analysis that measures the difference between an aspect of the observed
data set and a replicated data set. This article applies the PPMC method to assess
several aspects of ﬁt of BNs.

First, this article suggests a direct data display (like those suggested by Gelman, Carlin, Stern, & Rubin, 2003) as a quick overall check of model ﬁt. This article then uses the PPMC method to examine item ﬁt for BNs using a number of discrepancy measures.

Sinharay and Johnson (2003) and Sinharay (2005) apply the PPMC method to ﬁnd the odds ratios (e.g., Agresti, 2002) corresponding to the responses to pairs of items to be useful discrepancy measures in detecting misﬁt of the unidimensional IRT models. This article uses the PPP-values for the odds ratios to check whether the BNs can explain the association among the items adequately.

This article then assesses DIF using the Mantel–Haenszel test suggested by Holland (1985). The test uses “matching groups” based on raw scores of examinees;

*Sinharay*

however, these groups may not be homogeneous in the context of BNs. Therefore, this article applies the PPMC method using a discrepancy based on a version of the Mantel–Haenszel test that uses the equivalence class memberships (rather than raw scores) to form the “matching groups.”

The BNs deal with latent variables, hence there is a possibility of nonidentiﬁa- bility or weak identiﬁability of some parameters. Weak identiﬁability may lead to nonapplicability of standard asymptotic theory, problems with MCMC conver- gence, and misleading inferences. Therefore, a diagnostic for model identiﬁabil- ity, although not directly a model-checking tool, is a necessary tool for assessing the performance of a BN. This article uses plots of prior versus posterior distribu- tions of the model parameters to assess their identiﬁability.

Section 2 introduces the BNs. Section 3 introduces the mixed-number subtrac- tion example (Tatsuoka, 1990) that will form the basis of a number of analyses later.

Section 4 provides a brief discussion of the PPMC method and the PPP-value. Sec- tion 5 introduces the discrepancy measures to be used for assessing different aspects of model ﬁt. Section 6 applies the direct data display to the mixed-number data and a simulated data. Section 7 applies the suggested item ﬁt measures to the mixed- number subtraction data set and a simulated data set. Section 8 discusses results of examining association among item pairs. Section 9 discusses results of the DIF analysis. Section 10 assesses identiﬁability of the parameters. The conclusions of this study are discussed in Section 11.

**2. Bayesian Networks**

Increasingly, users of educational assessments want more than a single sum-
mary statistic out of an assessment. They would like to see a proﬁle of the state of
acquisition of a variety of knowledge, skills, and proﬁciencies for each learner. One
*technique for profile scoring is the use of a BN.*

The idea of a Bayesian inference network (BIN)—also known as a Bayesian net-
work or Bayes net (BN) (Pearl, 1988)—comes primarily from the theory of graph-
*ical models, where it is deﬁned as an acyclic-directed graph in which each node*
represents a random variable, or uncertain quantity, which can take two or more pos-
sible values. To a statistician, BNs are statistical models that describe a number of
observations from a process with the help of a number of latent (or unobserved)
categorical variables. In the context of educational assessments, BNs are appropri-
ate when the task analysis of an assessment suggests that the satisfactorily completion
of any task requires an examinee to have at least a certain level of a number of skills.

The unobserved (or latent) variables, representing the level of the skills in the exam- inees, affect the probability distribution of the observed responses in a BN. BNs are called so because they use Bayes’s rule for inference. Mislevy (1995), Almond and Mislevy (1999), Mislevy et al. (2001), Mislevy et al. (2003), and Weaver and Junker (2003) show how BNs may be useful in psychometrics.

*Consider an assessment with I examinees and J items. Suppose X**ij*denotes the
*response of the i-th examinee to the j-th item. In general, X**ij*can be a vector-valued
quantity; that is, a single “task” could consist of multiple “items.” It can contain
*Model Diagnostics for Bayesian Networks*

polytomous outcomes as well. In our example, each item produces a single dichoto-
mous outcome, which is 1 if the response is correct and 0 if it is incorrect. Denote
*the ability/skill variable vector for the i-th examinee to be ** ^{i}*= (θ

*, θ*

^{i1}*, . . . , θ*

^{i2}*)′, θ*

^{ip}

^{ik}*denoting the skill level of Examinee i for Skill k, and the item parameter vector for*

*the j-th item to be*

*= (π*

^{j}*, π*

^{j1}*, . . . , π*

^{j2}*)′. A standard assumption is that of local inde- pendence, that is, independence of the responses conditional on the skills. Suppose*

^{jm}*that a cognitive analysis of the items produced a Q-matrix (e.g., Tatsuoka, 1990),*an incidence matrix showing for each item the skills required to solve that item. The

*next step is to deﬁne P(X*

*ij*= 1

*,*

^{i}*). As an example of a simple BN, assume that*

^{j}*solving the j-th item needs a number of skills, deﬁne*

*to consist of the indicators of attainment of the skills, let*

^{i}*= (π*

^{j}*, π*

^{j0}*)′, and choose*

^{j1}To perform a Bayesian analysis of a BN, one needs a prior (population) distribu-
*tion p(***) on the ability variables . Mislevy (1995) showed that forming a graphic**
model (Lauritzen & Spiegelhalter, 1988; Pearl, 1988) over the skill variables provides
a convenient way to specify this distribution.

*BNs are similar to the deterministic inputs, noisy “and” gate (DINA) model of*
Junker and Sijtsma (2001), the only difference between the two being the graphic
*model for p() in a BN, the noisy inputs deterministic and gate (NIDA) model*
of Maris (1999), the higher-order latent trait models (de la Torre & Douglas, 2004),
and the fusion model of Hartz, Roussos, and Stout (2002), all of which fall in the
family of cognitive diagnostic models.

**3. The Mixed-Number Subtraction Example**

Here, we introduce what will be a running example used through this article, one regarding a test of mixed-number subtraction. This example is grounded in a cognitive analysis of middle school students’ solutions of mixed-number subtrac- tion problems. The focus here is on the students learning to use the following method (discussed in, e.g., Tatsuoka, Linn, Tatsuoka, & Yamamoto, 1988):

*Method B: Separate mixed numbers into whole number and fractional parts;*

subtract as two subproblems, borrowing one from minuend whole number if necessary; and then simplify and reduce if necessary.

As is usual with a typical application of a BN, the analysis started with a cogni- tive analysis (e.g., Tatsuoka, 1984) of such items to determine the “attributes,” which are important for solving different kinds of problems. The cognitive analysis mapped out a ﬂowchart for applying Method B to a universe of fraction–subtraction prob- lems. A number of key procedures appear, a subset of which is required to solve a given problem according to its structure. To simplify the model, we eliminate the items for which the fractions do not have a common denominator (leaving us with

*P X* *i*

*ij* *i* *j*

= *j*

## (

^{1}

^{ }

^{,}

## )

^{=}π if the examinee has thee skills required to solve item

^{1}othe

*j*

*j*

, π0 rrwise.

⎧⎨

⎩
*Sinharay*

15 items). The procedures are as follows: (a) Skill 1: Basic fraction subtraction;

(b) Skill 2: Simplify/reduce fraction or mixed number; (c) Skill 3: Separate whole number from fraction; (d) Skill 4: Borrow one from the whole number in a given mixed number; and (e) Skill 5: Convert a whole number to a fraction. The cognitive analysis identiﬁed Skill 3 as a prerequisite of Skill 4; that is, no students have Skill 4 without also having Skill 3. Thus, there are only 24 possible combinations of the ﬁve skills that a given student can possess.

We consider the responses of 325 examinees from a data set discussed in,
for example, Tatsuoka (1984) and Tatsuoka et al. (1988). This data set played a
*central role in the development of the rule space methodology (e.g., Tatsuoka,*
1990). The 325 examinees considered here were found to have used Method B
(Mislevy, 1995). Table 1 lists 15 items from the data set that will be studied here,
characterized by the skills they require in the column “Skills” that also represents
*the Q-matrix. An “x” for a skill denotes that the skill is required by the item while*
an “o” denotes the opposite. For example, Item 9 requires Skills 1, 3, and 4. The
table also shows the proportion-correct scores (“Prop. Corr.”) for each item. The
items are ordered with respect to an increasing order of difficulty (or decreasing
order of proportion-correct score). The last five columns of the table are dis-
*cussed later. Many rows of the Q-matrix are identical, corresponding to a group*
of items that require the same set of skills to solve. Following the terminology
of evidence-centered design (ECD; Almond, & Mislevy, 1999), we call the pat-
*terns corresponding to the rows evidence models (EM); these are shown in Col-*
umn 4 of the table.

Sinharay et al. (2004) discussed the fact that Items 5 and 6 may not require any mixed number subtraction skills; for example, an examinee may solve Item 5 just by noticing that any number minus the same number results in zero.

Certain patterns of skills will be indistinguishable on the basis of the results of
this test. For example, because every item requires Skill 1, the 12 proﬁles (i.e., skill
patterns) that lack Skill 1 are indistinguishable on the basis of this data. Similar
*logic reveals that there are only nine identiﬁable equivalence classes of student*
proﬁles. This property of the test design will manifest itself later in the analysis.

Table 2 describes the classes by relating them to the EMs. Often, distinctions among members of the same equivalence class are instructionally irrelevant. For example, students judged to be in Equivalence Class 1 would all be assigned reme- dial work in basic subtraction, so no further distinction is necessary.

Mislevy (1995) analyzed the data using a BN. Let * ^{i}*= (θ

*, . . . , θ*

^{i1}*) denote the*

^{i5}*attribute/skill vector for Examinee i. In the context of ECD, where the BNs have*found most use, θ

^{ik}*s are called proficiency variables. These are latent variables*describing knowledges, skills, and abilities of the examinees we wish to draw infer-

*ences about. (the term variable is used to emphasize its person speciﬁc nature.)*

*Mixed-Number Subtraction Link Models*

The “link models” in ECD terminology state the likelihood distribution of the data
given the model parameters and latent variables. It is assumed that a participant needs
*Model Diagnostics for Bayesian Networks*

*Sinharay*

TABLE 1

*Skill Requirements and Some P-Values for the Mixed-Number Subtraction Problems*

Text of Skills Prop. P*pbis* P*D*_{j}^{eq,χ}* (y,*)

*P*

*D*

_{j}

^{raw,χ}*)*

**(y,***P*

*D*

_{j}

^{raw,G}*)*

**(y,**No. the Item 12345 EM Corr. *r**pbis* 2LC 2LC 2LC 2LC

1 xoooo 1 .79 .50 .40 .74 .26 .19

5 xoooo 1 .70 .52 .00* .00* .00* .00*

4 xxooo 2 .71 .56 .10 .43 .06 .23

2 xoxoo 3 .75 .59 .14 .28 .05 .03*

3 xoxoo 3 .74 .61 .02* .19 .03* .02*

6 xoxoo 3 .69 .23 .08 .30 .00* .00*

9 xoxxo 4 .37 .80 .05 .06 .07 .11

10 xoxxo 4 .37 .76 .45 .56 .42 .30

11 xoxxo 4 .34 .74 .36 .49 .06 .08

14 xoxxo 4 .31 .77 .43 .41 .30 .28

7 xoxxo 4 .41 .70 .08 .04* .01* .04*

8 xoxxx 5 .38 .71 .00* .00* .00* .03*

12 xoxxx 5 .33 .66 .04* .04* .00* .00*

15 xoxxx 5 .26 .75 .06 .52 .30 .38

13 4 4 xxxxo 6 .31 .78 .23 .49 .53 .52

12 2 7

− 12 7 14

− 3 3 21

− 5 2 1

− 3 4 1

10 28

− 10 41

3 15

− 3 73

5 4

− 5 41

3 24

− 3 31

2 23

− 2 37

8−2 45

7 14

− 7 34

5 32

− 5 11

8 1

−8 3 4

3

− 4 6 7

4

− 7

to have mastered all of the skills required for an item in order to solve it. In an appli- cation however, we will get false positive and false negative results. The link model described below follows that intuition. The BN applied by Mislevy (1995), referred to as the 2-parameter latent class (2LC) model hereafter, uses two param- eters for each item, the true-positive and false-positive probabilities, in

*Suppose the j-th item uses the Evidence Model s, s= 1, 2, . . . , 6. Although s is deter-*
mined by the item, this notation does not reﬂect that. Let δ*i(s)*be the 0/1 indicator
*denoting whether the examinee i has mastered the skills needed for tasks using Evi-*
*dence Model s or not. The quantity *δ*i(s)*for any examinee is completely determined
by the values of θ1, θ2, . . . , θ5for that examinee. The likelihood of the response of
*the i-th examinee to the j-th item is then expressed as*

The model relies on the assumption of local independence, that is, given the proﬁ-
ciency * ^{i}*, the responses of an examinee to the different items are assumed indepen-
dent. The probability π

*represents a “true-positive” probability for the item, this is, it is the probability of getting the item right for students who have mastered all of the required skills. The probability π*

^{j1}*represents a “false-positive” probability; it is the probability of getting the item right for students who have yet to master at least one of the required skills. The probabilities π*

^{j0}*and π*

^{j0}

^{j1}*are allowed to differ over j*(i.e., from item to item). However, we use the same independent priors for all items:

π*j*0 ~Beta(3 5 23 5. , . ),π*j*1~ Beta(23 5 3 5. , . .) ( )3
*X** _{ij}* π

*δ*

_{j}*( )*

_{i s}^{,}

^{}

_{i}^{~}

^{Bernoulli}

## (

π*δ*

_{j}*( )*

_{i s}## )

^{.}

^{( )}

^{2}

*P X* *i*

*ij* *i* *j*

= *j*

### (

^{1}

^{θ π}

^{,}

### )

^{=}

^{π}

^{1}if examine mastered alll the skills needed to solve item oth

*j*
π*j*0 eerwise.

⎧⎨

⎪

⎩⎪ ( )1

*Model Diagnostics for Bayesian Networks*

TABLE 2

*Description of Each Equivalence Class With Evidence Models Items That Should Be*
*Solved by Examinees of an Equivalence Class*

Equivalence EM

Class Class Description 1 2 3 4 5 6

1 No Skill 1

2 Only Skill 1 x

3 Skills 1 & 3 x x

4 Skills 1, 3, & 4 x x x

5 Skills 1, 3, 4, & 5 x x x x

6 Skills 1 & 2 x x

7 Skills 1, 2, & 3 x x x

8 Skills 1, 2, 3, & 4 x x x x x

9 All Skills x x x x x x

*Mixed-Number Subtraction Proficiency Model*

*The prior distribution of the proﬁciency variables is known as the proficiency*
*model in ECD literature. In a BN, the prior distribution P(*) is expressed as a
discrete Bayesian network (this is where the name of the model comes from) or
graphic model (Lauritzen & Spiegelhalter, 1988; Pearl, 1988). Prior analyses
revealed that Skill 3 is a prerequisite to Skill 4. A three-level auxiliary variable θ* ^{WN}*
incorporates this constraint. Level 0 of θ

*WN*corresponds to the participants who have mastered neither skill; Level 1 represents participants who have mastered Skill 3 but not Skill 4; Level 2 represents participants who mastered both skills.

The dependence relationship (see, e.g., Figure 2 of Sinharay et al., 2004, for a graphic representation of the relationship) among the skill variables provided by the expert analysis corresponds to the factorization

The parameters λ are deﬁned as follows:

*Finally, one requires prior distributions P(*). We assume that λ1, 2, 5, and *WN*

are a priori independent.

The natural conjugate priors for the components of are either Beta or Dirichlet distributions. The hyperparameters are chosen, as in Mislevy et al. (2001), so that they sum to 27 (relatively strong numbers given the sample size of 325). With such a complex latent structure, strong priors such as those here are necessary to prevent problems with identiﬁability. These must be supported by relatively expensive elic- itation from the experts. Prior probabilities give a chance of 87% of acquiring a skill when the previous skills are mastered and 13% of acquiring the same skill when the previous skills are not mastered. They are given below:

λ λ

1 2 0

23 5 3 5 3 5 23 5

~ . , . ,

~ . , .

,

Beta Beta

( )

( );; ~ . , . ,

~ . ,

, ,

Beta Beta

λ λ

2 1 5 0

23 5 3 5 3 5 2

( )

3

3 5. ; _{5 1}_{,} ~ 13 5 13 5. , . ; _{5 2}_{,} ~

( ) λ Beta( ) λ Betaa 23 5 3 5

0 0 0 0 1 0 2

. , . ,

, ,

, , , , , , ,

( )

=

*WN*

### ((

*WN*

*WN*

*WN*

### )

^{~}

^{(}

^{, ,}

^{)}

^{,}

, ,. ~

Dirichlet Dirichlet

15 7 5

*WN*1 (111 9 7, , );*WN*_{, ,}_{2}. ~Dirichlet(7 9 11, , );*WN*_{,}_{3}_{,,}. ~Dirichlet 5 7 15( , , ).

λ θ

λ θ θ

λ

1 1

2 2 1

5

1

1 0 1

=

### (

=### )

=

### (

= =### )

==
*P*

*P* *m* *m*

*P*

*m*
*m*

.

, , .

,

for

θθ θ θ

λ θ θ

5 1 2

1

1 0 1 2

= + =

### ( )

== = +

*m* *m*

*P* *n*

*WN m n* *WN*

for , , .

, ,

### (

θθ2+θ5 =*m*

### )

for*m*= 0 1 2 3, , , ,and

*n*= 0 1 2, , .

*p*

### (

3### )

=*p*

### (

θ θ3*,*

_{WN}### ) (

*p*θ θ4

*,*

_{WN}### ) (

*p*θ

*θ θ θ1, 2, 5,*

_{WN}### )

*pp*

*p* *p*

θ θ θ

θ θ θ

5 1 2

2 1 1

, ,

, , .

### ( )

### ( ) ( )

*Fitting the 2LC Model to the Mixed-Number Subtraction Data*

The WinBUGS software (Spiegelhalter, Thomas, & Best, 1999) is used to ﬁt an MCMC algorithm to the data set. For details of the model ﬁtting, see, for example, Sinharay et al. (2004). Five chains of size 2,000 each are used, after a burn-in of 1,000 each, resulting in a ﬁnal posterior sample size of 10,000.

**4. Posterior Predictive Model Checking Method**

* Let p( y*) denote the likelihood distribution for a statistical model applied to

*denotes all the parameters in*

**data (examinee responses in this context) y, where***the model. Let p(*) be the prior distribution on the parameters. Then the posterior distribution of is

The posterior predictive model checking (PPMC) method (Rubin, 1984, building
*on the work in Guttman, 1967) suggests checking a model using the posterior pre-*
**dictive distribution of replicated data y*** ^{rep}*,

**as a reference distribution for the observed data y.**

**The basis of the PPMC method is to compare the observed data y to its refer-***ence distribution (Equation 4). In practice, test quantities or discrepancy mea-*
* sures D( y,*) are defined (Gelman, Meng, & Stern, 1996), and the posterior

**distribution of D( y,****) compared to the posterior predictive distribution of D(y***,*

^{rep}), with any significant difference between them indicating a model failure. A
* researcher may use D( y,) = D(y), a discrepancy measure depending on the data*
only, if appropriate.

*A popular summary of the comparison is the tail-area probability or posterior*
*predictive p-value (PPP-value), the Bayesian counterpart of the classical p-value*
*(Bayesian p-value):*

*where I**[A]*denotes the indicator function for the event A.

Because of the difﬁculty in dealing with Equations 4 or 5 analytically for all but
simple problems. Rubin (1984) suggests simulating replicated data sets from the
posterior predictive distribution in practical applications of the PPMC method. One
*draws N simulations *^{1}, ^{2}, . . . , ^{N}*from the posterior distribution p( y) of ,*

**and draws y**

^{rep,n}

**from the likelihood distribution p( y**

^{n}*), n= 1, 2, . . . , N. The*

*p* *P D* *D*

*I*

*rep*

*D* ^{rep}*D*

=

### [ ( )

≥### ( ) ]

= _{⎡⎣} _{(} _{)}_{≥} _{(} _{)}⎤

**y****y****y**

**y****y**

, ,

, ,

⎦⎦

### ∫

### ∫

^{y}

^{rep}

^{p}^{(}

^{y}*,*

^{rep}^{)}

^{p}^{(}

,

^{y}^{)}

^{d}^{y}*,*

^{rep}*, ( )5*

^{d}*p*

### (

**y**

^{rep}

**y**### )

=### ∫

*p*

### (

**y**

^{rep}^{}

### )

*p*

### (

^{}

**y**### )

*d*

^{,}

^{( )}

^{4}

*p* *p y* *p*

*p* *p* *d*

**y**

### ( )

≡### ( )

**y**^{( )}

### ( )

^{( )}

### ∫

^{}

^{}

.

**process results in N draws from the joint posterior distribution p( y*** ^{rep}*,

**y), and,**

**equivalently, from p( y**

^{rep}

**y). The expression in Equation 5 implies that the propor-**

**tion of the N replications for which D( y***,*

^{rep,n}

^{n}

**) exceeds D( y,***) provides an esti- mate of the PPP-value. Extreme PPP-values (close to 0, or 1, or both, depending on the nature of the discrepancy measure) indicate model misﬁt.*

^{n}Gelman et al. (1996) suggest that the preferable way to perform PPMC is to
**compare the realized discrepancies D( y,*** ^{n}*) and the replicated/predicted discrep-

**ancies D( y***,*

^{rep,n}

^{n}

**) by plotting the pairs {D( y,**

^{n}

**), D(y***,*

^{rep,n}

^{n}*)}, n= 1, 2, . . . , N,*in a scatter-plot.

Researchers like Robins, van der Vaart, and Ventura (2000) have shown that PPP-values are conservative (i.e., often fails to detect model misﬁt), even asymp- totically, for some choices of discrepancy measure; e.g., when the discrepancy measure is not centered. However, a conservative diagnostic with reasonable power is better than tests with unknown properties or poor Type I error rates, because evidence of misﬁt often leads to removal of test items from the item pool and it may be to the advantage of the administrators to fail to reject the hypothe- sis for a borderline item and thus retain the item for future use. Gelman et al.

(1996) comment that the PPMC method is useful if one thinks of the current model as a plausible ending point with modiﬁcations to be made only if substan- tial lack of ﬁt is found and we are dealing with applications exactly as referred to by them.

Empirical and theoretical studies suggest that PPP-values generally have rea- sonable long-run frequentist properties (Gelman et al., 1996, p. 754). The PPMC method combines well with the Markov chain Monte Carlo (MCMC) algorithms (e.g., Gelman et al., 2003).

**5. The Suggested Model Diagnostics**

A discrepancy measure that many will be tempted to apply here is the χ^{2}-type
measure

where the summation is over the examinees, items, or both. Although measures of
this type are intuitive and useful for other types of models, this measure was found
not useful by Yan et al. (2003) for the 2LC model. The presence of the latent vari-
able vector *i*in the 2LC model ensures that the residual for an individual cannot
be too unusual, and even an inadequate model predicts the χ^{2}-type quantity ade-
quately; the same phenomenon is observed by Gelman, Goegebeur, Tuerlinckx,
and van Mechelen (2000) in a discrete data regression model. The χ^{2}-type measure
*was also found to be of little use when computed using E(X**ij** ^{j}*,

*) and Var(X*

^{ij}*, ),*

^{j}*i.e., when the expectation and variance of X*

*ij*are computed by marginalizing over the distribution of

*. Therefore, the χ*

^{i}^{2}-type measure is not considered any more in

*X* *E X*
*X*

*ij* *ij* *j* *i*

*ij* *j* *i*

−

### ( )

## [ ]

### ( )

### ∑

_{Var}

_{π θ}

^{π θ}

_{,}

^{,}

^{,}

2

*Sinharay*

this article. The remaining of this section describes the measures that were found useful in this research.

*Direct Data Display*

Suitable display showing the observed data and a few replicated data sets may be a powerful tool to provide a rough idea about model ﬁt by revealing interesting pat- terns of differences between the observed and replicated data (e.g., Gelman et al., 2003). For large data sets, however, plotting all data points is prohibitive—but plots for a sample (for example, few examinees each from low-scoring, high-scoring, and middle groups) is possible.

*Item Fit Analysis*

This article ﬁrst examines if the ﬁtted model can predict a set of simple and nat- ural quantities, the point biserial correlations (e.g., Lord, 1980, p. 33), that is, the simple correlations between the item scores and raw (number-correct) scores of the examinees. Sinharay and Johnson (2003) and Sinharay (2005) apply a similar approach successfully to the unidimensional IRT models.

*Item Fit Measures Based on Equivalence Class Membership*

Here, this article examines item ﬁt measures based on equivalence class mem- berships of the examinees. We group the examinees according to the equivalence classes because the equivalence class membership of an examinee, although latent, is a natural quantity in this context. The natural groups are then the nine equivalence classes deﬁned in Table 2. Equivalence Class 1 represents no skills and Equivalence Class 9 represents all skills. The classes in between are roughly ordered in order of increasing skills; however, the ordering is only partial. Even though the memberships are unknown (latent) quantities, the Markov chain Monte Carlo (MCMC) algorithm used for ﬁtting the model produces values of the latent abilities in each iteration, which determine the equivalence class memberships and allow the grouping.

*The first discrepancy measure examined is the proportion of examinees Õ**jk*in
*equivalence class k, k= 1, 2, . . . , K = 9 who answer item j correctly, j = 1, 2, . . . , J.*

*The quantity Õ**jk*is not a truly observed quantity (reflected in the notation), but
depends on the parameter values of the model (that is why they are referred to as
discrepancy measures here, as in Gelman et al., 1996). They become known if
one knows the equivalence class membership of the examinees, which happens
in each iteration of the MCMC algorithm. For a replicated data set, there is one
*such proportion, denoted Õ*_{jk}^{rep}*. For each combination of item j and equivalence*
*class k, comparison of the values of Õ**jk**and Õ*_{jk}* ^{rep}*over the iterations of the MCMC
provides an idea regarding the fit of the item. This article computes the PPP-value

*for each combination of item and equivalence class. Even though Õ*

*jk*

*s depend on*the latent equivalence class membership and are unknown, the PPMC method integrates (sums in this context) out the unknown quantities with respect to their

*posterior distribution to provide natural Bayesian p-values. These p-values provide*

*Model Diagnostics for Bayesian Networks*

useful information about the fit of the items, and may suggest possible improve-
*ments of the model. For example, significant p-values for most items for Equiv-*
alence Class 1 (examinees with no Skill 1) will mean that the model is not
flexible enough to explain examinees in that class and that one should add a com-
ponent to the model to address that issue (this happens in the mixed-number
example later).

*The above mentioned item ﬁt p-values provide useful feedback about item ﬁt,*
but it will be useful to summarize the ﬁt information for each item into one plot or
*one number, preferably a p-value. To achieve that, this paper uses the two test sta-*
tistics (χ^{2}*-type and G*^{2}-type), somewhat similar to those in Sinharay (2003) as dis-
crepancy measures.

Consider for this discussion that the values of the * ^{i}*s and

*s are known (which is true for each iteration of the MCMC algorithm, the generated values of*

^{j}*s and*

^{i}* ^{j}*s acting as known values), which means, for example, that the equivalence class
memberships of all examinees are known. For each item, one can then compute

*Õ*

*jk*

*, k= 1, 2, . . . , K and Õ*

*jk*

*rep**s, k, k= 1, 2, . . . , K, which are described above. Let N*^{k}*,*
*k= 1, 2, . . . , K denote the number of examinees in Equivalence Class k. Given the*

* ^{i}*s and

^{j}*s, the expected probability of an examinee in Equivalence Class k*

*answering Item j correctly, E*

*jk*, is the suitable π

^{j}*δi(s)*. For example, for examinees in

*Equivalence Class 1 (one without skill 1), E*

*j1*= π

^{j0}*∀j, because these examinees*do not have the necessary skills to solve any item. On the other hand, for exami-

*nees in Equivalence Class 2 (with Skill 1 only), E*

*j2*= π

^{j1}*I*

*[j*∈{1,5}] + π

^{j0}*I*

*[j*∉{1,5}], because these examinees have necessary skills to solve Items 1 and 5 only. Let us denote the set of all item parameters and ability variables, that is, the set of all

*s and*

^{i}*s, as*

^{j}**.**

The χ^{2}*-type measure for item j, denoted henceforth as D*^{eq,}_{j}^{χ}* (y,*), is given by

*The G*^{2}*-type measure, denoted henceforth as D*^{eq,G}*j** (y, *), is given by

Each of these two statistics summarizes the fit information for an item, with
large values indicating poor ﬁt. A comparison between the posterior distribution
*of D**j**eq,*χ**( y,***) [or D*^{j}^{eq,G}* ( y,* )] and the posterior predictive distribution of

*D*

*j*

*eq,*χ

**( y***,*

^{rep}*) [D*

^{j}

^{eq,G}

**( y***,*

^{rep}*)] provides a summary regarding the ﬁt of item i. The*comparison can be done using a graphical plot (e.g., Figure 4). Another con-

*venient summary is the posterior predictive p-value (PPP-value) for the discrep-*ancy measure, which can be estimated using the process described in Section 4.

Here, a PPP-value very close to 0 indicates a problem (indicating that the vari-

*D* *O* *O*

*E* *N* *O*

*j*
*eq G*

*jk*
*jk*
*jk*

*k* *jk*

,

### (

*,*

**y**### )

= log⎛⎝⎜ ⎞

⎠⎟ +

## (

−## )

2

llog *N* *E* . ( )

*N* *E*

*k* *jk*

*k* *jk*

*k*

*K* −

−

⎛

⎝⎜ ⎞

⎠⎟

⎡

⎣⎢

⎢

⎤

⎦⎥

⎥

### ∑

= 17

*D* *N* *O* *E*

*E* *N* *E*

*j*
*eq x*

*k*
*k*

*K*

*jk* *jk*

*jk* *k* *jk*

,

### (

*,*

**y**### )

=## (

−## )

. (### (

−### )

### ∑

= 12

6
6)
*Sinharay*

ability in the data set appears unusually large compared to that predicted by the model).

Adding the discrepancy measures over the items, an overall discrepancy mea- sure for assessing the ﬁt of the model can be obtained as:

The PPP-value corresponding to this measure will provide an idea about the overall ﬁt of the model to the data set.

The computation of these PPP-values does not need any approximation (e.g., a
rough χ^{2}approximation, as in Sinharay et al., 2004), and hence results in natural
*Bayesian p-values summarizing the fit of the items. The cells (item-group com-*
bination) with small frequencies (especially for low or high raw scores) is not a
problem with the PPMC approach because this does not need a χ^{2}assumption.

However, for more stability of the discrepancy measures and the PPP-values,
equivalence classes with too few examinees can be pooled. Examining the dis-
*tribution of the values of D**j**eq,χ***( y*** ^{rep,m}*,

^{m}*), m= 1, 2, . . . , M is a way to check for*stability. These quantities should look like draws from a χ

^{2}

*-distribution with (K*−

*d− 1) d.f., where K is the number of equivalence classes for the problem, and d is*the adjustment in the d.f. due to pooling, if there are sufﬁcient number of exami- nees for each class—departure from the χ

^{2}distribution will point to the instability of the discrepancy measure. It is possible to examine the mean, variance and his- togram of the replicated discrepancies to ensure the stability of the measure.

*Item Fit Measures Based on Raw Scores*

The latent quantities (equivalent class memberships), on which the above item ﬁt measure is based on, may themselves be estimated with error, especially for a small data set, and hence may lead to a test with poor error rates. Therefore, this part examines item ﬁt using examinee groups based on observed raw scores (as in Orlando & Thissen, 2000). Even though the raw scores do not have a clear inter- pretation in the context of the multidimensional 2LC model, they are natural quan- tities for any assessment with dichotomous items. The approach used here to assess item ﬁt is very similar to that in Sinharay (2003).

The first discrepancy examined is the proportion of examinees in raw score
*group k, k= 1, 2, . . . , (J − 1) who answer item j correctly, j = 1, 2, . . . , J. Denote*
*the observed proportion of examinees in group k answering item j correctly as*
*O**jk*. For each replicated data set, there is one (replicated) proportion correct for
*the j-th item and k-th group, denoted O*_{jk}^{rep}*. For each item j, comparison of the val-*
*ues of O**jk**and O*_{jk}^{rep}*s, k, k= 1, 2, . . . , (J − 1), provides an idea regarding the ﬁt of*
*the item (O**j0**= 0 and O** ^{jJ}*= 1; so they are not included). One way to make the

*comparison is to compute the posterior predictive p-value (PPP-value) for each*item-group combination. This article suggests use of item fit plots to make the comparison. The approach will be described in detail with the data examples later.

*D*_{overall}^{eq x}*D*_{j}^{eg x}

*j*

, ,

, , . ( )

**y****y**

### ( )

=### ∑ ( )

^{8}

*Model Diagnostics for Bayesian Networks*

The item ﬁt plots promise to provide useful feedback about item ﬁt, but it will
be useful to summarize the ﬁt information for each item into one number, prefer-
*ably a p-value. To achieve that, this article uses the two test statistics (*χ^{2}-type and
*G*^{2}-type) similar to those suggested by Orlando and Thissen (2000) as discrepancy
*measures. Because the measures depend on E**jk**= E(O*^{jk}**, ), the next paragraph**
*describes computation of E**jk*.

*The computation of E**jk**, the expected proportion of examinees in group k*
*answering item j correctly, is not trivial. Orlando and Thissen (2000) used the*
recursive approach of Lord and Wingersky (1984) to compute the expectations
in the context of IRT models and this paper uses a slight variation of that; a brief
*description of the approach follows. Let T**j*() be the probability of success on
*Item j for proficiency * (remember that for the 2LC model, this will be either
π* ^{j1}*or π

*, depending on*

^{j0}*and the skill requirements of the item). Suppose S*

*()*

^{k}*denote the probability of obtaining a raw score of k in an J-item test by an exam-*inee with proficiency variable

*. For convenience, the dependence of S*

*() on*

^{k}* and is not reflected in the notation. Further, denote S*^{*j}* ^{k}*() to be the proba-

*bility of a raw score k at proficiency on items 1, 2, . . . , j − 1, j + 1, . . . , J (i.e.,*

*omitting Item j from the set of all items). Then,*

where the summation is over all possible values of (which is 24 for this prob-
*lem) and p(*) is the prior distribution on . For an IRT model, the summation
in Equation 9 is replaced by an integration (Orlando & Thissen, 2000). For com-
*puting S*_{k}* ^{* j}*(

*) and S*

*k*(), this article uses the recursive approach of Lord and Wingersky (1984) and Orlando and Thissen (2000); the approach involves the

*terms T*

*j*(

*)’s. The quantity E*

*does not depend on the latent proficiency vari- ables*

^{jk}*s.*

^{i}The suggested χ^{2}*-type measure, denoted henceforth as D*^{raw,χ}_{j}* (y,*), is given by

*The G*^{2}*-type measure, denoted henceforth as D*^{raw,G}_{j}* (y,*), is given by

Each of the above two statistics summarizes the ﬁt information for an item, with
large values indicating poor fit. One can use ∑^{j}*D*_{j}^{raw,}^{χ}* ( y,*) or ∑

^{j}*D*

_{j}

^{raw,G}*) to provide an idea regarding the overall fit of the model. As discussed for the equiv-*

**( y,***D* *N* *O* *O*

*E* *O*

*j*
*raw g*

*k* *jk*

*jk*
*jk*

*jk*

,

### (

*,*

**y**### )

= log⎛ l⎝⎜ ⎞

⎠⎟ + −

### ( )

2 1 oog 1 . ( )

1 11

1

1 −

−

⎛

⎝⎜ ⎞

⎠⎟

⎡

⎣⎢

⎢

⎤

⎦⎥

⎥

=

### ∑

−

^{O}_{E}

^{jk}*k* *jk*
*J*

*D* *N* *O* *E*

*E* *E*

*j*
*raw g*

*k*

*jk* *jk*

*jk* *jk*

*k*
*J*

,

### (

*,*

**y**### )

=### (

−### )

.### (

−### )

=

### ∑

−2

1 1

1 ((10)

*E* *T* *S* *p*

*S* *p*

*jk*

*j* *k*

*j*

*k*

= ( ) ( )

### ( )

( )^{−}

### ( )

Σ Σ

θ θ

1 9

*

, ( )

*Sinharay*

alence class-based measures earlier, a researcher can use graphical plots or PPP-
*values for D*^{raw,}_{j}^{χ}**(y,**) [or D*j*

*raw,G*

* ( y,*)] to summarize item fit, and these mea-
sures can be monitored for stability and examinee groups can be pooled, if
required.

*Examining Association Among the Items*

*Consider an item pair in an assessment. Denote n**k,k*′to be the number of individu-
*als scoring k on the ﬁrst item and k′ on the second item, k, k′ = 0, 1. We use*

*the sample odds ratio (see, for example, Agresti, 2002, p. 45), referred to as the odds*
*ratio hereafter, as a test statistic with the PPMC method. Chen and Thissen (1997)*
apply these to detect violation of the local independence assumption of simple IRT
models. Sinharay and Johnson (2003) and Sinharay (2005), applying the PPMC
method, ﬁnd the odds ratios to be useful discrepancy measures in detecting misﬁt
of the simple IRT models. Examining the model prediction of odds ratios, which
are measures of association, should help a researcher to detect if the ﬁtted model
can adequately explain the association among the test items. The BNs assume the
ability variables to be multidimensional and attempt to explain the associations
among items; therefore, examining how the model failed to reproduce the odds
ratios might also provide some insight regarding any associations that the model
failed to capture and might lead to an improved model.

*Differential Item Functioning*

Holland (1985) suggested the Mantel–Haenszel test statistic for studying DIF in the context of item response data. The test has the strength that it does not require ﬁtting any model and straightforward to apply.

*Suppose one is interested in examining if a given item j shows DIF over a “focal*
*group” F and a “reference group” R. In an application of the Mantel–Haenszel test,*
*the examinees are divided into K matching groups, typically based on their raw*
*scores. For an item, the data from the k-th matched group of reference and focal group*
members are arranged as a 2 × 2 table, as shown in Table 3. The Mantel–Haenszel

*OR* *n n*

= *n n*^{11 00}

10 01

,

*Model Diagnostics for Bayesian Networks*

TABLE 3

*Table for Mantel–Haenszel Statistic*

Group Right on an Item Wrong on an Item Total

Reference *A**k* *B**k* *n**Rk*

Focal *C**k* *D**k* *n**Fk*

Total *R**k* *W**k* *n*_{+k}

test statistic for an item is then given by

where μ*k**= n*^{Rk}*R**k**/n** _{+k}*, and σ

^{2}

^{k}*= [(n*

^{Rk}*n*

*Fk*

*R*

*k*

*W*

*k*

*)/(n*

^{2}

_{+k}*(n*

*− 1)]. This article applies the above test statistic to determine DIF for BNs.*

_{+k}However, the test employs matched groups based on raw scores of examinees, which are not the most natural quantities in the context of BNs. No DIF in an appli- cation of a BN means that success probabilities of the reference group and the focal group are the same conditional on the skill variables, and raw scores do not neces- sarily contain all information about the skills; a test incorporating this fact may per- form better than the Mantel–Haenszel test.

One simple alternative to the Mantel–Haenszel test is to ﬁt the model separately to the two groups (focal and reference) and compare the values of estimates of the

* ^{j}*s for the two groups using, for example, a normal or a χ

^{2}-test (as in Lord, 1980, in testing for DIF in IRT models). The approach may be costly in a setting such as here, where an MCMC algorithm is used to ﬁt the model, especially if an investi- gator is interested in a number of different focal groups and/or reference groups.

Furthermore, the approach may be problematic if one or more of the two groups is small—the estimates obtained by ﬁtting a BN to a small data set may not be stable.

This line of work is not pursued any more in this article.

This article examines a discrepancy measure based on the Mantel–Haenszel test statistic, forming matched groups with respect to their latent skills. This approach does not need running the MCMC more than once. Matching with respect to skills (or, equivalence classes) is a natural concept in this context; however, the problem in doing so is that the skills are unknown. Fortunately, the PPMC method and the draws from the MCMC algorithm provides us a way to get around that problem.

Suppose we know the skill variables for each examinee and hence know their
equivalence class memberships (this will be true for each iteration of MCMC, the
generated value acting as the known values). For each equivalence class, we can com-
pute a table such as Table 3 and can compute a discrepancy such as Equation 12. Let
*us denote the measure for Item j as D**j*^{MH}* ( y,*). It is straightforward to obtain an esti-

*mate of the PPP-value corresponding to D*

^{MH}*j*

*), by computing the proportion of*

**(y,***times D*

*j*

*MH*

**(y*** ^{rep,m}*,

^{m}*) exceeds D*

*j*

*MH*

**(y,*** ^{m}*), where

^{m}*, m= 1, 2, . . . , M, is a posterior*

**sample and y**

^{rep,m}*, m= 1, 2, . . . , M, are the corresponding posterior predictive data sets.*

*Discussion*

One disadvantage of these techniques is the conservativeness of the PPMC methods. However, as argued earlier, a conservative test is better than a test with unknown properties, and the conservativeness of the PPMC technique may be a boon here. Another disadvantage is that these techniques are based on the MCMC algorithm and hence are computation-intensive. Another issue, which might make some potential users uneasy, is that these techniques are Bayesian in nature and use

*X* *A*

*MH*

*k* *k* *k* *k*

*k* *k*
2

2 2

0 5 12

=

### (

Σ −Σ −### )

Σ μ σ

. , ( )

*Sinharay*

prior distributions. In the absence of substantial prior information, one can use non- informative prior distributions to reduce the effect of the prior distribution on the inference. Alternatively, one can perform sensitivity analysis to detect the effect of the prior distributions.

However, the disadvantages are somewhat compensated by the advantages that (a) the suggested techniques provide natural probability statements from a Bayesian point of view, and (b) computations for the techniques requires neither complicated theoretical derivations nor extensive simulation studies once an MCMC algorithm has been used to ﬁt the model.

*Assessing Identifiability of Model Parameters*

The BNs deal with latent classes (see, e.g., Sinharay et al., 2004 and the refer- ences therein) and hence there is a possibility of nonidentiﬁability or weak identi- ﬁability, which means that data provide little information about some parameters.

Lindley (1971) argues that even a weakly identiﬁable Bayesian model may be valid in the sense that it can still describe the data via its identiﬁable parameters and information supplied a priori. On the contrary, although weakly identiﬁable mod- els can be valid, they are not optimal for summarizing population characteristics.

By failing to recognize weak identiﬁability, one may make conclusions based on a weakly identiﬁed parameter when one has no direct evidence beyond prior beliefs to do so (Garrett & Zeger, 2000). Furthermore, such a model increases the chances of nonapplicability of standard asymptotic theory (e.g., as follows from Haberman, 1981) and problems with MCMC convergence. Therefore, a diagnostic for model identiﬁability, although not directly a model-checking tool, is a necessary tool for assessing the performance of a BN.

This article uses plots of prior versus posterior distributions of the model pa- rameters s and s to assess their identiﬁability. The closer the two densities, the less is the information provided the data on the parameter (because posterior ∝ likelihood × prior), and the weaker is the identiﬁability of the parameter. It is pos- sible to use numerical measures such as percentage overlap between the prior and posterior (e.g., Garrett & Zeger, 2000), but this article does not use those measures.

The following sections apply the suggested model checking tools to the mixed- number subtraction data set and a data set simulated from the 2LC model.

**6. Direct Data Display**

Figure 1 presents a direct data display discussed in the section on ﬁtting the 2LC
model to the mixed-number subtraction data. The ﬁgure shows the mixed-number
subtraction data and eight replicated data sets (chosen randomly) from the 10,000
obtained during the ﬁtting of the model. The plot is somewhat similar to Figures
6.6 and 6.7 of Gelman et al. (2003) that are used to assess the ﬁt of a logistic regres-
sion model. Figure 1 shows that there are certain patterns in the observed data that
are not present in the replicated data sets. For example, the examinees with the low-
est scores could answer only two items correctly (interestingly, these are Items 5 and
6, the idiosyncrasies of which were discussed earlier—both of these can be solved
without any knowledge of mixed-number subtraction). In the replicated data sets,
*Model Diagnostics for Bayesian Networks*

however, these examinees get other items correct as well. The smartest examinees get all items correct in the observed data, which is not true for the replicated data sets. Furthermore, the weakest half of the examinees rarely get any hard items (those belonging to Evidence Models 4 and 5) correct in the observed data, which is not true in the replicated data sets. These patterns suggest that the 2LC model is not entirely satisfactory for the data set.

To examine the performance of the 2LC model when data come from the same model, a limited simulation study is performed. The above analysis of the mixed- number data using the 2LC model produces 10,000 posterior predictive data sets;

each of these can be considered to have been generated from the 2LC model. We randomly pick one of them, ﬁt the 2LC model to the data, and generate 10,000 pos- terior predictive data sets as before.

Figure 2 shows the data simulated from the 2LC model and eight replicated data sets from its analysis. The data set on the left seem to ﬁt in with the replicated data sets on the right. The direct data display does not indicate any deﬁciency of the model, rightfully so.

**7. Item Fit Analysis**

*Fit of the 2LC Model to the Mixed-Number Subtraction Data*

This section examines, using various measures, item ﬁt for the 2LC model with the mixed-number subtraction data set.

*Sinharay*

*FIGURE 1. Mixed-number subtraction data and eight replicated data sets. Note. The left*
*column in Figure 1 shows all observed responses, with a little black box indicating a cor-*
*rect response. The items, sorted according to decreasing proportion correct, are shown*
*along the x-axis; the examinees, sorted according to increasing raw scores, are shown*
*along the y-axis. Right columns show eight replicated data sets from the 2LC model.*

*There are some clear differences between the observed and replicated data sets.*

*Point Biserial Correlations*

*Columns 6 and 7 in Table 1 show the observed point biserials (denoted r**pbis*) and
*the p-values (denoted P**pbis**2LC) corresponding to each item. The p-values signif-*
icant at 5% level are marked with an asterisk. The values shows that the 2LC model
underpredicts most of the point biserial correlations for this data set, and signiﬁ-
cantly so for Items 3, 5, 8, and 12.

Figure 3 shows a histogram for the predicted average biserial correlations (from
the 10,000 replicated data sets). A vertical line shows the observed average (averaged
*Model Diagnostics for Bayesian Networks*

*FIGURE 2. 2LC data and eight replicated data sets.*

*FIGURE 3. Observed and predicted average biserial correlations when the 2LC model*
*is fitted to the mixed-number subtraction data.*

over all items), which appears improbable under the 2LC model (PPP-value of .00). This indicates some inadequacy of the 2LC model for these data.

*Measures Based on Equivalence Classes*

*Figure 4 shows, for Items 12 and 13, plots of the realized discrepancy D**j**eq,χ*

**( y,**) and the predicted discrepancy D^{j}^{eq,χ}**(y*** ^{rep}*, ). A diagonal line is there for
convenience—points consistently below the diagonal (showing that the realized
discrepancy is mostly larger than the predicted discrepancy) indicate a problem.

The plot toward the left shows that the model cannot explain the responses for the item adequately, the model-predicted discrepancy being almost always smaller than the realized discrepancy. The PPP-value is .04. The plot toward the right shows the opposite—the model seems to reproduce the discrepancy and hence explain the responses for the item adequately (PPP-value = .49).

*Column 8 in Table 1 shows the item ﬁt p-values (denoted “P**D*^{eq,χ}_{j}* ( y,*)2LC”) for

*the measure D*

_{j}

^{eq,χ}

**( y,**). Because the p-values for the χ^{2}

*-type and G*

^{2}-type mea-

*sures are very close, we show and discuss the results for the former only. The p-value*for assessing overall ﬁt, that corresponding to the discrepancy measure (Equation 8), is .01 and indicates that the model cannot explain the data set adequately.

*Figure 5 shows the p-values for each item for each equivalence class. The plot is*
like one used in Murdoch and Chow (1996), where elliptical glyphs represent cor-
relations. For a correlation ρ, an ellipse is drawn with the shape of the contour of a
bivariate normal distribution with unit variances and correlation ρ. Here, we plot
*the quantity 2 * ( p-value *− .5) for each combination of item and equivalence class.

*A p-value close to zero (one) will be transformed to *−1 (1) and will be represented
by almost a line with slope −1 (1); one such example is provided by Item 5, Class 9
(Item 8, Class 1). The ﬁgure provides more insight regarding items with extreme
*p-values. For example, the 2LC model often results in inaccurate predictions for*
*Equivalence Class 1, with a number of extreme p-values, and does not perform too*
*Sinharay*

*FIGURE 4. Comparison of realized and predicted discrepancies when the 2LC model is*
*fit to mixed-number subtraction data.*