Model Diagnostics for Bayesian Networks Sandip Sinharay Educational Testing Service

(1)

Model Diagnostics for Bayesian Networks

Sandip Sinharay Educational Testing Service

Bayesian networks are frequently used in educational assessments primarily for learning about students’ knowledge and skills. There is a lack of works on assess- ing fit of Bayesian networks. This article employs the posterior predictive model checking method, a popular Bayesian model checking tool, to assess fit of simple Bayesian networks. A number of aspects of model fit, those of usual interest to practitioners, are assessed using various diagnostic tools. This article suggests a direct data display for assessing overall fit, suggests several diagnostics for assessing item fit, suggests a graphical approach to examine if the model can explain the association among the items, and suggests a version of the Mantel–

Haenszel statistic for assessing differential item functioning. Limited simulation studies and a real data application demonstrate the effectiveness of the suggested model diagnostics.

Keywords: discrepancy measure, Mantel–Haenszel statistic, p-values, posterior predictive model checking

1. Introduction

Bayesian inference networks or Bayesian networks (BIN or BN; Pearl, 1988) are frequently used in educational assessments (Mislevy, 1995; Mislevy, Almond, Yan, & Steinberg, 2001; Mislevy, Steinberg, & Almond, 2003; Weaver & Junker, 2003), especially in the context of evidence-centered design (ECD; Almond &

Mislevy, 1999), for learning about students’ knowledge and skills given information about their performance in an assessment.

A BN is an appropriate tool when an examination of the items in an assessment suggests that solving each item requires at least a certain level of one or more of a number of prespecified skills. A BN first assumes that for each skill, each examinee has a corresponding discrete proficiency variable that states the skill level of the examinee. The next step in a BN is to describe a probability model for the Journal of Educational and Behavioral Statistics Spring 2006, Vol. 31, No. 1, pp. 1–33

This research was supported by Educational Testing Service. The author thanks Shelby Haberman, Russell Almond, Robert Mislevy, Paul Holland, Andrew Gelman, Hal Stern, Matthias von Davier, the two anonymous referees, and Howard Wainer for their invaluable advice. The author also thanks Andrew Gelman and Michael Friendly for their help in creating some of the graphical plots. The author is grateful to Kikumi Tatsuoka for providing the mixed number subtraction data set. The author grate- fully acknowledges the help of Rochelle Stern and Kim Fryer with proofreading. Any opinions expressed in this article are those of the author and not of Educational Testing Service.

(2)

response of an examinee to an item based on his or her attainment of the skills required to solve the item. For example, a simple BN might assume that the skill variables are binary (i.e., one either has the skill or does not), that the examinees who have all the required skills for item j have a certain probability π^1jof solving the item, while those who do not have the required skills have probability π^0j (where π^0j< π^1j) of solving the item. Finally, a BN speciﬁes prior distributions over the skill variables, usually with the help of a graphic model (Pearl, 1988) and the item parameters.

Model checking for BNs is not straightforward. For even moderately large number of items, the possible number of response patterns is huge, and the standard χ² test of goodness of ﬁt does not apply directly. Standard item response theory (IRT) model diagnostics do not apply to BNs because of the presence of discrete and multivariate proﬁciency variables in the latter models. As an outcome, the use of statistical diagnostic tools in this context is notably lacking (Williamson, Almond,

& Mislevy, 2000).

Recently, Yan, Mislevy, and Almond (2003), and Sinharay, Almond, and Yan (2004) suggested a number of model checking tools for BNs, including item fit plots and an item-fit test statistic. However, a number of questions remain unanswered in this area. Hambleton (1989) comments that the assessment of fit of IRT models usually involves the collection of a wide variety of evidence about model fit, and then making an informed judgment about model fit and usefulness of a model with a particular set of data. Hambleton’s comment applies equally well to BNs, but is yet to be addressed completely for these models. For example, there is no work yet assessing differential item functioning (DIF) for BNs. In summary, there is a significant scope of further research in this area.

The posterior predictive model checking (PPMC) method (Rubin, 1984) is a popular Bayesian model checking tool because of its simplicity, strong theoretical basis, and obvious intuitive appeal. The method primarily consists of comparing the observed data with replicated data—those predicted by the model—using a num- ber of discrepancy measures; a discrepancy measure is like a test statistic in a fre- quentist analysis that measures the difference between an aspect of the observed data set and a replicated data set. This article applies the PPMC method to assess several aspects of ﬁt of BNs.

First, this article suggests a direct data display (like those suggested by Gelman, Carlin, Stern, & Rubin, 2003) as a quick overall check of model ﬁt. This article then uses the PPMC method to examine item ﬁt for BNs using a number of discrepancy measures.

Sinharay and Johnson (2003) and Sinharay (2005) apply the PPMC method to ﬁnd the odds ratios (e.g., Agresti, 2002) corresponding to the responses to pairs of items to be useful discrepancy measures in detecting misﬁt of the unidimensional IRT models. This article uses the PPP-values for the odds ratios to check whether the BNs can explain the association among the items adequately.

This article then assesses DIF using the Mantel–Haenszel test suggested by Holland (1985). The test uses “matching groups” based on raw scores of examinees;

Sinharay

(3)

however, these groups may not be homogeneous in the context of BNs. Therefore, this article applies the PPMC method using a discrepancy based on a version of the Mantel–Haenszel test that uses the equivalence class memberships (rather than raw scores) to form the “matching groups.”

The BNs deal with latent variables, hence there is a possibility of nonidentifia- bility or weak identifiability of some parameters. Weak identifiability may lead to nonapplicability of standard asymptotic theory, problems with MCMC convergence, and misleading inferences. Therefore, a diagnostic for model identifiability, although not directly a model-checking tool, is a necessary tool for assessing the performance of a BN. This article uses plots of prior versus posterior distributions of the model parameters to assess their identifiability.

Section 2 introduces the BNs. Section 3 introduces the mixed-number subtraction example (Tatsuoka, 1990) that will form the basis of a number of analyses later.

Section 4 provides a brief discussion of the PPMC method and the PPP-value. Sec- tion 5 introduces the discrepancy measures to be used for assessing different aspects of model fit. Section 6 applies the direct data display to the mixed-number data and a simulated data. Section 7 applies the suggested item fit measures to the mixed- number subtraction data set and a simulated data set. Section 8 discusses results of examining association among item pairs. Section 9 discusses results of the DIF analysis. Section 10 assesses identifiability of the parameters. The conclusions of this study are discussed in Section 11.

2. Bayesian Networks

Increasingly, users of educational assessments want more than a single summary statistic out of an assessment. They would like to see a proﬁle of the state of acquisition of a variety of knowledge, skills, and proﬁciencies for each learner. One technique for profile scoring is the use of a BN.

The idea of a Bayesian inference network (BIN)—also known as a Bayesian network or Bayes net (BN) (Pearl, 1988)—comes primarily from the theory of graph- ical models, where it is deﬁned as an acyclic-directed graph in which each node represents a random variable, or uncertain quantity, which can take two or more possible values. To a statistician, BNs are statistical models that describe a number of observations from a process with the help of a number of latent (or unobserved) categorical variables. In the context of educational assessments, BNs are appropriate when the task analysis of an assessment suggests that the satisfactorily completion of any task requires an examinee to have at least a certain level of a number of skills.

The unobserved (or latent) variables, representing the level of the skills in the examinees, affect the probability distribution of the observed responses in a BN. BNs are called so because they use Bayes’s rule for inference. Mislevy (1995), Almond and Mislevy (1999), Mislevy et al. (2001), Mislevy et al. (2003), and Weaver and Junker (2003) show how BNs may be useful in psychometrics.

Consider an assessment with I examinees and J items. Suppose Xijdenotes the response of the i-th examinee to the j-th item. In general, Xijcan be a vector-valued quantity; that is, a single “task” could consist of multiple “items.” It can contain Model Diagnostics for Bayesian Networks

(4)

polytomous outcomes as well. In our example, each item produces a single dichotomous outcome, which is 1 if the response is correct and 0 if it is incorrect. Denote the ability/skill variable vector for the i-th examinee to be ⁱ= (θⁱ¹, θⁱ², . . . , θîp)′, θîk denoting the skill level of Examinee i for Skill k, and the item parameter vector for the j-th item to be ^j= (π^j1, π^j2, . . . , π^jm)′. A standard assumption is that of local independence, that is, independence of the responses conditional on the skills. Suppose that a cognitive analysis of the items produced a Q-matrix (e.g., Tatsuoka, 1990), an incidence matrix showing for each item the skills required to solve that item. The next step is to define P(Xij= 1ⁱ, ^j). As an example of a simple BN, assume that solving the j-th item needs a number of skills, define ⁱto consist of the indicators of attainment of the skills, let ^j= (π^j0, π^j1)′, and choose

To perform a Bayesian analysis of a BN, one needs a prior (population) distribu- tion p() on the ability variables . Mislevy (1995) showed that forming a graphic model (Lauritzen & Spiegelhalter, 1988; Pearl, 1988) over the skill variables provides a convenient way to specify this distribution.

BNs are similar to the deterministic inputs, noisy “and” gate (DINA) model of Junker and Sijtsma (2001), the only difference between the two being the graphic model for p() in a BN, the noisy inputs deterministic and gate (NIDA) model of Maris (1999), the higher-order latent trait models (de la Torre & Douglas, 2004), and the fusion model of Hartz, Roussos, and Stout (2002), all of which fall in the family of cognitive diagnostic models.

3. The Mixed-Number Subtraction Example

Here, we introduce what will be a running example used through this article, one regarding a test of mixed-number subtraction. This example is grounded in a cognitive analysis of middle school students’ solutions of mixed-number subtraction problems. The focus here is on the students learning to use the following method (discussed in, e.g., Tatsuoka, Linn, Tatsuoka, & Yamamoto, 1988):

Method B: Separate mixed numbers into whole number and fractional parts;

subtract as two subproblems, borrowing one from minuend whole number if necessary; and then simplify and reduce if necessary.

As is usual with a typical application of a BN, the analysis started with a cognitive analysis (e.g., Tatsuoka, 1984) of such items to determine the “attributes,” which are important for solving different kinds of problems. The cognitive analysis mapped out a ﬂowchart for applying Method B to a universe of fraction–subtraction problems. A number of key procedures appear, a subset of which is required to solve a given problem according to its structure. To simplify the model, we eliminate the items for which the fractions do not have a common denominator (leaving us with

P X i

ij i j

= j

(

¹^,

)

⁼ π if the examinee has thee skills required to solve item¹ othe

j

, π0 rrwise.

⎧⎨

⎩ Sinharay

(5)

15 items). The procedures are as follows: (a) Skill 1: Basic fraction subtraction;

(b) Skill 2: Simplify/reduce fraction or mixed number; (c) Skill 3: Separate whole number from fraction; (d) Skill 4: Borrow one from the whole number in a given mixed number; and (e) Skill 5: Convert a whole number to a fraction. The cognitive analysis identiﬁed Skill 3 as a prerequisite of Skill 4; that is, no students have Skill 4 without also having Skill 3. Thus, there are only 24 possible combinations of the ﬁve skills that a given student can possess.

We consider the responses of 325 examinees from a data set discussed in, for example, Tatsuoka (1984) and Tatsuoka et al. (1988). This data set played a central role in the development of the rule space methodology (e.g., Tatsuoka, 1990). The 325 examinees considered here were found to have used Method B (Mislevy, 1995). Table 1 lists 15 items from the data set that will be studied here, characterized by the skills they require in the column “Skills” that also represents the Q-matrix. An “x” for a skill denotes that the skill is required by the item while an “o” denotes the opposite. For example, Item 9 requires Skills 1, 3, and 4. The table also shows the proportion-correct scores (“Prop. Corr.”) for each item. The items are ordered with respect to an increasing order of difficulty (or decreasing order of proportion-correct score). The last five columns of the table are dis- cussed later. Many rows of the Q-matrix are identical, corresponding to a group of items that require the same set of skills to solve. Following the terminology of evidence-centered design (ECD; Almond, & Mislevy, 1999), we call the pat- terns corresponding to the rows evidence models (EM); these are shown in Col- umn 4 of the table.

Sinharay et al. (2004) discussed the fact that Items 5 and 6 may not require any mixed number subtraction skills; for example, an examinee may solve Item 5 just by noticing that any number minus the same number results in zero.

Certain patterns of skills will be indistinguishable on the basis of the results of this test. For example, because every item requires Skill 1, the 12 profiles (i.e., skill patterns) that lack Skill 1 are indistinguishable on the basis of this data. Similar logic reveals that there are only nine identifiable equivalence classes of student profiles. This property of the test design will manifest itself later in the analysis.

Table 2 describes the classes by relating them to the EMs. Often, distinctions among members of the same equivalence class are instructionally irrelevant. For example, students judged to be in Equivalence Class 1 would all be assigned reme- dial work in basic subtraction, so no further distinction is necessary.

Mislevy (1995) analyzed the data using a BN. Let ⁱ= (θⁱ¹, . . . , θⁱ⁵) denote the attribute/skill vector for Examinee i. In the context of ECD, where the BNs have found most use, θ^iks are called proficiency variables. These are latent variables describing knowledges, skills, and abilities of the examinees we wish to draw infer- ences about. (the term variable is used to emphasize its person speciﬁc nature.)

Mixed-Number Subtraction Link Models

The “link models” in ECD terminology state the likelihood distribution of the data given the model parameters and latent variables. It is assumed that a participant needs Model Diagnostics for Bayesian Networks

(6)

Sinharay

TABLE 1

Skill Requirements and Some P-Values for the Mixed-Number Subtraction Problems

Text of Skills Prop. Ppbis PD_j^eq,χ(y,) PD_j^raw,χ(y,) PD_j^raw,G(y,)

No. the Item 12345 EM Corr. rpbis 2LC 2LC 2LC 2LC

1 xoooo 1 .79 .50 .40 .74 .26 .19

5 xoooo 1 .70 .52 .00* .00* .00* .00*

4 xxooo 2 .71 .56 .10 .43 .06 .23

2 xoxoo 3 .75 .59 .14 .28 .05 .03*

3 xoxoo 3 .74 .61 .02* .19 .03* .02*

6 xoxoo 3 .69 .23 .08 .30 .00* .00*

9 xoxxo 4 .37 .80 .05 .06 .07 .11

10 xoxxo 4 .37 .76 .45 .56 .42 .30

11 xoxxo 4 .34 .74 .36 .49 .06 .08

14 xoxxo 4 .31 .77 .43 .41 .30 .28

7 xoxxo 4 .41 .70 .08 .04* .01* .04*

8 xoxxx 5 .38 .71 .00* .00* .00* .03*

12 xoxxx 5 .33 .66 .04* .04* .00* .00*

15 xoxxx 5 .26 .75 .06 .52 .30 .38

13 4 4 xxxxo 6 .31 .78 .23 .49 .53 .52

12 2 7

− 12 7 14

− 3 3 21

− 5 2 1

− 3 4 1

10 28

− 10 41

3 15

− 3 73

5 4

− 5 41

3 24

− 3 31

2 23

− 2 37

8−2 45

7 14

− 7 34

5 32

− 5 11

8 1

−8 3 4

3

− 4 6 7

4

− 7

(7)

to have mastered all of the skills required for an item in order to solve it. In an application however, we will get false positive and false negative results. The link model described below follows that intuition. The BN applied by Mislevy (1995), referred to as the 2-parameter latent class (2LC) model hereafter, uses two parameters for each item, the true-positive and false-positive probabilities, in

Suppose the j-th item uses the Evidence Model s, s= 1, 2, . . . , 6. Although s is deter- mined by the item, this notation does not reﬂect that. Let δi(s)be the 0/1 indicator denoting whether the examinee i has mastered the skills needed for tasks using Evi- dence Model s or not. The quantity δi(s)for any examinee is completely determined by the values of θ1, θ2, . . . , θ5for that examinee. The likelihood of the response of the i-th examinee to the j-th item is then expressed as

The model relies on the assumption of local independence, that is, given the proﬁ- ciency ⁱ, the responses of an examinee to the different items are assumed independent. The probability π^j1represents a “true-positive” probability for the item, this is, it is the probability of getting the item right for students who have mastered all of the required skills. The probability π^j0represents a “false-positive” probability; it is the probability of getting the item right for students who have yet to master at least one of the required skills. The probabilities π^j0and π^j1are allowed to differ over j (i.e., from item to item). However, we use the same independent priors for all items:

πj0 ~Beta(3 5 23 5. , . ),πj1~ Beta(23 5 3 5. , . .) ( )3 X_ij π_jδ_{i s}( )^,_i ^~ ^Bernoulli

(

π_jδ_{i s}( )

)

^. ^{( )}²

P X i

ij i j

= j

(

¹^{θ π}^,

)

⁼ ^π¹if examine mastered alll the skills needed to solve item oth

j πj0 eerwise.

⎧⎨

⎪

⎩⎪ ( )1

Model Diagnostics for Bayesian Networks

TABLE 2

Description of Each Equivalence Class With Evidence Models Items That Should Be Solved by Examinees of an Equivalence Class

Equivalence EM

Class Class Description 1 2 3 4 5 6

1 No Skill 1

2 Only Skill 1 x

3 Skills 1 & 3 x x

4 Skills 1, 3, & 4 x x x

5 Skills 1, 3, 4, & 5 x x x x

6 Skills 1 & 2 x x

7 Skills 1, 2, & 3 x x x

8 Skills 1, 2, 3, & 4 x x x x x

9 All Skills x x x x x x

(8)

Mixed-Number Subtraction Proficiency Model

The prior distribution of the proﬁciency variables is known as the proficiency model in ECD literature. In a BN, the prior distribution P() is expressed as a discrete Bayesian network (this is where the name of the model comes from) or graphic model (Lauritzen & Spiegelhalter, 1988; Pearl, 1988). Prior analyses revealed that Skill 3 is a prerequisite to Skill 4. A three-level auxiliary variable θ^WN incorporates this constraint. Level 0 of θWNcorresponds to the participants who have mastered neither skill; Level 1 represents participants who have mastered Skill 3 but not Skill 4; Level 2 represents participants who mastered both skills.

The dependence relationship (see, e.g., Figure 2 of Sinharay et al., 2004, for a graphic representation of the relationship) among the skill variables provided by the expert analysis corresponds to the factorization

The parameters λ are deﬁned as follows:

Finally, one requires prior distributions P(). We assume that λ1, 2, 5, and WN

are a priori independent.

The natural conjugate priors for the components of are either Beta or Dirichlet distributions. The hyperparameters are chosen, as in Mislevy et al. (2001), so that they sum to 27 (relatively strong numbers given the sample size of 325). With such a complex latent structure, strong priors such as those here are necessary to prevent problems with identiﬁability. These must be supported by relatively expensive elic- itation from the experts. Prior probabilities give a chance of 87% of acquiring a skill when the previous skills are mastered and 13% of acquiring the same skill when the previous skills are not mastered. They are given below:

λ λ

1 2 0

23 5 3 5 3 5 23 5

~ . , . ,

~ . , .

,

Beta Beta

( )

( );; ~ . , . ,

~ . ,

, ,

Beta Beta

λ λ

2 1 5 0

23 5 3 5 3 5 2

( )

3

3 5. ; _{5 1}_, ~ 13 5 13 5. , . ; _{5 2}_, ~

( ) λ Beta( ) λ Betaa 23 5 3 5

0 0 0 0 1 0 2

. , . ,

, ,

, , , , , , ,

( )

=

WN

((

WN WN WN

)

^~ ⁽ ^{, ,} ⁾^,

, ,. ~

Dirichlet Dirichlet

15 7 5

WN1 (111 9 7, , );WN_{, ,}₂. ~Dirichlet(7 9 11, , );WN_,₃_,,. ~Dirichlet 5 7 15( , , ).

λ θ

λ θ θ

λ

1 1

2 2 1

5

1

1 0 1

=

(

=

)

=

(

= =

)

=

= P

P m m

P

m m

.

, , .

,

for

θθ θ θ

λ θ θ

5 1 2

1

1 0 1 2

= + =

( )

=

= = +

m m

P n

WN m n WN

for , , .

, ,

(

θθ2+θ5 = m

)

form = 0 1 2 3, , , ,andn= 0 1 2, , . p

(

3

)

= p

(

θ θ3 _WN,

) (

p θ θ4 _WN,

) (

p θ_WN θ θ θ1, 2, 5,

)

pp

p p

θ θ θ

5 1 2

2 1 1

, ,

, , .

( )

( ) ( )

(9)

Fitting the 2LC Model to the Mixed-Number Subtraction Data

The WinBUGS software (Spiegelhalter, Thomas, & Best, 1999) is used to fit an MCMC algorithm to the data set. For details of the model fitting, see, for example, Sinharay et al. (2004). Five chains of size 2,000 each are used, after a burn-in of 1,000 each, resulting in a final posterior sample size of 10,000.

4. Posterior Predictive Model Checking Method

Let p( y) denote the likelihood distribution for a statistical model applied to data (examinee responses in this context) y, where denotes all the parameters in the model. Let p() be the prior distribution on the parameters. Then the posterior distribution of is

The posterior predictive model checking (PPMC) method (Rubin, 1984, building on the work in Guttman, 1967) suggests checking a model using the posterior pre- dictive distribution of replicated data y^rep,

as a reference distribution for the observed data y.

The basis of the PPMC method is to compare the observed data y to its refer- ence distribution (Equation 4). In practice, test quantities or discrepancy mea- sures D( y,) are defined (Gelman, Meng, & Stern, 1996), and the posterior distribution of D( y,) compared to the posterior predictive distribution of D(y^rep,

), with any significant difference between them indicating a model failure. A researcher may use D( y,) = D(y), a discrepancy measure depending on the data only, if appropriate.

A popular summary of the comparison is the tail-area probability or posterior predictive p-value (PPP-value), the Bayesian counterpart of the classical p-value (Bayesian p-value):

where I[A]denotes the indicator function for the event A.

Because of the difﬁculty in dealing with Equations 4 or 5 analytically for all but simple problems. Rubin (1984) suggests simulating replicated data sets from the posterior predictive distribution in practical applications of the PPMC method. One draws N simulations ¹, ², . . . , ^Nfrom the posterior distribution p(y) of , and draws y^rep,n from the likelihood distribution p( yⁿ), n= 1, 2, . . . , N. The

p P D D

I

rep

D ^rep D

=

[ ( )

≥

( ) ]

= _⎡⎣ ₍ ₎_≥ ₍ ₎⎤

y y y

y y

, ,

⎦⎦

∫

^y^rep ^p

⁽

^y^rep,

⁾

^p

⁽

,^y

⁾

^d^y^rep,^d, ( )5 p

(

y^rep y

)

=

∫

p

(

y^rep

)

p

(

y

)

d^, ^{( )}⁴

p p y p

p p d

y

( )

≡

( )

y ^{( )}

( )

^{( )}

∫

.

(10)

process results in N draws from the joint posterior distribution p( y^rep,  y), and, equivalently, from p( y^repy). The expression in Equation 5 implies that the propor- tion of the N replications for which D( y^rep,n, ⁿ) exceeds D( y,ⁿ) provides an esti- mate of the PPP-value. Extreme PPP-values (close to 0, or 1, or both, depending on the nature of the discrepancy measure) indicate model misﬁt.

Gelman et al. (1996) suggest that the preferable way to perform PPMC is to compare the realized discrepancies D( y,ⁿ) and the replicated/predicted discrep- ancies D( y^rep,n, ⁿ) by plotting the pairs {D( y,ⁿ), D(y^rep,n, ⁿ)}, n= 1, 2, . . . , N, in a scatter-plot.

Researchers like Robins, van der Vaart, and Ventura (2000) have shown that PPP-values are conservative (i.e., often fails to detect model misﬁt), even asymp- totically, for some choices of discrepancy measure; e.g., when the discrepancy measure is not centered. However, a conservative diagnostic with reasonable power is better than tests with unknown properties or poor Type I error rates, because evidence of misﬁt often leads to removal of test items from the item pool and it may be to the advantage of the administrators to fail to reject the hypothe- sis for a borderline item and thus retain the item for future use. Gelman et al.

(1996) comment that the PPMC method is useful if one thinks of the current model as a plausible ending point with modiﬁcations to be made only if substantial lack of ﬁt is found and we are dealing with applications exactly as referred to by them.

Empirical and theoretical studies suggest that PPP-values generally have reasonable long-run frequentist properties (Gelman et al., 1996, p. 754). The PPMC method combines well with the Markov chain Monte Carlo (MCMC) algorithms (e.g., Gelman et al., 2003).

5. The Suggested Model Diagnostics

A discrepancy measure that many will be tempted to apply here is the χ²-type measure

where the summation is over the examinees, items, or both. Although measures of this type are intuitive and useful for other types of models, this measure was found not useful by Yan et al. (2003) for the 2LC model. The presence of the latent variable vector iin the 2LC model ensures that the residual for an individual cannot be too unusual, and even an inadequate model predicts the χ²-type quantity adequately; the same phenomenon is observed by Gelman, Goegebeur, Tuerlinckx, and van Mechelen (2000) in a discrete data regression model. The χ²-type measure was also found to be of little use when computed using E(Xij^j, ) and Var(X^ij^j, ), i.e., when the expectation and variance of Xijare computed by marginalizing over the distribution of ⁱ. Therefore, the χ²-type measure is not considered any more in

X E X X

ij ij j i

ij j i

−

( )

[ ]

( )

∑

_Var _{π θ}^{π θ}_, ^, ^,

2

Sinharay

(11)

this article. The remaining of this section describes the measures that were found useful in this research.

Direct Data Display

Suitable display showing the observed data and a few replicated data sets may be a powerful tool to provide a rough idea about model ﬁt by revealing interesting patterns of differences between the observed and replicated data (e.g., Gelman et al., 2003). For large data sets, however, plotting all data points is prohibitive—but plots for a sample (for example, few examinees each from low-scoring, high-scoring, and middle groups) is possible.

Item Fit Analysis

This article ﬁrst examines if the ﬁtted model can predict a set of simple and natural quantities, the point biserial correlations (e.g., Lord, 1980, p. 33), that is, the simple correlations between the item scores and raw (number-correct) scores of the examinees. Sinharay and Johnson (2003) and Sinharay (2005) apply a similar approach successfully to the unidimensional IRT models.

Item Fit Measures Based on Equivalence Class Membership

Here, this article examines item fit measures based on equivalence class memberships of the examinees. We group the examinees according to the equivalence classes because the equivalence class membership of an examinee, although latent, is a natural quantity in this context. The natural groups are then the nine equivalence classes defined in Table 2. Equivalence Class 1 represents no skills and Equivalence Class 9 represents all skills. The classes in between are roughly ordered in order of increasing skills; however, the ordering is only partial. Even though the memberships are unknown (latent) quantities, the Markov chain Monte Carlo (MCMC) algorithm used for fitting the model produces values of the latent abilities in each iteration, which determine the equivalence class memberships and allow the grouping.

The first discrepancy measure examined is the proportion of examinees Õjkin equivalence class k, k= 1, 2, . . . , K = 9 who answer item j correctly, j = 1, 2, . . . , J.

The quantity Õjkis not a truly observed quantity (reflected in the notation), but depends on the parameter values of the model (that is why they are referred to as discrepancy measures here, as in Gelman et al., 1996). They become known if one knows the equivalence class membership of the examinees, which happens in each iteration of the MCMC algorithm. For a replicated data set, there is one such proportion, denoted Õ_jk^rep. For each combination of item j and equivalence class k, comparison of the values of Õjkand Õ_jk^repover the iterations of the MCMC provides an idea regarding the fit of the item. This article computes the PPP-value for each combination of item and equivalence class. Even though Õjks depend on the latent equivalence class membership and are unknown, the PPMC method integrates (sums in this context) out the unknown quantities with respect to their posterior distribution to provide natural Bayesian p-values. These p-values provide Model Diagnostics for Bayesian Networks

(12)

useful information about the fit of the items, and may suggest possible improve- ments of the model. For example, significant p-values for most items for Equiv- alence Class 1 (examinees with no Skill 1) will mean that the model is not flexible enough to explain examinees in that class and that one should add a com- ponent to the model to address that issue (this happens in the mixed-number example later).

The above mentioned item fit p-values provide useful feedback about item fit, but it will be useful to summarize the fit information for each item into one plot or one number, preferably a p-value. To achieve that, this paper uses the two test sta- tistics (χ²-type and G²-type), somewhat similar to those in Sinharay (2003) as discrepancy measures.

Consider for this discussion that the values of the ⁱs and ^js are known (which is true for each iteration of the MCMC algorithm, the generated values of ⁱs and

^js acting as known values), which means, for example, that the equivalence class memberships of all examinees are known. For each item, one can then compute Õjk, k= 1, 2, . . . , K and Õjk

reps, k, k= 1, 2, . . . , K, which are described above. Let N^k, k= 1, 2, . . . , K denote the number of examinees in Equivalence Class k. Given the

ⁱs and ^js, the expected probability of an examinee in Equivalence Class k answering Item j correctly, Ejk, is the suitable π^jδi(s). For example, for examinees in Equivalence Class 1 (one without skill 1), Ej1= π^j0∀j, because these examinees do not have the necessary skills to solve any item. On the other hand, for exami- nees in Equivalence Class 2 (with Skill 1 only), Ej2 = π^j1I[j∈{1,5}] + π^j0I[j∉{1,5}], because these examinees have necessary skills to solve Items 1 and 5 only. Let us denote the set of all item parameters and ability variables, that is, the set of all ⁱs and ^js, as .

The χ²-type measure for item j, denoted henceforth as D^eq,_j ^χ(y,), is given by

The G²-type measure, denoted henceforth as D^eq,Gj(y, ), is given by

Each of these two statistics summarizes the fit information for an item, with large values indicating poor fit. A comparison between the posterior distribution of Djeq,χ( y, ) [or D^jêq,G( y, )] and the posterior predictive distribution of Djeq,χ( y^rep, ) [D^jêq,G( y^rep, )] provides a summary regarding the fit of item i. The comparison can be done using a graphical plot (e.g., Figure 4). Another con- venient summary is the posterior predictive p-value (PPP-value) for the discrep- ancy measure, which can be estimated using the process described in Section 4.

Here, a PPP-value very close to 0 indicates a problem (indicating that the vari-

D O O

E N O

j eq G

jk jk jk

k jk

,

(

y,

)

= log⎛

⎝⎜ ⎞

⎠⎟ +

(

−

)

2

llog N E . ( )

N E

k jk

k

K −

−

⎛

⎝⎜ ⎞

⎠⎟

⎡

⎣⎢

⎢

⎤

⎦⎥

⎥

∑

= 1

7

D N O E

E N E

j eq x

k k

K

jk jk

jk k jk

,

(

y,

)

=

(

−

)

. (

(

−

)

∑

= 1

2

6 6) Sinharay

(13)

ability in the data set appears unusually large compared to that predicted by the model).

Adding the discrepancy measures over the items, an overall discrepancy measure for assessing the ﬁt of the model can be obtained as:

The PPP-value corresponding to this measure will provide an idea about the overall ﬁt of the model to the data set.

The computation of these PPP-values does not need any approximation (e.g., a rough χ²approximation, as in Sinharay et al., 2004), and hence results in natural Bayesian p-values summarizing the fit of the items. The cells (item-group com- bination) with small frequencies (especially for low or high raw scores) is not a problem with the PPMC approach because this does not need a χ²assumption.

However, for more stability of the discrepancy measures and the PPP-values, equivalence classes with too few examinees can be pooled. Examining the dis- tribution of the values of Djeq,χ( y^rep,m, ^m), m= 1, 2, . . . , M is a way to check for stability. These quantities should look like draws from a χ²-distribution with (K− d− 1) d.f., where K is the number of equivalence classes for the problem, and d is the adjustment in the d.f. due to pooling, if there are sufﬁcient number of examinees for each class—departure from the χ²distribution will point to the instability of the discrepancy measure. It is possible to examine the mean, variance and histogram of the replicated discrepancies to ensure the stability of the measure.

Item Fit Measures Based on Raw Scores

The latent quantities (equivalent class memberships), on which the above item fit measure is based on, may themselves be estimated with error, especially for a small data set, and hence may lead to a test with poor error rates. Therefore, this part examines item fit using examinee groups based on observed raw scores (as in Orlando & Thissen, 2000). Even though the raw scores do not have a clear inter- pretation in the context of the multidimensional 2LC model, they are natural quantities for any assessment with dichotomous items. The approach used here to assess item fit is very similar to that in Sinharay (2003).

The first discrepancy examined is the proportion of examinees in raw score group k, k= 1, 2, . . . , (J − 1) who answer item j correctly, j = 1, 2, . . . , J. Denote the observed proportion of examinees in group k answering item j correctly as Ojk. For each replicated data set, there is one (replicated) proportion correct for the j-th item and k-th group, denoted O_jk^rep. For each item j, comparison of the val- ues of Ojkand O_jk^reps, k, k= 1, 2, . . . , (J − 1), provides an idea regarding the ﬁt of the item (Oj0= 0 and O^jJ= 1; so they are not included). One way to make the comparison is to compute the posterior predictive p-value (PPP-value) for each item-group combination. This article suggests use of item fit plots to make the comparison. The approach will be described in detail with the data examples later.

D_overall^{eq x} D_j^{eg x}

j

, ,

, , . ( )

y y

( )

=

∑ ( )

⁸

(14)

The item fit plots promise to provide useful feedback about item fit, but it will be useful to summarize the fit information for each item into one number, prefer- ably a p-value. To achieve that, this article uses the two test statistics (χ²-type and G²-type) similar to those suggested by Orlando and Thissen (2000) as discrepancy measures. Because the measures depend on Ejk= E(O^jk, ), the next paragraph describes computation of Ejk.

The computation of Ejk, the expected proportion of examinees in group k answering item j correctly, is not trivial. Orlando and Thissen (2000) used the recursive approach of Lord and Wingersky (1984) to compute the expectations in the context of IRT models and this paper uses a slight variation of that; a brief description of the approach follows. Let Tj() be the probability of success on Item j for proficiency (remember that for the 2LC model, this will be either π^j1or π^j0, depending on and the skill requirements of the item). Suppose S^k() denote the probability of obtaining a raw score of k in an J-item test by an exam- inee with proficiency variable . For convenience, the dependence of S^k() on

and is not reflected in the notation. Further, denote S^*j^k() to be the proba- bility of a raw score k at proficiency on items 1, 2, . . . , j − 1, j + 1, . . . , J (i.e., omitting Item j from the set of all items). Then,

where the summation is over all possible values of (which is 24 for this prob- lem) and p() is the prior distribution on . For an IRT model, the summation in Equation 9 is replaced by an integration (Orlando & Thissen, 2000). For com- puting S_k^{* j}() and Sk(), this article uses the recursive approach of Lord and Wingersky (1984) and Orlando and Thissen (2000); the approach involves the terms Tj()’s. The quantity E^jkdoes not depend on the latent proficiency variables ⁱs.

The suggested χ²-type measure, denoted henceforth as D^raw,χ_j (y,), is given by

The G²-type measure, denoted henceforth as D^raw,G_j (y,), is given by

Each of the above two statistics summarizes the ﬁt information for an item, with large values indicating poor fit. One can use ∑^jD_j^raw,^χ( y,) or ∑^jD_j^raw,G( y,) to provide an idea regarding the overall fit of the model. As discussed for the equiv-

D N O O

E O

j raw g

k jk

jk jk

jk

,

(

y,

)

= log⎛ l

⎝⎜ ⎞

⎠⎟ + −

( )

2 1 oog 1 . ( )

1 11

1

1 −

−

⎛

⎝⎜ ⎞

⎠⎟

⎡

⎣⎢

⎢

⎤

⎦⎥

⎥

=

∑

− ^O_E^jk

k jk J

D N O E

E E

j raw g

k

jk jk

k J

,

(

y,

)

=

(

−

)

.

(

−

)

=

∑

−

2

1 1

1 ((10)

E T S p

S p

jk

j k

j

k

= ( ) ( )

( )

( )⁻

( )

Σ Σ

θ θ

1 9

*

, ( )

Sinharay

(15)

alence class-based measures earlier, a researcher can use graphical plots or PPP- values for D^raw,_j ^χ(y,) [or Dj

raw,G

( y,)] to summarize item fit, and these measures can be monitored for stability and examinee groups can be pooled, if required.

Examining Association Among the Items

Consider an item pair in an assessment. Denote nk,k′to be the number of individu- als scoring k on the ﬁrst item and k′ on the second item, k, k′ = 0, 1. We use

the sample odds ratio (see, for example, Agresti, 2002, p. 45), referred to as the odds ratio hereafter, as a test statistic with the PPMC method. Chen and Thissen (1997) apply these to detect violation of the local independence assumption of simple IRT models. Sinharay and Johnson (2003) and Sinharay (2005), applying the PPMC method, find the odds ratios to be useful discrepancy measures in detecting misfit of the simple IRT models. Examining the model prediction of odds ratios, which are measures of association, should help a researcher to detect if the fitted model can adequately explain the association among the test items. The BNs assume the ability variables to be multidimensional and attempt to explain the associations among items; therefore, examining how the model failed to reproduce the odds ratios might also provide some insight regarding any associations that the model failed to capture and might lead to an improved model.

Differential Item Functioning

Holland (1985) suggested the Mantel–Haenszel test statistic for studying DIF in the context of item response data. The test has the strength that it does not require ﬁtting any model and straightforward to apply.

Suppose one is interested in examining if a given item j shows DIF over a “focal group” F and a “reference group” R. In an application of the Mantel–Haenszel test, the examinees are divided into K matching groups, typically based on their raw scores. For an item, the data from the k-th matched group of reference and focal group members are arranged as a 2 × 2 table, as shown in Table 3. The Mantel–Haenszel

OR n n

= n n^{11 00}

10 01

,

TABLE 3

Table for Mantel–Haenszel Statistic

Group Right on an Item Wrong on an Item Total

Reference Ak Bk nRk

Focal Ck Dk nFk

Total Rk Wk n_+k

(16)

test statistic for an item is then given by

where μk= n^RkRk/n_+k, and σ²^k= [(n^RknFkRkWk)/(n²_+k(n_+k− 1)]. This article applies the above test statistic to determine DIF for BNs.

However, the test employs matched groups based on raw scores of examinees, which are not the most natural quantities in the context of BNs. No DIF in an application of a BN means that success probabilities of the reference group and the focal group are the same conditional on the skill variables, and raw scores do not neces- sarily contain all information about the skills; a test incorporating this fact may perform better than the Mantel–Haenszel test.

One simple alternative to the Mantel–Haenszel test is to ﬁt the model separately to the two groups (focal and reference) and compare the values of estimates of the

^js for the two groups using, for example, a normal or a χ²-test (as in Lord, 1980, in testing for DIF in IRT models). The approach may be costly in a setting such as here, where an MCMC algorithm is used to ﬁt the model, especially if an investi- gator is interested in a number of different focal groups and/or reference groups.

Furthermore, the approach may be problematic if one or more of the two groups is small—the estimates obtained by ﬁtting a BN to a small data set may not be stable.

This line of work is not pursued any more in this article.

This article examines a discrepancy measure based on the Mantel–Haenszel test statistic, forming matched groups with respect to their latent skills. This approach does not need running the MCMC more than once. Matching with respect to skills (or, equivalence classes) is a natural concept in this context; however, the problem in doing so is that the skills are unknown. Fortunately, the PPMC method and the draws from the MCMC algorithm provides us a way to get around that problem.

Suppose we know the skill variables for each examinee and hence know their equivalence class memberships (this will be true for each iteration of MCMC, the generated value acting as the known values). For each equivalence class, we can compute a table such as Table 3 and can compute a discrepancy such as Equation 12. Let us denote the measure for Item j as Dj^MH( y,). It is straightforward to obtain an esti- mate of the PPP-value corresponding to D^MHj (y,), by computing the proportion of times DjMH

(y^rep,m, ^m) exceeds DjMH

(y,^m), where ^m, m= 1, 2, . . . , M, is a posterior sample and y^rep,m, m= 1, 2, . . . , M, are the corresponding posterior predictive data sets.

Discussion

One disadvantage of these techniques is the conservativeness of the PPMC methods. However, as argued earlier, a conservative test is better than a test with unknown properties, and the conservativeness of the PPMC technique may be a boon here. Another disadvantage is that these techniques are based on the MCMC algorithm and hence are computation-intensive. Another issue, which might make some potential users uneasy, is that these techniques are Bayesian in nature and use

X A

MH

k k k k

k k 2

2 2

0 5 12

=

(

Σ −Σ −

)

Σ μ σ

. , ( )

Sinharay

(17)

prior distributions. In the absence of substantial prior information, one can use non- informative prior distributions to reduce the effect of the prior distribution on the inference. Alternatively, one can perform sensitivity analysis to detect the effect of the prior distributions.

However, the disadvantages are somewhat compensated by the advantages that (a) the suggested techniques provide natural probability statements from a Bayesian point of view, and (b) computations for the techniques requires neither complicated theoretical derivations nor extensive simulation studies once an MCMC algorithm has been used to ﬁt the model.

Assessing Identifiability of Model Parameters

The BNs deal with latent classes (see, e.g., Sinharay et al., 2004 and the refer- ences therein) and hence there is a possibility of nonidentiﬁability or weak identi- ﬁability, which means that data provide little information about some parameters.

Lindley (1971) argues that even a weakly identifiable Bayesian model may be valid in the sense that it can still describe the data via its identifiable parameters and information supplied a priori. On the contrary, although weakly identifiable models can be valid, they are not optimal for summarizing population characteristics.

By failing to recognize weak identifiability, one may make conclusions based on a weakly identified parameter when one has no direct evidence beyond prior beliefs to do so (Garrett & Zeger, 2000). Furthermore, such a model increases the chances of nonapplicability of standard asymptotic theory (e.g., as follows from Haberman, 1981) and problems with MCMC convergence. Therefore, a diagnostic for model identifiability, although not directly a model-checking tool, is a necessary tool for assessing the performance of a BN.

This article uses plots of prior versus posterior distributions of the model parameters s and s to assess their identiﬁability. The closer the two densities, the less is the information provided the data on the parameter (because posterior ∝ likelihood × prior), and the weaker is the identiﬁability of the parameter. It is possible to use numerical measures such as percentage overlap between the prior and posterior (e.g., Garrett & Zeger, 2000), but this article does not use those measures.

The following sections apply the suggested model checking tools to the mixed- number subtraction data set and a data set simulated from the 2LC model.

6. Direct Data Display

Figure 1 presents a direct data display discussed in the section on fitting the 2LC model to the mixed-number subtraction data. The figure shows the mixed-number subtraction data and eight replicated data sets (chosen randomly) from the 10,000 obtained during the fitting of the model. The plot is somewhat similar to Figures 6.6 and 6.7 of Gelman et al. (2003) that are used to assess the fit of a logistic regression model. Figure 1 shows that there are certain patterns in the observed data that are not present in the replicated data sets. For example, the examinees with the low- est scores could answer only two items correctly (interestingly, these are Items 5 and 6, the idiosyncrasies of which were discussed earlier—both of these can be solved without any knowledge of mixed-number subtraction). In the replicated data sets, Model Diagnostics for Bayesian Networks

(18)

however, these examinees get other items correct as well. The smartest examinees get all items correct in the observed data, which is not true for the replicated data sets. Furthermore, the weakest half of the examinees rarely get any hard items (those belonging to Evidence Models 4 and 5) correct in the observed data, which is not true in the replicated data sets. These patterns suggest that the 2LC model is not entirely satisfactory for the data set.

To examine the performance of the 2LC model when data come from the same model, a limited simulation study is performed. The above analysis of the mixed- number data using the 2LC model produces 10,000 posterior predictive data sets;

each of these can be considered to have been generated from the 2LC model. We randomly pick one of them, ﬁt the 2LC model to the data, and generate 10,000 posterior predictive data sets as before.

Figure 2 shows the data simulated from the 2LC model and eight replicated data sets from its analysis. The data set on the left seem to ﬁt in with the replicated data sets on the right. The direct data display does not indicate any deﬁciency of the model, rightfully so.

7. Item Fit Analysis

Fit of the 2LC Model to the Mixed-Number Subtraction Data

This section examines, using various measures, item ﬁt for the 2LC model with the mixed-number subtraction data set.

Sinharay

FIGURE 1. Mixed-number subtraction data and eight replicated data sets. Note. The left column in Figure 1 shows all observed responses, with a little black box indicating a cor- rect response. The items, sorted according to decreasing proportion correct, are shown along the x-axis; the examinees, sorted according to increasing raw scores, are shown along the y-axis. Right columns show eight replicated data sets from the 2LC model.

There are some clear differences between the observed and replicated data sets.

(19)

Point Biserial Correlations

Columns 6 and 7 in Table 1 show the observed point biserials (denoted rpbis) and the p-values (denoted Ppbis2LC) corresponding to each item. The p-values signif- icant at 5% level are marked with an asterisk. The values shows that the 2LC model underpredicts most of the point biserial correlations for this data set, and signiﬁ- cantly so for Items 3, 5, 8, and 12.

Figure 3 shows a histogram for the predicted average biserial correlations (from the 10,000 replicated data sets). A vertical line shows the observed average (averaged Model Diagnostics for Bayesian Networks

FIGURE 2. 2LC data and eight replicated data sets.

FIGURE 3. Observed and predicted average biserial correlations when the 2LC model is fitted to the mixed-number subtraction data.

(20)

over all items), which appears improbable under the 2LC model (PPP-value of .00). This indicates some inadequacy of the 2LC model for these data.

Measures Based on Equivalence Classes

Figure 4 shows, for Items 12 and 13, plots of the realized discrepancy Djeq,χ

( y,) and the predicted discrepancy D^j^eq,χ(y^rep, ). A diagonal line is there for convenience—points consistently below the diagonal (showing that the realized discrepancy is mostly larger than the predicted discrepancy) indicate a problem.

The plot toward the left shows that the model cannot explain the responses for the item adequately, the model-predicted discrepancy being almost always smaller than the realized discrepancy. The PPP-value is .04. The plot toward the right shows the opposite—the model seems to reproduce the discrepancy and hence explain the responses for the item adequately (PPP-value = .49).

Column 8 in Table 1 shows the item fit p-values (denoted “PDêq,χ_j ( y,)2LC”) for the measure D_jêq,χ( y,). Because the p-values for the χ²-type and G²-type mea- sures are very close, we show and discuss the results for the former only. The p-value for assessing overall fit, that corresponding to the discrepancy measure (Equation 8), is .01 and indicates that the model cannot explain the data set adequately.

Figure 5 shows the p-values for each item for each equivalence class. The plot is like one used in Murdoch and Chow (1996), where elliptical glyphs represent correlations. For a correlation ρ, an ellipse is drawn with the shape of the contour of a bivariate normal distribution with unit variances and correlation ρ. Here, we plot the quantity 2 * ( p-value − .5) for each combination of item and equivalence class.

A p-value close to zero (one) will be transformed to −1 (1) and will be represented by almost a line with slope −1 (1); one such example is provided by Item 5, Class 9 (Item 8, Class 1). The ﬁgure provides more insight regarding items with extreme p-values. For example, the 2LC model often results in inaccurate predictions for Equivalence Class 1, with a number of extreme p-values, and does not perform too Sinharay

FIGURE 4. Comparison of realized and predicted discrepancies when the 2LC model is fit to mixed-number subtraction data.