## Evaluating Cognitive Theory:

## A Joint Modeling Approach Using Responses and Response Times

### Rinke H. Klein Entink

University of Twente

### Jo¨rg-Tobias Kuhn

University of Mu¨nster

### Lutz F. Hornke

RWTH Aachen University

### Jean-Paul Fox

University of Twente

In current psychological research, the analysis of data from computer-based assessments or experiments is often confined to accuracy scores. Response times, although being an impor- tant source of additional information, are either neglected or analyzed separately. In this article, a new model is developed that allows the simultaneous analysis of accuracy scores and response times of cognitive tests with a rule-based design. The model is capable of simultaneously estimating ability and speed on the person side as well as difficulty and time intensity on the task side, thus dissociating information that is often confounded in current analysis procedures. Further, by integrating design matrices on the task side, it becomes possible to assess the effects of design parameters (e.g., cognitive processes) on both task difficulty and time intensity, offering deeper insights into the task structure. A Bayesian approach, using Markov Chain Monte Carlo methods, has been developed to estimate the model. An application of the model in the context of educational assessment is illustrated using a large-scale investigation of figural reasoning ability.

*Keywords: rule-based item design, response times, item response theory*

An important facet reflecting cognitive processes is captured by response times (RTs). In experimental psychology, RTs have been a central source for inferences about the organiza- tion and structure of cognitive processes (Luce, 1986). How- ever, in educational measurement, RT data have been largely ignored until recently, probably because of the fact that record- ing RTs for single items in paper-and-pencil tests seemed difficult. With the advent of computer-based testing, item RTs have become easily available to test administrators. Taking RTs into account can lead to a better understanding of test and

item scores, and it can result in practical improvements of a test, for example, by investigating differential speededness (van der Linden, Scrams, & Schnipke, 1999).

The systematic combination of educational assessment techniques with RT analysis remains a scarcity in the liter- ature. The purpose of the present article is to present a model that allows the integration of RT information into an item response theory (IRT) framework in the context of educational assessment. More specifically, the approach ad- vanced here allows for the simultaneous estimation of abil- ity and speed on the person side, while offering difficulty and time-intensity parameters pertaining to specific cogni- tive operations on the item side. First, we briefly outline the cognitive theory for test design and IRT models that are capable of integrating cognitive theories into educational assessment. Next, current results from the RT literature with respect to educational measurement are summarized. We then develop a new model within a Bayesian framework that integrates both strands of research, and we demonstrate its application with an empirical example.

Cognitive Theory in Educational Assessment One of the core interests of psychological research per- tains to the analysis of cognitive processes. Research para- Rinke H. Klein Entink and Jean-Paul Fox, Department of Research

Methodology, Measurement, and Data Analysis, University of Twente, Enschede, the Netherlands; Jo¨rg-Tobias Kuhn, Department of Psychology, University of Mu¨nster, Mu¨nster, Germany; Lutz F.

Hornke, Department of Industrial and Organizational Psychology, RWTH Aachen University, Aachen, Germany.

Software for the application presented in this article is freely available as a package for the R statistical environment on www.kleinentink.eu

Correspondence concerning this article should be addressed to Rinke H. Klein Entink, University of Twente, Department of Re- search Methodology, Measurement, and Data Analysis, P.O. Box 217, 7500 AE Enschede, the Netherlands. E-mail: r.h.kleinentink@

gw.utwente.nl

54 This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

digms in cognitive psychology often assume that to success- fully solve a task or test item, a subject must perform certain associated cognitive processes, either serially or in parallel.

Subjects then can be differentiated on the basis of the processing times necessary for specific processes. For ex- ample, an important strand of research in cognitive psychol- ogy is concerned with analyzing parameters from individual RT distributions on simple tasks, and these parameters can be theoretically connected to psychological processes, such as attention fluctuations or executive control (Schmiedek, Oberauer, Wilhelm, Su¨ss, & Wittmann, 2007; Spieler, Balota, & Faust, 2000). Complex models of reaction times obtained in experimental settings have been developed, for example, focusing on a decomposition of reaction times, comparing competing models or complex cognitive archi- tectures (Dzhafarov & Schweickert, 1995). However, many of the reaction times analyzed in experimental psychology are based on very elementary tasks that are often psycho- physical in nature (van Zandt, 2002). The central difference between RT analysis in experimental psychology and edu- cational assessment lies with the cognitive phenomena un- der investigation and the complexity of the tasks involved.

In experimental psychology, research commonly focuses on
elementary cognitive processes related to stimulus discrim-
ination, attention, categorization, or memory retrieval (e.g.,
Ratcliff, 1978; Rouder, Lu, Morey, Sun, & Speckman,
2008; Spieler et al., 2000). In this research tradition, mostly
simple choice tasks are utilized that usually do not tap
subjects’ reasoning or problem-solving abilities. Further, in
experimental research on reaction times with mathematical
processing models, the focus has often been either on RTs
*or accuracy scores but not on both at the same time, with*
item parameters sometimes not being modeled at all (e.g.,
Rouder, Sun, Speckman, Lu, & Zhou, 2003; but see Rat-
cliff, 1978, for an approach that allows to simultaneously
model experimental accuracy and RT data). This is due to
the fact that such models often imply a within-subject
design with many replications of the same simple items, a
procedure not usually followed in educational measure-
ment.

Things look differently in educational assessment. Here, differentiating subjects according to latent variables (e.g., intelligence) as measured by psychological tests is of pri- mary interest. Latent variables represent unobservable enti- ties that are invoked to provide theoretical explanations for observed data patterns (Borsboom, Mellenbergh, & van Heerden, 2003; Edwards & Bagozzi, 2000). Recently, cog- nitive theories pertaining to test design as well as latent variable models have been merged in the field of educa- tional assessment to provide meaningful results. To improve construct valid item generation, contemporary test develop- ment often incorporates findings from theories of cognition (Mislevy, 2006). With respect to construct validation, Embretson (1983, 1998) has proposed a distinction between

construct representation, involving the identification of cog- nitive components affecting task performance, and nomo- thetic span, which refers to the correlation of test scores with other constructs. Whereas traditional methods of test construction have almost exclusively focused on establish- ing correlations of test scores with other measures to estab- lish construct validity (nomothetic span), contemporary test development methods focus on integrating parameters re- flecting task strategies, processes, or knowledge bases into item design (construct representation). Hence, the cognitive model on which a test is founded lends itself to direct empirical investigation, which is a central aspect of test validity (Borsboom, Mellenbergh, & van Heerden, 2004).

After a set of cognitive rules affecting item complexity has been defined on the basis of prior research, these rules can be systematically combined to produce items of varying difficulty. In a final step, the theoretical expectations can then be compared with empirical findings.

The integration of cognitive theory into educational as- sessment is usually based on an information-processing approach and assumes unobservable mental operations as fundamental to the problem-solving process (Newell &

Simon, 1972). The main purpose of educational assessment under an information-processing perspective is to design tasks that allow conclusions pertaining to the degree of mastery of some or all task-specific mental operations that an examinee has acquired. That is, by specifying a set of known manipulations of task structures and contents a pri- ori, psychological tests can be built in a rule-based manner, which in turn allows more fine-grained analyses of cogni- tive functioning (Irvine, 2002). In the process of test design, it is therefore entirely feasible (and generally desirable) to experimentally manipulate the difficulty of the items across the test by selecting which cognitive operations must be conducted to solve which item correctly. Hence, as is out- lined in the next section, some extensions of classical IRT models are capable of modeling the difficulty of cognitive components in a psychometric test. The basic requirement for such a procedure, however, is a strong theory relating specific item properties to the difficulty of the required cognitive operations (Gorin, 2006). Because classical test theory focuses on the true score that a subject obtains on a whole test, that is, on the sum score of correct test items, it is not well-suited to model cognitive processes on specific test items. In contrast, numerous IRT models have been developed that are capable of doing so (cf. Junker &

Sijtsma, 2001; Leighton & Gierl, 2007).

In the context of educational assessment, language-free tests lend themselves to rule-based item design, which can be understood as the systematic combination of test-specific rules that are connected to cognitive operations. A large body of research in rule-based test design has focused on figural matrices tests, which allow the assessment of rea- soning ability with nonverbal content. In these tests, items

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

consist usually of nine cells organized in 3⫻ 3 matrices, with each cell except the last one containing one or more geometric elements. The examinee is supposed to detect the rules that meaningfully connect these elements across cells and to correctly apply these rules to find the content of the empty cell. A typical item found in such a test is given in Figure 1.

Several investigations into the structure and design of
cognitive rules in figural matrices tests have been con-
ducted. Jacobs and Vandeventer (1972) and Ward and Fitz-
patrick (1973) have reported taxonomies of rules utilized in
the item design of existing figural matrices tests. In a review
of the current literature, Primi (2001) has described four
*main design factors (radicals according to Irvine, 2002) that*
affect item difficulty: (1) number of elements, (2) number of
rules, (3) type of rules, and (4) perceptual organization of
elements. In line with recent research, the two first radicals
are associated with the amount of information that must be
processed during working on an item (Carpenter, Just, &

Shell, 1990; Mulholland, Pellegrino, & Glaser, 1980): More information requires more working memory capacity and additionally results in longer RTs (Embretson, 1998).

Working memory, which is a construct grounded in cogni- tive psychology that has repeatedly been shown to correlate highly with intelligence (e.g., Engle, Tuholski, Laughlin, &

Conway, 1999), refers to the cognitive system that is capa- ble of simultaneously processing and storing information.

Carpenter et al. (1990) assumed that in addition to working memory capacity, abstraction capacity—that is, the ability to represent information in a more conceptual way—plays a role in item solving: Examinees that are capable of process- ing item features in a more abstract fashion are more capa- ble of discovering the correct solution. The third radical (type of rules) has been studied in several studies (e.g., Bethell-Fox, Lohman, & Snow, 1984; Carpenter et al., 1990; Embretson, 1998; Hornke & Habon, 1986; Primi, 2001). In one study (Carpenter et al., 1990), which analyzed performance on the Advanced Progressive Matrices (Raven,

1962), evidence was presented that easier rules taxing the
working memory system less are considered before harder
ones. On the basis of this finding, Embretson (1998) pro-
posed that the difficulty of understanding and applying item
rules correctly is related to working memory capacity. Fi-
nally, the fourth radical, perceptual organization, refers to
how the figural elements in an item are grouped. For exam-
*ple, Primi (2001) distinguished between harmonic and dis-*
*harmonic items, in which the latter introduce conflicting*
combinations between visual and conceptual figural ele-
ments, whereas the former display more congruent relation-
ships. Primi showed that perceptual organization had a
strong effect on item difficulty, even stronger than the
number and type of rules (i.e., radicals taxing working
memory capacity). In contrast, both Carpenter et al. and
Embretson found larger effects of item features relating to
working memory.

Psychometric Analysis of Rule-Based Test Items The analysis of tests and items with a cognitive design is usually cast in an IRT framework (Rupp & Mislevy, 2007).

One of the most basic IRT model is the Rasch model
(Rasch, 1960). It is the building block for numerous more
advanced IRT models. The Rasch model assumes unidi-
mensionality and local item dependence, which can be
regarded as equivalent to each other (McDonald, 1981). In
the Rasch model, the probability of a correct response of
*examinee i, i⫽ 1, 2, . . . , N to a test item k, k ⫽ 1, 2, . . . , K*
is given by

*P(Y**ik*⫽ 1兩^{i}*, b**k*)⫽ *exp*共*i**⫺ b**k*兲

1*⫹ exp共*^{i}*⫺ b** ^{k}*兲, (1)
where

*i*

*denotes the ability of test taker i, and b*

*denotes the*

_{k}*difficulty of item k. The Rasch model represents a saturated*model with respect to the items, because each item has its own difficulty parameter. Therefore, the model does not allow any statements pertaining to the cognitive operations that are assumed to underlie performance on the items.

Another IRT model, the linear-logistic test model (LLTM;

Fischer, 1973), which is nested in the Rasch model, allows
*the decomposition of the item difficulties b** _{k}*such that

*P(Y**ik*⫽ 1兩^{i}*, q**k*,) ⫽ *exp*共*i*⫺冘^{j}^{J}^{⫽ 1}^{q}^{kj}^{}^{j}^{兲}

1*⫹ exp共** ^{i}*⫺冘

^{j⫽ 1}

^{J}

^{q}

^{kj}^{}

^{j}^{兲}

^{,}

^{(2)}

where the*j**, j⫽ 1, . . . , J, are so-called “basic parameters”*

representing the difficulty of a specific design rule or cog-
*nitive operation in the items, and the q** _{kj}* are indicators

*reflecting the presence or absence of a rule j in item k. The*LLTM is therefore capable of determining the difficulty of specific cognitive operations that must be carried out to solve an item.

**?**

**1** **2** **3** **4** **5** **6** **7** **8**

**?**

**1** **2** **3** **4** **5** **6** **7** **8**

*Figure 1.* Example of a figural reasoning item.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Both the Rasch model and the LLTM assume that all items discriminate equally well across examinees. This is a rather strict assumption that can be relaxed. The 2 parameter logistic (2PL) model (Lord & Novick, 1968) is defined as

*P(Y**ik*⫽ 1兩*i**, a**k**, b**k*)⫽ *exp(a**k*共*i**⫺ b**k*兲)

1*⫹ exp(a**k*共*i**⫺ b**k*兲), (3)
with a* _{k}*denoting the item discrimination parameter of item

*k. The 2PL model therefore is an extension of the Rasch*model in that it allows the estimation of item-specific dif- ficulty and discrimination parameters. Conceptually con- necting the 2PL model with the LLTM, Embretson (1999) suggested the 2PL-constrained model, which is given by

*P(Y*

*ik*⫽ 1兩

^{i}*, q*

*k*,, )

⫽

*exp*

### 冋冉

^{冘}

^{j}

^{J}^{⫽ 1}

^{q}

^{kj}^{}

^{j}### 冊冉

^{}

^{i}^{⫺}

^{冘}

^{j}

^{J}^{⫽ 1}

^{q}

^{kj}^{}

^{j}### 冊册

1*⫹ exp*

### 冋冉

^{冘}

^{j}

^{J}^{⫽ 1}

^{q}

^{kj}^{}

^{j}### 冊冉

^{}

^{i}^{⫺}

^{冘}

^{j}

^{J}^{⫽ 1}

^{q}

^{kj}^{}

^{j}### 冊册

^{,}

^{(4)}

with *j* *reflecting the basic parameters of the J design*
variables with respect to item discrimination. In addition to
decomposing item difficulties, this model can therefore
check whether the presence of certain design features in an
item enlarge or decrease its discriminatory power. The
2PL-constrained model is nested in the 2PL model and
therefore allows a direct comparison of model fit.

Both the LLTM and the 2PL-constrained model make the
strong assumption that all item difficulties can be perfectly
predicted from the basic parameters, that is, there is no error
term in the regression of the item difficulties and/or dis-
crimination parameters on the respective item design fea-
tures. An implication of this assumption is that all items
with the same design structure must have the same item
parameters; for example, in the LLTM, all items with the
*same design vector q** _{kj}* must have the same item difficulty

*b*

*. It has been shown that there can still be considerable variation in the item difficulties after accounting for item design features (Embretson, 1998). To take this into ac- count, an error term must be introduced into the model.*

_{k}Janssen, Schepers, and Peres (2004) have presented an
application of this approach for the LLTM, where item
*difficulty b** _{k}*is decomposed as

*b**k*⫽ 冘

*j*⫽ 1
*J*

*q**kj** ^{j}*⫹ ⑀

*, (5)*

^{k}with ⑀*k**⬃ N共0, *⑀2兲. The error term ε*k* now captures the
residual variance not explained by the design parameters.

This approach can be generalized to the 2PL-constrained
*model as well, that is, the discrimination parameter a** _{k}*can
be assumed to show variation between structurally equal

items. A framework allowing the analysis of such effects has been suggested by Glas and van der Linden (2003) and by De Jong, Steenkamp, and Fox (2007). By allowing random error in these models, the amount of variance that is explained by the cognitive design in the item parameters can be evaluated and, hence, the quality of the proposed cogni- tive model can be assessed.

RTs in Educational Assessment

Traditional data analysis in educational assessment is founded on accuracy scores. Results obtained in classical, unidimensional IRT models (such as the 2PL model) usu- ally provide information on person and item parameters: For each person, a person parameter reflecting latent ability is estimated, and for each item, a difficulty parameter and a discrimination parameter are obtained (Embretson & Reise, 2000). In such models, RTs are not modeled. However, RTs are easily available in times of computerized testing, and they can contain important information beyond accuracy scores. For example, RTs are helpful in detecting faking behavior on personality questionnaires (Holden & Kroner, 1992), and they can provide information on the speededness of a psychometric test or test items (Schnipke & Scrams, 1997; van der Linden et al., 1999) or aberrant response patterns (van der Linden & van Krimpen-Stoop, 2003).

Apart from these issues pertaining to test administration, RTs in psychometric tests with a design potentially contain vital information concerning the underlying cognitive pro- cesses. Importantly, RT analysis may allow new insights into cognitive processes that transcend those obtained by IRT modeling. For example, one might be interested in the relationship between the difficulty and time intensity of a cognitive process: Are difficult processes the most time- intensive? What is the relationship between latent ability (e.g., intelligence) and speed? Does the test format affect this relationship? To investigate these questions, a unified treatment of accuracy scores and RTs is required.

Three different strategies have been used in the past to extract RT-related information from psychometric tests.

*Under the first strategy, RTs are modeled exclusively. This*
strategy is usually applied to speed tests that are based on
very simple items administered with a strict time limit for
which accuracy data offer only limited information. For
example, in his linear exponential model, Scheiblechner
*(1979) has suggested that the RT T for person i responding*
*to item k is exponentially distributed with density*

*f(t**ik*)⫽ 共*i*⫹ ␥*k**兲exp[⫺(**i*⫹ ␥*k**)t**ik*], (6)
where *i* is a person speed parameter, and ␥*k* is an item
speed parameter. Analogous to the LLTM, the item speed
parameter (␥*k*) can now be decomposed into component
processes that are necessary to solve the item:

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

␥*k*⫽* _{j}*冘

_{⫽1}

^{J}

^{a}

^{kj}^{}

^{j}^{,}

^{(7)}

where*j**indicates the speed of component process j, and*
a_{kj}*is a weight indicating whether component process j is*
*present in item k. Maris (1993) suggested a similar*
model, on the basis of the gamma distribution. Note that
these models focus on RTs exclusively, whereas accuracy
scores are not taken into consideration.

A second strategy chosen by several authors implies a
*separate analysis of RTs and accuracy scores. For exam-*
ple, Gorin (2005) decomposed the difficulty of reading
comprehension items using the LLTM, and in a second
step regressed the log-transformed RTs on the basic
parameters. A similar approach was chosen by Embret-
son (1998) and Primi (2001) with a figural reasoning task,
whereas Mulholland et al. (1980) used analyses of vari-
ance to predict RTs by item properties in a figural anal-
ogies test separately for correct and wrong answers,
respectively (cf. Sternberg, 1977). In contrast, Bejar and
Yocom (1991) compared both difficulty parameters and
the shape of cumulative RT distributions of item iso-
morphs, that is, parallel items, in two figural reasoning
test forms. Separate analyses provide some information
on both accuracy scores and RTs, but the relation be-
tween these two variables cannot be modeled, as they are
assumed to vary independently. For an analysis that
overcomes this difficulty, a model is needed that can
simultaneously estimate RT parameters and IRT param-
eters. This has been done in a third strategy of analyses
*on the basis of the joint modeling of both RTs and*
accuracy scores. Recently, several models have been
proposed for the investigation of RTs in a psychometric
test within an IRT framework. One of the first models
was introduced by Thissen (1983), which describes the
*log-transformed RT of person i to item k as*

*log(T**ik*)*⫽ ⫹ s**i**⫹ u**k**⫺ bz**ik*⫹ ⑀*ik*, (8)
with⑀*ik**⬃ N共0, *⑀2兲. In this model, reflects the overall mean
*log RT, s*_{i}*and u** _{k}* are person- and item-related slowness
parameters, respectively,

*⫺b represents the log-linear rela-*

*tion between RT and ability, z*

*is the logit estimated from a 2PL model, andεis an error term. The new parameter in this model is*

_{ij}*⫺b, which reflects the relationship of ability*and item difficulty with the RT. The model suggested by Thissen (1983) is rather descriptive than explanatory in nature in that it does not provide a decomposition of item parameters reflecting cognitive operations (but see Ferrando

& Lorenzo-Seva, 2007, for a recent extension of Thissen’s model to questionnaire data).

The model proposed by Roskam (1997), which concep- tually is very similar to the model by Verhelst, Verstralen,

and Jansen (1997), specifies the probability of a correct
*response of person i to item k as*

*P(Y**ik**⫽ 1兩T**ik*)⫽ *i**T**ik*

*i**T**ik*⫹ε*k*⫽ *exp*共*i*⫹ *ik*⫺ *k*兲
1*⫹ exp共**i*⫹ *ik*⫺ *k*兲, (9)
where*i*represents the person ability,ε*k*is item difficulty,
*T** _{ik}*is RT, and

*i*,

*ik*, and

*k*represent the natural logarithms of

*i, T*

*ik*, andε

*k*, respectively. In this model, RT is param-

*eterized as a predictor for the solution probability of item k*

*by person i. As can be seen, if T*

*goes to infinity, the probability of a correct solution approaches 1 irrespective of item difficulty. The model, therefore, is more suitable for speed tests than for power tests, because items in a speed test usually have very low item difficulties under conditions without a time limit. This is not the case for items in a power test, even with a moderate time limit.*

_{ik}A model more suitable for power tests under time-limit conditions was proposed by Wang and Hanson (2005), who extended the traditional three-parameter logistic model by including RTs as well as parameters reflecting item slow- ness and person slowness, respectively:

*P共Y** ^{ik}*⫽ 1兩

^{i}*,*

^{i}*, a*

*k*

*, b*

*k*

*, c*

*k*

*, d*

*k*

*, T*

*ik*

*兲 ⫽ c*

^{k}⫹ 1*⫺ c*^{k}

1*⫹ e*^{共⫺1.7}^{ak}^{[}^{i}^{⫺ (}^{i}^{d}^{k}^{/T}^{ik}^{)}^{⫺ b}^{k}^{兴)}, (10)
*where a*_{k}*, b*_{k}*, and c** _{k}*are the discrimination, difficulty, and

*guessing parameter of item k; and*

*i*is a parameter for

*person i. d*

*is an item slowness parameter; *

_{k}*i*is a person

*slowness parameter; and T*

_{ik}*is the RT of subject i on item k.*

In this model, RTs are treated as an additional predictor, but in contrast to the model by Roskam (1997), as RT goes to infinity, a classical three-parameter logistic model is ob- tained. A similar model for speeded reasoning tests, with presentation time as an experimentally manipulated vari- able, was developed by Wright and Dennis (1999) in a Bayesian framework. The model allows the dissociation of time parameters with respect to persons and items, thereby avoiding the problematic assumptions as above. However, a major problem here pertains to the RTs, which are modeled as fixed parameters. It is a common assumption across the literature that RT is a random variable (Luce, 1986). By treating a variable assumed to be random as fixed, system- atic bias in parameter estimation can occur. Further, the joint distribution of item responses and RTs cannot be analyzed. The model by Wang and Hanson (2005), there- fore, can only be regarded as a partial model, as stated by the authors.

A different approach was chosen by van Breukelen (2005).

He used a bivariate mixed logistic regression model, predicting log-normalized RTs as well as the log-odds of correct re- sponses simultaneously. For the log-odds, the model assumed the log-normalized RTs and item-related design parameters

with random effects (Rijmen & De Boeck, 2002) as predictors.

Similarly, the RTs were predicted by item-related design parameters as well as accuracy scores. However, this ap- proach can be problematic. van Breukelen (2005), for ex- ample, took the log-normalized RTs into account but did not specify parameters reflecting the test taker’s speed or the time intensity of the items. If RTs are both regarded as a person-related predictor and as being implicitly equal to processing speed, as was done in the model by van Breuke- len, the assumption is made that the time intensity of the items is equal, although their difficulties are not. This as- sumption can be avoided by including explicit time param- eters in the model, reflecting the time intensity of the items and the speed of the test takers, respectively.

To conclude, several IRT models have been developed recently that are capable of incorporating RTs, but these suffer from some conceptual or statistical drawbacks for the application to time-limited tests. Further, they cannot relate the design structure of the utilized items to the RTs and accuracy scores simultaneously. A model that can overcome these difficulties, on the basis of the model developed by van der Linden (2007), is described below.

A Model for Response Accuracies and RTs With responses and RTs, we have two sources of infor- mation on a test. The first provides us with information on the response accuracy of test takers on a set of items. The RTs result from the required processing time to solve the items. Naturally, test takers differ in their speed of working, and different items require different amounts of cognitive processing to solve them. This leads us to consider RTs as resulting from person effects and item effects, a separation similar to that made in IRT. A framework will be developed that deploys separate models for the responses and the RTs as measurement models for ability and speed, respectively.

At a higher level, a population model for the person param- eters (ability and speed) is deployed to take account of the possible dependencies between the person parameters. This hierarchical modeling approach was recently introduced by van der Linden (2007). The focus of this article, however, is on the item parameter side. A novel model is presented where the item parameters of both measurement models can be modeled as a function of underlying design factors.

*Level 1 Measurement Model for Accuracy*

*The probability that person i⫽ 1, . . . , N answers item*
*k⫽ 1, . . . , K correctly (Y**ik*⫽ 1), is assumed to follow the
2-parameter normal ogive model (Lord & Novick, 1968):

*P(Y**ik*⫽ 1兩*i**, a**k**, b**k*)*⫽ ⌽共a**k**i**⫺ b**k*兲, (11)
where*i**denotes the ability parameter of test taker i, and a*_{k}*and b** _{k}*denote the discrimination and difficulty parameters

*of item k, respectively.*(.) denotes the cumulative normal
distribution function. This normal ogive form of the 2-pa-
rameter IRT model is adopted for computational conve-
nience, as was shown by Albert (1992). Its latent variable
form lends itself perfectly for Bayesian estimation and is
given by

*Z**ik**⫽ a**k**i**⫺ b**k*⫹ε*ik*, (12)
*where Z*_{ik}*ⱖ 0 when Y**ik**⫽ 1 and Z**ik*⬍ 0 otherwise and with
ε_{ik}*⬃ N(0, 1). With this data augmentation approach*
(Albert, 1992; Lanza, Collins, Schafer, & Flaherty, 2005), it
is possible to change from dichotomous response variables
to continuous latent responses. Also, as is shown below,
after a suitable transformation of the RTs to normality, the
simultaneous distribution of the responses and RTs turns out
to be a bivariate normal one. This allows us to view the
entire structure as a multivariate normal model, thereby
simplifying the statistical inferences as well as the estima-
tion procedure.

*Level 1 Measurement Model for Speed*

As a result of a natural lower bound at zero, the distribu- tion of RTs is skewed to the right. Various types of distri- butions are able to describe such data. For instance, the Poisson, Gamma, Weibull, inverse normal, exponential, and lognormal distributions have been employed to describe RT distributions in psychometric applications. The reader is referred to Maris (1993), Roskam (1997), Rouder et al.

(2003), Thissen (1983), van Breukelen (1995), Schnipke and Scrams (1997), and van der Linden (2006) for exam- ples. However, in this application, the log-normal model is chosen to model the RT distributions for specific reasons.

First of all, Schnipke and Scrams and van der Linden have shown that the lognormal model is well suited to describe such distributions, and it generally performs well with respect to model fit, as we experienced during the analyses of several data sets. Second, the lognormal model fits well within the larger framework for responses and RTs. It is assumed that the log-transformed RTs are normally distrib- uted. Thereby, as mentioned above, the simultaneous dis- tribution of the latent responses and log-transformed RTs can be viewed as a bivariate normal one. This is a strong advantage over other possible RT distributions, because its generalization to a hierarchical model becomes straightfor- ward. Also, the properties of multivariate normal distribu- tions are well known (Anderson, 1984), which simplifies the statistical inferences.

By analogous reasoning, an RT model will be developed that is similar in structure to the 2-parameter IRT model.

Test takers tend to differ in their speed of working on a test;

therefore, a person speed parameter *i* is introduced. Like
ability in IRT, speed is assumed to be the underlying con-
struct for the RTs. Also, it is assumed that test takers work

with a constant speed during a test and that, given speed, the RTs on a set of items are conditionally independent. That is, the speed parameter captures all the systematic variation within the population of test takers. These assumptions are similar to the assumptions of constant ability and condi- tional independence in the IRT model.

However, test takers do not divide their time uniformly over the test, because items have different time intensities.

The expected RT on an item is modeled by a time intensity
parameter*k*. Basically, an item that requires more steps to
obtain its solution can be expected to be more time inten-
sive, which is then reflected in a higher time intensity. It can
be seen that*k**is the analogue of the difficulty parameter b** _{k}*,
reflecting the time needed to solve the item. As an example,
running 100 m will be less time consuming than running
200 m. Clearly, the latter item takes more steps to be solved
and will have a higher time intensity. An illustration of the
effect on time intensity on the expected RTs is given in
Figure 2. In this figure, item characteristic curves (ICCs) for
the IRT model (left figure) and response time characteristic
curves (RTCCs) (right figure) are plotted against the latent
trait. The RTCCs show the decrease in expected RT as
function of speed. For both measurement models, two
curves are plotted that show the shift in probability/time as
a result of a shift in difficulty/time intensity. In this exam-
ple, the above curve would reflect running 200 m, whereas
the lower curve reflects the expected RTs on the 100-m
distance. Note, however, that it is not necessarily so that
running 200 m is more difficult than 100 m.

*Now, for the expectation of the log-RT T*_{ik}*of person i on*
*item k we have obtained that E(T** _{ik}*)⫽ ⫺

*i*⫹

*k*. However, a straightforward yes–no question might show less variabil- ity around its mean

*k*than predicted by

*i*. Such an effect can be considered as the discriminative power of an item, and therefore a time discrimination parameter

*k*is intro-

duced. This parameter controls the decrease in expected RT
on an item for a one-step increase in speed of a test taker. It
*is the analogue of the discrimination parameter a** _{k}*in Equa-
tion 12. The effect of item discrimination on the ICCs and
RTCCs are illustrated in Figure 3. It can be seen that the
difference in expected RTs between test takers working at
different speed levels is less for the lower discriminating
item.

*Finally, the log-RT T*_{ik}*of person i on item k follows a*
normal model according to

*T**ik*⫽ ⫺*k**i*⫹ *k*⫹ ⑀*ik*, (13)
where⑀*ik**⬃ N共0, **k*

2兲 models the residual variance.

*Level 2 Model for the Person Parameters*

In IRT, it is common to view observations as nested within persons. Local independence between observations is assumed conditional on the ability of a test taker. That is, a test taker is seen as a sample randomly drawn from a population distribution of test takers. Usually, a normal population model is adopted, so

*i**⬃ N共*,2兲. (14)
The local independence assumption has a long tradition in
IRT (e.g., Lord, 1980; Holland & Rosenbaum, 1986). Sim-
ilarly, the speed parameter models the heterogeneity in the
observed RTs between test takers. Therefore, conditional on
the (random) speed parameters, there should be no covaria-
tion left between the RTs on different items. In other words,
conditional independence is assumed in the RT model as
well. Now a third assumption of conditional independence
follows from the previous two. If test takers work with
constant speed and constant ability during a test, then within

*Figure 2.* Item characteristic curves (ICCs) and response time characteristic curves (RTCCs) for
*two items with differing time intensity and difficulty, where a*⫽ ⫽ 1.

an item these parameters should capture all possible co- variation between the responses and RTs. That is, the re- sponses and RTs on an item are assumed to be conditionally independent given the levels of ability and speed of the test takers. At the second level of modeling, these person pa- rameters are assumed to follow a bivariate normal distribu- tion:

共* ^{i}*,

*兲 ⫽ *

^{i}

^{P}*⫹ e*

^{P,}*e*

*P*

*⬃ N*

### 冉

^{0,}

^{冘}

^{P}### 冊

^{,}

^{(15)}

where*p*⫽ (_{},_{}), and the covariance structure is spec-
ified by

冘^{P}^{⫽}

### 冋

^{}

^{}

^{}

^{}

^{2}

^{}

^{}

^{}

^{}

^{2}

### 册

^{.}

^{(16)}

The parameterin the model for the person parameters reflects possible dependencies between speed and accuracy of the test takers. For instance, when is negative, this means that persons who work faster than average on the test are expected to have below-average abilities. When⫽ 0, there is independence between ability and speed. How- ever, this is not necessarily equivalent to independence between the responses and RTs, because such a dependency can occur via the item side of the model as well, as is discussed below.

This hierarchical approach, which was first presented by van der Linden (2007), models a connection between the two Level 1 measurement models. Note that Equation 15 is a population model and is therefore entirely different from what is known as the speed–accuracy trade-off (Luce, 1986). The latter is a within-person phenomenon, reflecting the trade-off between accuracy and speed of working for a specific test taker, and it is often assumed to be negative.

That is, it assumes that a test taker chooses a certain speed

level of working and, given that speed, attains a certain ability. If he or she chooses to work faster, the trade-off then predicts that this test taker will make more errors and, as a result, will attain a lower ability. On the contrary, the model given in Equation 15 describes the relationship between ability and speed at the population level. It is perfectly reasonable that, within a population, the dependency be- tween ability and speed is positive, reflecting that faster working test takers are also the higher ability candidates. In the analysis of real test data, we have found positive as well as negative dependencies between ability and speed (Klein Entink, Fox, & van der Linden, in press).

So far, the model is equivalent to that presented by van der Linden (2007) and as described in Fox, Klein Entink, and van der Linden (2007). Another possible bridge be- tween the two Level 1 models can be built on the item side.

That one is developed now and presents a novel extension of the model that allows us to describe item parameters as a function of underlying cognitive structures, which is the focus of this article.

*Level 2 Model for the Item Parameters*

The hierarchical approach is easily extended to the item side of the model. As discussed in the overview in the introduction section, several approaches have been devel- oped to model underlying item design structures in IRT.

However, some of these approaches made rather strict as- sumptions by incorporating the design model into the IRT model. We present an approach in which this is avoided by introducing possible underlying design features at the sec- ond level of modeling.

Interest goes out to explaining differences between items
resulting from the item design structure. Because the char-
acteristics of the items are represented by their item param-
eters, it seems straightforward to study the differences in the
*Figure 3.* Item characteristic curves (ICCs) and response time characteristic curves (RTCCs) for

*two items with differing discrimination, where b*⫽ 0, and ⫽ 4.

estimated item parameters as a function of the design fea- tures. Moreover, it should be possible to assess to what extend the differences in these parameters can be explained by the design features. To do so, the hierarchical modeling approach is extended to the item side of the model first.

Similarly to Equation 15, the vector*k**⫽ (a**k**, b** _{k}*,

*k*,

*k*) is assumed to follow a multivariate normal distribution,

^{k}*⬃ N*

### 冉

^{}

^{I}

^{,}^{冘}

^{I}### 冊

^{,}

^{(17)}

where ⌺*I* specifies the covariance structure of the item
parameters:

冘^{I}^{⫽}

## 冤

^{}

^{}

^{}

^{}

^{a}

^{a}

^{ab}

^{a}^{2}

^{}

^{}

^{}

^{}

^{}

^{b}

^{b}

^{ab}

^{b}^{2}

^{}

^{}

^{}

^{}

^{}

^{}

^{a}

^{b}^{}

^{2}

^{}

^{}

^{}

^{}

^{}

^{}

^{a}

^{b}^{}

^{2}

^{}

## 冥

^{.}

^{(18)}

⌺*I* is the second bridge between the Level 1 models. It
allows us to study dependencies between the item parame-
ters. For instance, if there is a dependency between item
difficulty and time intensity, this would be reflected by the
covariance component between these parameters. For in-
stance, a positive estimate for*b* indicates that more dif-
ficult items also tend to be more time consuming.

Now suppose we have a test in which items are formu-
lated using either squares, circles, or triangles and we are
interested whether such items differ in their difficulty. This
leads us to consider the following model, in which we
develop an analysis of variance approach to model the
effects of each rule. That is, the means of the item param-
eters are decomposed into a general mean and deviations
from that mean as a result of the underlying item construc-
tion rules used to formulate the items. To reflect the three
symbols used to formulate the items, two dummy variables
are constructed. The first variable, denoted by A_{1}of length
*K, contains a 1 for circles, a 0 for triangles, and a*⫺1 for
squares. The second variable A_{2}contains a 0 for circles, a 1
for triangles, and also a ⫺1 for squares. Now, following
*Equation 5, the difficulty of item k can be modeled as*

*b**k*⫽ ␥0*共b兲*⫹ A* ^{1k}*␥1

*共b兲*⫹ A

*␥2*

^{2k}*共b兲*

*⫹ e*

*k*

*共b兲*. (19) This indicator variable approach models the difficulty of

*item k as a deviation from the base level*␥0as a result of the

*figure used to construct the item. That is, if item k is*constructed using circles, its difficulty is predicted by␥0⫹

␥1. If there are triangles used, its difficulty is given by␥0⫹

␥2. In the case that squares are used, its difficulty is modeled
as␥0⫺ ␥1⫺ ␥2. Note that when␥3denotes the effect for
squares, it must equal ⫺␥1 ⫺ ␥2 because otherwise the
model is over parameterized. Let A⫽ (1, A1, A_{2}) and␥* ^{(b)}*⫽
(␥0,␥1,␥2)

^{t}, then the model can be represented as

*b**k*⫽ A*k*␥^{共b兲}*⫹ e**k**共b兲*. (20)
In the previous example, the interest was only in dissociat-
ing the heterogeneity in the item difficulty parameters into
three possible groups of items. However, if we are interested
in validating a cognitive model that underlies the item design,
it makes sense to extend the model to the other item parameters
as well. The full multivariate model for the item parameters
can be generalized to

*a**k*⫽ A^{k}␥^{共a兲}*⫹ e**k**共a兲*, (21)
*b**k*⫽ A* ^{k}*␥

^{共b兲}*⫹ e*

*k*

*共b兲*, (22)

* ^{k}*⫽ A

*␥*

^{k}^{共兲}

*⫹ e*

^{k}^{共兲}, (23)

*k*⫽ A*k*␥^{共兲}*⫹ e**k*共兲, (24)
where the error terms are assumed to follow a multivariate
*normal distribution, that is e⬃ N(0, ⌺**I*). This is a general-
ization of Equation 5, not only by allowing for residual
*variance in other item parameters than b but by modeling*
covariance components between the item parameters as
well. Further, A is a design matrix containing zeros and
ones denoting which construction rules are used for each
item;␥* ^{(a)}*,␥

*,␥*

^{(b)}^{(}

^{)}, and␥

^{(}

^{)}are the vectors of effects of the construction rules on discrimination, difficulty, time dis- crimination, and item time intensity, respectively.

The complete model structure is represented in Figure 4.

The ovals denote the measurement models for accuracy (left) and speed (right). The circles at Level 2 denote the covariance structures that connect the Level 1 model pa- rameters. The structural model is denoted by the square box.

The square containing A* _{I}* denotes the design matrix con-
taining item specific information that allows for explaining
variance between the item parameters. This approach is not
limited to rule based test construction but can just as well be
used to test hypotheses of, for instance, differences in cog-
nitive processing when data are presented in a table versus
presented in a figure.

By the conditional independence assumption and by tak-
ing the possible dependencies to a second level of modeling,
this framework becomes very flexible. It allows for the
incorporation of any measurement model for either accu-
racy or speed. For example, the measurement model for the
dichotomous responses could be replaced by a model for
polytomous items. When needed, independence between the
two Level 1 models can be obtained by restricting⌺*I*and⌺*P*

to be diagonal matrices. However, the strength of the frame- work comes from the simultaneous modeling of two data sources on test items. The two likelihoods at Level 1, linked via the covariance structures at Level 2, allow us to use the RTs as collateral information in the estimation of the re- sponse parameters and vice versa.

Bayesian Inference and Estimation

This section deals with the statistical treatment of the model. The model is estimated in a fully Bayesian frame- work. Before discussing the estimation procedures, how- ever, first the basic principles of the Bayesian approach are introduced. For a general introduction to the Bayesian ap- proach and its estimation methods, see Gelman, Carlin, Stern, and Rubin (2004). Bayesian estimation of IRT mod- els is discussed in, for instance, Albert (1992), Patz and Junker (1999), and Fox and Glas (2001).

*Bayesian Approach*

In the classical approach to statistics, a parameter is
assumed to be an unknown, but fixed, quantity. A random
sample from a population indexed by is obtained. On the
basis of the observed sample, the value of can be esti-
mated. Instead, in the Bayesian approach, is assumed to
be random. That is, there is uncertainty about its value,
which is reflected by specifying a probability distribution
for *. This is called the prior distribution and reflects the*
subjective belief of the researcher before the data are seen.

Subsequently, a sample is obtained from the distribution
indexed by, and the prior distribution is then updated. The
*updated distribution is called the posterior and is obtained*
*via Bayes’ rule. Let p() denote the prior and f(x|) denote*
the sampling distribution, then the posterior density of|x is
*p共兩x兲 ⫽ f共x兩)p共兲/m共x),* (25)
*where m(x) denotes the marginal distribution of x (Casella*

& Berger, 2002, p. 324).

*Markov Chain Monte Carlo (MCMC) Methods*
The posterior distributions of the model parameters are
the objects of interest in Bayesian inference. For simple
models, obtaining these estimates can be done analytically.

However, for complex models as presented above, it is impossible to do so. Sampling based estimation procedures, known as MCMC methods, however, solve these problems easily. A strong feature of these methods is that their ap- plication remains straightforward, whereas model complex- ity may increase.

The MCMC algorithm applied in this article is known as the Gibbs sampler (Geman & Geman, 1984). To obtain samples from the posterior distributions of all model pa- rameters, a Gibbs sampling algorithm requires that all the conditional distributions of the parameters can be specified.

Basically, a complex multivariate distribution from which it
is hard to sample is broken down into smaller univariate
distributions, conditional on the other model parameters,
from which it is easy to draw samples. After giving the
algorithm some arbitrary starting values for all parameters,
*it alternates between the conditional distributions for M*
iterations. Thereby, every step depends only on the last
draws of the other model parameters. Hence, (under some
broad conditions) a Markov Chain is obtained that con-
verges toward a target distribution. It has been shown that if
the number of iterations goes to infinity, the target distribu-
tion can be approximated with any accuracy (Robert &

Casella, 2004).

To illustrate the approach, consider estimation of the RT model given by Equation 13. For simplicity of this example, we assume that ⫽ 1 and independence from the response model. First, (independent) prior distributions for, , and

^{2}*are specified. Now, because it does not depend on m(t)*
up to some constant, the posterior distribution is propor-
*tional to p*共, , ^{2}*兩t兲 ⬀ f共t兩, , *^{2}*)p共兲p共兲p共*^{2}兲. After pro-
viding the algorithm with starting values^{(0)},^{(0)}, and^{2(0)},
the algorithm proceeds as follows:

1. *At iteration m, draw the person parameters* from
*p(*|^{(m}^{⫺ 1)},^{2(m}^{⫺ 1)}, t).

2. Using the new values* ^{(m)}*, draw

* from p(|*

*,*

^{(m)}^{2(m}^{⫺ 1)}, t).

3. Using the new values* ^{(m)}*, draw

^{2}

*from p(*

^{2}|

*,*

^{(m)}* ^{(m)}*, t).

4. *Increment m with 1 and repeat the above steps for*
*M iterations.*

*Now M values for both parameters have been obtained.*

Before descriptive statistics as the posterior mean and pos- terior variance can be obtained, issues such as autocorrela-

a,b

Y T

I

P

**A**I

Level 1 Level 2

Level 2 RT model IRT model

*Figure 4.* Schematic representation of the modeling structure.

IRT⫽ item response theory; RT ⫽ response time.

tion of the samples and convergence of the Markov chain must be checked. Most statistical software packages provide means to obtain autocorrelations. Convergence can be checked by making trace plots, that is, plotting the drawn samples against their iteration number. This allows for a visual inspection to determine whether stationarity has been reached. Dividing the MCMC chain into two or more sub- sets of equal sample size and comparing the posterior mean and standard deviation also provides information on con- vergence. Another approach is rerunning the algorithm us- ing different starting values. This is also helpful to deter- mine whether the chain has really converged to a global optimum. Other (numerical) methods to assess convergence issues are discussed in Gelman et al. (2004; Section 11.6).

Because the first samples are influenced by the starting
values, a “burn-in” period is used, which means that the first
samples of the chain are discarded. The posterior means and
variances of the parameters are then obtained from the
*remaining Q samples. This is usually done by checking*
convergence of the chain and when this seems to be
reached, running the algorithm for another few thousand
iterations on which the inferences can be based. The BOA
software for use in the SPLUS or R statistical environment
provides several of these diagnostic tools (numerical and
graphical) to assess convergence of the MCMC chains
(Smith, 2007).

The model presented above lends itself for a fully Gibbs sampling approach. This is a feature of the multivariate normality of the responses and RTs after the data augmen- tation step. The derivation of the conditional distributions for the Gibbs sampling algorithm is discussed in the Ap- pendix of this article. The algorithm has been implemented in Visual FORTRAN Pro 8.0.

*Model Checking and Evaluation*

In a Bayesian framework, goodness of fit tests can be
performed using posterior predictive checks (Gelman et al.,
2004; Gelman, Meng, & Stern, 1996). Model fit can be
evaluated by comparing replications of the data x* ^{rep}*, drawn
from the posterior predictive distribution of the model, with
the observed data. A discrepancy between model and data is

*measured by a test quantity T(x,*) (e.g., mean squared error), where x denotes the data, and denotes the vector of

*model parameters. A Bayesian p value, p*, can be estimated*as the probability that the replicated data under the model are more extreme than the observed data:

*pⴱ ⫽ P(T共x** ^{rep}*,

*兲 ⱖ T共x, )兩x),*(26)

*whereby p values close to 0 or 1 indicate extreme observa-*tions under the model. Using the drawn samples each iter-

*ation of the Gibbs sampler, these estimates of the p values*are easily obtained as a by product from the MCMC chain.

For more details, see Gelman et al. (1996).

Next, appropriate test quantities have to be chosen. An important assumption of the model is that of local indepen- dence. Therefore, an odds ratio statistic was used to test for possible violations of local independence between response patterns on items. For an impression of the overall fit of the response model, an observed score statistic was estimated to assess whether the model was able to replicate the observed response patterns of the test takers. For a detailed descrip- tion of these two statistics, see Sinharay (2005) and Sinha- ray, Johnson, and Stern (2006).

Residual analysis is another useful means to examine the
appropriateness of a statistical model. The basic idea is that
the observed residuals, that is, the difference between the
observed values and the expected values under the model,
should reflect the assumed properties of the error term. To
assess the fit of the RT model, van der Linden and Guo (in
press) proposed a Bayesian residual analysis. More specif-
*ically, by evaluating the actual observation t** _{ik}* under the
posterior density, the probability of observing a value

*smaller than t*

*can be approximated by*

_{ik}*u**ik*⬇* _{m⫽0}*冘

^{M}

^{⌽共t}

^{ik}^{兩}

^{i}

^{(m)}^{,}

^{}

^{k}

^{(m)}^{,}

^{}

^{k}

^{(m)}

^{兲/M,}^{(27)}

*from M iterations from the MCMC chain. According to the*
probability integral transform theorem (Casella & Berger,
2002, p. 54), under a good fitting model, these probabilities
*should be distributed U*_{ik}*⬃ U (0, 1). Model fit can then be*
*checked graphically by plotting the posterior p values*
*against their expected values under the U (0, 1) distribution.*

When the model fits well, these plots should approximate the identity line.

*Model Selection*

Research hypotheses are usually reformulated so that two competing statistical models are obtained that explain the observed data. An appropriate test criterion then has to be selected that evaluates these two models with respect to their explanatory power for the data. The Bayes factor (Kass

& Raftery, 1995; Klugkist, Laudy, & Hoijtink, 2005) can be
*used to test a model M*_{1}*against another model M*_{0}for the
data at hand. The Bayes factor is defined as the ratio of the
marginal likelihoods of these models:

*BF* ⫽ *p共y兩M*0兲

*p(y兩M*1). (28)

The marginal likelihood is the average of the density of
the data taken over all parameter values admissible by the
*prior—that is, p(y|M)⫽ 兰p(y|␥, M)p(␥|M)d␥, where ␥ is the*
vector of model parameters. Because the Bayes factor
weighs the two models against each other, a value near one
means that both models are equally likely. A value of 3 or