Educational and Psychological Measurement 73(3) 491–511 Ó The Author(s) 2012 Reprints and permissions:

sagepub.com/journalsPermissions.nav DOI: 10.1177/0013164412454431 epm.sagepub.com

### Higher Order Testlet Response Models for

### Hierarchical Latent Traits and Testlet-Based Items

Hung-Yu Huang^{1}and Wen-Chung Wang^{2}

Abstract

Both testlet design and hierarchical latent traits are fairly common in educational and psychological measurements. This study aimed to develop a new class of higher order testlet response models that consider both local item dependence within test- lets and a hierarchy of latent traits. Due to high dimensionality, the authors adopted the Bayesian approach implemented in the WinBUGS freeware for parameter esti- mation. A series of simulations were conducted to evaluate parameter recovery, consequences of model misspecification, and effectiveness of model–data fit statistics.

Results show that the parameters of the new models can be recovered well. Ignoring the testlet effect led to a biased estimation of item parameters, underestimation of factor loadings, and overestimation of test reliability for the first-order latent traits.

The Bayesian deviance information criterion and the posterior predictive model checking were helpful for model comparison and model–data fit assessment. Two empirical examples of ability tests and nonability tests are given.

Keywords

item response theory, testlet response theory, higher order models, hierarchical latent trait, Bayesian methods

Testlet design, in which a set of items share a common stimulus (e.g., a reading comprehension passage or a figure), has been widely used in educational and

1Taipei Municipal University of Education, Taipei, Taiwan

2The Hong Kong Institute of Education, Hong Kong, Hong Kong SAR Corresponding Author:

Wen-Chung Wang, 10 Lo Ping Road, Tai Po, New Territories, Hong Kong SAR.

Email: wcwang@ied.edu.hk

psychological tests. Many test developers find testlet design attractive because of its efficiency in item writing and administration. However, testlet design poses a chal- lenge to standard item response theory (IRT) models, because items within a testlet are connected with the same stimulus such that the usual assumption of local item independence in standard IRT models may not hold. Fitting standard IRT models to such data leads to biased parameter estimation and overestimation of test reliability (Thissen, Steinberg, & Mooney, 1989; Wainer & Lukhele, 1997; Wainer & Thissen, 1996; Wainer & Wang, 2000). Various testlet response models have been developed to account for the local dependence among items within a testlet by adding a set of random-effect parameters to standard IRT models, one for each testlet (Bradlow, Wainer, & Wang, 1999; Wainer, Bradlow, & Du, 2000; Wainer, Bradlow, & Wang, 2007; W.-C. Wang & Wilson, 2005).

In the human sciences, latent traits may be hierarchical. For example, a hierarchi- cal structure orders mental abilities with g (general ability) at the top of the hierarchy and other broad abilities (e.g., mathematical, spatial-mechanical, and verbal reason- ing) lower down. Specifically, the Wechsler Adult Intelligence Scale measures a total of three orders of latent traits. At the first order, there are 13 subtests, each measuring a specific ability. At the second order, these 13 specific abilities are grouped into four broad abilities: verbal comprehension, perceptual reasoning, working memory, and processing speed. At the third order, these four broad abilities constitute a general mental ability (Ryan & Schnakenberg-Ott, 2003). The SAT Reasoning Test consists of three sections: critical reading, mathematics, and writing. Each section receives a score on a scale of 200 to 800. Total scores are calculated by adding up the scores of the three sections. One can treat the tests as measuring three first-order latent traits (critical reading, mathematics, and writing) and one second-order latent trait (scho- lastic aptitude). Similarly, the Graduate Record Examinations consist of three sec- tions of verbal, quantitative, and analytical writing. One can treat the examinations as measuring three first-order latent traits (verbal reasoning, quantitative reasoning, and analytical writing skills) and one second-order latent trait (scholastic aptitude).

Recently, several higher order IRT models for hierarchical latent traits have been developed. These include the cognitive diagnosis models by de la Torre and Douglas (2004), multidimensional models with a hierarchical structure by Sheng and Wikle (2008), and higher order IRT models by de la Torre and Song (2009), de la Torre and Hong (2010), and Huang, Wang, and Chen (2010). In the IRT literature, testlet response models and higher order IRT models were developed separately. Since hier- archical latent traits may be measured by testlet-based items, it is of great value to develop a class of IRT models that consider both testlet design and hierarchical structure in latent traits, which is the main purpose of this study. This article is orga- nized as follows. First, higher order IRT models are briefly introduced and a new class of higher order testlet response models is formulated. Second, parameter esti- mation and model–data fit assessment for the new class of models are described.

Third, simulations are conducted to assess parameter recovery, effects of model mis- specification, and model–data fit indices using the WinBUGS computer program

492 Educational and Psychological Measurement 73(3)

(Spiegelhalter, Thomas, & Best, 2003), and the results are summarized. Fourth, two empirical examples of achievement and personality assessments are presented to demonstrate the applications of the new class of higher order testlet response models.

Fifth, conclusions are drawn and suggestions for future study are provided.

Higher Order IRT Models

Assume the hierarchy of latent traits is in two orders. Each of the first-order latent
traits is measured by a set of unidimensional items, and these first-order latent traits
are governed by the same second-order latent trait. Let u^{(2)}_{n} be the second-order latent
trait and u^{(1)}_{nv} be the vth first-order latent trait for person n. The relationship between
these two orders of latent traits is assumed to be

u^{(1)}_{nv}= b_{v}u^{(2)}_{n} + e^{(1)}_{nv}, ð1Þ

where e^{(1)}_{nv} is assumed to be normally distributed with mean zero and independent of
other es and us, and bvis the regression weight (factor loading) between the second-
order latent trait and the vth first-order latent trait. More orders of latent traits (e.g.,
the third order) can be readily generalized.

For a dichotomous item measuring the first-order latent trait, the item response function can be the one-, two-, or three-parameter logistic model (Birnbaum, 1968;

Rasch, 1960) or other IRT functions. In the three-parameter logistic model, the prob- ability of a correct response to item i in test v for person n is

P_{ni1v}= p_{iv}+ (1 p_{iv}) 3 exp½aiv(u^{(1)}_{nv} div)

1 + exp½aiv(u^{(1)}_{nv} div), ð2Þ

where aivis the slope (discrimination) parameter, divis the location (difficulty) para-
meter, and p_{iv} is the asymptotic (pseudo-guessing) parameter of item i in test v.

Combining Equations 1 and 2 leads to

Pni1v= piv+ (1 piv) 3 exp½aiv(b_{v}u^{(2)}_{n} div+ e^{(1)}_{nv})

1 + exp½aiv(b_{v}u^{(2)}_{n} div+ e^{(1)}nv): ð3Þ

If the item response function follows the two-parameter logistic model, piv in Equation 3 becomes zero. If it follows the one-parameter logistic model, pivand aiv

in Equation 3 become zero and one, respectively. Equation 3 is referred to as the three-parameter higher order IRT model (3P-HIRT). If the item response function is the two-parameter logistic model, then it is called the 2P-HIRT. If the item response function is the one-parameter logistic model, then it is called the 1P-HIRT (de la Torre & Hong, 2010; Huang et al., 2010).

For a polytomous item measuring the first-order latent trait, the item response function can be the generalized partial credit model (Muraki, 1992), the partial credit model (Masters, 1982), the rating scale model (Andrich, 1978), or the graded

response model (Samejima, 1969). Take the generalized partial credit model as an example. The log odds are defined as

log P_{nijv}
Pni(j1)v

= aiv(u^{(1)}_{nv} div tijv), ð4Þ

where Pnijvand Pni(j 2 1)vare the probabilities of scoring j and j 2 1 on item i in test v for person n, respectively; divis the overall difficulty of item i in test v; tijvis the jth threshold parameter of item i in test v; and the others are defined as above.

Combining Equations 1 and 4 leads to

log P_{nijv}
Pni(j1)v

= aiv(b_{v}u^{(2)}_{n} div tijv+ e^{(1)}_{nv}): ð5Þ

If the item response function follows the partial credit model, then aiv = 1. If it follows the rating scale model, then aiv = 1 and tijv = tjv. If it follows the graded response model, then cumulative logit should be used as the link function and a com- mon set of step parameters are used across items. The corresponding higher order IRT model can be easily formed (Huang et al., 2010). When the item response func- tion is the generalized partial credit model, the corresponding higher order IRT model is denoted as the GP-HIRT. Likewise, when item response function is the par- tial credit model, rating scale model, or graded response model, the corresponding higher order IRT model is denoted as the PC-HIRT, RS-HIRT, or GR-HIRT, respectively.

Higher Order Testlet Response Models

Assume that testlet-based items are used and the testlet effect exists in each testlet. To account for the testlet effect, one can add a random-effect parameter to Equation 2 (dichotomous item) or Equation 4 (polytomous item):

Pni1v= piv+ (1 piv) 3 exp½aiv(u^{(1)}_{nv} div+ g_{nd(i)v})

1 + exp½aiv(u^{(1)}_{nv} div+ g_{nd(i)v}), ð6Þ

log Pnijv

Pni(j1)v

= a_{iv}(u^{(1)}_{nv} d_{iv} t_{ijv}+ g_{nd(i)v}), ð7Þ

where g_{nd(i)v}is an additional latent trait to account for local dependence among items
within testlet d of test v and is assumed to be normally distributed and independent
of u and other gs (Wainer et al., 2007; W.-C. Wang & Wilson, 2005). The g variance
depicts the magnitude of the testlet effect: the larger the variance, the stronger the
testlet effect is. When the latent traits are hierarchical, one can combine Equations 1
and 6 or Equations 1 and 7:

494 Educational and Psychological Measurement 73(3)

P_{ni1v}= p_{iv}+ (1 p_{iv}) 3 exp½aiv(b_{v}u^{(2)}_{n} div+ e^{(1)}_{nv} + g_{nd(i)v})

1 + exp½aiv(b_{v}u^{(2)}_{n} div+ e^{(1)}_{nv} + g_{nd(i)v}), ð8Þ

log Pnijv

Pni(j1)v

= a_{iv}(b_{v}u^{(2)}_{n} d_{iv} t_{ijv}+ e^{(1)}_{nv} + g_{nd(i)v}): ð9Þ

Equation 8 is referred to as the three-parameter higher order testlet response model (3P-HTM). If piv= 0 for all iv, then the 3P-HTM is reduced to the two-parameter higher order testlet response model (2P-HTM). If piv= 0 and aiv= 1 for all iv, then the 3P-HTM is reduced to the one-parameter higher order testlet response model (1P- HTM). Equation 9 is referred to as the generalized partial credit higher order testlet response model (GP-HTM), because its item response function is the generalized par- tial credit. Likewise, if the item response function is the partial credit model, rating scale model, or graded response model, then it is denoted as the PC-HTM, RS-HTM, or GS-HTM, respectively.

In the new class of HTM, the item response function is not limited to a specific function. Users are allowed to assign any function to any item. The HTM can accom- modate the situation where a test consists of both independent items and testlet-based items and both dichotomous items and polytomous items. In addition, it is feasible to assign different item response functions to different items, even in the same test. For example, some dichotomous items may follow the one-parameter logistic model, and others follow the two-parameter or three-parameter logistic model. Some polytomous items follow the rating scale model, and others follow the generalized partial credit model. Users are also allowed to create their own customized item response functions.

For model identification, one can set u^{(2)}; N (0, 1) and u^{(1)}_{nv} ; N (0, 1) so that bv

can be interpreted as the correlation between u^{(2)}and u^{(1)}_{v} , as in traditional linear fac-
tor analysis. When the item response function follows the family of Rasch models,
the common slope parameter across all items in the same test can be freely estimated.

MCMC Estimation and Bayesian Model–Data Fit Assessment The new class of higher order testlet response models often involves many random- effect parameters––u, g, and e. Although it is possible to develop likelihood-based estimation methods for higher order testlet response models, these methods become computationally burdensome because of high-dimensional numerical integration and may fail to converge. Bayesian estimation with the Markov chain Monte Carlo (MCMC) methods, which have been widely used in complicated IRT models (Bolt

& Lall, 2003; Bradlow et al., 1999; Fox & Glas, 2003; Kang & Cohen, 2007; Klein Entink, Fox, & van der Linden, 2009; Wainer et al., 2000, 2007), were thus used to estimate the parameters. In Bayesian estimation, specifications of a statistical model, prior distributions of model parameters, and observed data are required to produce a joint posterior distribution for the model parameters. MCMC methods provide an

alternative and simple way to simulate the joint posterior distribution of the unknown quantities and obtain simulation-based estimates of posterior parameters of interest.

The Metropolis-Hastings and the Gibbs sampling algorithms (Gelfand & Smith, 1990; Geman & Geman, 1984) are two major procedures to construct the Markov chain through transition density to target density. In this study, we used the WinBUGS freeware (Spiegelhalter et al., 2003) to simulate Markov chains, and the mean of the joint posterior distribution was treated as the parameter estimate.

WinBUGS is flexible enough to allow users to specify different models for differ- ent items within or between tests. For example, a test may consist of both dichoto- mous items and polytomous items, in which some dichotomous items follow the 1PLM, some dichotomous items follow the 3PLM, some polytomous items follow the partial credit model, and some polytomous items follow the rating scale model.

A test may consist of both independent items (no testlet effects) and testlet-based items, exclusively independent items, or exclusively testlet-based items. All these combinations can be easily formulated in WinBUGS.

Posterior predictive model checking (PPMC; Gelman, Meng, & Stern, 1996) is a
Bayesian model–data fit checking technique to assess the plausibility of posterior
predictive replicated data against observed data, which has the advantage of a strong
theoretical basis and an intuitively appealing simplicity applied with numerical evi-
dence. Let y be the observed data and y^{rep}be the replicated data. The posterior pre-
dictive density function of y^{rep}given model parameter v can be given by

P(y^{rep}j ) =y
ð

P(y^{rep}j )P(v yv j )dv, ð10Þ

A test statistic T is chosen to detect the systematic discrepancy between the observed data and the replicated data. The posterior predictive p value is defined as follows to summarize the comparison between the two test statistics over a large number of iterations:

p [ P½T (y^{rep}) T (y) yj =
ð

T (y^{rep})T (y)

P(y^{rep}j )dyy ^{rep}: ð11Þ

An extreme p value (close to 0 or 1) indicates a poor model–data fit.

According to the nature of the higher order testlet response model (HTM), four statistics were chosen to assess model-data fit in this study. The first one is the Bayesian chi-square test (Sinharay, Johnson, & Stern, 2006), which evaluates the overall model–data fit:

X

n

X

i

½y_{ni} E(y_{ni}j )v ^{2}

Var(ynij )v , ð12Þ

where subscript n is for person and subscript i is for item. The second statistic evalu- ates factor structure. It compares the reproduced correlation matrix with the original

496 Educational and Psychological Measurement 73(3)

correlation matrix. If the difference between the two matrices is huge, then the factors do not account for a great deal of the variance in the original correlation matrix, and thus, the model–data fit is poor.

The third and fourth statistics measure the association among item pairs within a
testlet for dichotomous items and polytomous items, respectively. For dichotomous
items, we compute the odds ratios between item pairs within a testlet to depict uncon-
trolled dependence between items. Let N_{kk#}be the number of persons scoring k on
the first item and k# on the second item. The odds ratio (OR) between an item pair is
given by

OR =N11N00

N_{10}N_{01}: ð13Þ

For polytomous items, the Spearman rank correlation between item pairs within a testlet is computed to assess between-item association. If any systematic discrepancy is found between the odds ratios (or the rank correlations) obtained from the observed data and the replicated data, misfit due to the unexplained variance from item depen- dence is detected (Sinharay et al., 2006). To reduce the computational burden and concentrate on testlet effects, only the odd ratios and the correlations for item pairs within a testlet are computed, and those across testlets are not computed.

For model comparison, one can use the Bayesian deviance information criterion (DIC), which simultaneously takes into account model fit and model complexity (Spiegelhalter, Best, Carlin, & van der Linde, 2002):

DIC = D + PD, ð14Þ

where D is the posterior expectation of the deviance and PD is defined as the differ- ence between the posterior mean of the deviance and the deviance at the posterior mean of the parameters.

Method

Simulation Design

The simulations comprised two major parts: dichotomous items and polytomous items. Each part had three first-order latent traits and one second-order latent trait.

Each test measuring the first-order latent trait consisted of either 20 dichotomous testlet-based items or 20 four-point polytomous testlet-based items. All the g var- iances (testlet effects) were set at one, suggesting a rather large testlet effect. The number of testlets in the 20-item test was either two (each testlet had 10 items—a long testlet) or four (each testlet had 5 items—a short testlet). Although the g var- iances were all equal across testlets, their effects were stronger in the long testlets.

For dichotomous items, the item response functions in each test followed the one- parameter, two-parameter, or three-parameter testlet response model. For polytomous items, the item response functions in each test followed the generalized partial credit,

partial credit, rating scale, or graded response testlet response model. The factor load- ings (the correlations between the second-order latent trait and the three first-order latent traits) were set at .9, .8, and .7, respectively.

In the dichotomous items, the difficulty parameters were sampled from U(22, 2), the slope parameters were sampled from N(1, 0.25) and truncated above 0, and the pseudo-guessing parameters were sampled from U(0, 0.3). In the polytomous items, the step difficulty parameters were generated from U(22.5, 2.5), and the location parameters were generated from U(21.5, 1.5) with 21, 0, and 1 as the category threshold parameters, respectively. The specifications of item parameters were in line with those commonly found in practice. The item parameters were generated inde- pendently, consistent with the IRT literature. If necessary, multivariate normal distri- bution can be assumed for the item parameters (van der Linden, Klein Entink, &

Fox, 2010). A total of 5,000 persons were generated when the item response function of dichotomous items followed the three-parameter model. However, a total of 2,000 persons were generated when the item response function followed the other models because a large sample size is required for estimating the pseudo-guessing param- eters. All the variances of latent traits were set at one.

The simulated data sets were analyzed in two ways: First, the data-generating model was also the analysis model (i.e., the HTM was fit to HTM data). Second, the standard higher order IRT model (HIRT) was fit to HTM data. We wrote a Matlab computer program to generate item responses. Ten replications were made under each condition, mainly because each replication could take several days of computer time, and there were many conditions. In our experience, 10 replications appeared to be sufficient to gain reliable inferences because the sampling variation across repli- cations appeared to be rather small. Other studies that used Bayesian estimation for complicated IRT models also used 10 replications or fewer (Bolt & Lall, 2003;

Cao & Stokes, 2008; Klein Entink et al., 2009; Li, Bolt, & Fu, 2006; van der Linden et al., 2010).

Analysis

The freeware WinBUGS 1.4 with MCMC methods was used to estimate the param- eters. We specified a normal prior with mean 0 and variance 4 for all the location and threshold parameters, a lognormal prior with mean 0 and variance 1 for the slope parameters, a beta prior with both hyperparameters equal to 1 for the pseudo-guessing parameters, a normal prior with mean 0.5 and variance 10 accommodating a truncated distribution between 0 and 1 for the factor loadings, and a gamma prior with both hyperparameters equal to 0.01 for the inverse of g variances (testlet effects). A total of 15,000 iterations were obtained with the first 5,000 iterations as the burn in. These settings were consistent with the literature when WinBUGS was used to fit IRT models.

After monitoring the convergence diagnostic according to the multivariate poten- tial scale reduction factor (Brooks & Gelman, 1998) with three parallel chains for

498 Educational and Psychological Measurement 73(3)

the first simulated data set across all analysis models, we found that the chain lengths were sufficient to reach stationarity for all structural parameters. In addition, the last 200 MCMC samples for latent trait estimates were used to compute test reliability (precision of person measures).

For each estimator, we computed the bias and the root mean square error (RMSE):

Bias(E(z_{r})) =X^{R}

r = 1

(EAP(^z_{r}) z)=10, ð15Þ

RMSE(E(z_{r})) =

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
X^{R}

r = 1

(EAP(^z_{r}) z)^{2}=10
vu

ut , ð16Þ

where z was the generating value and EAP(^z_{r}) was the expected a posterior estimate
in replication r.

It was expected that (a) the parameter recovery would be satisfactory when the HTM was fit to HTM data because the data-generating model and the analysis model were identical; (b) the parameter estimation would be biased and the test reliability would be overestimated when the HIRT was fit to HTM data because the testlet effect existed but was not considered; (c) the longer the testlet, the worse the estimation of parameters and test reliability would be when the testlet effect was not considered;

and (d) the DIC would favor the data-generating HTM rather than the corresponding HIRT, and the PPMC would show a good model–data fit for the generating HTM and a poor fit for the corresponding HIRT.

Results

Parameter Recovery of Dichotomous Items

Due to space constraints, the bias and RMSE for individual parameters are not reported; instead, their means and standard deviations are provided. Table 1 sum- marizes the bias values when the data-generating HTM (left-hand side) and the HIRT (right-hand side) were fit to HTM data. The bias values yielded by the HTM appeared to be very close to zero and were consistent with those found in the litera- ture when data-generating dichotomous testlet response models were fit (Bradlow et al., 1999; Wainer et al., 2000; W.-C. Wang & Wilson, 2005). In comparison, the 1P- HTM yielded bias values slightly better than those by the 2P-HTM, which were slightly better than those by the 3P-HTM. The difference in bias values between long and short testlets was small. When the testlet effect was ignored and the HIRT was fit to HTM data, the bias values were many times more extreme than those when the HTM was fit. The underestimation for the three factor loadings by the HIRT was the most significant, with bias values ranging from 20.193 to 20.043.

Table 2 summarizes the RMSE values when the data-generating HTM (left-hand side) and the HIRT (right-hand side) were fit to HTM data. The HTM yielded small

Table1.BiasofDichotomousItems(Multipliedby1,000)WhentheHTMandHIRTWereFittoHTMData. HTMHIRT 1-Parameter2-Parameter3-Parameter1-Parameter2-Parameter3-Parameter LongShortLongShortLongShortLongShortLongShortLongShort Difficulty Mean29112184511039302826186169 SD19214344154186132158212113496356 Slope Mean2202721128196297237902542236 SD105342850725210179264151 Pseudo-guessing Mean————1629————4645 SD————4253————12391 Factorloading b1221024916217827021672762193277 b21926158629215826921372672147286 b321122321330214924321282752121266 Testletvariance Mean5546604025—————— SD192768503043—————— Note:HTM=higherordertestletresponsemodel;IRT=itemresponsetheory;HIRT=higherorderIRTmodel;—=notapplicable.

Table2.RootMeanSquareErrorofDichotomousItems(Multipliedby1,000)WhentheHTMandHIRTWereFittoHTMData. HTMHIRT 1-Parameter2-Parameter3-Parameter1-Parameter2-Parameter3-Parameter LongShortLongShortLongShortLongShortLongShortLongShort Difficulty Mean6463110110209253133156207141446 SD17154347137159536610562346 Slope Mean4123819011417910143141106332 SD55313457110647159196 Pseudo-guessing Mean————5770————116 SD————3242————876 Factorloading b11929341831171797316978194 b22724282227211597213969149 b33926412824171534613180123 Testletvariance Mean81118175167122129—————— SD182661554135—————— Note:HTM=higherordertestletresponsemodel;IRT=itemresponsetheory;HIRT=higherorderIRTmodel;—=notapplicable.

RMSE values, especially for the 1P-HTM. In contrast, when the testlet effect was ignored and the HIRT was fit, the RMSE values became many times larger than those in the HTM, especially for the factor loadings. A factorial analysis of variance was conducted to evaluate the effects of the three independent variables: (a) method (HIRT or HTM), (b) testlet length (short or long), and (c) model (1P, 2P, or 3P).

When the RMSE of difficulty parameters was the dependent variable, the partial h^{2}
was 17% for the main effect of model, 9% for the main effect of method, and less
than 2% for the other effects. When the RMSE of discrimination parameters was the
dependent variable, the partial h^{2} was 10% for the main effect of method and less
than 2% for the other effects. When the RMSE of pseudo-guessing parameters was
the dependent variable, the partial h^{2}was 8% for the main effect of method and less
than 1% for the other effects. When the RMSE of factor loadings was the dependent
variable, the partial h^{2} was 81% for the main effect of method, 52% for the main
effect of testlet length, 43% for the interaction effect for method and testlet length,
and less than 3% for the other effects. In short, method had a large impact on the
parameter recovery, which demonstrated that ignoring the testlet effects by fitting
the HIRT to HTM data yielded poor parameter estimation.

Parameter Recovery of Polytomous Items

Table 3 summarizes the bias values when the data-generating HTM (left-hand side) and the HIRT (right-hand side) were fit to HTM data. Similar to those found in dichotomous items, the bias values yielded by the HTM were very close to zero and were consistent with those found in the literature when data-generating polytomous testlet response models were fit (W.-C. Wang & Wilson, 2005; X. Wang, Bradlow,

& Wainer, 2002). In comparison, the Rasch family models (partial credit and rating scale models) had a slightly smaller bias than the two-parameter models (generalized partial credit and graded response models). The difference in bias values between long and short testlets was small. When the testlet effect was ignored and the HIRT was fit to HTM data, most bias values were many times more extreme than those when the HTM was fit. The three factor loadings were substantially underestimated.

Table 4 summarizes the RMSE values when the data-generating HTM (left-hand
side) and the HIRT (right-hand side) were fit to HTM data. The results were consis-
tent with those found in dichotomous items (Table 2). The HTM yielded small
RMSE values, whereas the HIRT yielded RMSE values many times larger than those
in the HTM. According to the factorial analysis of variance, when the RMSE of loca-
tion parameters was the dependent variable, the partial h^{2} was 43% for the main
effect of method and less than 3% for the other effects. When the RMSE of discrimi-
nation parameters was the dependent variable, the partial h^{2} was 5% for the main
effect of method and less than 1% for the other effects. When the RMSE of factor
loadings was the dependent variable, the partial h^{2} was 73% for the main effect of
method, 48% for the main effect of testlet length, 42% for the interaction effect for
method and testlet length, 16% for the interaction effect of method and model, 10%

502 Educational and Psychological Measurement 73(3)

Table3.BiasofPolytomousItems(Multipliedby1,000)WhentheHTMandHIRTWasFittoHTMData. HTMHIRT Partial CreditGeneralized PartialCreditRating ScaleGraded ResponsePartial CreditGeneralized PartialCreditRating ScaleGraded Response LongShortLongShortLongShortLongShortLongShortLongShortLongShortLongShort Location Mean2121238223252112142132200101918 SD2922574215122221315321326329329324323 Slope Mean224225227216212272132132602235256223126122382582 SD11325183617257315114937160 Factorloading b12527250291242621242872186287216127721822 b28412230282921421182792140277215329321602 b321322721621022621272762119269212228421312 Testletvariance Mean222667432934527———————— SD2422602915152327———————— Note:HTM=higherordertestletresponsemodel;IRT=itemresponsetheory;HIRT=higherorderIRTmodel;—=notapplicable.

Table4.RootMeanSquareErrorofPolytomousItems(Multipliedby1,000)WhentheHTMandHIRTWasFittoHTMData. HTMHIRT PartialCreditGeneralized PartialCreditRatingScaleGraded ResponsePartialCreditGeneralized PartialCreditRatingScaleGraded Response LongShortLongShortLongShortLongShortLongShortLongShortLongShortLongShort Location Mean827911411043445655293281302307285286280279 SD232342431091615155160160167168159169153 Slope Mean37326361191750616423615123863238158249 SD55212220192643861443791154 Factorloading b1292221144220252413287187891657818393 b2361626242721293112682142801559316193 b3312735302528292313278122751248713361 Testletvariance Mean607514411060618777———————— SD62131211171917———————— Note:HTM=higherordertestletresponsemodel;IRT=itemresponsetheory;HIRT=higherorderIRTmodel;—=notapplicable.

for the main effect of model, and less than 2% for the other effects. Apparently, method played a major role in parameter recovery, and ignoring testlet effect yielded poor parameter estimation.

Test Reliability

The mean test reliabilities for dichotomous items and polytomous items across 10 replications are shown in Table 5. The test reliability obtained from the HTM was treated as a gold standard to which the test reliability obtained from the HIRT was compared, because the HTM was the data-generating model. It was found that fitting the HIRT to the HTM data (both dichotomous and polytomous items) substantially overestimated the test reliability for the first-order latent traits, because the testlet effect was mistakenly treated as a true variance. However, the test reliability of the second-order latent trait appeared to be less affected, which might be because the overestimation in the test reliability of the first-order latent traits and the underesti- mation of the factor loadings cancelled out such that the test reliability of the second- order latent trait was estimated appropriately in the HIRT.

Posterior Predictive Model Checking

When the Bayesian chi-square and the reproduced correlation were used to evaluate model–data fit using the PPMC, as Table 6 shows, fitting the HTM to HTM data con- sistently yielded a good model–data fit (slightly overconservative); however, fitting the HIRT to the HTM data yielded a poor fit only for the one-parameter models. In other words, these two statistics appeared not to be sensitive enough to model misspe- cification in the multiparameter models. With respect to the odds ratio statistic for dichotomous items, the mean number of misfits across 120 item pairs was computed across 10 replications. According to this statistic, fitting the HTM yielded a very small number of misfits, whereas fitting the HIRT yielded a very large number of misfits, especially for the one-parameter models. In addition, the longer the testlets were, the more powerful the odds ratio statistic. With respect to the rank correlation for polyto- mous items, the mean number of misfits across 120 item pairs was computed. Like the odds ratio statistic for dichotomous items, the rank correlation performed fairly well in distinguishing the true and wrong models. In addition to the PPMC, the Bayesian DIC was computed to compare the HTM and HIRT. It was found that the HTM always had a smaller DIC value than the corresponding HIRT. Due to a small number of replica- tions, the above PPMC results should be interpreted with caution.

Two Empirical Examples

Example 1: Ability Tests With Dichotomous Items

In Taiwan, junior high school graduates who wish to enter senior high schools have to take the Basic Competence Tests for Junior High School Students, which consists

of five subjects: language, mathematics, English, social sciences, and nature sciences.

In the 2006 tests, there were 48 items in language, 33 items in mathematics, 45 items in English, 63 items in social sciences, and 58 items in nature sciences; all were in Table 5. Mean Test Reliability for Dichotomous and Polytomous Items Across Replications.

Long Short

HTM HIRT HTM HIRT

Dichotomous items 1-Parameter

Test 1 .67 .81 .72 .79

Test 2 .65 .81 .71 .79

Test 3 .63 .81 .70 .79

2nd order .66 .65 .71 .71

2-Parameter

Test 1 .66 .80 .71 .77

Test 2 .66 .83 .72 .80

Test 3 .62 .80 .68 .77

2nd order .66 .66 .70 .70

3-Parameter

Test 1 .62 .76 .67 .72

Test 2 .62 .78 .67 .75

Test 3 .59 .73 .63 .70

2nd order .62 .61 .67 .66

Polytomous items Partial credit

Test 1 .71 .92 .79 .90

Test 2 .70 .92 .78 .90

Test 3 .68 .92 .77 .90

2nd order .70 .69 .75 .76

Generalized partial credit

Test 1 .71 .91 .78 .89

Test 2 .70 .92 .79 .90

Test 3 .69 .91 .76 .88

2nd order .70 .69 .75 .75

Rating scale

Test 1 .72 .92 .79 .90

Test 2 .70 .92 .78 .90

Test 3 .69 .92 .77 .90

2nd order .70 .70 .76 .77

Graded response

Test 1 .70 .91 .78 .89

Test 2 .70 .92 .78 .90

Test 3 .68 .91 .77 .89

2nd order .69 .68 .75 .74

Note: HTM = higher order testlet response model; IRT = item response theory; HIRT = higher order IRT model. All the standard deviations of test reliabilities were between .01 and .02.

506 Educational and Psychological Measurement 73(3)

multiple-choice format. Every test consisted of both independent items and testlet- based items. Altogether, there were 23 testlets in the five tests. Each of the five tests was treated as measuring a first-order latent trait. The five first-order latent traits were treated as governed by the same second-order latent trait, referred to as ‘‘aca- demic ability.’’ A total of 5,000 examinees (2,639 boys and 2,361 girls) were ran- domly sampled from more than 300,000 examinees.

The HTM and HIRT with item response functions following the 1PLM, 2PLM, and 3PLM were fit to the data, resulting in a total of six models. WinBUGS was used for parameter estimation. According to the Bayesian DIC, the 3P-HTM had the best fit. According to the PPMC with the Bayesian chi-square, reproduced correlation, and odds ratio statistics, the 3P-HTM also had a good fit. The estimates were between Table 6. Frequencies of Misfit in Posterior Predictive Model Checking in 10 Replications.

Long Short

HTM HIRT HTM HIRT

1-Parameter

Bayesian chi-square 0 8 0 0

Reproduced correlation 0 8 0 0

Odds ratio (mean) 0.34 9.38 0.42 9.06

2-Parameter

Bayesian chi-square 0 0 0 0

Reproduced correlation 0 0 0 0

Odds ratio (mean) 0.00 8.71 0.07 6.03

3-Parameter

Bayesian chi-square 0 0 0 0

Reproduced correlation 0 0 0 0

Odds ratio (mean) 0.02 8.73 0.06 5.85

Partial credit

Bayesian chi-square 0 0 0 10

Reproduced correlation 0 1 0 10

Rank correlation (mean) 0.15 10.00 0.21 9.70

Generalized partial credit

Bayesian chi-square 0 0 0 0

Reproduced correlation 0 0 0 0

Rank correlation (mean) 0.00 9.68 0.01 9.23

Rating scale

Bayesian chi-square 0 0 0 10

Reproduced correlation 0 2 0 10

Rank correlation (mean) 0.14 10.00 0.31 9.70

Graded response

Bayesian chi-square 0 0 0 0

Reproduced correlation 0 0 0 0

Rank correlation (mean) 0.00 9.68 0.00 9.34

Note: HTM = higher order testlet response model; IRT = item response theory; HIRT = higher order IRT model.

0.60 and 10.71 (M = 2.60) for the slope parameters, between 22.72 and 1.53 (M = 20.19) for the difficulty parameters, and between 0.02 and 0.47 (M = 0.21) for the pseudo-guessing parameters. The factor loadings were .95, .90, .94, .96, and .98, and the test reliabilities were .95, .94, .94, .96, and .96, for language, English, mathe- matics, social sciences, and nature sciences, respectively. The test reliability for the second-order latent trait was .97. The high factor loadings were not surprising, because these five tests were designed to measure basic competence for junior high school students. The standard deviations of g for the 23 testlets were between 0.10 and 0.52 (M = 0.26). Compared to the unit standard deviation of the first-order latent traits (i.e., set at 1 for model identification), the random testlet effects ranged from tri- vial to moderate. When the testlet effects were not considered by fitting the 3P-HIRT to the data, the resulting factor loadings and test reliabilities were very similar to those found in the 3P-HTM, mainly because the testlet effects were rather small.

Example 2: Nonability Tests With Polytomous Items

Three tests were used to measure pathological Internet use in Taiwan (Lin & Liu, 2003), including the Internet Hostility Questionnaire (31 four-point items), the Chinese Internet Addiction Scale (26 four-point items), and the Questionnaire of Cognitive Distortions of Pathological Internet Use (16 four-point items). A single total score was reported for each test: the higher the total score, the more serious the Internet hostility, Internet addiction, or cognitive distortion. Each of the three tests was treated as measuring a first-order latent trait. The three first-order latent traits were treated as governed by a second-order latent trait, referred to as ‘‘pathological Internet use.’’ Each test consisted of several ‘‘subtests,’’ which were treated as test- lets, because after partitioning out the target latent trait (u), the specific latent trait (g) left for each subtest was considered unimportant (nuisance). There were 15 test- lets altogether in the three tests. A total of 987 Taiwanese Internet users (college stu- dents—625 males and 362 females) responded to the three tests.

The HTM and HIRT with item response functions following the PCM, GPCM, RSM, and GRM were fit to the data, resulting in a total of eight models. WinBUGS was used. According to the Bayesian DIC, the GP-HTM had the best fit. According to the PPMC with the Bayesian chi-square, reproduced correlation, and rank correla- tion, the GP-HTM also had a good fit. The estimates were between 0.10 and 6.77 (M = 1.59) for the slope parameters and between 28.88 and 6.49 (M = 0.44) for the location parameters. The factor loadings were .68, .61, and .83, and the test reliabil- ities were .82, .89, and .81 for Internet addiction, Internet hostility, and cognitive dis- tortions, respectively. The test reliability for the second-order latent trait was .71. It appeared that the factor loading of cognitive distortions contributed more to patholog- ical Internet use than Internet hostility and Internet addiction. The standard deviations of g for the 15 testlets were between 0.40 and 3.23 (M = 1.41). Compared to the unit standard deviation of the first-order latent traits, most of the testlet effects were very large. When the large testlet effects were not considered by fitting the GPC-HIRT to

508 Educational and Psychological Measurement 73(3)

the data, the factor loadings were .60, .61, and .80, and the test reliabilities were .91, .94, and .88, for Internet addiction, Internet hostility, and cognitive distortions, respectively. The test reliability for the second-order latent trait was .70. Apparently, ignoring the testlet effects led to an underestimation of factor loadings and an overes- timation of test reliabilities for the first-order latent traits, which is consistent with results found in the simulations.

Discussion and Conclusion

The new class of HTM was developed for tests measuring hierarchical latent traits with testlet-based items. The HTM is very flexible because the item response func- tion is not limited to any specific function. Different items can have different item response functions, even in the same test. Due to high dimensionality, we used the Bayesian approach with MCMC methods that were implemented on WinBUGS for parameter estimation. A series of simulations were conducted to evaluate parameter recovery of the HTM, the consequences of ignoring testlet effects on estimation of parameters and test reliability, and the effectiveness of model–data fit statistics. The results show that (a) the parameters of the HTM were recovered very well; (b) when testlet effect existed but was not considered by fitting the HIRT, the resulting item parameter estimates were biased, the factor loadings were underestimated, and the test reliability of the first-order latent traits was overestimated; and (c) the Bayesian DIC and PPMC were helpful in model comparison and model-data fit checking.

The simulations were conducted on personal computers with 2.66-GHz Intel Core i5. It took approximately 2 to 6 days per replication. The computation time is feasible for most real data analyses but may not be feasible for comprehensive simulations, which was the main reason why only 10 replications were conducted in this study. To facilitate parameter estimation of the HTM, one may adopt standard IRT computer programs, such as BILOG-MG, MULTILOG, or PARSCALE, to fit each test at the first order, and then use the parameter estimates as starting values for the HTM using WinBUGS. Even better, a stand-alone computer program should be developed for the HTM.

In the HTM, we formulated two orders of latent traits. Actually, the HTM can be easily extended to accommodate more orders but at the cost of the computational bur- den. In recent years, multilevel IRT models have been developed to describe multile- vel structure in persons (e.g., persons are nested within schools; Fox, 2005). It is of great value to combine both the HTM and multilevel models. Future studies can be conducted to explore these theoretical issues and develop more efficient computer programs for these complicated models.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The first author was supported by the National Science Council (Grant No. NSC 100-2410-H-133-015).

References

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinees’ ability.

In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-479).

Reading, MA: Addison-Wesley.

Bolt, D. M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory multidimensional item response models using Markov chain Monte Carlo. Applied Psychological Measurement, 27, 395-414.

Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlet.

Psychometrika, 64, 153-168.

Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434-455.

Cao, J., & Stokes, S. L. (2008). Bayesian IRT guessing models for partial guessing behaviors.

Psychometrica, 73, 209-230.

de la Torre, J., & Douglas, J. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333-353.

de la Torre, J., & Hong, Y. (2010). Parameter estimation with small sample size a higher order IRT model approach. Applied Psychological Measurement, 34, 267-285.

de la Torre, J., & Song, H. (2009). Simultaneously estimation of overall and domain abilities:

A higher order IRT model approach. Applied Psychological Measurement, 33, 620-639.

Fox, J.-P. (2005). Multilevel IRT using dichotomous and polytomous items. British Journal of Mathematical and Statistical Psychology, 58, 145-172.

Fox, J.-P., & Glas, C. A. W. (2003). Bayesian modeling of measurement error in predictor variables using item response theory. Psychometrika, 68, 169-191.

Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398-409.

Gelman, A., Meng, X.-L., & Stern, H. S. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, 733-807.

Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721-741.

Huang, H.-Y., Wang, W.-C., & Chen, P.-H. (2010). An item response model with hierarchical latent traits. Paper presented at the annual meeting of the American Educational Research Association, Denver, CO.

Kang, T., & Cohen, A. S. (2007). IRT model selection methods for dichotomous items.

Applied Psychological Measurement, 31, 331-358.

Klein Entink, R. H., Fox, J.-P., & van der Linden, W. J. (2009). A multivariate multilevel approach to the modeling of accuracy and speed of test takers. Psychometrika, 74, 21-48.

Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30, 3-21.

510 Educational and Psychological Measurement 73(3)