M SimultaneousEstimationofOverallandDomainAbilities:AHigher-OrderIRTModelApproach

(1)

Simultaneous Estimation of Overall and Domain Abilities: A Higher-Order IRT Model Approach

Jimmy de la Torre

Rutgers, The State University of New Jersey

Hao Song

American Board of Internal Medicine

Assessments consisting of different domains (e.g., content areas, objectives) are typically multidimensional in nature but are commonly assumed to be unidimensional for estimation purposes. The different domains of these assessments are further treated as multi-unidimensional tests for the purpose of obtaining diagnostic information. However, when the domains are disparate, assuming a single underlying ability across the domains is not tenable. More- over, estimating domain proficiencies based on short tests can result in unreliable scores.

This article presents a higher-order item response theory framework where an overall and multiple domain abilities are specified in the same model. Using a Markov chain Monte Carlo method in a hierarchical Bayesian framework, the overall and domain-specific abilities, and their correlations, are estimated simultaneously. The feasibility and effectiveness of the proposed model are investigated under varied conditions in a simulation study and illu- strated using actual assessment data. Implications of the model for future test analysis and ability estimation are also discussed. Index terms: higher-order ability estimation, item response theory, multidimensionality, domain scoring, Markov chain Monte Carlo

M

any educational and psychological assessments, including those for large-scale applications, are inherently multidimensional in that they measure multiple abilities or constructs which could be due to item multidimensionality (Reckase, 1985; 1997) or intended content or construct structure of the assessment (Ackerman, Gierl, & Walker, 2003). This implies that item response theory (IRT) models based on a single underlying construct may not suffice for some data (e.g., Traub, 1983). However, in practice, conventional unidimensional IRT (CU-IRT) analysis of tests estimates the overall ability using all the items in a test while ignoring its multidimensionality; under this framework, domain abilities are estimated by repeating the approach multiple times using subsets of the items. In the former, depending on the disparity of the domains, the general or overall ability estimate may not be valid because of the extent to which the unidimensional assumption is violated; whereas in the latter, with the correlation between the domain abilities ignored, the multiple applications of the CU-IRT approach are suboptimal, and provide domain ability estimates that are unreliable when the number of items in each domain is small (Ackerman, 1992; de la Torre & Patz, 2005; Wainer et al., 2001).

November 2009 620-639 Ó 2009 SAGE Publications 10.1177/0146621608326423 http://apm.sagepub.com hosted at http://online.sagepub.com

(2)

Several methods are currently available for improving estimation of the domain abilities.

The core idea shared by these methods is to incorporate ancillary information into the estimation process, not only of abilities, but of the item parameters as well (e.g., Ackerman &

Davey, 1991; Kahraman & Kamata, 2004; Mislevy, 1987; Mislevy & Sheehan, 1989).

Ancillary information in this context can include examinees’ performance on the overall test or other subtests, the correlational structure of the underlying abilities, and information about the examinees’ demographic and educational standings. In a Bayesian IRT procedure introduced by Yen (1987), objective scores are stabilized by incorporating an examinee’s performance on the overall test. Wainer et al.’s (2001) approach is a multivariate expansion of Kelly’s (1927) regressed scores and capitalizes on the examinees’ performance on other subtests. Similarly, using information on other subtests, de la Torre and Patz (2005), Wang, Chen, and Cheng (2004), and Yao and Boughton (2007) used a multidimensional IRT framework to obtain more precise ability estimates for test domains or subsections. Finally, the method for improving ability estimates based on a small number of items found in the National Assessment of Educational Progress scaling procedure involves incorporating the examinee’s demographic and educational background information (Mislevy, Johnson, &

Muraki, 1992). However, although all these methods provide better domain ability estimates, none of these methods estimates the overall ability together with the domain abilities.

Hierarchical Modeling of Abilities

Contemporaneously, a hierarchical structure of ability organization has been well accepted in psychological research and practices (e.g., Carrol, 1993; Cronbach & Snow, 1977). The hierarchical model of ability organization unifies two prominent but opposing theories on human intelligence—the unitary intelligence view (Spearman, 1904) and the multiple intelligence view (e.g., Thurstone, 1938)—by positing the general ability (G) on the top and multiple more specialized abilities at the lower levels (Gustafsson & Snow, 1997). In large-scale assessment settings, with common existence of multiple tests and multicomponent tests, it has been recognized that both general and domain-specific abilities are being tested and estimates of these abilities serve different purposes and uses.

Although hierarchical linear models have their own limitations, the hierarchical modeling framework has been employed in two recent works. De la Torre and Douglas (2004) proposed hierarchical latent trait models in the context of cognitive diagnosis. They posited higher-order latent traits for modeling the joint distribution of binary attributes at the lower level. The basic idea behind this method is that the probability of mastery of a skill or attainment of knowledge is expressed as a function of a person’s higher-order latent traits. In other words, examinees with higher abilities are more likely to show mastery of a skill or attainment of knowledge. This method is limited in that it is parameterized for binary attributes at the lower level, whereas in most IRT applications, both levels involved continuous traits. Another recent study is Sheng’s (2007) two-parameter normal ogive hierarchical model. This model is a hierarchical IRT model for estimating overall and domain-specific abilities. But Sheng’s work was limited in several aspects. First, it was only used for abilities in a two-dimensional space. Second, it was used with the two-parameter normal ogive

(3)

model (equivalently, the two-parameter logistic model), whereas a more generalized method of modeling assessment data requires the three-parameter logistic (3PL) model. Finally, no constraints were imposed on the regression of the domain abilities on the overall ability.

Consequently, the marginal distribution of a given domain ability may not be on the same scale as the overall or other domain abilities.

Objectives

To address the need for a unified framework, this study proposes a higher-order multidimensional IRT approach to simultaneous estimation of the overall and domain abilities.

Using a one-factor higher-order item response theory (HO-IRT) model formulation, it is posited that an examinee’s performance in each domain is accounted for by a domain-specific ability, whereas the correlations among domain abilities are accounted for by a single higher- order ability that can be viewed as the examinee’s overall ability. The HO-IRT model is a general and parsimonious model that subsumes the multiple applications of the CU-IRT model as a special case. Estimates of the HO-IRT model parameters are obtained through Markov chain Monte Carlo (MCMC) estimation in a hierarchical Bayesian framework. The formulation of the model and how it relates to other models such as the bifactor and testlet models are discussed in the next section.

This article focuses on how the HO-IRT model can be used to score examinees’

responses and seeks to answer the following primary questions: Can the overall ability and domain abilities be accurately estimated using the proposed model? How do test length, number of domains, correlation between domain abilities, and number of examinees affect these ability estimates? How do the HO-IRT overall ability estimates compare with the CU-IRT overall ability estimates? Are the HO-IRT domain abilities more effi- ciently estimated than their CU-IRT counterparts? The article also seeks to investigate how the different factors affect the estimation of the correlational structure of the abiliti- ties, which is expressed as a function of the regression parameters. Finally, this article seeks to examine the applicability of the proposed model in analyzing real test data.

Method

Model Specification

In the proposed model, a test is viewed as a multi-unidimensional test. That is, each domain is considered to be unidimensional, and a single domain-specific ability y^(d)_i accounts for the performance of examineei on domain d, where d= 1, 2, . . . , D. When the different domains measure the same ability, the entire test is deemed unidimensional. The correlations between the different domain abilities are accounted for by positing a higher- order overall ability yi∼ N(1, 0). Specifically, the domain abilities are expressed as linear functions of the overall ability, that is, y^(d)_i = l^(d)yi+ eid, where l^(d)is the latent coefficient in regressing the abilityd on the overall ability, eid is the error term that is independent of other error terms and follows a normal distribution with a mean of zero and variance of 1− l^(d)2, and|l^(d)| ≤ 1. The domain-level abilities are assumed to be independent of each

(4)

other conditional on the overall ability. The correlation between the overall and domain abilities is given by l^(d), whereas the correlation between the abilities in domainsd and d⁰ is given by l^(d)× l^(d⁰⁾. Mathematically, l^(d) can be negative, but in most applications it can be expected to be nonnegative because domain abilities, if they are related to the overall ability, are typically positively correlated. In addition, the constraint imposed on the regression parameter ensures that the overall and domain abilities are on the same scale.

That is, whereas the distribution of the overall ability is obvious, it can be easily shown that the marginal distribution of each domain ability is also the standard normal distribution (i.e., y^(d)_i ∼ N(0, 1)).

A diagrammatic representation of the HO-IRT model is given in Figure 1. The first level of the figure shows the responses of examinee i to the jth item of the D domains, X⁽¹⁾_ij ,X_ij⁽²⁾, . . . ,X_ij^(D). On the second level, the examinee’s domain-level response is linked to the examinee’s domain-specific ability y^(d)_i and the specific item characteristic β^(d)_j via some IRT model. In this article, we will employ the 3PL model with the 1.7 scaling constant. Hence,β^(d)j consists of a^(d)_j , b^(d)_j , and g^(d)_j —the slope, difficulty, and guessing parameters, respectively. Finally, the third level of the figure shows the examinee’s domain ability as a function of his or her overall ability yi.

Estimation

Although the HO-IRT model can also be used to improve the calibration process, this article focuses on scoring the examinees’ responses. Hence, the item parameters are assumed to be known (i.e., have been previously estimated) throughout the article.

Figure 1

A Diagrammatic Representation of the HO-IRT Model

(1)i

θ θ⁽²⁾i θi^{( )}^D β⁽¹⁾j β⁽²⁾j β^{( )}j^D

ij(1)

X Xij⁽²⁾ Xij^{( )}^D

θi

λ(1) λ⁽²⁾ λ^{( )}^D Observed variables are in circles;

fixed variables are in boxes; the remaining variables are to be

estimated

Note: HO-IRT= higher-order item response theory.

(5)

In addition to estimation of the overall and domain abilities for each examinee, the latent regression parameters l⁽¹⁾, l⁽²⁾, . . . , l^(D) of the model also need to be estimated. Using a hierarchical Bayesian formulation, the model can be expressed as follows:

yi∼ N(0, 1), l^ðdÞ∼ Uð−1:0, 1:0Þ,

and

y^ðdÞ_i |yi, l^ðdÞ∼ Nðl^ðdÞyi, 1− l^ðdÞ²Þ:

If the regression parameters are assumed to be known, the overall and domain ability estimates can be easily obtained using traditional scoring methods (e.g., maximum likelihood estimation) by treating each examinee one at a time. However, the complexity and dimensionality of the problem is greatly increased by estimating the regression parameters together with the abilities. Therefore, for this study, the model parameters were estimated using MCMC. In this article, draws for y^(d)_i and l^(d) were obtained using the Metropolis- Hastings algorithm, whereas draws for yiwere sampled using the full conditional distribution N(c·PD

d= 1l^(d)y^(d)_i =(1− l^(d)²),c), where c⁻¹= 1 + PD

d= 1l^(d)=(1− l^(d)²): From this conditional distribution, it can be seen that the overall ability yi is simply a weighted average of the domain abilities y^(d)_i . To this extent, the overall ability is similar to the unidimensional latent composite described by Zhang and Stout (1999a, 1999b). Finally, the parameter estimates and their standard errors were based on the means and standard deviations of the draws after the burn-in, respectively. For a general overview of MCMC, refer to Casella and George (1995), Chib and Greenberg (1995), Gamerman (1997), Gelman, Carlin, Stern, and Rubin (2003), Gilks, Richardson, and Spiegelhalter (1996), and Tierney (1994). For an overview of MCMC as applied to IRT, refer to Patz and Junker (1999a, 1999b).

As noted above, the correlation between domain d and d⁰ is given by l^(d)· l^(d⁰⁾. When only two dimensions are involved, the estimates of the regression parameters will not be unique because different sets of l⁽¹⁾ and l⁽²⁾ resulting in the same correlation coefficient can be found. Consequently, the two regression parameters cannot be separately estimated because of model indeterminacy. Thus, when the HO-IRT model involves only two domains, an additional constraint needs to be imposed for the regression parameters to be estimable. In this article, the regression parameters are constrained to be equal (i.e., l⁽¹⁾= l⁽²⁾). If three dimensions are involved, three correlations between the domains exist.

Because the current formulation of the HO-IRT model requires estimating three regression parameters, the model always perfectly fits the data. WhenD≥ 4, there exist more correlations between the domains than there are regression parameters. As a result, the true correlational structure of the abilities may be more complex than what a linear model can fit.

Finally, it should be noted that when all the dimensions are uncorrelated, some algorithms may encounter convergence problems in estimating the regression parameters. Specifi- cally, for all correlation coefficients to be zero, onlyD− 1 regression parameters must be equal to zero; the remaining parameter can be of any value. However, this should not be a

(6)

cause of serious concern because the HO-IRT model should not be applied unless one has a priori knowledge that the domains are correlated.

HO-IRT and Other Similar Models

Under a different formulation, the HO-IRT model can be viewed as the testlet model of Wang, Bradlow, and Wainer (2002) and the mathematical relationship between the two models can be easily established. Although not its primary purpose, the HO-IRT model can also be used to determine a testlet effect: Under the current model, the testlet effect is said to be present when l^(d) 1. Despite their similarities, a difference of practical importance exists between the two formulations—whereas the testlet model views the domain-specific deviation as a random component representing the person–testlet interac- tion, the HO-IRT formulation explicitly models the variation in examinees’ performance across the different domains as due to domain-specific fixed abilities. Consequently, the testlet model can only directly provide the examinees’ overall abilities, whereas the HO- IRT model can provide both the overall and domain-specific scores at the same time.

Another model that is mathematically related to the HO-IRT and the testlet models is the bifactor model, a hierarchical factor model in which an item has nonzero loadings on the general factor and one specific factor. Yung, Thissen, and McLeod (1999) showed that the hierarchical factor model (e.g., bifactor) and the higher-order factor model (e.g., testlet) can be transformed into each other. The general factor and specific factors of the bifactor model are orthogonal to one another, and computer packages such as TESTFACT (Bock et al., 2000) implement the item analysis using a full-information item factor analysis algorithm (Gibbons & Hedeker, 1992). Like the testlet model, the setup of the bifactor model allows for the standard error of the general factor to be estimated accurately even when the test is not entirely unidimensional (McLeod, Swygert, & Thissen, 2001).

Although related, the bifactor and HO-IRT models have some differences. Unlike the bifactor model, the general factor or the overall ability in the proposed HO-IRT model has no direct effect on examinees’ performance; the performance variability is accounted for solely by the specific domain factors or abilities. Moreover, in contrast to the bifactor model, the HO-IRT model can be formulated to have more than one general factor and to allow the higher-order factor and group level factors to relate in a nonlinear fashion.

Simulation Study

Design

A simulation study was conducted to evaluate the feasibility of the proposed model and how the ability estimates obtained from this model are affected by different factors. In the simulation study, four factors and their varied conditions were considered: (a) number of subtests or domains,D= 2 or 5; (b) number of items in each domain, J = 10, 20, or 30; (c) correlation between the domains, r= 0:0, 0.4, 0.7, or 0.9; and (d) number of examinees, N= 1; 000, 2,000, or 4,000. Fully crossing the different levels of these four factors yielded

(7)

72 conditions. Moreover, the total test lengths considered in this study ranged from 20 to 150 items.

Items in the simulation study were obtained from a pool of 550 nationally standardized mathematics items. Given in Table 1 are the 10 selected 3PL items that have mean information functions closest to that of all the items. These 10 items were then replicated to produce longer tests of 20 or more items. The same items were used for all the domains.

For a specific sample size, I overall abilities were drawn from N(0,1). The correlation between the abilities was converted into the regression coefficient, l= ffiffiffirp

, and a sym- metric structure among domain abilities was assumed. Given l and the overall ability, the domain abilities of examinee i were generated from N(lyi, 1− l²). The domain abilities were used in simulating the item responses. Although only a single sample was drawn for each condition, investigation of the statistical properties of the overall and domain ability estimates to answer the primary questions of this study was possible due to the presence of multiple examinees within each draw. In addition, even with single replicates, the results below indicate that the regression parameters can be reasonably estimated, and clear and discernible patterns in the ability estimates can be observed.

For each condition, four chains started at random were run. All the chains had the same number of burn-ins (i.e, 5,000), but had different chain lengths. The total number of iterations ranged from 15,000 to 65,000 to ensure that the structural parameter (i.e., the regression parameters) had converged. The convergence criterion throughout this article was based on the multivariate potential scale reduction factor (MPSRF; Brooks & Gelman, 1998)—the chain lengths were determined to ensure that all MPSRFs were less than 1.20. Initial estimates of the regression parameters, and overall and domain abilities were based on the draws after the burn-in of each chain. The final estimates were obtained by averaging the estimates across the four chains. All the estimation codes in this article were implemented in Ox (Doornik, 2003) and can be made available to readers by the authors upon request.

To compare the HO-IRT estimates, the CU-IRT estimate of the overall ability and its precision were also computed based on the posterior mean and variance. A standard normal prior was used in obtaining the CU-IRT estimate and the integration over the posterior distribution was approximated using 141 quadrature nodes from -3.50 to 3.50. The CU- and

Table 1

Item Parameters in the Simulation Study

Item a b g

1 0.90 0.95 0.18

2 0.50 0.13 0.18

3 1.22 0.21 0.27

4 1.13 −0.75 0.06

5 0.69 0.34 0.13

6 0.79 −1.60 0.20

7 1.24 1.17 0.12

8 0.52 −1.78 0.00

9 1.01 0.95 0.21

10 1.12 −0.05 0.25

(8)

HO-IRT overall ability estimates were compared with the true overall ability and to each other. Specifically, the correlation between the true and estimated abilities, and the posterior variance and the mean squared error (MSE) of the estimates were computed. The same statistics were computed for the HO-IRT estimates of the domain abilities. In addition, because the condition r= 0:0 is equivalent to the CU-IRT estimation of the domain abilities, the efficiency of the proposed method relative to the CU-IRT approach was computed by taking the ratio between the MSEs of the estimates when r= 0:0 and r 6¼ 0:0. Finally, the quality of the regression parameter estimates was also investigated by comparing the correlations derived from the estimates of l to the true correlations between the domains.

Results

Overall Ability Estimates

Based on the correlation between the true (i.e., the generating values) and estimates of the overall ability, the CU- and HO-IRT models provided very similar estimates of the overall ability. Table 2 shows that nearly identical correlations were found between the true ability and the CU- and HO-IRT estimates of the overall ability particularly when r > 0. In addition, when r > 0, better estimates were obtained with longer tests, higher dimensions, and higher correlations between the domains for both methods. These figures also suggest that, in improving the overall ability estimates, the number of dimensions had greater impact than the number of items—for any fixed r > 0, better estimates can be obtained from a shorter 50-item test with five domains than from a longer 60-item test with two domains. However, as the size of the correlation increases, the magnitude of the difference decreases. Moreover, as was expected, the sample size had no impact on the quality of the overall ability estimates. Finally, it should be noted that the true and estimated abilities were nearly uncorrelated when r= 0 because a common underlying ability cannot be estimated from unrelated sets of responses or uncorrelated domain abilities.

In comparison, the posterior variances in Table 3 indicate that the CU-IRT estimates of the overall ability were more precise than the HO-IRT estimates for all values of r. For the HO-IRT, the estimates also showed improvement in precision with longer tests and more domains when r > 0. These effects were true for the CU-IRT estimates for all values of r. The different factors had similar impacts on the precision of the HO-IRT estimates as on the correlation between the true and HO-IRT estimated overall abilities. In contrast, the precision of the CU-IRT estimates were affected differently—only the overall test length (i.e., test length multiplied by the number of domains) had an impact on the precision of the ability estimates in that, regardless of the number of domains involved, tests with more items provided higher precision. It is not clear, however, why the correlations between the domains had no impact on the precision of the CU-IRT overall ability estimates. That is, for different values of r, the posterior variance remained relatively unchanged.

Although CU- and HO-IRT estimates correlated to the same degree with the true overall ability, and the former were more precise than the latter, the MSE in Table 4 shows that the MSEs of the HO-IRT estimates were equal to or smaller than those of the

(9)

CU-IRT estimates for all except three of the conditions considered. The difference was less evident with higher r, and the two estimates were largely comparable when the abilities had at least moderately high correlation (i.e., r≥ 0:7). Finally, the different factors affected the MSEs of both methods the same way they did the correlations between the true and estimated abilities. Specifically, lower MSE were obtained for the overall ability estimates when longer tests and more domains in conjunction with higher correlations between domain abilities were involved.

The large difference in the MSE of the HO- and CU-IRT overall ability estimates when r= 0 can be understood by considering their respective posterior distributions. As given earlier, the posterior distribution of the overall ability of examinee i using HO-IRT is yi|Xi∼ N(c · PD

d= 1l^(d)y^(d)_i =(1− l^(d)²),c), where c⁻¹= 1 +PD

d= 1l^(d)=(1− l^(d)²). How- ever, when r= 0 , l^(d)= 0. Consequently, yi|Xi∼ N(0, 1). Thus, yi is estimated by

~y^(HO)_i = E(yi|Xi)= 0, and has a posterior variance Var(~y^(HO)i |Xi)= 1, which is a constant.

The MSE of ~y^(HO)_i ,MSE(~y^(HO)_i ), isE (~y^(HO)_i − yi)²

h i

= E(y²_i)= Var(yi)= 1. Thus, the posterior variance and the MSE of ~y^(HO)_i are expected to be identical and equal to 1.

Table 2

Correlation Between True Overall Ability and CU- and HO-IRT Estimated Overall Abilities

r

0.0 0.4 0.7 0.9

Number of Examinees

Number

of Domains J CU HO CU HO CU HO CU HO

N= 1,000 D= 2 10 0.04 0.03 0.64 0.64 0.77 0.77 0.87 0.87

20 −0.01 0.01 0.69 0.69 0.85 0.85 0.92 0.92

30 −0.03 0.02 0.72 0.72 0.87 0.87 0.93 0.93

D= 5 10 0.06 0.06 0.80 0.80 0.91 0.91 0.94 0.94

20 −0.01 −0.02 0.84 0.84 0.92 0.92 0.96 0.96

30 −0.03 0.05 0.85 0.86 0.94 0.94 0.97 0.97

N= 2,000 D= 2 10 −0.01 0.00 0.63 0.63 0.79 0.79 0.87 0.87

20 −0.01 −0.03 0.71 0.71 0.85 0.85 0.91 0.91

30 0.03 −0.03 0.72 0.72 0.86 0.86 0.93 0.93

D= 5 10 0.03 −0.02 0.79 0.80 0.90 0.90 0.94 0.94

20 0.01 0.01 0.84 0.84 0.93 0.93 0.96 0.96

30 −0.03 −0.04 0.85 0.85 0.94 0.94 0.97 0.97

N= 4,000 D= 2 10 0.01 0.00 0.65 0.65 0.81 0.81 0.87 0.87

20 0.01 −0.01 0.70 0.70 0.84 0.84 0.92 0.92

30 0.02 −0.02 0.70 0.70 0.87 0.87 0.93 0.93

D= 5 10 0.00 0.02 0.80 0.81 0.91 0.91 0.94 0.94

20 0.04 0.01 0.84 0.84 0.93 0.93 0.96 0.96

30 −0.01 −0.03 0.85 0.85 0.94 0.94 0.97 0.97

Note: CU= conventional unidimensional; HO = higher-order; IRT = item response theory.

(10)

In comparison, the posterior distribution of the overall ability of examineei using CU- IRT is yi|Xi / p(Xi)p(yi). This can be obtained by noting that yi|Xi / p(Xi|yi)p(yi)= Ðp(Xi,θ⁽i^∗)|yi)qθ⁽i^∗)p(yi), where θ⁽i^∗) is the vector of domain abilities. Moreover, the quantity inside the integral operation can be written as p(Xi,θ⁽i^∗)|yi)= p(Xi|θ⁽i^∗), yi) p(θ⁽i^∗)|yi)= p(Xi|θ⁽i^∗))p(θ⁽i^∗)). The last equality, which is the joint distribution of Xi and θ⁽i^∗), is due to the conditional independence of the responses given the domain abilities, and the independence between the overall and domain abilities when r= 0: By integrating outθ⁽i^∗), we get the posterior distribution of yiunder CU-IRT.

Unlike its HO-IRT counterpart, the posterior distribution of the CU-IRT overall ability estimate cannot be expressed in closed form due to the nonlinear relationship betweenθ^(∗)_i and Xi. However, it is instructive to examine the behavior of this posterior distribution when the responses are assumed to be linear in the domain abilities. For illustration purposes, we can assume that X_ij^(d)= y^(d)i + e^(d)ij , where e^(d)_ij is independently distributed as N(0, s²_e). It can be shown that the marginal distribution of Xi is N(0,Σ), where Σ is a block-diagonal matrix that consists ofΣ^(d)J× J= 11⁰+ s²_e, ford= 1, . . . , D, and 1 is a J × 1 vector of ones. Furthermore, Xi∼ N(0, (J + s²_e)=JD).

Table 3

Posterior Variance of the CU- and HO-IRT Overall Ability Estimates

r

0.0 0.4 0.7 0.9

Number of Examinees

Number

N= 1,000 D= 2 10 0.19 1.00 0.19 0.55 0.19 0.36 0.19 0.27

20 0.11 1.00 0.11 0.53 0.11 0.27 0.11 0.16

30 0.07 1.00 0.07 0.53 0.08 0.27 0.08 0.12

D= 5 10 0.08 0.98 0.08 0.36 0.09 0.18 0.09 0.12

20 0.04 0.99 0.04 0.29 0.04 0.14 0.05 0.07

30 0.03 0.99 0.03 0.26 0.03 0.12 0.03 0.05

N= 2,000 D= 2 10 0.19 1.00 0.19 0.56 0.19 0.36 0.19 0.23

20 0.10 1.00 0.11 0.51 0.11 0.26 0.11 0.16

30 0.07 1.00 0.07 0.49 0.07 0.25 0.08 0.13

D= 5 10 0.08 0.96 0.08 0.35 0.09 0.19 0.09 0.11

20 0.04 0.99 0.04 0.30 0.05 0.13 0.05 0.07

30 0.03 0.99 0.03 0.27 0.03 0.12 0.03 0.06

N= 4,000 D= 2 10 0.19 1.00 0.19 0.59 0.19 0.37 0.19 0.25

20 0.10 1.00 0.11 0.51 0.11 0.28 0.11 0.16

30 0.07 1.00 0.07 0.49 0.08 0.24 0.08 0.13

D= 5 10 0.08 0.99 0.08 0.35 0.09 0.18 0.09 0.12

20 0.04 1.00 0.04 0.30 0.05 0.14 0.05 0.07

30 0.03 0.99 0.03 0.28 0.03 0.12 0.03 0.06

Note: CU= conventional unidimensional; HO = higher-order; IRT = item response theory.

(11)

It follows that

yi|Xi∼ N JD

JD+ J + s²_eX_i, J+ s²_e JD+ J + s²_e

:

Thus, under the CU-IRT framework, the overall ability of examinee i is estimated by

~y^(CU)_i = E(yi|Xi)= JD Xi=(JD+ J + s²_e), and has a less than unity posterior variance that decreases primarily asD, and to some extent, J, increases. In addition,

MSE ~y^ðCUÞ_i

= E ~y^ðCUÞi − yi

2

= Var yð Þi + Var ~y^ðCUÞi

because yiand ~y^(CU)_i are uncorrelated, and their expected values are zero. Again, Var(yi)= 1, whereas

Var ~y^ðCUÞ_i

= Var JD

JD+ J + s²_eXi

= JDðJ+ s²eÞ ðJD+ J + s²_eÞ²: Table 4

MSE of the CU- and HO-IRT Overall Ability Estimates

r

0.0 0.4 0.7 0.9

Number of Examinees

Number

N= 1,000 D= 2 10 1.46 1.02 0.63 0.61 0.39 0.39 0.24 0.24

20 1.51 1.00 0.53 0.53 0.27 0.27 0.16 0.16

30 1.49 1.00 0.46 0.53 0.25 0.27 0.13 0.12

D= 5 10 1.25 1.09 0.35 0.34 0.19 0.19 0.11 0.11

20 1.15 0.96 0.33 0.30 0.14 0.13 0.07 0.07

30 1.20 1.00 0.30 0.27 0.12 0.11 0.06 0.06

N= 2,000 D= 2 10 1.43 1.00 0.63 0.56 0.36 0.36 0.24 0.23

20 1.41 1.00 0.51 0.51 0.27 0.26 0.17 0.16

30 1.39 1.00 0.49 0.49 0.26 0.25 0.13 0.13

D= 5 10 1.20 1.04 0.38 0.36 0.20 0.19 0.12 0.12

20 1.23 1.05 0.31 0.28 0.14 0.13 0.07 0.07

30 1.22 1.01 0.29 0.27 0.14 0.12 0.06 0.06

N= 4,000 D= 2 10 1.44 1.00 0.59 0.59 0.34 0.37 0.25 0.25

20 1.43 1.00 0.52 0.51 0.28 0.28 0.16 0.16

30 1.44 1.00 0.50 0.49 0.25 0.24 0.13 0.13

D= 5 10 1.23 1.02 0.36 0.34 0.19 0.18 0.11 0.11

20 1.16 1.00 0.32 0.29 0.14 0.13 0.08 0.07

30 1.17 0.99 0.31 0.28 0.13 0.12 0.06 0.06

Note: MSE= mean squared error; CU = conventional unidimensional; HO = higher-order; IRT = item response theory.

(12)

Thus

MSEð~y^ðCUÞ_i )= 1 + JDðJ + s²_eÞ=ðJD+ J + s²_eÞ,

which is always larger than 1 and decreases primarily when D increases. Although the derivation of the posterior distribution of ~y^(CU)_i is based on linear responses, it clearly demonstrates how the CU-IRT estimate of the overall ability can have a smaller posterior variance but much larger MSE.

Domain Ability Estimates

As noted above, CU-IRT estimation of the domain abilities is a special case of the HO- IRT approach. Although the results were not presented, it was verified that the domain ability estimates using the CU-IRT method were identical to the HO-IRT estimates when r= 0: Therefore, only results for the latter, which represent the average results across the different domains, are presented in this article.

The correlation between the true and HO-IRT estimated domain abilities in Table 5 ranged from 0.82 (10-item tests measuring uncorrelated abilities) to 0.96 (five 30-item tests measuring highly correlated abilities). The degree of correlation improved with the greater number of domains when r≥ 0:4, particularly for 0.7 and 0.9. The correlations

Table 5

Correlation Between True and HO-IRT Estimated Domain Abilities

Number r of Examinees

Number

of Domains J 0.0 0.4 0.7 0.9

N= 1,000 D= 2 10 0.82 0.83 0.84 0.87

20 0.89 0.90 0.91 0.93

30 0.93 0.93 0.93 0.94

D= 5 10 0.83 0.84 0.88 0.92

20 0.90 0.90 0.91 0.95

30 0.93 0.93 0.94 0.96

N= 2,000 D= 2 10 0.82 0.82 0.85 0.88

20 0.90 0.90 0.91 0.92

30 0.93 0.93 0.93 0.94

D= 5 10 0.82 0.84 0.87 0.92

20 0.90 0.90 0.92 0.95

30 0.92 0.93 0.94 0.96

N= 4,000 D= 2 10 0.82 0.83 0.85 0.87

20 0.89 0.90 0.91 0.93

30 0.93 0.93 0.93 0.95

D= 5 10 0.82 0.84 0.88 0.92

20 0.90 0.91 0.92 0.95

30 0.93 0.93 0.94 0.96

Note: HO-IRT = higher-order item response theory.

(13)

between the true and estimated domain abilities were higher for longer tests, and the impact of longer test is most evident when moving fromJ= 10 to J = 20 and r ≤ 0:4. The improvements in the quality of the domain ability estimates in switching from CU-IRT estimation (column r= 0:0) to HO-IRT estimation (column r > 0:0) are given in each row of Table 5. The highest gains that can be obtained from using the HO-IRT method involved five 10-item tests and r≥ 0:7. Under these conditions, the correlation improved from about 0.82 using CU-IRT to about 0.87 through 0.92 using HO-IRT. These results indicate that, although HO-IRT estimation is expected to provide better estimates than CU-IRT estimation, the degree of improvement can be negligible when the correlations between the domain abilities are low or when the domain abilities have already been well estimated using long tests. Finally, based on the correlation values, the sample size had no noticeable impact on the quality of the ability estimates.

As can be seen from the posterior variances in Table 6, the different factors had a similar pattern of impacts on the precision of the domain ability estimates. As in the correlation between the true and estimated abilities, the precision of the ability estimates improved when scoring responses from more domains involving abilities of at least moderately high correlations, but were more notable when moving fromJ= 10 to J = 20 than fromJ= 20 to J = 30: These results show that CU-IRT estimated abilities from a 10-item test had a posterior variance of 0.32. In comparison, when HO-IRT was employed in conjunction with five short tests measuring highly correlated abilities, the posterior variance dropped to only about 0.15, indicating that under the optimal condition investigated in this

Table 6

Posterior Variance of the HO-IRT Domain Ability Estimates

Number

of Domains J 0.0 0.4 0.07 0.9

N= 1,000 D= 2 10 0.32 0.31 0.28 0.25

20 0.20 0.19 0.17 0.14

30 0.14 0.14 0.13 0.10

D= 5 10 0.32 0.29 0.23 0.15

20 0.19 0.18 0.15 0.10

30 0.14 0.13 0.12 0.08

N= 2,000 D= 2 10 0.32 0.31 0.28 0.22

20 0.20 0.19 0.17 0.14

30 0.14 0.14 0.13 0.11

D= 5 10 0.32 0.29 0.24 0.15

20 0.20 0.18 0.15 0.11

30 0.14 0.13 0.12 0.08

N= 4,000 D= 2 10 0.32 0.31 0.28 0.24

20 0.19 0.19 0.17 0.14

30 0.14 0.14 0.13 0.11

D= 5 10 0.32 0.29 0.23 0.16

20 0.19 0.18 0.15 0.10

30 0.14 0.13 0.11 0.08

Note: HO-IRT = higher-order item response theory.

(14)

study (i.e.,D= 5, J = 10, and r = 0:9), HO-IRT domain estimates can be, on average, at least twice as precise as CU-IRT estimates.

The MSE of the HO-IRT domain ability estimates, and the efficiency of the proposed method relative to the CU-IRT estimation are given in Table 7. The MSE was affected in the same way that the correlation between true and estimated abilities and the posterior variances were affected by the different factors. That is, smaller MSE were expected when longer tests were used, or when more domains measuring abilities with at least moderately high correlations were considered. As in the previous statistics, the improvements due to test length tapered off when longer tests were involved.

To a large extent, these factors impact the relative efficiency in the same way. Notably large relative efficiency (i.e., at least 1.20) can only be observed whenD= 5, J = 10, and r≥ 0:7, and as expected, is highest when r = 0:9. Under this optimal condition, the efficiency of the HO-IRT method relative to the CU-IRT method was at least 2.02 across the different sample sizes indicating that the quality of the domain ability estimates using the HO-IRT approach was equivalent to the quality of CU-IRT estimates obtained from tests consisting of at least 20 items. It should be noted that although longer tests have lower relative efficiency, the increase in terms of the number of additional items was actually higher for these tests. For example, the relative efficiency whenD= 5, J = 30, and r = 0:9 was at least 1.70, which is equivalent to an additional 21 items. In contrast, a relative efficiency of 2.02 under the optimal condition is only equivalent to 10 additional items.

Table 7

MSE of HO-IRT Domain Ability Estimates: Relative Efficiency in Parentheses

Number

of Domains J 0.0 0.4 0.7 0.9

N= 1,000 D= 2 10 0.34 (—) 0.31 (1.13) 0.28 (1.21) 0.24 (1.46)

20 0.21 (—) 0.19 (1.12) 0.17 (1.25) 0.14 (1.54) 30 0.15 (—) 0.13 (1.13) 0.13 (1.15) 0.11 (1.43)

D= 5 10 0.32 (—) 0.30 (1.06) 0.23 (1.36) 0.16 (2.02)

20 0.19 (—) 0.19 (1.00) 0.15 (1.23) 0.10 (1.84) 30 0.14 (—) 0.14 (1.02) 0.11 (1.28) 0.08 (1.74)

N= 2,000 D= 2 10 0.33 (—) 0.32 (1.06) 0.28 (1.22) 0.23 (1.43)

20 0.19 (—) 0.19 (1.00) 0.18 (1.07) 0.15 (1.31) 30 0.15 (—) 0.14 (1.05) 0.13 (1.17) 0.11 (1.36)

D= 5 10 0.33 (—) 0.29 (1.12) 0.24 (1.37) 0.16 (2.03)

20 0.19 (—) 0.18 (1.07) 0.15 (1.27) 0.10 (1.88) 30 0.15 (—) 0.14 (1.07) 0.12 (1.25) 0.08 (1.83)

N= 4,000 D= 2 10 0.33 (—) 0.32 (1.02) 0.27 (1.20) 0.24 (1.37)

20 0.20 (—) 0.19 (1.08) 0.17 (1.18) 0.14 (1.41) 30 0.14 (—) 0.14 (1.04) 0.13 (1.08) 0.11 (1.31)

D= 5 10 0.32 (—) 0.29 (1.09) 0.24 (1.36) 0.15 (2.14)

20 0.19 (—) 0.18 (1.06) 0.15 (1.25) 0.10 (1.81) 30 0.14 (—) 0.13 (1.04) 0.11 (1.21) 0.08 (1.70) Note: MSE= mean squared error; HO-IRT = higher-order item response theory.

(15)

Regression Parameter Estimates

The complete specification of the HO-IRT model includes the regression parameters, and these parameters can be better understood when converted into correlations between the domain abilities. Estimates of these correlations are given in Table 8. The table shows that the correlations between the abilities can be well estimated using the HO-IRT model when r= 0:0. As a whole, better estimates can be obtained when more domains were tested or the sample size was larger. However, increasing the number of domains resulted in higher improvements when N= 1,000. This is primarily due to inaccurate estimates when two domains and small sample sizes were involved. Finally, the impact of test length on the accuracy of the correlation estimates was not as pronounced as expected.

A Real Data Example

Data

To illustrate the applicability of the proposed HO-IRT approach, the scoring procedure was carried out for a large-scale standardized assessment administered to 2,255 Grade 9 examinees obtained from CTB/McGraw-Hill. The assessment involved four domains, namely, Math (MA, 25 items), Math Computation (MC, 20 items), Spelling (SP, 20 items), and Social Studies (SS, 25 items). These were the same data analyzed by de la Torre and Patz (2005). The MCMC algorithm employed in this analysis is similar to that

Table 8

Estimate of Correlation Between Domain Abilities

Number

of Domains J 0.0 0.4 0.07 0.9

N= 1,000 D= 2 10 0.00 0.43 0.70 0.86

20 0.00 0.38 0.71 0.89

30 0.00 0.35 0.68 0.92

D= 5 10 0.00 0.39 0.71 0.90

20 0.00 0.40 0.70 0.90

30 0.00 0.42 0.69 0.91

N= 2,000 D= 2 10 0.00 0.42 0.70 0.93

20 0.00 0.40 0.72 0.90

30 0.00 0.40 0.70 0.91

D= 5 10 0.00 0.40 0.68 0.91

20 0.00 0.40 0.72 0.89

30 0.00 0.41 0.70 0.90

N= 4,000 D= 2 10 0.00 0.38 0.68 0.88

20 0.00 0.40 0.69 0.90

30 0.00 0.40 0.72 0.90

D= 5 10 0.00 0.40 0.71 0.89

20 0.00 0.40 0.69 0.90

30 0.00 0.39 0.70 0.90

(16)

in the simulation study. Five chains were started at random, each with 3,000 iterations as burn-in, and a total of 15,000 iterations. The MPSRF was 1.05, indicating that the chains for the structural parameters have reached the approximate stationary distribution. The regression parameters and ability estimates were based on the draws of all the chains.

Results

Correlation between domains. Estimate of the correlational structure of the four test domains using the HO-IRT model is given in Table 9. It shows that the correlations among the domains ranged from 0.59 to 0.86, with the highest correlation found between MA and MC, and the lowest correlation between SP and SS. These estimates were similar to those reported in de la Torre and Patz (2005), except that the correlations of SP with MC and SS were slightly lower. The method employed by de la Torre and Patz put no constraint on the correlational structure of the domains, but the comparability of the results suggests that a linear model can be used to approximate the correlational structure of the four domains tested. Finally, the substantial correlations between content domains warranted the application of the HO-IRT approach.

CU- and HO-IRT estimates of the four domain abilities. Both CU- and HO-IRT methods were used to estimate the examinees’ abilities across the four content domains. The domain ability estimates are comparable for the two methods, however, the HO-IRT estimates showed consistently higher precision than their counterparts as shown in Table 10.

Because the MSE cannot be known for real data, and the simulation results indicated that the average bias across the ability continuum was very small (i.e., posterior variance is very similar to MSE), the posterior variances were used in place of the MSEs in comput- ing the approximate relative efficiency of the proposed method for the real data. Table 9 shows that the relative efficiency in employing the HO-IRT model for these data ranged from 1.18 to 1.38, with the two highly correlated mathematics sections showing the highest gains. The relative efficiency indicated that application of the HO-IRT approach provided more accurate estimates of domain abilities and is equivalent to adding about 4 to 10 items in each content area. These represented modest but important gains in actual test settings, where only small numbers of items can be given in each domain due to a relatively broad range of contents to be covered in a limited testing time.

Table 9

Estimated Correlation Between the Four Domains of the Grade 9 Data

Domain MC SP SS

MA 0.86 0.67 0.78

MC 0.65 0.76

SP 0.59

Note: MA= Math; MC = Math Computation; SP = Spelling; SS = Social Studies.

(17)

Summary and Discussion

This article proposes a cohesive framework for analyzing assessment data that allows integration of one general and several domain-specific abilities in the same model. A higher-order linear factor model formulation is used to relate the two types of abilities.

The resulting model is a general framework that subsumes the conventional IRT estimation of the overall and domain abilities as special cases. Estimates of the model parameters, which include the latent regression parameters in addition to the abilities, can be obtained using an MCMC algorithm.

Obtaining accurate and reliable estimates of the overall and domain-specific abilities greatly enhances the effective use of large-scale assessments: on one hand, the overall ability estimate is useful for important decisions such as rank-ordering the examinees; on the other hand, the domain ability estimates complement the overall ability estimate by providing finer-grained diagnosis of examinees’ strengths and weaknesses. Compared with currently available methods for improving estimation of domain-specific abilities, the HO-IRT approach provides a more elegant framework for modeling the multilevel abilities tested in large-scale assessment settings and conforms to our current understand- ing of the hierarchy of abilities.

The simulation study shows that the CU- and HO-IRT overall ability estimates are very similar to each other in terms of their correlation with the true ability and MSE when the domain abilities are correlated. However, the latter shows less bias and is generally more efficient than the former, particularly when the domain abilities are uncorrelated. It is worth noting that in estimating the overall ability, test dimensionality affects the accuracy of CU- IRT, but not its precision. More specifically, as test dimensionality increases (i.e., as domain abilities become less correlated), the bias of the estimates also increases, but the precision, which is solely a function of test length, remains unchanged. Therefore, one needs to exer- cise caution in using the overall ability estimated from multi-unidimensional tests via the conventional approach: At worst, the estimate can be extremely biased, and at best, it can be precise, but still biased.

In addition, the simulation study also shows that, by using the HO-IRT model to estimate the domain abilities, more efficient estimates can be obtained. The improvement in the relative efficiency can be sizeable when multiple-short tests measuring highly

Table 10

Posterior Variance and Approximate Relative Efficiency of HO-IRT Domain Ability Estimates for the Grade 9 Data

Domain

Method MA MC SP SS

CU-IRT 0.16 0.18 0.27 0.14

HO-IRT 0.12 0.13 0.23 0.12

Relative efficiency 1.38 1.35 1.21 1.18

Note: MA= Math; MC = Math Computation; SP = Spelling; SS = Social Studies; CU = conventional unidimensional; HO-IRT= higher-order item response theory.